Non-Unicode Japanese Font Encodings
Several encoding schemes for Japanese have been developed by the government of Japan, with some manufacturers adding their own non-standard extensions over the years. This page describes some of the most common standards, ending with a description of how those encodings map to Unicode fonts, and how one such Japanese font was added to Unifont.
Because Unicode code points are specified as hexadecimal numbers, this web page will give character assignments in a mixture of decimal and hexadecimal. Hexadecimal numbers will begin with "0x", matching the convention established in the C programming language. Hexadecimal Unicode code points will begin with "U+", matching the Unicode convention.
In the Beginning
Long before the advent of computers, telegraph operators in Japan transmitted Japanese using Wabun code (和文モールス符号), a series of short and long tones transmitted like Morse code. Wabun code used katakana, along with the Japanese comma and period (full stop). Two other sequences indicated a switch from International Morse code to Wabun code, and a switch from Wabun code back to International Morse code. According to Ken Lunde, the use of katakana by telegraph operators was the reason why the first Japanese computer encoding standard used katakana, with such standards work beginning in 1969 [Lunde, pp. 103, 144].
ASCII
ASCII interoperability has been a design consideration in Japanese fonts from the start. Hence a brief description of ASCII will help explain some later concepts on this web page.
ASCII assigns control codes, also referred to as the "C0" control codes, to byte values (code points) 0 through 31 (hexadecimal 0x00 through 0x1F), the space character to code point 32 (0x20), printable characters to code points 33 through 126 (0x21 through 0x7E), and the delete code to code point 127 (0x7F).
The printable characters, from 33 through 126, form a range of 94 characters. For this reason, many multi-byte Asian encoding schemes use groups of 94 glyphs. Japanese instances will be described later in this web page.
With ASCII only occupying the lower 128 code points of a byte, the upper 128 code points (128 through 255, or 0x80 through 0xFF) have been used for many other international encodings. Very often, the range 128 through 159 (0x80 through 0x9F) is reserved as the set of "C1" control codes, with code points 160 through 254 or 255 (0xA0 through 0xFE or 0xFF) being printable characters. Thus, often the term graphics left ("GL") is used to refer to the range of printable characters from 33 through 126, and the term graphics right ("GR") is used to refer to the range of printable characters from 160 through 254 or 255.
JIS X 0201 Encoding: a First Step
The first Japanese computer encoding standard only used one byte per character. It was originally named JIS C 6220. In 1987, it was renamed as JIS X 0201. Like Wabun code, this standard allowed a combination of the Latin alphabet and katakana.
JIS X 0201 assigns katakana characters and some punctuation marks to byte values (code points) 161 through 223 (hexadecimal 0xA1 through 0xDF). These characters have been directly mapped to Unicode, as the "Halfwidth CJK punctuation" and "Halfwidth Katakana variants" in the range U+FF61 through U+FF9F.
Code points from 0 through 127 (hexadecimal 0x00 through 0x7F) are reserved for ASCII, with two common exceptions:
- The Yen symbol ("¥") can occupy position 92 (0x5C, which in ASCII is the backslash character, "\")
- The Overline symbol ("‾"), to represent a long vowel, can occopy position 126 (which in ASCII is the tilde character, "~").
Code point 127 (0x7F) is reserved for the Delete code.
JIS X 0201 is specified for use on seven- and eight-bit systems. Seven-bit systems were more common in the days of serial modems and RS-232 connections that used the eighth bit for parity. With today's Internet, eight-bit systems are all but universal.
This standard left much of the eight-bit space of 256 code points undefined, which allowed for expansion to add support for hiragana and also for kanji, the Japanese ideographs. Because there are thousands of kanji characters, at least a two-byte code is required to cover the entire set. The next section describes such a code.
JIS X 0208 Encoding
Recall from the ASCII section above that there are 94 printable ASCII characters, in the range 33 through 126 (hexadecimal 0x21 through 0x7E). JIS X 0208 specified an encoding matrix of 94 rows, with each row containing 94 cells. When the government of Japan created this format, they referred to this 94 row by 94 cell arrangement as 区点 (kuten), from 区 (ku, meaning "ward" or "section") and 点 (ten, meaning "point"). The common English usage today is to refer to this 94 ku by 94 ten array as 94 rows by 94 cells.
The code points in the first row of JIS X 0208 are numbered 1-1 through 1-94, code points in the second row are numbered 2-1 through 2-94, and so on, up to 94-94. This allows up to 94 × 94 = 8836 character assignments.
In JIS X 0208, rows 1 through 15 are used to encode non-kanji characters; rows 9 through 15 have not been assigned characters. Rows 16 through 47 are used to encode Level 1 kanji, which Japanese children learn in school. Rows 48 through 84 are used to encode Level 2 kanji. Rows 85 through 94 are currently unused.
The two-byte JIS codes can be mapped to byte sequences in several ways. A very popular method (for a while, the most popular method) allowed JIS X 0201 characters to be combined with JIS X 0208 characters. This method is called Shift-JIS, and is defined in the JIS X 0208 standard. The next section provides a brief overview.
Shift-JIS Encoding
JIS X 0208 specifies a way for its 94-by-94 encoding to mix with JIS X 0201 encoding. JIS X 0201 codes remain the same: code points 0 through 127 are unchanged, as are the punctuation codes and katakana at code points 161 through 223.
A two-byte JIS X 0208 sequence encoded using Shift-JIS has a value in the ranges 129 through 159 (0x81 through 0x9F) or 224 through 239 (0xE0 through 0xEF). The second byte will have a value in the ranges 64 through 126 (0x40 through 0x7E) or 128 through 252 (0x80 through 0xFC). Code point 127 (0x7F), the delete code point, is skipped over, as is the JIS X 0201 punctuation and katakana range of 161 through 223 (0xA1 through 0xDF). Thus Shift-JIS allows intermingling JIS X 0201 codes with JIS X 0208 codes in the same document.
Some descriptions of Shift-JIS present mathematical formulas for mapping a two-byte JIS X 0208 code. The purpose of those formulas is to skip over the delete character (127, or 0x7F) always, and to skip over the range 161 through 223 (0xA1 through 0xDF) on the first byte. They first add 32 (hexadecimal 0x20) to the row and cell numbers of a code, so that row and cell numbers range from 33 through 126 instead of 1 through 94. A (hopefully) more intuitive description of this mapping follows instead of just repeating formulas.
- The number 94 is 0x5E in hexadecimal, which in binary is 101 1110. Thus representing a number that has values from 1 – 94 requires 7 bits. Representing two such numbers, to cover the entire JIS X 0208 code point space, requires at least 14 bits.
- The first byte of a Shift-JIS two-byte code will have a value in the range 129 – 159 or 224 – 239. This is 0x81 – 0x9F or 0xE0 – 0xEF. This forms a range of 15 + 16 + 16 = 47 values. 47 is half of 94, so every two row numbers of a JIS X 0208 code are mapped to the same first byte value among these 47 byte values.
- The second byte of a Shift-JIS two-byte code maps to the lower range of values, 64 – 126 or 128 – 158 (hexadecimal 0x40 – 0x7E or 0x80 – 0x9E) if the code point's row number is odd, skipping over the delete code point, 127 (0x7F). The second byte maps to the upper range of 159 – 252 (0x9F – 0xFC) if the row number is even. Thus the second byte in a Shift-JIS two-byte code can have 188 possible values, which in hexadecimal is 0xBC and in binary is 1011 1100.
- In summary, a JIS X 0208 Shift-JIS two-byte code can have up to 47 × 188 = 8836 possible values. 94 × 94 = 8836, so this two-byte code exactly covers all possible JIS X 0208 codes.
Shift-JIS and FONTX2 Fonts
In 1990, IBM Japan introduced the DOS/V operating system for the IBM PC/AT to support kanji on a PC with a VGA graphics display. Originally, "DOS/V" was an abbreviation for "DOS/VGA". Later, as PC graphics cards provided resolutions beyond the 640 × 480 pixel resolution of VGA graphics controllers, the "V" was said to stand for "Variable".
In November 1991, Takashi Oyama (小山 隆史, handle "lepton") published FONTX version 1.1 as a font driver to replace the DOS/V font driver. It allowed different Shift-JIS encoded kanji fonts to be displayed under DOS/V. In January 1992, he published version 2.0 of the driver, FONTX2. This became the most popular kanji font encoding format under DOS/V.
Many legacy Japanese FONTX2 font files are still available on the Internet today and the format remains in use for displaying Shift-JIS encoded kanji on embedded systems.
FONTX2 files have two formats: a small file format for single byte fonts, and a large file format for double byte Shift-JIS fonts. The tables below provide an overview of these formats.
FONTX2 Single Byte Format | ||
---|---|---|
First Byte | Byte Length | Contents |
0 | 6 | "FONTX2" in ASCII |
6 | 8 | Font Name in ASCII |
14 | 1 | Glyph Width (pixels), ncols |
15 | 1 | Glyph Height (pixels), nrows |
16 | 1 | Code Flag: 0 for Alphanumeric Keyboard |
17 | Bitmaps for up to 256 left-justified glyphs, 8 pixels/byte |
Glyphs in a FONTX2 file contain a whole number of bytes per row, with pixels left-justified to the highest bit in the last byte in each row if the glyph width is not a multiple of 8 pixels. Thus the size of the glyph bitmap section of a single byte FONTX2 font file, with 256 glyphs per font, is
For example, a single byte FONTX2 file containing 16 × 16 pixel glyphs would have a bitmap size of
The total file size would be 17 header bytes + 4096 glyph bitmap bytes = 4113 bytes.
The format for a double byte Shift-JIS FONTX2 font file is as follows:
FONTX2 Double Byte (Shift-JIS) Format | ||
---|---|---|
First Byte | Byte Length | Contents |
0 | 6 | "FONTX2" in ASCII |
6 | 8 | Font Name in ASCII |
14 | 1 | Glyph Width (pixels), ncols |
15 | 1 | Glyph Height (pixels), nrows |
16 | 1 | Code Flag: 1 for Shift-JIS |
17 | 1 | Number of Code Blocks, nb |
18 | 2 | Code Block 1 Start |
20 | 2 | Code Block 1 End |
14 + (4 × nb) | 2 | Code Block nb Start |
16 + (4 × nb) | 2 | Code Block nb End |
18 + (4 × nb) | Start of Glyph Bitmaps |
The two-byte Code Block Start and Code Block End fields are stored in little-endian order: low-order byte first, high-order byte second. Glyph bitmaps are stored one byte at a time, so each bitmap is read in the order that bytes appear in the file.
As with single byte FONTX2 files, the size of each glyph is
and so the total file size of a double byte FONTX2 font file is
The glyph bitmaps follow one after the other, with no gaps between block boundaries. Dividing glyphs into runs of blocks allows skipping over the large sections of a two byte code space that contain no Shift-JIS encodings. These two qualities make FONTX2 a very space-efficient Shift-JIS font format.
FONTX2 to BDF Font Conversion
There are numerous Japanese fonts in the MS-DOS FONTX2 format that are in the public domain. This motivated me to create conversion software to allow their use on other systems. I chose BDF as the initial output format because of its support on many Unix-like platforms and embedded systems, and because of its flexibility in supporting bitmap glyphs of various sizes.
The following tarball contains the program fontx2bdf
,
which will read a single- or double-byte format FONTX2 file and
output a BDF version 2.1 font file. The ".sig
" file
is the GPG signature for the tarball. Those code points in the input
file that have valid mappings to Unicode Plane 0 are included
in the BDF output. The program is licensed under GPL version 2,
and the resulting header file for mapping JIS X code points to
Unicode code points is still in the public domain.
fontx2bdf-1.1.tar.gz.sig
The header file, jisx2uni.h
, is an expanded
version of the file eucjp2uni.h
(which is
linked towards the end of this page). I renamed the file and
its arrays to better reflect that the mapping arrays are ordered
for JIS X encodings rather than in any particular information
interchange scheme such as EUC (Extended Unix Code). This
header file contains three new arrays for mapping single- and
double-byte FONTX2 font files to Unicode code points.
Below is a graphic of such a one-byte FONTX2 font,
dflhn16x.fnt
. This can be
converted to a BDF font with this command:
fontx2bdf dflhn16x.fnt dflhn16x.bdf
Note that the glyph in position 0x5C is a backslash, not
a Yen symbol. Some Japanese fonts map 0x5C to the Yen
symbol, such as those encoded with IBM's Code Page 1041.
Keeping the backslash in its ASCII position preserves its
use as the DOS directory separator in pathnames and
as the character that precedes control codes in C program
syntax (for example, '\n
' for "newline").
The original FONTX2 files from December 2001 are in the tarball below, with the SHA-256 checksum:
db5a2203bf08aaa089afa756a7afddeeeb245b281f78529c7d442f98298d1ed5 dflfnt.tar.gz
The font name appearing within the font file is "DR F ANK". "ANK" is a common abbreviation for a single-byte Japanese font that contains Alphabet, Numerals, and Katakana.
JIS X 0213 Encoding
Although JIS X 0208 can encode over 8000 characters, this is not enough to encode all existing kanji. In the year 2000, JIS X 0213 was released with the addition of planes, or in Japanese 面 (men, meaning "surface"), to the existing row-cell ordering. The pre-existing JIS X 0208 row-cell arrangement was preserved in Plane 1; new characters are assigned to Plane 2. Each of the two JIS X 0213 planes contains a code point space of 94 rows with 94 cells per row.
JIS X 0213 was revised in 2004 and in 2012, but preserving the original two plane encoding notation. The 2004 version modified the glyphs of 168 kanji.
A JIS X 0213 code point is often written as plane-row-cell; for example, 1-1-1 represents the first code point in Plane 1 and 2-94-94 represents the last code point in Plane 2. Currently, Plane 2 only has character assignments in 25 of its 94 rows: 1, 3 – 5, 8, 12 – 15, and 78 – 94. This is significant in Shift-JIS encoding of JIS X 0213, described in the next section.
Shift-JIS for JIS X 0213 Encoding
A JIS X 0213 code point in Plane 1 has the same Shift-JIS encoding order and values that a JIS X 0208 code point would have. Plane 2 code points map to a first Shift-JIS byte in the range 240 – 252 (0xF0 – 0xFC). The second Shift-JIS byte is encoded the same for Plane 2 as for Plane 1:
- If the row number was odd, code points are assigned second Shift-JIS byte values in order in the ranges 64 through 126 and 128 through 158 (in hexadecimal, 0x40 – 0x7E and 0x80 – 0x9E), skipping over the delete code point, 127 (0x7F).
- If the row number was even, code points are assigned second Shift-JIS byte values in order in the range 159 – 252 (in hexadecimal, 0x9F – 0xFC).
The mapping of the first byte in a two-byte Shift-JIS sequence for Plane 2 code points is not entirely in order. This is an artifact of nearing the end of available Shift-JIS codes and wanting to calculate the first byte value by a mathematical formula:
Shift-JIS First & Second Bytes for JIS X 0213 Plane 2 | ||||
---|---|---|---|---|
JIS X 0213 Row (Plane 2) | Shift-JIS First Byte | Second Byte Range Half | ||
Decimal | Decimal | Hexadecimal | Lower | Upper |
1 | 240 | 0xF0 | X | |
3 | 241 | 0xF1 | X | |
4 | 241 | 0xF1 | X | |
5 | 242 | 0xF2 | X | |
8 | 240 | 0xF0 | X | |
12 | 242 | 0xF2 | X | |
13 | 243 | 0xF3 | X | |
14 | 243 | 0xF3 | X | |
15 | 244 | 0xF4 | X | |
78 | 244 | 0xF4 | X | |
79 | 245 | 0xF5 | X | |
80 | 245 | 0xF5 | X | |
81 | 246 | 0xF6 | X | |
82 | 246 | 0xF6 | X | |
83 | 247 | 0xF7 | X | |
84 | 247 | 0xF7 | X | |
85 | 248 | 0xF8 | X | |
86 | 248 | 0xF8 | X | |
87 | 249 | 0xF9 | X | |
88 | 249 | 0xF9 | X | |
89 | 250 | 0xFA | X | |
90 | 250 | 0xFA | X | |
91 | 251 | 0xFB | X | |
92 | 251 | 0xFB | X | |
93 | 252 | 0xFC | X | |
94 | 252 | 0xFC | X |
Although not entirely straightforward, the Shift-JIS mapping of a JIS X 0213 Plane 2 code point still maps one odd row and one even row to the same first byte value, as with the one-plane JIS X 0208 standard.
The Shift-JIS two-byte encoding of JIS X 0213 allows up to 15 + 16 + 16 + 13 = 60 values for the first byte of a two-byte Shift-JIS encoded JIS X 0213 code point, and the same 188 values for a second byte as Shift-JIS encoding of JIS X 0208. This is a total of 60 × 188 = 11,280 possible code points. Even if this were extended to allow first byte values of 253 through 255 (0xFD through 0xFF), that would only add 3 × 188 = 564 more code points, for a total of 11,844 possible total code points.
Thus Shift-JIS encoding is approaching its practial limit. It cannot encode all possible code points of a two plane, 94 row by 94 cell encoding such as JIS X 0213 in two bytes.
Two other non-Unicode encoding methods were introduced in 1994 that do not have this limitation: ISO-2022 and Extended Unix Code. They are described in the next two sections.
ISO-2022 and Japanese Font Encoding
By the 1990s, many computer encoding schemes had evolved to represent various languages. The most basic of these used the unassigned upper range of byte values, 128 through 255 (0x80 through 0xFF). JIS X 0201, which assigned a set of Japanese punctuation marks and katakana to the range 161 – 223 (0xA1 – 0xDF) was one such encoding method. The row-cell arrangement of two-byte character sets, with rows and cells ranging from 1 – 94, was an ordering of characters that could be mapped to one or more computer encodings.
ISO-2022 was released in 1994 to address two problems involving international character sets, including multi-byte Asian scripts:
- Providing a way to mix different international scripts in one byte stream for transmission between computers, and
- Providing a standard way to encode two-byte character sets (and even three-byte character sets if necessary).
ISO-2022 can transmit all character sets using only seven-bit codes. This makes it suitable for communication between systems that may use the eighth bit for parity, as was common in the days of dial-up modems and RS-232 connections. It does this by using different escape sequences and control codes to indicate a shifting from one character set type or range to another. It encodes 94 × 94 element row-cell character sets such as the two-byte JIS codes by adding 32 to each row or cell value (0x20 in hexadecimal). This maps all 94 × 94 row-cell bytes to the range of printable seven-bit characters: 33 – 126 (0x21 – 0x7E).
The ISO-2022 assignment of row and cell character values to byte ranges 33 – 126 (0x21 – 0x7E) is also how two-byte Japanese fonts are commonly encoded. Fonts supporting the two-plane JIS X 0213 standard may place Plane 1 and Plane 2 glyphs in two separate files, or in a single file with Plane 1 glyphs starting with a plane digit of 1 or 3, and with Plane 2 glyphs starting with a plane digit of 2 or 4.
The main contents of a two-byte JIS encoding are the kanji characters. In JIS X 0213, these begin at plane 1, row 16. In font encodings that follow the ISO-2022 convention, this kanji starting point corresponds to 0x3000. The first 15 rows of Plane 1 contain non-kanji characters.
JIS X 0213 Plane 1, rows 16 through 47 contain Level 1 kanji, which school children learn. These glyphs are in locations 0x3000 through 0x4F7E in a JIS X 0213 Plane 1 font that uses the ISO-2022 encoding convention.
Plane 1, rows 48 through 47 contain Level 2 kanji. These glyphs are in locations 0x5000 through 0x72FF.
The remaining rows after Plane 1, row 84 contain less frequently used kanji.
JIS X 0213:2004 Plane 1 has glyphs assigned for all possible code points. Plane 2 has glyphs assigned in the following rows:
JIS X 0213:2004 Plane 2 Rows Assigned |
||
---|---|---|
Ordinal Row | Font Encoding Row | |
Decimal | Decimal | Hexadecimal |
1 | 33 | 0x21 |
3 – 5 | 35 – 37 | 0x23 – 0x25 |
8 | 40 | 0x28 |
12 – 15 | 44 – 47 | 0x2C – 0x2F |
68 – 94 | 110 – 126 | 0x6E – 0x7E |
The two tables below contain links to JIS X 0213 glyph maps,
created with the Unifont unihex2png
program. The first
character (in Plane 1, location 0x2121) is a space character,
so its cell appears blank.
JIS X 0213:2004 Glyphs Plane 1 (All cells have glyph assignments) |
|||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 2A | 2B | 2C | 2D | 2E | 2F | |
30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 3A | 3B | 3C | 3D | 3E | 3F |
40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 4A | 4B | 4C | 4D | 4E | 4F |
50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 5A | 5B | 5C | 5D | 5E | 5F |
60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 6A | 6B | 6C | 6D | 6E | 6F |
70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 7A | 7B | 7C | 7D | 7E |
Plane 2 does not have characters assigned to all locations. Those rows with assignments are colored green.
JIS X 0213:2004 Glyphs Plane 2 (Gray cells have no glyph assignments) |
|||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 2A | 2B | 2C | 2D | 2E | 2F | |
30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 3A | 3B | 3C | 3D | 3E | 3F |
40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 4A | 4B | 4C | 4D | 4E | 4F |
50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 5A | 5B | 5C | 5D | 5E | 5F |
60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 6A | 6B | 6C | 6D | 6E | 6F |
70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 7A | 7B | 7C | 7D | 7E |
ISO-2022 provided a standard method for exchange of seven-bit characters between computers, notably for email. However, the base standard, with its multi-byte shift codes, is not an efficient method for computer information storage, so it is not described further. ISO-2022 does allow for an eight-bit mode popular for storing Asian character sets, known as Extended Unix Code. This is described in the next section.
Extended Unix Code-JP Encoding
The Extended Unix Code can map to many Asian glyph encodings that use planes of 94 by 94 characters. EUC-JP is the standard for encoding Japanese, and it fully supports JIS X 0213.
Regardless of the character set being encoded, EUC reserves decimal code points 0 – 31 (C0 control codes), 32 (space), 127 (delete), and 128 – 159 (C1 control codes). In hexadecimal, 0 – 32 is 0x00 – 0x20 and 127 – 159 is 0x7F – 0x9F.
EUC-JP uses the range 0x21 – 0x7E for ASCII characters, although some EUC-JP documents use the ASCII code point for the backslash character ("\"), 0x5C, for the Yen symbol and the code point for the tilde character ("~"), 0x7E, for the Overline character (Unicode code point U+203E). This ASCII range is referred to as EUC Code Set 0, and was also covered by the older Japanese standard JIS X 0201-ro.
EUC-JP adds 128 (hexadecimal 0x80) to each ISO-2022 multi-byte code point (which will have a range of 33 through 126, or 0x20 through 0x7E), so resulting byte values will lie above the ASCII range of 0x00 – 0x7F, and above the range 0x80 – 0x9F reserved for control codes. Thus JIS X 0213 code points in each plane ranging from 0x2121 through 0x7E7E using ISO-2022 encoding, when 0x8080 is added, will have EUC-JP encoding range hexadecimal 0xA1A1 through 0xFEFE.
Following this arrangement, EUC-JP encodes JIS X 0213 Plane 1 with a two-byte sequence in the range 0xA1A1 through 0xFEFE.
EUC-JP also adds 0x8080 to JIS X 0213 Plane 2 glyphs, and precedes those glyphs with the byte 0x8F. This sequence was also used for the older JIS X 0212 standard.
Finally, EUC-JP indicates the half-width katakana set from JIS X 0201 with byte value 0x8E followed by their single byte values of 0xA1 through 0xDF.
The following table summarizes these EUC-JP code sets:
EUC Code Set | Bytes per Code | Byte Sequence | Standard(s) |
---|---|---|---|
0 | 1 | 0x00–0x7F | ASCII, JIS X 0201-ro |
1 | 2 | 0xA1–0x7E, 0xA1–0x7E | JIS X 0208, JIS X 0213 Plane 1 |
2 | 2 | 0x8E, 0xA1–0xDF | JIS X 0201-jp Punctuation & Half-width Hiragana |
3 | 3 | 0x8F, 0xA1–0x7E, 0xA1–0x7E | JIS X 0212, JIS X 0213 Plane 2 |
Mapping JIS X 0213 / EUC-JP to Unicode
As discussed in the last section, all JIS X 0213 code points map to a byte sequence in the EUC-JP encoding. These code points also map to Unicode.
Most JIS X 0213 characters map to code points in Unicode Plane 0, the Basic Multilingual Plane (BMP). However, 303 characters from JIS X 0213 Plane 2 map to Unicode Plane 2, the Supplementary Ideographic Plane (SIP). Therefore, to support the entire JIS X 0213 repertoire, a font must support some glyphs in Unicode Plane 2.
As a further twist, twenty-five JIS X 0213 code points in JIS Plane 1 are a combination of two Unicode code points, and so cannot be mapped into a single Unicode code point. They are as follows:
JIS Plane 1 Code Point | Jiskan 16 Bitmap | Unicode Code Points |
---|---|---|
2477 | U+304B + U+309A | |
2478 | U+304D + U+309A | |
2479 | U+304F + U+309A | |
247A | U+3051 + U+309A | |
247B | U+3053 + U+309A | |
2577 | U+30AB + U+309A | |
2578 | U+30AD + U+309A | |
2579 | U+30AF + U+309A | |
257A | U+30B1 + U+309A | |
257B | U+30B3 + U+309A | |
257C | U+30BB + U+309A | |
257D | U+30C4 + U+309A | |
257E | U+30C8 + U+309A | |
2678 | U+31F7 + U+309A | |
2B44 | U+00E6 + U+0300 | |
2B48 | U+0254 + U+0300 | |
2B49 | U+0254 + U+0301 | |
2B4A | U+028C + U+0300 | |
2B4B | U+028C + U+0301 | |
2B4C | U+0259 + U+0300 | |
2B4D | U+0259 + U+0301 | |
2B4E | U+025A + U+0300 | |
2B4F | U+025A + U+0301 | |
2B65 | U+02E9 + U+02E5 | |
2B66 | U+02E5 + U+02E9 |
Note: Japanese OpenType fonts will recognize the two combinations of U+02E5 and U+02E9, and render glyphs that resemble the JIS X 0213 Plane 1 glyphs 0x2B65 and 0x2B66. Bitmapped fonts (BDF, PCF, etc.) and simple TrueType fonts (such as GNU Unifont) will render the two Unicode glyphs separately.
JIS X 0213 Unicode Coverage
The following are the Unicode ranges covered by the public domain
Jiskan16 JIS X 0213 font [table created with the help of the Unifont
unicoverage
utility]:
Percent Covered Range Script ------- ----- ------ 71.1% U+0080..U+00FF C1 Controls and Latin-1 Supplement 60.2% U+0100..U+017F Latin Extended - A 7.2% U+0180..U+024F Latin Extended - B 57.3% U+0250..U+02AF IPA Extensions 18.8% U+02B0..U+02FF Spacing Modifier Letters 28.6% U+0300..U+036F Combining Diacritical Marks 34.0% U+0370..U+03FF Greek and Coptic 25.8% U+0400..U+04FF Cyrillic 0.8% U+1E00..U+1EFF Latin Extended Additional 1.6% U+1F00..U+1FFF Greek Extended 21.4% U+2000..U+206F General Punctuation 2.1% U+20A0..U+20CF Currency Symbols 10.0% U+2100..U+214F Letterlike Symbols 42.2% U+2150..U+218F Number Forms 14.3% U+2190..U+21FF Arrows 21.5% U+2200..U+22FF Mathematical Operators 7.8% U+2300..U+23FF Miscellaneous Technical 1.6% U+2400..U+243F Control Pictures 41.2% U+2460..U+24FF Enclosed Alphanumerics 25.0% U+2500..U+257F Box Drawing 24.0% U+25A0..U+25FF Geometric Shapes 10.9% U+2600..U+26FF Miscellaneous Symbols 6.2% U+2700..U+27BF Dingbats 1.6% U+2900..U+297F Supplemental Arrows - B 2.3% U+2980..U+29FF Miscellaneous Mathematical Symbols - B 54.7% U+3000..U+303F CJK Symbols and Punctuation 94.8% U+3040..U+309F Hiragana 100.0% U+30A0..U+30FF Katakana 100.0% U+31F0..U+31FF Katakana Phonetic Extensions 24.6% U+3200..U+32FF Enclosed CJK Letters and Months 11.3% U+3300..U+33FF CJK Compatibility 2.5% U+3400..U+4DBF CJK Unified Ideographs Extension A 45.3% U+4E00..U+9FFF CJK Unified Ideographs 16.0% U+F900..U+FAFF CJK Compatibility Ideographs 4.7% U+FB50..U+FDFF Arabic Presentation Forms - A 6.2% U+FE30..U+FE4F CJK Compatibility Forms 42.1% U+FF00..U+FFEF Halfwidth and Fullwidth Forms 0.7% U+20000..U+2A6FF CJK Unified Ideographs Extension B
JIS X 0213 Fonts and GNU Unifont
Jiskan 16
Unifont version 12.1.02 added a unifont_jp
variant of
the main Unicode Basic Multilingual Plane (Plane 0) font.
This substitutes the default Unifont CJK glyphs with glyphs from
the public domain Jiskan16 BDF font in these Unicode ranges:
Unicode Range | Unicode Script |
---|---|
U+3300..U+33FF | CJK Compatibility |
U+3400..U+4DBF | CJK Unified Ideographs Extension A |
U+4E00..U+9FFF | CJK Unified Ideographs |
U+F900..U+FAFF | CJK Compatibility Ideographs |
U+FE30..U+FE4F | CJK Compatibility Forms |
U+FF00..U+FFEF | Halfwidth and Fullwidth Forms |
In addition, the 303 kanji in the range U+20000..U+2A6FF
(CJK Unified Ideographs Extension B) from Jiskan16 are
added to the Unifont Plane 0 glyphs. The resulting
unifont_jp
font file provides complete coverage
of the JIS X 0213 standard in a Unicode font; all other JIS
code points lie within the Basic Multilingual Plane, which
Unifont already covers completely.
I chose the Jiskan 16 by 16 pixel glyphs because they were the
clearest free set of JIS glyphs that I could find. However,
some of the glyphs do seem a little wide. If anyone who is
a Japanese native wishes to improve any of the glyphs from
the Jiskan font, the 94 cell glyph PNG files are linked in
the JIS X 0213 Plane 1 and Plane 2 tables n the
"ISO-2022 and Japanese Font Encoding" section above on this web page.
Those graphics images were created with the unihex2png
utility, which is part of the GNU Unifont distribution.
Instructions for modifying these 16-by-16 pixel glyphs
appear on the
Unifont Utilities
section of this website.
The Jiskan 16 Plane 1 and Plane 2 glyphs are in the two Unifont-format hex files below, with code points ranging from 0x2121 through 0x7E7E as per ISO-2022 encoding. The original .bdf.gz font files are also below (copied from the Gentoo GNU/Linux distribution); they contain the list of contributors to those files. These four files are in the Public Domain:
- Jiskan 16 Plane 1 Glyphs, JIS X 0213:2004: jiskan16-1.hex, jiskan16-2004-1.bdf.gz
- Jiskan 16 Plane 2 Glyphs:, JIS X 0213:2000 jiskan16-2.hex, jiskan16-2000-2.bdf.gz
- Jiskan 16 Plane 1 Glyphs, JIS X 0213:2000: jiskan16-2000-1.bdf.gz — Not used in Unifont
Izumi 16
Unifont 12.1.03 switched to Izumi 16 glyphs. Following the release of Unifont 12.1.02, someone notified me of a public domain JIS X 0213 font at http://izumilib.web.fc2.com/izumi-bf/dl-eng.html. This font seems to have started as the Jiskan 16 font and improved some glyphs.
The two relevant font files, izmg16-2004-1.bdf and izmg16-2004-2.bdf, contain blank glyphs that are at duplicate locations of non-blank glyphs: 45 duplicates for the first font and 13 for the second. I suspect that this was the result of the original glyphs being encoded in FONTX2 format, with a blank glyph specified at hexadecimal code points xx7F (to avoid ending one block and beginning another in the FONTX2 font when spanning xx7F), which were then converted into code points xx60 in the BDF font. After over a month without a response from the person who maintains that website, I decided to remove the duplicates and redistribute those two font files on this website. They are here:
- Izumi 16 Plane 1 Glyphs, JIS X 0213:2004: izmg16-2004-1.bdf.gz
- Izumi 16 Plane 2 Glyphs, JIS X 0213:2004: izmg16-2004-2.bdf.gz
Izumi 16 Plane 1 Glyphs — 45 Duplicates Removed: 2160, 2560, 2760, 2960, 2B60, 2F60, 3160, 3360, 3560, 3760, 3960, 3B60, 3D60, 3F60, 4160, 4360, 4560, 4760, 4960, 4B60, 4D60, 4F60, 5160, 5360, 5560, 5760, 5960, 5B60, 5D60, 5F60, 6160, 6360, 6560, 6760, 6960, 6B60, 6D60, 6F60, 7160, 7360, 7560, 7760, 7960, 7B60, 7D60.
Izumi 16 Plane 2 Glyphs — 13 Duplicates Removed: 2160, 2360, 2560, 2D60, 2F60, 6F60, 7160, 7360, 7560, 7760, 7960, 7B60, 7D60.
JIS/EUC-JP Conversion Mapping
The following C language header file contains arrays
that programmers can use to map from JIS or EUC-JP encoded
documents to Unicode. It is the mapping used to convert
the Jiskan16 font files to Unicode code points for the
Japanese version of Unifont, "unifont_jp
".
This header file is in the Public Domain:
File Checksums
Finally, here are SHA256 checksums for the above files:
895f26a841ba28f7277c5629fec6841a65deb1cbd49487bf0407446d30390fdb eucjp2uni.h 005345196615e692c54ef67286b44dd0eaa3a899d066cd407a4b3a810822c20c izmg16-2004-1.bdf.gz 987aeec6d39e0dd82d3c41749578589e3a4ddb65c74ae52948030f6092c8c956 izmg16-2004-2.bdf.gz cb177cddeb3cf73bd3e62dc93a7db1ad8c162256a50b93ecf7fc3bd641ea26d1 jiskan16-1.hex fdf0eb6ad56a5611c03071c8395eeba7ca54c057f4078bf0aac7b9ea05336074 jiskan16-2.hex 34e05cb59ae562d53b4b8fa5620454e9a060208f76da915c7da844f581e94830 jiskan16-2000-1.bdf.gz 14d5db8001d4f289df98ce1371f782e81add21281bf4b7f8cd7041cb1c6306cd jiskan16-2000-2.bdf.gz d79a57b81d3626414cb207eafec8d7b94fc2e7efabb5dce9d3ae58416e82ea3d jiskan16-2004-1.bdf.gz
Sources
JIS Standards:
Lunde, Ken. CJKV Information Procecessing. O'Reilly, 1999. ISBN 1-56592-224-7. By far the most informative source of information on CJKV processing, although dated: it was published before JIS X 0213 was released.
Wikipedia: articles on JIS and EUC-JP encodings.
JIS X 0213 Code Mapping Tables at http://x0213.org/codetable/index.en.html. These were taken as the canonical Japanese encoding, using fullwidth alternates when they were specified, and using Windows code page mappings except for three Plane 1 mappings that would have been duplicates of other code point mappings; the mappings adopted for Unifont were 0x2142 → U+2016, 0x215D → U+2212, and 0x2141 → U+301C.
The GNU libiconv library at https://ftp.gnu.org/gnu/libiconv/, for comparison with data tables at x0213.org.
Gentoo GNU/Linux distribution at https://gentoo.org/, the source for the public domain Jiskan 16 × 16 Plane 1 and Plane 2 BDF fonts.
FONTX2 Format:
https://ja.wikipedia.org/wiki/DOS/V [in Japanese]: History of DOS/V and Takashi Oyama's development of the FONTX2 font format.
http://www.hmsoft.co.jp/lepton/software/dosv/fontx.htm [in Japanese]: Takashi Oyama's description of the FONTX2 format.
http://elm-chan.org/docs/dosv/fontx_e.html [in English]: Description of FONTX2 format.