Solutions Asia

Half- And Full-width Characters In CJK (and normalization)

May 22, 2011

CHARACTER SIZES

Latin Characters in half-width and full-width


Asia Asia


Katakana in half-width and full-width


アジア アジア


Kanji full-width only



The terms half- and full-width refer to the relative width size of the glyph of characters and is important to CJK languages as most of their characters require full width characters to display due to their complexity. The origin of half width characters goes back to the early days of computing when single byte representation of characters was the norm and Japanese and Korean computer manufacturers displayed very limited ranges of their languages’ characters in low resolution in single byte encodings. In fact some people still refer to half width characters as single byte characters and full width characters as double-byte characters but this should be discouraged as this has not necessarily been true for some time—half width characters may be encoded using multiple bytes and full width characters may be encoded using more than two bytes depending on the encoding scheme.


As exciting as the history of byte encoding is we are not going to cover it here. Suffice it say that size of characters is not something that should be handled by character code mapping but something that belongs in font, style sheets, or other display markup and the only reason this issue is still with us today is that for backwards compatibility reasons, the Unicode Consortium decided to include the half-width/full width character distinction in their mappings. So instead of taking some admittedly significant pain around conversion five or ten years ago, we are going to have to deal with the half- and full-width issue for the foreseeable future. This is surprising considering the outstanding job of rationalization and simplification that the Unicode consortium has managed to pull off in almost all other areas of the extremely complicated subject of CJK character encoding.


What’s the issue?

Why is half- and full-width such an issue? The problem is that having the character width distinction handled in character code mapping means all CJK text needs to be normalized before searching, sorting, filtering, matching, etc. or it will not return the expected results. People searching for a word expect all results to be returned not just those that are in the half-width format that they may have inputted in their search query or vice versa if they inputted their search term in full-width format.


Chinese, Japanese and Korean

In Chinese, half and full-width are usually referred to as 半角(bànjiǎo) and 全角(quánjiǎo) in mainland China and 半形 ( bànxíng) and 全形 (quánxíng) in Taiwan. In Japanese, these terms are referred to as 半角 (hankaku) and 全角 (zenkaku). In Korean, the terms 반각 (bangak) and 전각 (jeongak) are used.


Half- and full-width distinction is less of an issue for Chinese than Japanese and Korean because it only applies to Latin characters, symbols, and numbers for Chinese. In the case of Japanese and Korean, the distinction also applies to Katakana (Japanese) and Hangul (Korean) characters, in addition to Latin characters, numbers, etc.


Relevant Unicode Ranges from Unicode 6.0

The width size column in the table below is a gross simplification but is sufficient for our purposes. Those interested in understanding East Asian Width in more detail should refer to Unicode Standard Annex #11.


Range Content Width Size
0x0020-0x007F
ASCII (Latin characters, symbols, punctuation,
numbers)
HALF-WIDTH
0x1100-0x11FF
Hangul Jamo
(Korean)
FULL-WIDTH
0x3000-0x303F CJK punctuation FULL-WIDTH
0x3040-0x309F Hiragana (Japanese) FULL-WIDTH
0x30A0-0x30FF Katakana (Japanese) FULL-WIDTH
0x3130-0x318F Hangul Compatibility (Korean for KS X1001 compatibility) FULL-WIDTH
0xAC00-0xD7AC Hangul Syllables FULL-WIDTH
0xF900-0xFAFF CJK Compatibility Ideographs Block FULL-WIDTH
0xFF00-0xFFEF  
Latin characters and half-width Katakana and Hangul
 
HALF-WIDTH AND FULL WIDTH
0x4E00-0x9FFF CJK unifed ideographs - Common and uncommon FULL-WIDTH
0x3400-0x4DBF CJK unified ideographs Extension A - Rare FULL-WIDTH
0x20000-0x2FFF CJK unified ideographs Extension B - Very rare FULL-WIDTH


NORMALIZATION

The next question is how to best normalize text to ignore the differences between half-width and full-width forms when doing searches, sorts, etc.


Consulting the table from the previous chapter shows that there is always a full-width version but not always a half-width version of characters so at first thought it might seem easiest to just convert everything into full-width size for normalization.


However, that is probably not the best solution for two reasons. A large amount of software for English and other languages using the Latin character set only targets the character codes for the ASCII half width versions and that software would all have to be rewritten. The second is that Latin characters generally look ugly in full width form.


For those reasons the preferable normalization solution is to convert all Latin characters, symbols, punctuation and numbers to half-width ASCII form and everything else to their full-width form.


Example code in Java for half and full-width normalization is included in an appendix at the end (note Java has this functionality built in but I have included this code for explanatory purposes). This code is simple and can easily be implemented in different computer languages that do not have Unicode Normalization built-in.


The Unicode Standard Annex #15 defines normalization forms for Unicode text (covers more than just half- and full-width normalization) and without going into the details the Unicode Standardization Form most useful for normalization of searches, etc. is NFKC as both half- and full-width differences are normalized and different compatibility equivalents of a single CJK character will result in the same string.


Built-in Normalization

Many computer languages and applications have the needed normalization built-in. For example in Java, which has Unicode Normalization Form functionality built in it is as easy as


String normalized_string = java.text.Normalizer.normalize(unnormalized_string, java.text.Normalizer.Form.NFKC);


Microsoft offers a similar functionality for Unicode normalization of strings. IBM products such as search offer equivalent normalization. For example, in Japanese, a full-width alphanumeric character is normalized to the half-width character, a half-width Katakana character to the full-width character, and so on.


Complexity of Normalization

Normalization of Unicode can be quite complex. Take for example the simple space character with all its different variants in Unicode.


ASCII 0020  SPACE


sometimes considered a control code


other space characters: 2000  –200A 


→ 00A0  no-break space


→ 200B  zero width space


→ 2060  word joiner


→ 3000  ideographic space


→ FEFF  zero width no-break space


For this, and a large number of other reasons I strongly recommend using the Unicode NFKC normalized format from a standard library whenever possible. Where this is not possible please see the code sample below.


JAVA CODE


import java.util.Map;
import java.util.HashMap;
 
/**
 * Halfwidth and Fullwidth Character Normalization for CJK
 * http://solutions.asia
 *
 * See the Unicode Standard 6.0 – Halfwidth and Fullwidth Forms
 * http://unicode.org/charts/PDF/UFF00.pdf
 *
 * For Chinese, Japanese and Korean, some characters have Unicode mappings to
 * both a halfwidth and a fullwidth version. This code normalizes them
 * to halfwidth for latin characters, numbers and punctuation and fullwidth
 * for everything else.
 * Fine for half/full width normalization but not fully equivalent to NFKC
 * normalization
 */
public class CJKHalfFullWidthNormalize {
   
    private static final Map<Character, Character> charCodeMap;
    // Key — Original Character
    // Value — Replacement character
    static {
        charCodeMap = new HashMap<Character, Character>();
        // TO HALFWIDTH CHARACTERS
        // ASCII variants (Latin Symbols, Punctuation, Numbers, and Alphabet)
        for (char key = '\uff01'; key <= '\uff5e'; key++) {
            char value = (char) (key - '\ufee0');
            charCodeMap.put(key, value);
        }
        // Brackets
        charCodeMap.put('\uff5f', '\u2985'); // left white parenthesis
        charCodeMap.put('\uff60', '\u2986'); // right white parenthesis
        // Symbol Variants
        charCodeMap.put('\uffe0', '\u00a2'); // Cent sign
        charCodeMap.put('\uffe1', '\u00a3'); // Pound sign
        charCodeMap.put('\uffe2', '\u00ac'); // Not sign
        charCodeMap.put('\uffe3', '\u00af'); // Macron
        charCodeMap.put('\uffe4', '\u00a6'); // Broken Bar
        charCodeMap.put('\uffe5', '\u00a5'); // Yen sign
        charCodeMap.put('\uffe6', '\u20a9'); // Won sign
        // Space (strictly speaking not listed in Unicode 6.0 Halfwidth and
        // Fullwidth forms but including here as the ideographic space can
        // cause issues)
        charCodeMap.put('\u3000', '\u0020'); // SPACE
        // TO FULLWIDTH CHARACTERS
        // CJK punctuation
        charCodeMap.put('\uff61', '\u3002'); // ideographic full stop
        charCodeMap.put('\uff62', '\u300c'); // left corner bracket
        charCodeMap.put('\uff63', '\u300d'); // right corner bracket
        charCodeMap.put('\uff64', '\u3001'); // ideographic comma
        // Katakana variants
        charCodeMap.put('\uff65', '\u30fb'); // Middle Dot
        charCodeMap.put('\uff66', '\u30f2'); // Wo
        charCodeMap.put('\uff67', '\u30a1'); // A small
        charCodeMap.put('\uff68', '\u30a3'); // I small
        charCodeMap.put('\uff69', '\u30a5'); // U small
        charCodeMap.put('\uff6a', '\u30a7'); // E small
        charCodeMap.put('\uff6b', '\u30a9'); // O small
        charCodeMap.put('\uff6c', '\u30e3'); // Ya small
        charCodeMap.put('\uff6d', '\u30e5'); // Yu small
        charCodeMap.put('\uff6e', '\u30e7'); // Yo small
        charCodeMap.put('\uff6f', '\u30c3'); // Tsu small
        charCodeMap.put('\uff70', '\u30fc'); // Prolonged Sound Mark
        charCodeMap.put('\uff71', '\u30a2'); // A
        charCodeMap.put('\uff72', '\u30a4'); // I
        charCodeMap.put('\uff73', '\u30a6'); // U
        charCodeMap.put('\uff74', '\u30a8'); // E
        charCodeMap.put('\uff75', '\u30aa'); // O
        charCodeMap.put('\uff76', '\u30ab'); // Ka
        charCodeMap.put('\uff77', '\u30ad'); // Ki
        charCodeMap.put('\uff78', '\u30af'); // Ku
        charCodeMap.put('\uff79', '\u30b1'); // Ke
        charCodeMap.put('\uff7a', '\u30b3'); // Ko
        charCodeMap.put('\uff7b', '\u30b5'); // Sa
        charCodeMap.put('\uff7c', '\u30b7'); // Shi
        charCodeMap.put('\uff7d', '\u30b9'); // Su
        charCodeMap.put('\uff7e', '\u30bb'); // Se
        charCodeMap.put('\uff7f', '\u30bd'); // So
        charCodeMap.put('\uff80', '\u30bf'); // Ta
        charCodeMap.put('\uff81', '\u30c1'); // Chi
        charCodeMap.put('\uff82', '\u30c4'); // Tsu
        charCodeMap.put('\uff83', '\u30c6'); // Te
        charCodeMap.put('\uff84', '\u30c8'); // To
        charCodeMap.put('\uff85', '\u30ca'); // Na
        charCodeMap.put('\uff86', '\u30cb'); // Ni
        charCodeMap.put('\uff87', '\u30cc'); // Nu
        charCodeMap.put('\uff88', '\u30cd'); // Ne
        charCodeMap.put('\uff89', '\u30ce'); // No
        charCodeMap.put('\uff8a', '\u30cf'); // Ha
        charCodeMap.put('\uff8b', '\u30d2'); // Hi
        charCodeMap.put('\uff8c', '\u30d5'); // Hu
        charCodeMap.put('\uff8d', '\u30d8'); // He
        charCodeMap.put('\uff8e', '\u30db'); // Ho
        charCodeMap.put('\uff8f', '\u30de'); // Ma
        charCodeMap.put('\uff90', '\u30df'); // Mi
        charCodeMap.put('\uff91', '\u30e0'); // Mu
        charCodeMap.put('\uff92', '\u30e1'); // Me
        charCodeMap.put('\uff93', '\u30e2'); // Mo
        charCodeMap.put('\uff94', '\u30e4'); // Ya
        charCodeMap.put('\uff95', '\u30e6'); // Yu
        charCodeMap.put('\uff96', '\u30e8'); // Yo
        charCodeMap.put('\uff97', '\u30e9'); // Ra
        charCodeMap.put('\uff98', '\u30ea'); // Ri
        charCodeMap.put('\uff99', '\u30eb'); // Ru
        charCodeMap.put('\uff9a', '\u30ec'); // Re
        charCodeMap.put('\uff9b', '\u30ed'); // Ro
        charCodeMap.put('\uff9c', '\u30ef'); // Wa
        charCodeMap.put('\uff9d', '\u30f3'); // N
        charCodeMap.put('\uff9e', '\u3099'); // Voiced Sound Mark
        charCodeMap.put('\uff9f', '\u309a'); // Semi-Voiced Sound Mark
        // Hangul variants
        charCodeMap.put('\uffa0', '\u3164'); // Hangul Filler
        // Hangul First Range
        // KIYEOK to HIEUH
        for (char key = '\uffa1'; key <= '\uffbe'; key++) {
            char value = (char) (key - '\uce70');
            charCodeMap.put(key, value);
        }
        // Hangul Second Range
        // A to E
        for (char key = '\uffc2'; key <= '\uffc7'; key++) {
            char value = (char) (key - '\uce73');
            charCodeMap.put(key, value);
        }
        // Hangul Third Range
        // YEO to OE
        for (char key = '\uffca'; key <= '\uffcf'; key++) {
            char value = (char) (key - '\uce75');
            charCodeMap.put(key, value);
        }
        // Hangul Fourth Range
        // YO to YU
        for (char key = '\uffd2'; key <= '\uffd7'; key++) {
            char value = (char) (key - '\uce77');
            charCodeMap.put(key, value);
        }
        // More Hangul variants
        charCodeMap.put('\uffda', '\u3161'); // Hangul EU
        charCodeMap.put('\uffdb', '\u3162'); // Hangul YI
        charCodeMap.put('\uffdc', '\u3163'); // Hangul I
        // Symbol Variants
        charCodeMap.put('\uffe8', '\u2502'); // Forms Light Vertical
        charCodeMap.put('\uffe9', '\u2190'); // Leftwards Arrow
        charCodeMap.put('\uffea', '\u2191'); // Upwards Arrow
        charCodeMap.put('\uffeb', '\u2192'); // Rightwards Arrow
        charCodeMap.put('\uffec', '\u2193'); // Downwards Arrow
        charCodeMap.put('\uffed', '\u25a0'); // Black Square
        charCodeMap.put('\uffee', '\u25cb'); // White Circle
    }
   
    /**
     * Takes an unnormalized (Halfwidth/Fullwidth) and outputs a normalized string
     */
    public static void main(String[] args) {
        String unnormalized = args[0];
        System.out.println("Unnormalized:\t " + unnormalized);
        char[] buffer = unnormalized.toCharArray();
        int bufferLen = buffer.length;
        for (int i = 0; i < bufferLen; i++) {
            if (charCodeMap.containsKey(buffer[i])) {
                buffer[i] = charCodeMap.get(buffer[i]);
            }
        }
        System.out.println("Normalized:\t " + new String(buffer));
    }
   
}