Character Encoding

ASCII uses 7 bits, extended ASCII allows you to use the 8th bit, but that’s NOT a standard. Problems started when bits in ASCII were too less to represent more characters.

How does UNICODE work? Map characters to code points (or numbers). Representation of code points to disk is an independent concern. Different Unicode encoding’s will do it differently. Encoding’s UTF-8 UTF-16 and UTF-32 are most common. Due to this separation of concern if there is an issue with a certain encoding (say UTF-32 takes too much space), it could be phased out without affecting the mappings from character to code points. In a nutshell, you map all chars to numbers or code points and map code point to encoding (like UTF-8,16,32).

Lets look at some terminologies. A coded character set is a set of characters mapped to numbers e.g. ASCII. Code point is a valid integer value that can be mapped to a character in UTF. What is a Code unit? UTF-8 has one code unit. UTF-16 has two code units. UTF-32 has four code units. Get the drift…

UTF32:- Fixed size encoding of 4 bytes. Problem is characters now take 4 times more size!!! so much wastage of space. Files become 4 times bigger. 00000000 00000000 00000000 00000001

UTF16:- Variable length character set. It can be one or two chunks of 16 bits to represent a character. So, at best your files will be 2 times as big (assuming you need just one 16 bit chunk). You also have to deal with problem like endianess. Byte order marks helps us to know endianess. If Byte order mark is not present a reader will assume some endianess. Since UTF-16 is variable length, how do we know if particular char is two chunks or one? It probably uses some marker bits to mark code units, just like UTF-8 does.

UTF8:- There was a need for compatibility with ASCII. Meaning if i create a file using ASCII character set, even if i read it using some unicode encoding, it should work. Here comes UTF-8 to the rescue. Variable length encoding, a UTF-8 character can have 1,2,3 or 4 bytes. Since its variable length, how do we know how many bytes does a character represent? Every byte has some bits set in the most significant bit positions.

0xxxxxxx -> 1 byte unicode char
110xxxxx (leading byte) 10xxxxxx (continuation byte) -> 2 byte unicode char
1110xxxx (leading byte) 10xxxxxx (continuation byte) 10xxxxxx (continuation byte) -> 3 byte unicode char
11110xxx (leading byte) 10xxxxxx (continuation byte) 10xxxxxx (continuation byte) 10xxxxxx (continuation byte) -> 4 byte unicode char

How many possible code points could exist? Lets take a character that needs 4 bytes to be represented in UTF-8. It will be of the form 11110xxx 10yyyyyy 10zzzzzz 10mmmmmm. Now the code point is xxxyyyyyyzzzzzzmmmmmm which are 21 bits. So max number of characters (or symbols) that UTF can describe is 2^21. Other bits are all marker bits.

There is an important reason why continuation bytes are marked. Assuming you connect to a stream of bytes, you might come accross a continuation byte first, so you cannot start reading from there. You need to wait until a new character starts.

Upto a code point of 127, single byte representation, there is no difference between ASCII mappings and UNICODE mappings. This means that if you pass an standard ASCII file to a UTF reader it will read it correctly. Note: it will NOT be able to read extended ASCII file.

Decoding UTF-8
Say we had a UTF-8 char represented with two code units 110xxxxx 10yyyyyy. You strip the marker bits and merge the code point bits. It will represent code point xxxxxyyyyyy. Now goto unicode mapping and check what character xxxxxyyyyyy stands for.

Encoding UTF-8
Say you want to encode अ which has code point 905 (U+0905). Binary for 905 is 1110001001. Lets populate the trailing byte first 10-001001 (10xxxxxx). The starting byte will be 110-01110 (110yyyyy), notice we adding an extra 0 in the beginning. Ok so UTF-8 encoding for अ will be 11001110 10001001. Here is the unicode standard for Devanagri.

Characters in Java
Java character’s are 2 bytes. Yes, even an ASCII char that could have taken just one byte will take two bytes in memory. Advantage? Well, your single char object in java can refer to code points from 0 to 2^16. That means a single char can represent a lot of chars (even foreign chars). What happens to code points more than 2^16 (e.g. Deseret, LONG I U+10400), we will use two char values to represent this. Read this document to know more about representing supplementary characters. In java, Characters that are in the range U+10000 to U+10FFFF are called supplementary characters. The set of characters from U+0000 to U+FFFF are sometimes referred to as the Basic Multilingual Plane (BMP)

Here is an example code to write hello world in Devanagari script in a text file.

//Hello World in devangri script. 
String content = "\u0939\u0947\u0932\u094B \u0935\u0932\u0921";		
File newTextFile = new File("hinditest.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(newTextFile), "UTF8"));
out.append(content).append("\r\n");	 
out.flush();
out.close();

Here is an Awesome lecture series on Unicode