{"id":1130,"date":"2013-03-27T07:01:34","date_gmt":"2013-03-27T07:01:34","guid":{"rendered":"http:\/\/www.softwareeverydayblog.com\/?p=1130"},"modified":"2013-03-31T22:47:14","modified_gmt":"2013-03-31T22:47:14","slug":"unicode-2","status":"publish","type":"post","link":"https:\/\/www.softwareeverydayblog.com\/?p=1130","title":{"rendered":"Character Encoding"},"content":{"rendered":"<p>ASCII uses 7 bits, extended ASCII allows you to use the 8th bit, but that&#8217;s NOT a standard. Problems started when bits in ASCII were too less to represent more characters.<\/p>\n<p>How does UNICODE work? Map characters to code points (or numbers). Representation of code points to disk is an independent concern. Different Unicode encoding&#8217;s will do it differently. Encoding&#8217;s UTF-8 UTF-16 and UTF-32 are most common. Due to this separation of concern if there is an issue with a certain encoding (say UTF-32 takes too much space), it could be phased out without affecting the mappings from character to code points. In a nutshell, you <strong>map all chars to numbers or code points<\/strong> and <strong>map code point to encoding<\/strong> (like UTF-8,16,32).<\/p>\n<p>Lets look at some terminologies. A coded character set is a set of characters mapped to numbers e.g. ASCII. Code point is a valid integer value that can be mapped to a character in UTF. What is a Code unit? UTF-8 has one code unit. UTF-16 has two code units. UTF-32 has four code units. Get the drift&#8230;<\/p>\n<p><strong>UTF32<\/strong>:- Fixed size encoding of 4 bytes. Problem is characters now take 4 times more size!!! so much wastage of space. Files become 4 times bigger. 00000000 00000000 00000000 00000001<\/p>\n<p><strong>UTF16<\/strong>:- Variable length character set. It can be one or two chunks of 16 bits to represent a character. So, at best your files will be 2 times as big (assuming you need just one 16 bit chunk). You also have to deal with problem like endianess. <strong>Byte order marks<\/strong> helps us to know endianess. If Byte order mark is not present a reader will assume some endianess. Since UTF-16 is variable length, how do we know if particular char is two chunks or one? It probably uses some marker bits to mark code units, just like UTF-8 does.<\/p>\n<p><strong>UTF8<\/strong>:- There was a need for compatibility with ASCII. Meaning if i create a file using ASCII character set, even if i read it using some unicode encoding, it should work. Here comes UTF-8 to the rescue. Variable length encoding, a UTF-8 character can have 1,2,3 or 4 bytes. Since its variable length, <strong>how do we know how many bytes does a character represent<\/strong>? Every byte has some bits set in the most significant bit positions.<\/p>\n<p>0xxxxxxx -> 1 byte unicode char<br \/>\n110xxxxx (leading byte) 10xxxxxx (continuation byte) -> 2 byte unicode char<br \/>\n1110xxxx (leading byte) 10xxxxxx (continuation byte) 10xxxxxx (continuation byte) -> 3 byte unicode char<br \/>\n11110xxx (leading byte) 10xxxxxx (continuation byte) 10xxxxxx (continuation byte) 10xxxxxx (continuation byte) -> 4 byte unicode char<\/p>\n<p><strong>How many possible code points could exist<\/strong>? Lets take a character that needs 4 bytes to be represented in UTF-8. It will be of the form 11110xxx 10yyyyyy 10zzzzzz 10mmmmmm. Now the code point is xxxyyyyyyzzzzzzmmmmmm which are 21 bits. So max number of characters (or symbols) that UTF can describe is 2^21. Other bits are all marker bits.<\/p>\n<p>There is an <strong>important reason why continuation bytes are marked<\/strong>. Assuming you connect to a stream of bytes, you might come accross a continuation byte first, so you cannot start reading from there. You need to wait until a new character starts.<\/p>\n<p>Upto a code point of 127, single byte representation, there is no difference between ASCII mappings and UNICODE mappings. This means that <strong>if you pass an standard ASCII file to a UTF reader it will read it correctly<\/strong>. Note: it will NOT be able to read extended ASCII file.<\/p>\n<p><strong>Decoding UTF-8<\/strong><br \/>\nSay we had a UTF-8 char represented with two code units 110xxxxx 10yyyyyy. You strip the marker bits and merge the code point bits. It will represent code point xxxxxyyyyyy. Now goto unicode mapping and check what character xxxxxyyyyyy stands for.<\/p>\n<p><strong>Encoding UTF-8<\/strong><br \/>\nSay you want to encode \u0905 which has code point 905 (U+0905). Binary for 905 is 1110001001. Lets populate the trailing byte first 10-001001 (10xxxxxx). The starting byte will be 110-01110 (110yyyyy), notice we adding an extra 0 in the beginning. Ok so UTF-8 encoding for \u0905 will be 11001110 10001001. Here is the<a href=\"http:\/\/www.unicode.org\/charts\/PDF\/U0900.pdf\" title=\"Devanagri Unicode\" target=\"_blank\"> unicode standard for Devanagri<\/a>.<\/p>\n<p><strong>Characters in Java<\/strong><br \/>\nJava character&#8217;s are 2 bytes. Yes, even an ASCII char that could have taken just one byte will take two bytes in memory. Advantage? Well, your single char object in java can refer to code points from 0 to 2^16. That means a single char can represent a lot of chars (even foreign chars). What happens to code points more than 2^16 (e.g. Deseret, LONG I U+10400), we will use two char values to represent this. <a href=\"http:\/\/www.oracle.com\/technetwork\/articles\/javase\/supplementary-142654.html\" title=\"supplementary characters\" target=\"_blank\">Read this document<\/a> to know more about representing supplementary characters. In java, Characters that are in the range U+10000 to U+10FFFF are called supplementary characters. The set of characters from U+0000 to U+FFFF are sometimes referred to as the Basic Multilingual Plane (BMP)<\/p>\n<p>Here is an example code to write hello world in Devanagari script in a text file.<\/p>\n<pre lang=\"java\">\r\n\/\/Hello World in devangri script.\u00a0\r\nString content = \"\\u0939\\u0947\\u0932\\u094B \\u0935\\u0932\\u0921\";\t\t\r\nFile newTextFile = new File(\"hinditest.txt\");\r\nWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(newTextFile), \"UTF8\"));\r\nout.append(content).append(\"\\r\\n\");\t\u00a0\r\nout.flush();\r\nout.close();\r\n<\/pre>\n<p><a href=\"http:\/\/www.youtube.com\/watch?v=B1Sf1IhA0j4&#038;list=PLhQN_EIoIKBRA0yVTsWDoJzEKZwJY0p3l\" title=\"Unicode lecture series\" target=\"_blank\">Here is an Awesome lecture series on Unicode<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>ASCII uses 7 bits, extended ASCII allows you to use the 8th bit, but that&#8217;s NOT a standard. Problems started when bits in ASCII were too less to represent more characters. How does UNICODE work? Map characters to code points (or numbers). Representation of code points to disk is an independent concern. Different Unicode encoding&#8217;s [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1130","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=\/wp\/v2\/posts\/1130","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1130"}],"version-history":[{"count":49,"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=\/wp\/v2\/posts\/1130\/revisions"}],"predecessor-version":[{"id":1191,"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=\/wp\/v2\/posts\/1130\/revisions\/1191"}],"wp:attachment":[{"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1130"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1130"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.softwareeverydayblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1130"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}