A few days ago I was at FOSDEM,
the Free and Open Source Software Developers European Meeting. One of the talks there, was about
PHP 6 and the unicode support it promised by
Andrei Zmievski. I have long been waiting for unicode support in PHP because of some projects in the past where I discovered the limitations we currently have to work around. So I was coming into the talk expecting that it could now speak unicode natively.
But to my surprise they had implemented much more. Not only will PHP 6 talk unicode and understand it, it will understand locals and will even understand full language definitions and convert between them. All wrapped in an object oriented way of working.
Let me first explain a bit about text encoding and why you should be annoyed by it. the most basic encoding we currently have is ASCII, ASCII consists of 128 characters. These are the most basic characters you will use, and in no way complete for any language. Because ASCII was so limited extended sets were created, however these extended sets were created by multiple companies and groups, resulting in the mess that is latin-1, windows-1252 and mac roman. The problem happens because text encoding doesn't have a header, they just start. So to detect an encoding a program will read a few bytes of a file to detect it. Which in the case of the extended ASCII encodings will quite often fail because they are for the most part the same. On the web, we can view this phenomenon quite often, this is when certain characters on a page are "weird", because your browser thinks the text is latin-1 or windows-1252, while it actually was mac roman.
To solve this mess and some other problems with text, like asian characters, Unicode was invented. Unicode in itself is not a text encoding but an abstraction. Basically Unicode assigns a number to every character known to man, which currently are about 100.000 characters. Now text encoding like UTF-8 and UTF-16 will map numbers to unicode, but in their own system. UTF-8 will use 8 bits to define a character while UTF-16 will use 16 bits. These encodings (and there are many more) have a certain efficiency on different texts, but for western languages UTF-8 is currently the most efficient and most widely supported.
Now to jump back to PHP. You might think PHP supports UTF-8, but in reality it doesn't. When unicode was setup they made sure the first range of characters was the same as ASCII and because of the way the unicode numbers are expressed in UTF-8 it's backwards compatible, so PHP doesn't even know it's dealing with UTF. This of course goes horribly wrong when you go outside of the scope of ASCII.
Now in PHP 6 a string will basically have two modes. Unicode and binary blob. When it's in unicode mode it will contain the actual unicode numbers based on the text you gave it. And they have created a wide range of objects to do all kinds of crazy stuff with that unicode.
Firstly they added a text iterator, which as you might guess will iterate over your text. It can do this in multiple modes, per unicode code point, per character, per word, line and sentence. The difference between a unicode code points and a character is that an actual character can be made up out of multiple code points. This is called composition. For instance "U+0061 (A)" + "U+030A (combining ring above)" = U+00C5 (Å).
The text iterator will become the new way to work with text in PHP; because of how it's implemented it will be useful for all kinds of text manipulations like extracting pieces of text and such. So you could just get the first two sentences, instead of using substr and hoping for the best.
Secondly PHP will support ICU locals. Locals are basically a definition of local differences in text, numbers, timezone, etc... They are defined on the level of language, script, country and variant. So you could for instance define "nl_NL" for dutch language. The statement should be read as <language>_<country> but a full definition could contain <language>_<script>_<country>_<variant>@<keywords>. The language in this is obvious as is the country. Script denotes which set of characters is going to be used and variants can define small differences (like euro vs. guilder difference), the keywords can be used to define exceptions to the chosen local. When having defined this, PHP will do things correctly based on that local.
Think for instance about sorting, you would normally think A goes before B and then C and so on. But when introducing characters that fall out of the simple alphabet like for instance "Ä", it becomes a bit more tricky, and when you also add "Å" it becomes impossible to sort a text without knowing the local rules of sorting. Because yes: it differs per local how this should be sorted. For instance, in Germany "Ä" goes right after the normal "A", but in Sweden "Ä" goes after Z. When the locale is set PHP will automatically do this kind of stuff correctly.
Finally the converting of text is going to be a lot better. For instance when using strtoupper it will correctly transform text. So the text "fußball" will become "FUSSBALL". Also quite a bit more impressive you can change scripts to convert text. so for instance if you want to convert japanese katakana to latin(romaji) it will do this correctly for you.
When PHP 6 becomes stable the implementation of different languages and locals will become much more powerful and easier. As someone who has personally felt the frustration of doing unicode stuff in PHP 5, i can't wait until its here.