Does PHP really need unicode support?

Published on 02.10.2014, by lubosdz

Does PHP really need unicode support?

Bits of history

A lot of talks and criticism has been spread around on PHP development efforst due to missing and long-time-still-expected native unicode (UTF-8/16/32) support. This article presents few thoughts of my own I've gathered over time on the topic.

At my early programming years back 10 years ago, I used to be pretty much frustrated with the complexity of handling accented characters. And along with other folks I was wondering why PHP does not support unicode - it would make simple things like splitting multibyte strings so much easier.

Well, now after another years, another debates and reading PHP internals, I am backing off. The support for unicode is still not built-in and PHP 6 is viewed as unlucky attempt (even though not quite justified, since most of nice features from PHP 6 have been ported into PHP 5.4).

Guys at PHP are professionals and over time they proved to make right decisions, even though sometime against majority of landusers. Through 2005 - 2008 significant efforts have been dedicated to researching possible solutions to unicode implementation. In the end, tests showed up huge increase in memory usage and significant performance loss. Scripts ported into unicode were often broken, too difficult migrations of existing PHP extensions, issues with plurals, currency, date formats etc. All in all, development discovered many previously unknown issues that blocked whole PHP 6 unicode development.

That's what happened behind the scene back mainly during 2006 - 2008.

Developer's point of view

Now let's take a look from developer's perspective.

As a programmer, I often deal with accented characters - German, Czech, Slovak, sometime Russian. After 10 years of practical coding, I estimate, that only circa 2 - 3% of programming code deals with localized strings. Solutions for localizations are often solved via storing strings into databases - so PHP acts just like middle-tier and does not need unicode. 95% of applications use UTF-8 nowadays, and most of localized solutions consist simply of translating huge arrays of localized strings from source language (english) into local language.

  • So - does PHP really need unicode native support? I don't think so.

  • Do we need to call PHP functions by localized names, e.g. prinť(){ ... } ? No, PHP will always stay written in English. Now, if one enjoys localized function names, he still can write PHP localized function names while having script encoded in UTF-8, e.g. Myclass::nájdiČísloTřicetPět(){ ... }.

Don't get me wrong - I still think, it would be a GREAT feature, but it's just not worth of implementation at the mentioned costs.

Performance first, localization goes there after. The real benefits to minority of users would be wasted with problems caused to majority of users.

Personally, I don't mind writing those few mb_string_* functions when occassionally splitting localized string or calculating string length.

Conclusion (I.)

From practical experience, only cca 2-3% of coding deals with localizations.

There are other possible solutions to localizations, like:

  • storing accented strings in databases with multibyte support
  • storing accented strings in a single array in PHP utf-8 scripts
  • using frameworks's solutions for localizations (*.po files, collecting strings etc.)
  • PHP's I10n, I18n and INTL extensions provide partial solution to plurals, message, date, currency etc. formating issues

If ever will be native unicode/utf8/16/32 support added to PHP, then it might happen at a time, when processors deliver such a performance, that doubling response time for a page load won't matter at all :-) Or a genius guy will come up with some revolutionary solution ...

Conclusion (II.)

So, naming next release as PHP 7 (and not PHP 6) was correct decision :-) There are already books referring PHP 6, web hostings offering PHP 6 hosting ... this might cause confusion amongst people not following up PHP evolvement.

Related links:


Leave your comment..
Email will be converted into something like [michael AT gmail DOT com]
Note: Offensive and unrelated comments will be deleted.
Please enter result from the picture above.