Exploring PHP character sets
PHP as we see it in version 5 does not really know much at all about character sets. As we have seen previously, using UTF-8 means that the things that people see as characters may be one, two or three bytes long in simple cases. They are longer when special characters are accounted for. But when PHP looks at a string using something like the strlen
function, the only thing it is looking for is bytes. The length returned by strlen
for a single UTF-8 character could be 1, 2, or 3.
On the plus side, PHP will not damage or alter strings. So if we have a string that contains UTF-8 characters, it can be moved around, stored, retrieved, and sent to the browser, all without any adverse events. Provided, that is, we do not attempt to do the kind of processing that is liable to go wrong!
It is possible to subscript the individual bytes of a character string, by writing something like $string[0]
. Here it is essential to remember that what we will get is a byte, and not...