Regular Expression to match Multi-Byte
Here's something I figured out today: How to match a UTF-8 multibyte char with a regular expression without enabling the Unicode support.
The task was to match a single character following the ✆ symbol. The initial approach would look like this:
preg_match('/✆./', $input, $match);
And that works fine for inputs like ✆A
. But what if your input is ✆Ф
? Then $match
will not look like you expected.
That's because you didn't take UTF-8 into account. A dot only matches a single byte. That Ф
character is two bytes!
The obvious solution is using PCRE's /u
modifier:
preg_match('/✆\X/u', $input, $match);
However, using the /u
modifier is very slow. That doesn't matter for this simple example, but for my use case it did.
But there is another way. We simply want to identify a single UTF-8 character. Their length varies between 1 and 4 bytes. The first 127 bits are just like ASCII. If a byte is above the 127 range it indicates a multibyte sequence. Since we can match on the byte level, it should be possible to match a multibyte marker and the right amount of bytes following it.
This is what I came up with:
$multi2 = '(?:[\xC2-\xDF].)'; $multi3 = '(?:[\xE0-\xEF]..)'; $multi4 = '(?:[\xF0-\xF4]...)'; $latin = '[0-9A-Za-z]'; $anychar = "(?:$multi4|$multi3|$multi2|$latin)"; preg_match("/✆$anychar/", $input, $match);
Seems to work fine in my limited testing so far.
Update: as Bruno points out in the comments, we can simplify this some more by taking into account that the follow-up bytes of UTF-8 characters are always in the 128-191 range. So we can simply match any byte optionally followed by bytes in that range. No need to “count” the follower bytes. Nice.
preg_match("/✆.[\x80-\xbf]*/", $input, $match);