Skip to content

Four Bytes

I once wrote a full set of character encoding converters in C.

EUC-JP, Shift_JIS, ISO-2022-JP, UTF-8. Each had a different byte structure. EUC checked the high bit to identify the start of a multibyte sequence. Shift_JIS used the range of the first byte, but the second byte was troublesome — 0x5C collided with the backslash. ISO-2022-JP switched character sets with escape sequences. Email used this one. The same character "あ" had different byte representations in every encoding. In C, you processed this one byte at a time, carrying state forward as you went. The structure seeped into your bones.

Then the flip phone era brought emoji.

DoCoMo, au, J-Phone. Three carriers, each defining their own emoji. Crammed into unused regions of Shift_JIS. No compatibility, naturally. DoCoMo's sun and au's sun were different code points. No official conversion map existed. Each carrier's gateway performed its own conversion, and what couldn't be converted became "〓" — the geta mark. Every time I saw that placeholder glyph, I thought: someone's feeling just got erased.

Now emoji live in Unicode, defined as four bytes in UTF-8. The chaotic three-kingdom war is over. Everyone on the planet sends the same emoji at the same code point. For those of us who once stared at raw byte sequences in C, this peace is almost unbelievable.

Incidentally: MySQL's utf8 only handles up to three bytes. To store emoji you need utf8mb4. A thing called utf8 that is not UTF-8. What MySQL named utf8 was an incomplete implementation of the real thing. That naming confused countless engineers. Same name, different contents. Character encodings have always been like that.