Top page

Category encodings

Mojibake

mojibake (文字化け), literally "changed characters", is when character encodings get fouled up somewhere, such as in email or usenet postings.

It may happen for several reasons:

  • Incorrect description of encoding
  • Removal of eighth bit of encoding
  • Computer does not recognize encoding
  • Server-side software processes multibyte text as if it was single byte encoded.
  • Types of mojibake and diagnostics

    Mojibake may appear in several forms:

  • A line of question marks such as ???????. This happens when the computer displays characters from eight bit encodings with question marks, this is common in Usenet posts from Outlook Express. In most cases the characters have been actually replaced with question marks making it impossible to recover from it.
  • A line of accented letters. This happens when the computer misunderstands encoded eight-bit Japanese to be text encoded in a European encoding such as those for French or Swedish which uses the eighth bit for accented letters.
  • Streams of unreadable kanji. This happens when
  • #bytes are lost in an encoding
  • #eight bit encodings such as UTF-8 are misinterpreted as other encodings such as EUC-JP
  • Text such as ?$B%1!<%?%$�(B. This happens when escape characters are removed from ISO-2022-JP encoded Japanese (here they have been turned into question marks) and only the escape sequences remain visible.
  • Links

    Nora Stevens Heath: Avoiding Mojibake by John de Hoog (Wataru Tenga) Category:encodings