Top page

Category encodings

UTF-8

UTF-8 stands for Unicode Transformation Format 8 bit. It is an 8 bit format for transmission of Unicode that is compatible with ASCII. Each byte of UTF-8 takes one of the following three formats:

  • Unicode characters U0000 to U007F (the ASCII characters) are encoded simply as bytes 0x00 to 0x7F. This means that files and strings which contain only seven-bit ASCII characters have the same encoding under both ASCII and UTF-8.
  • All UCS characters greater than U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set, in other words is greater than or equal to 0x80 (binary 10000000). No ASCII byte (0x00 to 0x7F) can appear as part of any other character.
  • * The first byte of a sequence of bytes that represents a non-ASCII character is always in the range 0xC0 to 0xFD (binary 11000000 to binary 11111101). This first byte indicates how many bytes follow for this character by the number of binary ones after the initial 1. For example, a two byte sequence would have the form 110XXXXX, and a three byte sequence would have the form 1110XXXX.
  • *All further bytes in a multibyte sequence representing a character are in the range 0x80 to 0xBF (binary 10000000 to 10111111).
  • UTF-8 encoded characters may theoretically be up to six bytes long.
  • The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
  • In practice

  • ASCII characters remain unchanged under UTF-8.
  • Characters from U0080 to U07FF are encoded as two bytes.
  • The remaining characters in the basic multilingual plane, the Unicode characters up to UFFFF, are encoded as three bytes.
  • Advantages of UTF-8

    The format allows easy resynchronization and is stateless and robust against missing bytes.