UTF-7

From Freepedia

Unicode
Encodings
UCS
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail

UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages.

The basic Internet e-mail standard SMTP specifies that the transmission format is US-ASCII and does not allow byte values above the ASCII range. MIME provides a way to specify the character set allowing for use of other character sets including UTF-8 and UTF-16. However the underlying transmission infrastructure is still not guaranteed to be 8-bit clean and therefore content transfer encodings have to be used with them. Unfortunately base64 has the problem of making even US-ASCII characters unreadable and UTF-8 quoted-printable is a very inefficient format requiring 6-9 bytes for non-ascii characters from the BMP and 12 bytes for characters outside the BMP.

Provided certain rules are followed during encoding UTF-7 can be sent in e-mail without using a separate MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:" UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either quoted-printable or base64 UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable.

UTF-7 is generally not used as a native representation within applications as it is very awkward to process. 8BITMIME has also been introduced, which reduces the need to encode messages in a 7-bit format. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the Internet Mail Consortium recommends against its use.

A modified form of UTF-7 is currently used in the IMAP e-mail retrieval protocol for mailbox names. See section 5.1.3 of rfc 2060 for details.

Contents

Description

UTF-7 was first standardized in RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC has been obsoleted by RFC 2152.

Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains all 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?. The direct characters are considered very safe to include literally. The other main group known of as optional direct characters contains all other printable characters in the range U+0020-U+007E except ~ \ and space. Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways.

Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail.

Other characters must be encoded in UTF-16 and then in modified base64. The start of these blocks of modified base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified base64 set. As a special case if the character after the modified base64 is a - (ascii hyphen-minus) then it is consumed. Literal + characters are encoded as +-.

Examples

  • "Hello, World!" is encoded as "Hello, World!"
  • "1 + 1 = 2" is encoded as "1 +- 1 = 2"
  • "£1" is encoded as "+AKM-1". The British pound codepoint is 0x00A3 in UTF-16, which converts into Modified Base64 as:
    • 0000002 = 010 = 'A',
    • 0010102 = 1010 = 'K', and
    • 0011[00]2 = 1210 = 'M', where the last two bits in the last base64 digit are padding.

See also

External links



Views
Personal tools
In other languages
Similar Links