Unicode and E-mail

Information about Unicode and E-mail

Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces


Many E-mail clients now offer some support for Unicode in E-mail bodies. Most do not send in Unicode by default, and few systems are likely to be set up with fonts capable of displaying the full range of Unicode characters.

Unicode support for E-mail subject lines and E-mail addresses is more problematic, because several different standards need to be used to retrofit the handling of non-ASCII data to the originally ASCII-only E-mail protocol:
  • RFC 2047 provides support for encoding non-ASCII values such as real names and subject lines in E-mail headers
  • RFC 3490 provides support for encoding non-ASCII domain names
However, mailbox names (the part of the E-mail address before the '@' sign) are still limited to a subset of ASCII printable characters by RFC 2822.

Unicode support in message bodies

HTML e-mail can use HTML entities to use characters from anywhere in Unicode even if the HTML source text for the e-mail is in a legacy encoding. For details of this see Unicode and HTML. The rest of this article will deal with e-mail messages where the actual raw text (whether HTML markup or plain text) is in an encoding that covers the whole of Unicode.

As with all encodings apart from US-ASCII, when using Unicode text in e-mail, MIME must be used to specify that a Unicode transformation format is being used for the text. To use Unicode in email headers, the Unicode text has to be encoded using a MIME "Encoded-Word" with a unicode encoding as the charset.

UTF-7, although sometimes considered deprecated, has an advantage over other Unicode encodings in that it does not require a transfer encoding to fit within the seven-bit limits of many legacy Internet mail servers. UTF-8 and UTF-16 on the other hand must be transfer encoded in base64 or quoted-printable to allow safe transmission across seven-bit mail servers (i.e., those that do not advertise 8BITMIME).

Unicode in various mail clients

Evolution

View > Character Encoding > Unicode
Tools > Settings > Mail Preferences and Composer Preferences > Check default Character Encoding to Unicode

Mozilla Thunderbird

View > Character Encoding > Unicode
Tools > Options… > Fonts > Outgoing Mail / Incoming Mail (change to Unicode)

For Mac: Preferences > Display > Formatting > Fonts… > Character Encoding (bottom of the window).

MS Outlook

Outlook supports sending mail in UTF-7 and UTF-8 but does not do so by default. When replying, Outlook uses the same encoding as the message it is replying to. All Unicode characters can be entered in the edit box, but ones not available in the selected encoding will be silently replaced (usually with a question mark: ?) when sending the message.

Lotus Notes

Notes can send Unicode also:
  1. From the menu, select File -> Preferences -> User Preferences.
  2. under Basis -> Additional Options -> Tick Enable UNICODE Display
  3. Click Mail, then Internet.
  4. Under "Multilingual Internet mail," choose an option.

Scribe/InScribe

Scribe will display Unicode with default settings. But you can override the charset specified in the headers by right clicking on the body and using the "Change Charset" menu to select a new charset. You can also configure preferred charsets for 8-bit text and us-ascii in the receive options. When sending a suitable legacy charset (8-bit, e.g. ISO-8859-?? or Windows-???) is chosen automatically - however, if the message has a complicated script or a mixture of scripts, UTF-8 will be used by default. You can set a preferred legacy charset in the sending options panel to override the default charset choice. Characters not available in the current font will be substituted from another font installed on the system (if available).

See also

External links

Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard
..... Click the link for more information.
N.B. The tables below list numbers of bytes per code point, not per user visible "character" (or "grapheme cluster"). It can take multiple code points to describe a single grapheme cluster, so even in UTF-32, care must be taken when splitting or concatenating
..... Click the link for more information.
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages.
..... Click the link for more information.
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII.
..... Click the link for more information.
"Compatibility Encoding Scheme for UTF-16: 8-Bit" (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1] . A Unicode code point from the Basic Multilingual Plane (BMP), i.e.
..... Click the link for more information.
In computing, UTF-16 (16-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps code points (characters) into a sequence of 16-bit words, called code units.
..... Click the link for more information.
UTF-32 and UCS-4 are alternative names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. It can be regarded as the simplest encoding form, as all other Unicode Transformation Formats have variable-length
..... Click the link for more information.
UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty.
..... Click the link for more information.
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks.
..... Click the link for more information.
Punycode is a computer programming protocol by which a Unicode string of characters can be translated into the more-limited character set permitted in network host names. The protocol is published on the Internet in Request for Comments #.
..... Click the link for more information.
internationalized domain name (IDN) is an Internet domain name that (potentially) contains non-ASCII characters. Such domain names could contain letters with diacritics, as required by many European languages, or characters from non-Latin scripts such as Arabic or Chinese.
..... Click the link for more information.
GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. This character set is formally called "Chinese National Standard GB 18030-2000: Information Technology -- Chinese ideograms coded character set for
..... Click the link for more information.
The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character set on which many encodings are based. It contains nearly a hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its
..... Click the link for more information.
Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points.

As of Unicode 5.0.0, 102,012 (9.
..... Click the link for more information.
bi-directional text. This can get rather complex when multiple levels of quotation are used.

Many computer programs fail to display bi-directional text correctly. For example, the Hebrew name Sarah (שרה) should be spelled shin (ש) resh (ר) heh
..... Click the link for more information.
A byte-order mark (BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space") when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32.
..... Click the link for more information.
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters.
..... Click the link for more information.
hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set.

The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike.
..... Click the link for more information.
Unicode typefaces (also known as UCS fonts and Unicode fonts) are typefaces containing a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc.
..... Click the link for more information.
An e-mail client is a frontend computer program used to manage e-mail. Large all-in-one e-mail clients such as the open source Mozilla Thunderbird and Microsoft Outlook today combine the operations of an MSA, MDA, MRA and MUA in one application.
..... Click the link for more information.
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard
..... Click the link for more information.
This article or section is in need of attention from an expert on the subject.
Please help recruit one or [ improve this article] yourself. See the talk page for details.
..... Click the link for more information.
HTML (Hypertext Markup Language)

File extension: .html, .htm
MIME type: text/html
Type code: TEXT
..... Click the link for more information.
hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set.

The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike.
..... Click the link for more information.
American Standard Code for Information Interchange (ASCII), generally pronounced ask-ee IPA: /ˈæski/ ( [1] ), is a character encoding based on the English alphabet.
..... Click the link for more information.
Mime or pantomime is a theatrical medium or performance art, involving the acting out of a story by a mime artist through body motions, without use of speech.

History


..... Click the link for more information.
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages.
..... Click the link for more information.
In computer software standards and documentation, the term deprecation is used to indicate discouragement of usage of a particular software feature, usually because it has been superseded by a newer/better version.
..... Click the link for more information.
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII.
..... Click the link for more information.
Base64 is a positional notation using a base of 64. It is the largest power-of-two base that can be represented using single printable ASCII characters. This has led to its use as a transfer encoding for e-mail among other things.
..... Click the link for more information.

This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.