Subject: Character proposal Date: Sun, 24 Nov 1996 21:54:23 -0600 From: David R Tribble To: "Unicode, Inc." Subject: Proposal to modify the Unicode character set Title: More permanently unassigned codes Date: 1996-11-24 Sun Author: David R. Tribble Email: david.tribble@central.beasys.com, dtribble@flash.net http://www.flash.net/~dtribble/ Phone: +1-972-738-6125 15:00-00:00 GMT, +1-972-964-1720 00:00-05:00 GMT Mail: 6004 Cave River Dr. Plano, TX 75093-6951 USA Abstract: A proposal to define additional character codes as permanently unassigned. Unicode character codes U+FFF0..U+FFF7 (eight code points) are to be defined as permanently unassigned codes and therefore cannot represent any valid Unicode characters within conforming implementations. This restricts conforming Unicode character data streams (files, transmissions, etc.) from containing characters in the range U+FFF0..U+FFF7 and U+FFFE..U+FFFF. (Such characters are represented as "" in the code charts.) Programs are free to ignore these characters if they are encountered in a data stream or to substitute a valid Unicode character code (such as U+FFFD) in their place. In addition to the two existing code values U+FFFE and U+FFFF, eight more permanently unassigned codes would give programs extra freedom in representing Unicode character streams internally. Rationale: Consider a revision control program that is designed to store and retrieve Unicode files. Being strictly conforming, this program must properly handle all valid Unicode characters (U+0000..U+FFFD), preserving them intact. This includes all character codes in the private use and surrogates areas, whether this program understands them or not. Consider also that this program might be implemented so that it must represent certain control codes outside the Unicode character set; such control codes might be necessary to indicate text storage boundaries (at a level below that of the boundary codes provided by the U+2000 block), or compression control sequences, or even physical boundaries such as fixed-length end-of-record or end-of-block markers. These "sentinel" codes embody information at a representational level below that of the Unicode text upon which they operate. A convenient internal representation of these sentinel codes is to simply embed them within the stream of Unicode characters, using codes that are invalid Unicode characters. As the Unicode standard is currently defined (v2.0), the only codes available for such use are those that are defined as "permanently unassigned", specifically, U+FFFE and U+FFFF. (Codes within the private use area cannot be used, since such character codes may already be in use in the Unicode files that the program wishes to store/retrieve.) By adding a few more permanently unassigned codes to the character set, programs like the one described will have more flexibility in the way they chose to represent Unicode characters internally. (Note that, when used as internal sentinel codes, the unassigned character codes will never be written to or read from files or transmission streams.) References: "The Unicode Standard, Version 2.0", Addison-Wesley, Jul 1996, sect. 2.3, p. 2-10, "Allocation Areas". ---, sect. 2.3, p. 2-12, "Non-Graphic Characters, Reserved, and Unassigned Codes". ---, sect. 2.4, p. 2-13, "Special Non-Character Codes". ---, sect. 2.4, p. 2-14, "The Replacement Character". ---, sect. 3-8, p. 3-38, "Special Character Properties". ---, sect. 5.4, p. 5-3, "Unassigned and Private Use Character Codes". ---, sect. 6.2, p. 6-72,6-74, "General Punctuation". ---, sect. 6.2, p. 6-131, "Specials: U+FEFF, U+FFF0-U+FFFF". End.