Subject:        Character proposal 
  Date:         Sun, 24 Nov 1996 21:54:23 -0600 
  From:         David R Tribble <dtribble@flash.net>
    To:         "Unicode, Inc." <unicode-inc@unicode.org>


Subject:        Proposal to modify the Unicode character set
Title:          More permanently unassigned codes
Date:           1996-11-24 Sun
Author:         David R. Tribble
Email:          david.tribble@central.beasys.com,
                dtribble@flash.net
                http://www.flash.net/~dtribble/
Phone:          +1-972-738-6125  15:00-00:00 GMT,
                +1-972-964-1720  00:00-05:00 GMT
Mail:           6004 Cave River Dr.
                Plano, TX 75093-6951
                USA

Abstract:
    A proposal to define additional character codes as permanently
    unassigned.

    Unicode character codes U+FFF0..U+FFF7 (eight code points) are to be
    defined as permanently unassigned codes and therefore cannot
    represent any valid Unicode characters within conforming
    implementations.

    This restricts conforming Unicode character data streams (files,
    transmissions, etc.) from containing characters in the range
    U+FFF0..U+FFF7 and U+FFFE..U+FFFF.  (Such characters are represented
    as "<not a character>" in the code charts.)  Programs are free to
    ignore these characters if they are encountered in a data stream or
    to substitute a valid Unicode character code (such as U+FFFD) in
    their place.

    In addition to the two existing code values U+FFFE and U+FFFF, eight
    more permanently unassigned codes would give programs extra freedom
    in representing Unicode character streams internally.

Rationale:
    Consider a revision control program that is designed to store and
    retrieve Unicode files.  Being strictly conforming, this program
    must properly handle all valid Unicode characters (U+0000..U+FFFD),
    preserving them intact.  This includes all character codes in the
    private use and surrogates areas, whether this program understands
    them or not.  Consider also that this program might be implemented
    so that it must represent certain control codes outside the Unicode
    character set; such control codes might be necessary to indicate
    text storage boundaries (at a level below that of the boundary codes
    provided by the U+2000 block), or compression control sequences, or
    even physical boundaries such as fixed-length end-of-record or
    end-of-block markers.  These "sentinel" codes embody information at
    a representational level below that of the Unicode text upon which
    they operate.

    A convenient internal representation of these sentinel codes is to
    simply embed them within the stream of Unicode characters, using
    codes that are invalid Unicode characters.  As the Unicode standard
    is currently defined (v2.0), the only codes available for such use
    are those that are defined as "permanently unassigned",
    specifically, U+FFFE and U+FFFF.  (Codes within the private use area
    cannot be used, since such character codes may already be in use in
    the Unicode files that the program wishes to store/retrieve.)

    By adding a few more permanently unassigned codes to the character
    set, programs like the one described will have more flexibility in
    the way they chose to represent Unicode characters internally.
    (Note that, when used as internal sentinel codes, the unassigned
    character codes will never be written to or read from files or
    transmission streams.)

References:
    "The Unicode Standard, Version 2.0", Addison-Wesley, Jul 1996, sect.
    2.3, p. 2-10, "Allocation Areas".

    ---, sect. 2.3, p. 2-12, "Non-Graphic Characters, Reserved, and
    Unassigned Codes".

    ---, sect. 2.4, p. 2-13, "Special Non-Character Codes".

    ---, sect. 2.4, p. 2-14, "The Replacement Character".

    ---, sect. 3-8, p. 3-38, "Special Character Properties".

    ---, sect. 5.4, p. 5-3, "Unassigned and Private Use Character
    Codes".

    ---, sect. 6.2, p. 6-72,6-74, "General Punctuation".

    ---, sect. 6.2, p. 6-131, "Specials: U+FEFF, U+FFF0-U+FFFF".

End.