David Tribble's Proposal for ISO C and C++
Unicode Character Types

Date:	08 Jan 1998
Version:	1.01
Supersedes:	1.00

Unicode Character Types

Types

The standard header <unicode.h> contains a declaration for the following types:

    unicode_t
    utf8_t

The unicode_t type is an unsigned integer type of at least 16 bits. It is capable of representing any 16-bit Unicode character value, which covers the range [0x0000,0xFFFF].

The utf8_t type is an unsigned integer type of at least 32 bits. It is capable of holding a sequence of one to four 8-bit bytes representing the UTF-8 encoding of a single Unicode character.

Unicode Character Codes

etc...

Examples

etc...

Constants

The standard header <unicode.h> contains declarations for the following constants with their corresponding values:

    UNIC__C0		0x0000	// C0 Controls
    UNIC__LATIN		0x0020	// Basic Latin (ASCII, ISO-646)
    UNIC__C1		0x0080	// C1 Controls
    UNIC__LATIN_1	0x00A0	// Latin-1 Supplement
    UNIC__LATIN_A	0x0100	// Latin Extended-A
    UNIC__LATIN_B	0x0180	// Latin Extended-B
    UNIC__IPA		0x0250	// IPA Extensions
    UNIC__SPACING	0x02B0	// Spacing Modifier Letters
    UNIC__DIACRIT	0x0300	// Combining Diacritical Marks
    UNIC__GREEK		0x0370	// Greek
    UNIC__CYRILLIC	0x0400	// Cyrillic
    UNIC__ARMENIAN	0x0530	// Armenian
    UNIC__HEBREW	0x0590	// Hebrew
    UNIC__ARABIC	0x0600	// Arabic
    UNIC__DEVANAGARI	0x0900	// Devanagari
    UNIC__BENGALI	0x0980	// Bengali
    UNIC__GURMUHKI	0x0A00	// Gurmuhki
    UNIC__GUJARATI	0x0A80	// Gujarati
    UNIC__ORIYA		0x0B00	// Oriya
    UNIC__TAMIL		0x0B80	// Tamil
    UNIC__TELUGU	0x0C00	// Telugu
    UNIC__KANNADA	0x0C80	// Kannada
    UNIC__MALAYALAM	0x0D00	// Malayalam
    UNIC__THAI		0x0E00	// Thai
    UNIC__LAO		0x0E80	// Lao
    UNIC__TIBETAN	0x0F00	// Tibetan
    UNIC__GEORGIAN	0x10A0	// Georgian
    UNIC__HANGUL	0x1100	// Hangul Jamo
    UNIC__LATIN_E	0x1E00	// Latin Extended Additional
    UNIC__GREEK_E	0x1F00	// Greek Extended
    UNIC__PUNCT		0x2000	// General Punctuation
    UNIC__SUPER		0x2070	// Superscripts and Subscripts
    UNIC__CURRENCY	0x20A0	// Currency Symbols
    UNIC__DIACRIT_SYM	0x20D0	// Combining Diacriticals for Symbols
    UNIC__LETTERLIKE	0x2100	// Letterlike Symbols
    UNIC__NUMBER	0x2150	// Number Forms
    UNIC__ARROW		0x2190	// Arrows
    UNIC__MATH		0x2200	// Mathematical Operators
    UNIC__TECHNICAL	0x2300	// Miscellaneous Technical
    UNIC__CONTROL	0x2400	// Control Pictures
    UNIC__OCR		0x2440	// Optical Character Recognition
    UNIC__ENCL_ALPHA	0x2460	// Enclosed Alphanumerics
    UNIC__BOX		0x2500	// Box Drawing
    UNIC__BLOCK		0x2580	// Block Elements
    UNIC__GEOMETRIC	0x25A0	// Geometric Shapes
    UNIC__SYMBOL	0x2600	// Miscellaneous Symbols
    UNIC__DINGBAT	0x2700	// Dingbats
    UNIC__CJK_PUNCT	0x3000	// CJK Symbols and Punctuation
    UNIC__HIRIGANA	0x3040	// Hirigana
    UNIC__KATAKANA	0x30A0	// Katakana
    UNIC__BOPOMOFO	0x3100	// Bopomofo
    UNIC__HANGUL_COMPAT	0x3130	// Hangul Compatibility Jamo
    UNIC__KANBUN	0x3190	// Kanbun
    UNIC__ENCL_CJK	0x3200	// Enclosed CJK Letters and Months
    UNIC__CJK_COMPAT	0x3300	// CJK Compatibility
    UNIC__CJK		0x4E00	// CJK Unified Ideographs
    UNIC__HANGUL_SYL	0xAC00	// Hangul Syllables
    UNIC__CJK_COMPAT2	0xF900	// CJK Compatibility Ideographs
    UNIC__ALPHABETIC	0xFB00	// Alphabetic Presentation Forms
    UNIC__ARABIC_A	0xFB50	// Arabic Presentations Forms A
    UNIC__HALF		0xFE20	// Combining Half Marks
    UNIC__CJK_COMPAT3	0xFE30	// CJK Compatibility Forms
    UNIC__SMALL		0xFE50	// Small Form Variants
    UNIC__ARABIC_B	0xFE70	// Arabic Presentations Forms B
    UNIC__HALF_FULL	0xFF00	// Halfwidth and Fullwidth Forms
    UNIC__SPECIAL	0xFFF0	// Specials

    UNIC_MIN		0x0000	// Minimum Unicode code
    UNIC_MAX		0xFFFF	// Maximum Unicode code

    UNIC_NUL		0x0000	// Null
    UNIC_SOH		0x0001	// Start of Heading
    UNIC_STX		0x0002	// Start of Text
    UNIC_ETX		0x0003	// End of Text
    UNIC_EOT		0x0004	// End of Transmission
    UNIC_ENQ		0x0005	// Enquire
    UNIC_ACK		0x0006	// Acknowledge
    UNIC_BEL		0x0007	// Bell (Alarm)
    UNIC_BS		0x0008	// Backspace
    UNIC_HT		0x0009	// Horizontal Tab
    UNIC_LF		0x000A	// Linefeed
    UNIC_VT		0x000B	// Vertical Tab
    UNIC_FF		0x000C	// Formfeed
    UNIC_CR		0x000D	// Carriage Return
    UNIC_SO		0x000E	// Shift Out
    UNIC_SI		0x000F	// Shift In
    UNIC_DLE		0x0010	// Data Link Escape
    UNIC_DC1		0x0011	// Device Control 1
    UNIC_DC2		0x0012	// Device Control 2
    UNIC_DC3		0x0013	// Device Control 3
    UNIC_DC4		0x0014	// Device Control 4
    UNIC_NAK		0x0015	// Negative Acknowledge
    UNIC_SYN		0x0016	// Synchronous Idle
    UNIC_ETB		0x0017	// End of Transmission Block
    UNIC_CAN		0x0018	// Cancel
    UNIC_EM		0x0019	// End of Medium
    UNIC_SUB		0x001A	// Substitute
    UNIC_ESC		0x001B	// Escape
    UNIC_FS		0x001C	// File Separator
    UNIC_GS		0x001D	// Group Separator
    UNIC_RS		0x001E	// Record Separator
    UNIC_US		0x001F	// Unit Separator
    UNIC_SP		0x0020	// Space
    UNIC_DEL		0x007F	// Delete (Rubout)
    UNIC_NBSP		0x00A0	// No-Break Space
    UNIC_NQSP		0x2000	// En Quad
    UNIC_MQSP		0x2001	// Em Quad
    UNIC_ENSP		0x2002	// En Space
    UNIC_EMSP		0x2003	// Em Space
    UNIC_3MSP		0x2004	// 3-Em Space
    UNIC_4MSP		0x2005	// 4-Em Space
    UNIC_6MSP		0x2006	// 6-Em Space
    UNIC_FSP		0x2007	// Figure Space
    UNIC_PSP		0x2008	// Punctuation Space
    UNIC_THSP		0x2009	// Thin Space
    UNIC_HSP		0x200A	// Hair Space
    UNIC_ZWSP		0x200B	// Zero-Width Space
    UNIC_ZWNJ		0x200C	// Zero-Width Non-Joiner
    UNIC_ZWJ		0x200D	// Zero-Width Joiner
    UNIC_LRM		0x200E	// Left-to-Right Mark
    UNIC_RLM		0x200F	// Right-to-Left Mark
    UNIC_LSEP		0x2028	// Line Separator
    UNIC_PSEP		0x2029	// Paragraph Separator
    UNIC_LRE		0x202A	// Left-to-Right Embedding
    UNIC_RLE		0x202B	// Right-to-Left Embedding
    UNIC_PDF		0x202C	// Pop Directional Formatting
    UNIC_LRO		0x202D	// Left-to-Right Override
    UNIC_RLO		0x202E	// Right-to-Left Override
    UNIC_ISS		0x206A	// Inhibit Symmetric Swapping
    UNIC_ASS		0x206B	// Activate Symmetric Swapping
    UNIC_IAFS		0x206C	// Inhibit Arabic Form Shaping
    UNIC_AAFS		0x206D	// Activate Arabic Form Shaping
    UNIC_NADS		0x206E	// National Digit Shapes
    UNIC_NODS		0x206F	// Nominal Digit Shapes
    UNIC_IDSP		0x3000	// Ideographic Space
    ...
    UNIC_REPL		0xFFFD	// Replacement
    UNIC_RES0		0xFFF0	// Reserved 0
    UNIC_ZWNBSP		0xFEFF	// Zero-Width No-Break Space
    UNIC_BOM		0xFEFF	// Byte Order Mark
    UNIC_BOM_R		0xFFFE	// Byte Order Mark, reversed
    UNIC_NAC		0xFFFF	// Not A Character

These constants may be implemented as either enumeration constants or as preprocessor macros.

Constants that begin with a UNIC__ indicate the groups into which the Unicode characters are arranged. Each such constant represents the first Unicode character code within a specific group.

A few other constants bear special meanings, as described below.

Constant `UNIC_MIN`

    const unicode_t  UNIC_MIN;

The UNIC_MIN constant represents the lowest valid Unicode character value, which is 0x0000 (U+0000).

Constant `UNIC_MAX`

    const unicode_t  UNIC_MAX;

The UNIC_MAX constant represents the highest Unicode character value, which is 0xFFFF (U+FFFF).

Constant `UNIC_NAC`

    const unicode_t  UNIC_NAC;

The UNIC_NAC constant represents the "Not A Character" Unicode character value, which is 0xFFFF (U+FFFF).

Constant `UNIC_NUL`

    const unicode_t  UNIC_NUL;

The UNIC_NUL constant represents the "Null" Unicode character value, which is 0x0000 (U+0000). Unicode character strings are usually terminated with this character code.

Constant `UNIC_BOM`

    const unicode_t  UNIC_BOM;

The UNIC_BOM constant represents the "Byte Order Mark" Unicode character value, which is 0xFEFF (U+FEFF). This character code occurs within Unicode character streams to indicate a "normal" byte ordering, i.e., big-endian ordering.

Constant `UNIC_BOM_R`

    const unicode_t  UNIC_BOM_R;

The UNIC_BOM_R constant represents a byte-swapped "Byte Order Mark" Unicode character value, which is 0xFFFE (U+FFFE). This character code occurs within Unicode character streams to indicate a "reversed" byte ordering, i.e., little-endian ordering.

Functions

The standard header <unicode.h> contains declarations for the following library functions:

    unicode()

Function `unicode()`

    int  unicode(unicode_t *t);

The unicode() function ...

Footnotes

1. etc...

Prior Art

etc...

Comments? Send mail to David Tribble at work or home.
Link to David Tribble's home page.

David Tribble's Proposal for ISO C and C++ Unicode Character Types

Unicode Character Types

Types

Unicode Character Codes

Examples

Constants

Constant UNIC_MIN

Constant UNIC_MAX

Constant UNIC_NAC

Constant UNIC_NUL

Constant UNIC_BOM

Constant UNIC_BOM_R

Functions

Function unicode()

Footnotes

Prior Art

David Tribble's Proposal for ISO C and C++
Unicode Character Types

Constant `UNIC_MIN`

Constant `UNIC_MAX`

Constant `UNIC_NAC`

Constant `UNIC_NUL`

Constant `UNIC_BOM`

Constant `UNIC_BOM_R`

Function `unicode()`