David Tribble's Proposal for ISO C and C++
Unicode Character Types

Date: 08 Jan 1998
Version: 1.01
Supersedes: 1.00

Unicode Character Types


The standard header <unicode.h> contains a declaration for the following types:


The unicode_t type is an unsigned integer type of at least 16 bits. It is capable of representing any 16-bit Unicode character value, which covers the range [0x0000,0xFFFF].

The utf8_t type is an unsigned integer type of at least 32 bits. It is capable of holding a sequence of one to four 8-bit bytes representing the UTF-8 encoding of a single Unicode character.

Unicode Character Codes





The standard header <unicode.h> contains declarations for the following constants with their corresponding values:

    UNIC__C0		0x0000	// C0 Controls
    UNIC__LATIN		0x0020	// Basic Latin (ASCII, ISO-646)
    UNIC__C1		0x0080	// C1 Controls
    UNIC__LATIN_1	0x00A0	// Latin-1 Supplement
    UNIC__LATIN_A	0x0100	// Latin Extended-A
    UNIC__LATIN_B	0x0180	// Latin Extended-B
    UNIC__IPA		0x0250	// IPA Extensions
    UNIC__SPACING	0x02B0	// Spacing Modifier Letters
    UNIC__DIACRIT	0x0300	// Combining Diacritical Marks
    UNIC__GREEK		0x0370	// Greek
    UNIC__CYRILLIC	0x0400	// Cyrillic
    UNIC__ARMENIAN	0x0530	// Armenian
    UNIC__HEBREW	0x0590	// Hebrew
    UNIC__ARABIC	0x0600	// Arabic
    UNIC__DEVANAGARI	0x0900	// Devanagari
    UNIC__BENGALI	0x0980	// Bengali
    UNIC__GURMUHKI	0x0A00	// Gurmuhki
    UNIC__GUJARATI	0x0A80	// Gujarati
    UNIC__ORIYA		0x0B00	// Oriya
    UNIC__TAMIL		0x0B80	// Tamil
    UNIC__TELUGU	0x0C00	// Telugu
    UNIC__KANNADA	0x0C80	// Kannada
    UNIC__MALAYALAM	0x0D00	// Malayalam
    UNIC__THAI		0x0E00	// Thai
    UNIC__LAO		0x0E80	// Lao
    UNIC__TIBETAN	0x0F00	// Tibetan
    UNIC__GEORGIAN	0x10A0	// Georgian
    UNIC__HANGUL	0x1100	// Hangul Jamo
    UNIC__LATIN_E	0x1E00	// Latin Extended Additional
    UNIC__GREEK_E	0x1F00	// Greek Extended
    UNIC__PUNCT		0x2000	// General Punctuation
    UNIC__SUPER		0x2070	// Superscripts and Subscripts
    UNIC__CURRENCY	0x20A0	// Currency Symbols
    UNIC__DIACRIT_SYM	0x20D0	// Combining Diacriticals for Symbols
    UNIC__LETTERLIKE	0x2100	// Letterlike Symbols
    UNIC__NUMBER	0x2150	// Number Forms
    UNIC__ARROW		0x2190	// Arrows
    UNIC__MATH		0x2200	// Mathematical Operators
    UNIC__TECHNICAL	0x2300	// Miscellaneous Technical
    UNIC__CONTROL	0x2400	// Control Pictures
    UNIC__OCR		0x2440	// Optical Character Recognition
    UNIC__ENCL_ALPHA	0x2460	// Enclosed Alphanumerics
    UNIC__BOX		0x2500	// Box Drawing
    UNIC__BLOCK		0x2580	// Block Elements
    UNIC__GEOMETRIC	0x25A0	// Geometric Shapes
    UNIC__SYMBOL	0x2600	// Miscellaneous Symbols
    UNIC__DINGBAT	0x2700	// Dingbats
    UNIC__CJK_PUNCT	0x3000	// CJK Symbols and Punctuation
    UNIC__HIRIGANA	0x3040	// Hirigana
    UNIC__KATAKANA	0x30A0	// Katakana
    UNIC__BOPOMOFO	0x3100	// Bopomofo
    UNIC__HANGUL_COMPAT	0x3130	// Hangul Compatibility Jamo
    UNIC__KANBUN	0x3190	// Kanbun
    UNIC__ENCL_CJK	0x3200	// Enclosed CJK Letters and Months
    UNIC__CJK_COMPAT	0x3300	// CJK Compatibility
    UNIC__CJK		0x4E00	// CJK Unified Ideographs
    UNIC__HANGUL_SYL	0xAC00	// Hangul Syllables
    UNIC__CJK_COMPAT2	0xF900	// CJK Compatibility Ideographs
    UNIC__ALPHABETIC	0xFB00	// Alphabetic Presentation Forms
    UNIC__ARABIC_A	0xFB50	// Arabic Presentations Forms A
    UNIC__HALF		0xFE20	// Combining Half Marks
    UNIC__CJK_COMPAT3	0xFE30	// CJK Compatibility Forms
    UNIC__SMALL		0xFE50	// Small Form Variants
    UNIC__ARABIC_B	0xFE70	// Arabic Presentations Forms B
    UNIC__HALF_FULL	0xFF00	// Halfwidth and Fullwidth Forms
    UNIC__SPECIAL	0xFFF0	// Specials

    UNIC_MIN		0x0000	// Minimum Unicode code
    UNIC_MAX		0xFFFF	// Maximum Unicode code

    UNIC_NUL		0x0000	// Null
    UNIC_SOH		0x0001	// Start of Heading
    UNIC_STX		0x0002	// Start of Text
    UNIC_ETX		0x0003	// End of Text
    UNIC_EOT		0x0004	// End of Transmission
    UNIC_ENQ		0x0005	// Enquire
    UNIC_ACK		0x0006	// Acknowledge
    UNIC_BEL		0x0007	// Bell (Alarm)
    UNIC_BS		0x0008	// Backspace
    UNIC_HT		0x0009	// Horizontal Tab
    UNIC_LF		0x000A	// Linefeed
    UNIC_VT		0x000B	// Vertical Tab
    UNIC_FF		0x000C	// Formfeed
    UNIC_CR		0x000D	// Carriage Return
    UNIC_SO		0x000E	// Shift Out
    UNIC_SI		0x000F	// Shift In
    UNIC_DLE		0x0010	// Data Link Escape
    UNIC_DC1		0x0011	// Device Control 1
    UNIC_DC2		0x0012	// Device Control 2
    UNIC_DC3		0x0013	// Device Control 3
    UNIC_DC4		0x0014	// Device Control 4
    UNIC_NAK		0x0015	// Negative Acknowledge
    UNIC_SYN		0x0016	// Synchronous Idle
    UNIC_ETB		0x0017	// End of Transmission Block
    UNIC_CAN		0x0018	// Cancel
    UNIC_EM		0x0019	// End of Medium
    UNIC_SUB		0x001A	// Substitute
    UNIC_ESC		0x001B	// Escape
    UNIC_FS		0x001C	// File Separator
    UNIC_GS		0x001D	// Group Separator
    UNIC_RS		0x001E	// Record Separator
    UNIC_US		0x001F	// Unit Separator
    UNIC_SP		0x0020	// Space
    UNIC_DEL		0x007F	// Delete (Rubout)
    UNIC_NBSP		0x00A0	// No-Break Space
    UNIC_NQSP		0x2000	// En Quad
    UNIC_MQSP		0x2001	// Em Quad
    UNIC_ENSP		0x2002	// En Space
    UNIC_EMSP		0x2003	// Em Space
    UNIC_3MSP		0x2004	// 3-Em Space
    UNIC_4MSP		0x2005	// 4-Em Space
    UNIC_6MSP		0x2006	// 6-Em Space
    UNIC_FSP		0x2007	// Figure Space
    UNIC_PSP		0x2008	// Punctuation Space
    UNIC_THSP		0x2009	// Thin Space
    UNIC_HSP		0x200A	// Hair Space
    UNIC_ZWSP		0x200B	// Zero-Width Space
    UNIC_ZWNJ		0x200C	// Zero-Width Non-Joiner
    UNIC_ZWJ		0x200D	// Zero-Width Joiner
    UNIC_LRM		0x200E	// Left-to-Right Mark
    UNIC_RLM		0x200F	// Right-to-Left Mark
    UNIC_LSEP		0x2028	// Line Separator
    UNIC_PSEP		0x2029	// Paragraph Separator
    UNIC_LRE		0x202A	// Left-to-Right Embedding
    UNIC_RLE		0x202B	// Right-to-Left Embedding
    UNIC_PDF		0x202C	// Pop Directional Formatting
    UNIC_LRO		0x202D	// Left-to-Right Override
    UNIC_RLO		0x202E	// Right-to-Left Override
    UNIC_ISS		0x206A	// Inhibit Symmetric Swapping
    UNIC_ASS		0x206B	// Activate Symmetric Swapping
    UNIC_IAFS		0x206C	// Inhibit Arabic Form Shaping
    UNIC_AAFS		0x206D	// Activate Arabic Form Shaping
    UNIC_NADS		0x206E	// National Digit Shapes
    UNIC_NODS		0x206F	// Nominal Digit Shapes
    UNIC_IDSP		0x3000	// Ideographic Space
    UNIC_REPL		0xFFFD	// Replacement
    UNIC_RES0		0xFFF0	// Reserved 0
    UNIC_ZWNBSP		0xFEFF	// Zero-Width No-Break Space
    UNIC_BOM		0xFEFF	// Byte Order Mark
    UNIC_BOM_R		0xFFFE	// Byte Order Mark, reversed
    UNIC_NAC		0xFFFF	// Not A Character

These constants may be implemented as either enumeration constants or as preprocessor macros.

Constants that begin with a UNIC__ indicate the groups into which the Unicode characters are arranged. Each such constant represents the first Unicode character code within a specific group.

A few other constants bear special meanings, as described below.

Constant UNIC_MIN

    const unicode_t  UNIC_MIN;

The UNIC_MIN constant represents the lowest valid Unicode character value, which is 0x0000 (U+0000).

Constant UNIC_MAX

    const unicode_t  UNIC_MAX;

The UNIC_MAX constant represents the highest Unicode character value, which is 0xFFFF (U+FFFF).

Constant UNIC_NAC

    const unicode_t  UNIC_NAC;

The UNIC_NAC constant represents the "Not A Character" Unicode character value, which is 0xFFFF (U+FFFF).

Constant UNIC_NUL

    const unicode_t  UNIC_NUL;

The UNIC_NUL constant represents the "Null" Unicode character value, which is 0x0000 (U+0000). Unicode character strings are usually terminated with this character code.

Constant UNIC_BOM

    const unicode_t  UNIC_BOM;

The UNIC_BOM constant represents the "Byte Order Mark" Unicode character value, which is 0xFEFF (U+FEFF). This character code occurs within Unicode character streams to indicate a "normal" byte ordering, i.e., big-endian ordering.

Constant UNIC_BOM_R

    const unicode_t  UNIC_BOM_R;

The UNIC_BOM_R constant represents a byte-swapped "Byte Order Mark" Unicode character value, which is 0xFFFE (U+FFFE). This character code occurs within Unicode character streams to indicate a "reversed" byte ordering, i.e., little-endian ordering.


The standard header <unicode.h> contains declarations for the following library functions:


Function unicode()

    int  unicode(unicode_t *t);

The unicode() function ...


1. etc...

Prior Art


Comments? Send mail to David Tribble at work or home.
Link to David Tribble's home page.

Text Copyright ©1998 David R. Tribble, all rights reserved.