| Date: | 08 Jan 1998 |
| Version: | 1.01 |
| Supersedes: | 1.00 |
The standard header <unicode.h> contains a declaration for the following types:
unicode_t
utf8_t
The unicode_t type is
an unsigned integer type of at least 16 bits.
It is capable of representing any 16-bit Unicode character value,
which covers the range [0x0000,0xFFFF].
The utf8_t type is
an unsigned integer type of at least 32 bits.
It is capable of holding a sequence of one to four 8-bit bytes
representing the UTF-8 encoding of a single Unicode character.
etc...
etc...
The standard header <unicode.h> contains declarations for the following constants with their corresponding values:
UNIC__C0 0x0000 // C0 Controls
UNIC__LATIN 0x0020 // Basic Latin (ASCII, ISO-646)
UNIC__C1 0x0080 // C1 Controls
UNIC__LATIN_1 0x00A0 // Latin-1 Supplement
UNIC__LATIN_A 0x0100 // Latin Extended-A
UNIC__LATIN_B 0x0180 // Latin Extended-B
UNIC__IPA 0x0250 // IPA Extensions
UNIC__SPACING 0x02B0 // Spacing Modifier Letters
UNIC__DIACRIT 0x0300 // Combining Diacritical Marks
UNIC__GREEK 0x0370 // Greek
UNIC__CYRILLIC 0x0400 // Cyrillic
UNIC__ARMENIAN 0x0530 // Armenian
UNIC__HEBREW 0x0590 // Hebrew
UNIC__ARABIC 0x0600 // Arabic
UNIC__DEVANAGARI 0x0900 // Devanagari
UNIC__BENGALI 0x0980 // Bengali
UNIC__GURMUHKI 0x0A00 // Gurmuhki
UNIC__GUJARATI 0x0A80 // Gujarati
UNIC__ORIYA 0x0B00 // Oriya
UNIC__TAMIL 0x0B80 // Tamil
UNIC__TELUGU 0x0C00 // Telugu
UNIC__KANNADA 0x0C80 // Kannada
UNIC__MALAYALAM 0x0D00 // Malayalam
UNIC__THAI 0x0E00 // Thai
UNIC__LAO 0x0E80 // Lao
UNIC__TIBETAN 0x0F00 // Tibetan
UNIC__GEORGIAN 0x10A0 // Georgian
UNIC__HANGUL 0x1100 // Hangul Jamo
UNIC__LATIN_E 0x1E00 // Latin Extended Additional
UNIC__GREEK_E 0x1F00 // Greek Extended
UNIC__PUNCT 0x2000 // General Punctuation
UNIC__SUPER 0x2070 // Superscripts and Subscripts
UNIC__CURRENCY 0x20A0 // Currency Symbols
UNIC__DIACRIT_SYM 0x20D0 // Combining Diacriticals for Symbols
UNIC__LETTERLIKE 0x2100 // Letterlike Symbols
UNIC__NUMBER 0x2150 // Number Forms
UNIC__ARROW 0x2190 // Arrows
UNIC__MATH 0x2200 // Mathematical Operators
UNIC__TECHNICAL 0x2300 // Miscellaneous Technical
UNIC__CONTROL 0x2400 // Control Pictures
UNIC__OCR 0x2440 // Optical Character Recognition
UNIC__ENCL_ALPHA 0x2460 // Enclosed Alphanumerics
UNIC__BOX 0x2500 // Box Drawing
UNIC__BLOCK 0x2580 // Block Elements
UNIC__GEOMETRIC 0x25A0 // Geometric Shapes
UNIC__SYMBOL 0x2600 // Miscellaneous Symbols
UNIC__DINGBAT 0x2700 // Dingbats
UNIC__CJK_PUNCT 0x3000 // CJK Symbols and Punctuation
UNIC__HIRIGANA 0x3040 // Hirigana
UNIC__KATAKANA 0x30A0 // Katakana
UNIC__BOPOMOFO 0x3100 // Bopomofo
UNIC__HANGUL_COMPAT 0x3130 // Hangul Compatibility Jamo
UNIC__KANBUN 0x3190 // Kanbun
UNIC__ENCL_CJK 0x3200 // Enclosed CJK Letters and Months
UNIC__CJK_COMPAT 0x3300 // CJK Compatibility
UNIC__CJK 0x4E00 // CJK Unified Ideographs
UNIC__HANGUL_SYL 0xAC00 // Hangul Syllables
UNIC__CJK_COMPAT2 0xF900 // CJK Compatibility Ideographs
UNIC__ALPHABETIC 0xFB00 // Alphabetic Presentation Forms
UNIC__ARABIC_A 0xFB50 // Arabic Presentations Forms A
UNIC__HALF 0xFE20 // Combining Half Marks
UNIC__CJK_COMPAT3 0xFE30 // CJK Compatibility Forms
UNIC__SMALL 0xFE50 // Small Form Variants
UNIC__ARABIC_B 0xFE70 // Arabic Presentations Forms B
UNIC__HALF_FULL 0xFF00 // Halfwidth and Fullwidth Forms
UNIC__SPECIAL 0xFFF0 // Specials
UNIC_MIN 0x0000 // Minimum Unicode code
UNIC_MAX 0xFFFF // Maximum Unicode code
UNIC_NUL 0x0000 // Null
UNIC_SOH 0x0001 // Start of Heading
UNIC_STX 0x0002 // Start of Text
UNIC_ETX 0x0003 // End of Text
UNIC_EOT 0x0004 // End of Transmission
UNIC_ENQ 0x0005 // Enquire
UNIC_ACK 0x0006 // Acknowledge
UNIC_BEL 0x0007 // Bell (Alarm)
UNIC_BS 0x0008 // Backspace
UNIC_HT 0x0009 // Horizontal Tab
UNIC_LF 0x000A // Linefeed
UNIC_VT 0x000B // Vertical Tab
UNIC_FF 0x000C // Formfeed
UNIC_CR 0x000D // Carriage Return
UNIC_SO 0x000E // Shift Out
UNIC_SI 0x000F // Shift In
UNIC_DLE 0x0010 // Data Link Escape
UNIC_DC1 0x0011 // Device Control 1
UNIC_DC2 0x0012 // Device Control 2
UNIC_DC3 0x0013 // Device Control 3
UNIC_DC4 0x0014 // Device Control 4
UNIC_NAK 0x0015 // Negative Acknowledge
UNIC_SYN 0x0016 // Synchronous Idle
UNIC_ETB 0x0017 // End of Transmission Block
UNIC_CAN 0x0018 // Cancel
UNIC_EM 0x0019 // End of Medium
UNIC_SUB 0x001A // Substitute
UNIC_ESC 0x001B // Escape
UNIC_FS 0x001C // File Separator
UNIC_GS 0x001D // Group Separator
UNIC_RS 0x001E // Record Separator
UNIC_US 0x001F // Unit Separator
UNIC_SP 0x0020 // Space
UNIC_DEL 0x007F // Delete (Rubout)
UNIC_NBSP 0x00A0 // No-Break Space
UNIC_NQSP 0x2000 // En Quad
UNIC_MQSP 0x2001 // Em Quad
UNIC_ENSP 0x2002 // En Space
UNIC_EMSP 0x2003 // Em Space
UNIC_3MSP 0x2004 // 3-Em Space
UNIC_4MSP 0x2005 // 4-Em Space
UNIC_6MSP 0x2006 // 6-Em Space
UNIC_FSP 0x2007 // Figure Space
UNIC_PSP 0x2008 // Punctuation Space
UNIC_THSP 0x2009 // Thin Space
UNIC_HSP 0x200A // Hair Space
UNIC_ZWSP 0x200B // Zero-Width Space
UNIC_ZWNJ 0x200C // Zero-Width Non-Joiner
UNIC_ZWJ 0x200D // Zero-Width Joiner
UNIC_LRM 0x200E // Left-to-Right Mark
UNIC_RLM 0x200F // Right-to-Left Mark
UNIC_LSEP 0x2028 // Line Separator
UNIC_PSEP 0x2029 // Paragraph Separator
UNIC_LRE 0x202A // Left-to-Right Embedding
UNIC_RLE 0x202B // Right-to-Left Embedding
UNIC_PDF 0x202C // Pop Directional Formatting
UNIC_LRO 0x202D // Left-to-Right Override
UNIC_RLO 0x202E // Right-to-Left Override
UNIC_ISS 0x206A // Inhibit Symmetric Swapping
UNIC_ASS 0x206B // Activate Symmetric Swapping
UNIC_IAFS 0x206C // Inhibit Arabic Form Shaping
UNIC_AAFS 0x206D // Activate Arabic Form Shaping
UNIC_NADS 0x206E // National Digit Shapes
UNIC_NODS 0x206F // Nominal Digit Shapes
UNIC_IDSP 0x3000 // Ideographic Space
...
UNIC_REPL 0xFFFD // Replacement
UNIC_RES0 0xFFF0 // Reserved 0
UNIC_ZWNBSP 0xFEFF // Zero-Width No-Break Space
UNIC_BOM 0xFEFF // Byte Order Mark
UNIC_BOM_R 0xFFFE // Byte Order Mark, reversed
UNIC_NAC 0xFFFF // Not A Character
These constants may be implemented as either enumeration constants or as preprocessor macros.
Constants that begin with a UNIC__ indicate the groups
into which the Unicode characters are arranged.
Each such constant represents the first Unicode character code within
a specific group.
A few other constants bear special meanings, as described below.
UNIC_MIN
const unicode_t UNIC_MIN;
The UNIC_MIN constant represents
the lowest valid Unicode character value, which is 0x0000 (U+0000).
UNIC_MAX
const unicode_t UNIC_MAX;
The UNIC_MAX constant represents
the highest Unicode character value, which is 0xFFFF (U+FFFF).
UNIC_NAC
const unicode_t UNIC_NAC;
The UNIC_NAC constant represents
the "Not A Character" Unicode character value, which is 0xFFFF (U+FFFF).
UNIC_NUL
const unicode_t UNIC_NUL;
The UNIC_NUL constant represents
the "Null" Unicode character value, which is 0x0000 (U+0000).
Unicode character strings are usually terminated with this
character code.
UNIC_BOM
const unicode_t UNIC_BOM;
The UNIC_BOM constant represents
the "Byte Order Mark" Unicode character value, which is 0xFEFF (U+FEFF).
This character code occurs within Unicode character streams to
indicate a "normal" byte ordering, i.e., big-endian ordering.
UNIC_BOM_R
const unicode_t UNIC_BOM_R;
The UNIC_BOM_R constant represents a
byte-swapped "Byte Order Mark" Unicode character value,
which is 0xFFFE (U+FFFE).
This character code occurs within Unicode character streams to
indicate a "reversed" byte ordering, i.e., little-endian ordering.
The standard header <unicode.h> contains declarations for the following library functions:
unicode()
unicode()
int unicode(unicode_t *t);
The unicode() function
...
1. etc...
etc...