Date: | 08 Jan 1998 |
Version: | 1.01 |
Supersedes: | 1.00 |
The standard header <unicode.h> contains a declaration for the following types:
unicode_t utf8_t
The unicode_t
type is
an unsigned integer type of at least 16 bits.
It is capable of representing any 16-bit Unicode character value,
which covers the range [0x0000,0xFFFF].
The utf8_t
type is
an unsigned integer type of at least 32 bits.
It is capable of holding a sequence of one to four 8-bit bytes
representing the UTF-8 encoding of a single Unicode character.
etc...
etc...
The standard header <unicode.h> contains declarations for the following constants with their corresponding values:
UNIC__C0 0x0000 // C0 Controls UNIC__LATIN 0x0020 // Basic Latin (ASCII, ISO-646) UNIC__C1 0x0080 // C1 Controls UNIC__LATIN_1 0x00A0 // Latin-1 Supplement UNIC__LATIN_A 0x0100 // Latin Extended-A UNIC__LATIN_B 0x0180 // Latin Extended-B UNIC__IPA 0x0250 // IPA Extensions UNIC__SPACING 0x02B0 // Spacing Modifier Letters UNIC__DIACRIT 0x0300 // Combining Diacritical Marks UNIC__GREEK 0x0370 // Greek UNIC__CYRILLIC 0x0400 // Cyrillic UNIC__ARMENIAN 0x0530 // Armenian UNIC__HEBREW 0x0590 // Hebrew UNIC__ARABIC 0x0600 // Arabic UNIC__DEVANAGARI 0x0900 // Devanagari UNIC__BENGALI 0x0980 // Bengali UNIC__GURMUHKI 0x0A00 // Gurmuhki UNIC__GUJARATI 0x0A80 // Gujarati UNIC__ORIYA 0x0B00 // Oriya UNIC__TAMIL 0x0B80 // Tamil UNIC__TELUGU 0x0C00 // Telugu UNIC__KANNADA 0x0C80 // Kannada UNIC__MALAYALAM 0x0D00 // Malayalam UNIC__THAI 0x0E00 // Thai UNIC__LAO 0x0E80 // Lao UNIC__TIBETAN 0x0F00 // Tibetan UNIC__GEORGIAN 0x10A0 // Georgian UNIC__HANGUL 0x1100 // Hangul Jamo UNIC__LATIN_E 0x1E00 // Latin Extended Additional UNIC__GREEK_E 0x1F00 // Greek Extended UNIC__PUNCT 0x2000 // General Punctuation UNIC__SUPER 0x2070 // Superscripts and Subscripts UNIC__CURRENCY 0x20A0 // Currency Symbols UNIC__DIACRIT_SYM 0x20D0 // Combining Diacriticals for Symbols UNIC__LETTERLIKE 0x2100 // Letterlike Symbols UNIC__NUMBER 0x2150 // Number Forms UNIC__ARROW 0x2190 // Arrows UNIC__MATH 0x2200 // Mathematical Operators UNIC__TECHNICAL 0x2300 // Miscellaneous Technical UNIC__CONTROL 0x2400 // Control Pictures UNIC__OCR 0x2440 // Optical Character Recognition UNIC__ENCL_ALPHA 0x2460 // Enclosed Alphanumerics UNIC__BOX 0x2500 // Box Drawing UNIC__BLOCK 0x2580 // Block Elements UNIC__GEOMETRIC 0x25A0 // Geometric Shapes UNIC__SYMBOL 0x2600 // Miscellaneous Symbols UNIC__DINGBAT 0x2700 // Dingbats UNIC__CJK_PUNCT 0x3000 // CJK Symbols and Punctuation UNIC__HIRIGANA 0x3040 // Hirigana UNIC__KATAKANA 0x30A0 // Katakana UNIC__BOPOMOFO 0x3100 // Bopomofo UNIC__HANGUL_COMPAT 0x3130 // Hangul Compatibility Jamo UNIC__KANBUN 0x3190 // Kanbun UNIC__ENCL_CJK 0x3200 // Enclosed CJK Letters and Months UNIC__CJK_COMPAT 0x3300 // CJK Compatibility UNIC__CJK 0x4E00 // CJK Unified Ideographs UNIC__HANGUL_SYL 0xAC00 // Hangul Syllables UNIC__CJK_COMPAT2 0xF900 // CJK Compatibility Ideographs UNIC__ALPHABETIC 0xFB00 // Alphabetic Presentation Forms UNIC__ARABIC_A 0xFB50 // Arabic Presentations Forms A UNIC__HALF 0xFE20 // Combining Half Marks UNIC__CJK_COMPAT3 0xFE30 // CJK Compatibility Forms UNIC__SMALL 0xFE50 // Small Form Variants UNIC__ARABIC_B 0xFE70 // Arabic Presentations Forms B UNIC__HALF_FULL 0xFF00 // Halfwidth and Fullwidth Forms UNIC__SPECIAL 0xFFF0 // Specials UNIC_MIN 0x0000 // Minimum Unicode code UNIC_MAX 0xFFFF // Maximum Unicode code UNIC_NUL 0x0000 // Null UNIC_SOH 0x0001 // Start of Heading UNIC_STX 0x0002 // Start of Text UNIC_ETX 0x0003 // End of Text UNIC_EOT 0x0004 // End of Transmission UNIC_ENQ 0x0005 // Enquire UNIC_ACK 0x0006 // Acknowledge UNIC_BEL 0x0007 // Bell (Alarm) UNIC_BS 0x0008 // Backspace UNIC_HT 0x0009 // Horizontal Tab UNIC_LF 0x000A // Linefeed UNIC_VT 0x000B // Vertical Tab UNIC_FF 0x000C // Formfeed UNIC_CR 0x000D // Carriage Return UNIC_SO 0x000E // Shift Out UNIC_SI 0x000F // Shift In UNIC_DLE 0x0010 // Data Link Escape UNIC_DC1 0x0011 // Device Control 1 UNIC_DC2 0x0012 // Device Control 2 UNIC_DC3 0x0013 // Device Control 3 UNIC_DC4 0x0014 // Device Control 4 UNIC_NAK 0x0015 // Negative Acknowledge UNIC_SYN 0x0016 // Synchronous Idle UNIC_ETB 0x0017 // End of Transmission Block UNIC_CAN 0x0018 // Cancel UNIC_EM 0x0019 // End of Medium UNIC_SUB 0x001A // Substitute UNIC_ESC 0x001B // Escape UNIC_FS 0x001C // File Separator UNIC_GS 0x001D // Group Separator UNIC_RS 0x001E // Record Separator UNIC_US 0x001F // Unit Separator UNIC_SP 0x0020 // Space UNIC_DEL 0x007F // Delete (Rubout) UNIC_NBSP 0x00A0 // No-Break Space UNIC_NQSP 0x2000 // En Quad UNIC_MQSP 0x2001 // Em Quad UNIC_ENSP 0x2002 // En Space UNIC_EMSP 0x2003 // Em Space UNIC_3MSP 0x2004 // 3-Em Space UNIC_4MSP 0x2005 // 4-Em Space UNIC_6MSP 0x2006 // 6-Em Space UNIC_FSP 0x2007 // Figure Space UNIC_PSP 0x2008 // Punctuation Space UNIC_THSP 0x2009 // Thin Space UNIC_HSP 0x200A // Hair Space UNIC_ZWSP 0x200B // Zero-Width Space UNIC_ZWNJ 0x200C // Zero-Width Non-Joiner UNIC_ZWJ 0x200D // Zero-Width Joiner UNIC_LRM 0x200E // Left-to-Right Mark UNIC_RLM 0x200F // Right-to-Left Mark UNIC_LSEP 0x2028 // Line Separator UNIC_PSEP 0x2029 // Paragraph Separator UNIC_LRE 0x202A // Left-to-Right Embedding UNIC_RLE 0x202B // Right-to-Left Embedding UNIC_PDF 0x202C // Pop Directional Formatting UNIC_LRO 0x202D // Left-to-Right Override UNIC_RLO 0x202E // Right-to-Left Override UNIC_ISS 0x206A // Inhibit Symmetric Swapping UNIC_ASS 0x206B // Activate Symmetric Swapping UNIC_IAFS 0x206C // Inhibit Arabic Form Shaping UNIC_AAFS 0x206D // Activate Arabic Form Shaping UNIC_NADS 0x206E // National Digit Shapes UNIC_NODS 0x206F // Nominal Digit Shapes UNIC_IDSP 0x3000 // Ideographic Space ... UNIC_REPL 0xFFFD // Replacement UNIC_RES0 0xFFF0 // Reserved 0 UNIC_ZWNBSP 0xFEFF // Zero-Width No-Break Space UNIC_BOM 0xFEFF // Byte Order Mark UNIC_BOM_R 0xFFFE // Byte Order Mark, reversed UNIC_NAC 0xFFFF // Not A Character
These constants may be implemented as either enumeration constants or as preprocessor macros.
Constants that begin with a UNIC__
indicate the groups
into which the Unicode characters are arranged.
Each such constant represents the first Unicode character code within
a specific group.
A few other constants bear special meanings, as described below.
UNIC_MIN
const unicode_t UNIC_MIN;
The UNIC_MIN
constant represents
the lowest valid Unicode character value, which is 0x0000 (U+0000).
UNIC_MAX
const unicode_t UNIC_MAX;
The UNIC_MAX
constant represents
the highest Unicode character value, which is 0xFFFF (U+FFFF).
UNIC_NAC
const unicode_t UNIC_NAC;
The UNIC_NAC
constant represents
the "Not A Character" Unicode character value, which is 0xFFFF (U+FFFF).
UNIC_NUL
const unicode_t UNIC_NUL;
The UNIC_NUL
constant represents
the "Null" Unicode character value, which is 0x0000 (U+0000).
Unicode character strings are usually terminated with this
character code.
UNIC_BOM
const unicode_t UNIC_BOM;
The UNIC_BOM
constant represents
the "Byte Order Mark" Unicode character value, which is 0xFEFF (U+FEFF).
This character code occurs within Unicode character streams to
indicate a "normal" byte ordering, i.e., big-endian ordering.
UNIC_BOM_R
const unicode_t UNIC_BOM_R;
The UNIC_BOM_R
constant represents a
byte-swapped "Byte Order Mark" Unicode character value,
which is 0xFFFE (U+FFFE).
This character code occurs within Unicode character streams to
indicate a "reversed" byte ordering, i.e., little-endian ordering.
The standard header <unicode.h> contains declarations for the following library functions:
unicode()
unicode()
int unicode(unicode_t *t);
The unicode()
function
...
1. etc...
etc...