module Uucd:sig..end
Uucd decodes the data of the
Unicode character database
from its XML representation. It provides high-level (but not
necessarily efficient) access to the data so that efficient
representations can be extracted.
Uucd decodes the representation described in the Annex #42 of
Unicode 6.3.0. Subsequent versions may be decoded as
long as no new cases are introduced in parsed enumerated
properties.
Consult the basics.
Note. All strings returned by the module are UTF-8 encoded.
Release 1.0.0 — Unicode version 6.3.0 — Daniel Bünzli <daniel.buenzl i@erratique.ch>
typecp =int
val is_cp : int -> bool
val is_scalar_value : int -> bool
module Cpmap:Map.Swith type key = cp
Properties are referenced by their name and property values by
their
abbreviated name. To understand their semantics refer to the
standard.
type props
type 'a prop
'a.val find : props -> 'a prop -> 'a optionfind ps p is the value of property p in ps, if any.val unknown_prop : string * string -> string propunknown_prop (ns, n) is a property read from an XML attribute
whose expanded name is (ns, n). This can be used to access a
property unknown to the module.
In alphabetical order.
val age : [ `Unassigned | `Version of int * int ] propval alphabetic : bool propval ascii_hex_digit : bool propval bidi_class : [ `AL
| `AN
| `B
| `BN
| `CS
| `EN
| `ES
| `ET
| `FSI
| `L
| `LRE
| `LRI
| `LRO
| `NSM
| `ON
| `PDF
| `PDI
| `R
| `RLE
| `RLI
| `RLO
| `S
| `WS ] propval bidi_control : bool propval bidi_mirrored : bool propval bidi_mirroring_glyph : cp option propval bidi_paired_bracket : [ `Cp of cp | `Self ] propval bidi_paired_bracket_type : [ `C | `N | `O ] propval block : [ `ASCII
| `Aegean_Numbers
| `Alchemical
| `Alphabetic_PF
| `Ancient_Greek_Music
| `Ancient_Greek_Numbers
| `Ancient_Symbols
| `Arabic
| `Arabic_Ext_A
| `Arabic_Math
| `Arabic_PF_A
| `Arabic_PF_B
| `Arabic_Sup
| `Armenian
| `Arrows
| `Avestan
| `Balinese
| `Bamum
| `Bamum_Sup
| `Batak
| `Bengali
| `Block_Elements
| `Bopomofo
| `Bopomofo_Ext
| `Box_Drawing
| `Brahmi
| `Braille
| `Buginese
| `Buhid
| `Byzantine_Music
| `CJK
| `CJK_Compat
| `CJK_Compat_Forms
| `CJK_Compat_Ideographs
| `CJK_Compat_Ideographs_Sup
| `CJK_Ext_A
| `CJK_Ext_B
| `CJK_Ext_C
| `CJK_Ext_D
| `CJK_Radicals_Sup
| `CJK_Strokes
| `CJK_Symbols
| `Carian
| `Chakma
| `Cham
| `Cherokee
| `Compat_Jamo
| `Control_Pictures
| `Coptic
| `Counting_Rod
| `Cuneiform
| `Cuneiform_Numbers
| `Currency_Symbols
| `Cypriot_Syllabary
| `Cyrillic
| `Cyrillic_Ext_A
| `Cyrillic_Ext_B
| `Cyrillic_Sup
| `Deseret
| `Devanagari
| `Devanagari_Ext
| `Diacriticals
| `Diacriticals_For_Symbols
| `Diacriticals_Sup
| `Dingbats
| `Domino
| `Egyptian_Hieroglyphs
| `Emoticons
| `Enclosed_Alphanum
| `Enclosed_Alphanum_Sup
| `Enclosed_CJK
| `Enclosed_Ideographic_Sup
| `Ethiopic
| `Ethiopic_Ext
| `Ethiopic_Ext_A
| `Ethiopic_Sup
| `Geometric_Shapes
| `Georgian
| `Georgian_Sup
| `Glagolitic
| `Gothic
| `Greek
| `Greek_Ext
| `Gujarati
| `Gurmukhi
| `Half_And_Full_Forms
| `Half_Marks
| `Hangul
| `Hanunoo
| `Hebrew
| `High_PU_Surrogates
| `High_Surrogates
| `Hiragana
| `IDC
| `IPA_Ext
| `Imperial_Aramaic
| `Indic_Number_Forms
| `Inscriptional_Pahlavi
| `Inscriptional_Parthian
| `Jamo
| `Jamo_Ext_A
| `Jamo_Ext_B
| `Javanese
| `Kaithi
| `Kana_Sup
| `Kanbun
| `Kangxi
| `Kannada
| `Katakana
| `Katakana_Ext
| `Kayah_Li
| `Kharoshthi
| `Khmer
| `Khmer_Symbols
| `Lao
| `Latin_1_Sup
| `Latin_Ext_A
| `Latin_Ext_Additional
| `Latin_Ext_B
| `Latin_Ext_C
| `Latin_Ext_D
| `Lepcha
| `Letterlike_Symbols
| `Limbu
| `Linear_B_Ideograms
| `Linear_B_Syllabary
| `Lisu
| `Low_Surrogates
| `Lycian
| `Lydian
| `Mahjong
| `Malayalam
| `Mandaic
| `Math_Alphanum
| `Math_Operators
| `Meetei_Mayek
| `Meetei_Mayek_Ext
| `Meroitic_Cursive
| `Meroitic_Hieroglyphs
| `Miao
| `Misc_Arrows
| `Misc_Math_Symbols_A
| `Misc_Math_Symbols_B
| `Misc_Pictographs
| `Misc_Symbols
| `Misc_Technical
| `Modifier_Letters
| `Modifier_Tone_Letters
| `Mongolian
| `Music
| `Myanmar
| `Myanmar_Ext_A
| `NB
| `NKo
| `New_Tai_Lue
| `Number_Forms
| `OCR
| `Ogham
| `Ol_Chiki
| `Old_Italic
| `Old_Persian
| `Old_South_Arabian
| `Old_Turkic
| `Oriya
| `Osmanya
| `PUA
| `Phags_Pa
| `Phaistos
| `Phoenician
| `Phonetic_Ext
| `Phonetic_Ext_Sup
| `Playing_Cards
| `Punctuation
| `Rejang
| `Rumi
| `Runic
| `Samaritan
| `Saurashtra
| `Sharada
| `Shavian
| `Sinhala
| `Small_Forms
| `Sora_Sompeng
| `Specials
| `Sundanese
| `Sundanese_Sup
| `Sup_Arrows_A
| `Sup_Arrows_B
| `Sup_Math_Operators
| `Sup_PUA_A
| `Sup_PUA_B
| `Sup_Punctuation
| `Super_And_Sub
| `Syloti_Nagri
| `Syriac
| `Tagalog
| `Tagbanwa
| `Tags
| `Tai_Le
| `Tai_Tham
| `Tai_Viet
| `Tai_Xuan_Jing
| `Takri
| `Tamil
| `Telugu
| `Thaana
| `Thai
| `Tibetan
| `Tifinagh
| `Transport_And_Map
| `UCAS
| `UCAS_Ext
| `Ugaritic
| `VS
| `VS_Sup
| `Vai
| `Vedic_Ext
| `Vertical_Forms
| `Yi_Radicals
| `Yi_Syllables
| `Yijing ] propval canonical_combining_class : int propval cased : bool propval case_folding : [ `Cps of cp list | `Self ] propval case_ignorable : bool propval changes_when_casefolded : bool propval changes_when_casemapped : bool propval changes_when_lowercased : bool propval changes_when_nfkc_casefolded : bool propval changes_when_titlecased : bool propval changes_when_uppercased : bool propval composition_exclusion : bool propval dash : bool propval decomposition_mapping : [ `Cps of cp list | `Self ] propval decomposition_type : [ `Can
| `Com
| `Enc
| `Fin
| `Font
| `Fra
| `Init
| `Iso
| `Med
| `Nar
| `Nb
| `None
| `Sml
| `Sqr
| `Sub
| `Sup
| `Vert
| `Wide ] propval default_ignorable_code_point : bool propval deprecated : bool propval diacritic : bool propval east_asian_width : [ `A | `F | `H | `N | `Na | `W ] propval expands_on_nfc : bool propval expands_on_nfd : bool propval expands_on_nfkc : bool propval expands_on_nfkd : bool propval extender : bool propval fc_nfkc_closure : [ `Cps of cp list | `Self ] propval full_composition_exclusion : bool propval general_category : [ `Cc
| `Cf
| `Cn
| `Co
| `Cs
| `Ll
| `Lm
| `Lo
| `Lt
| `Lu
| `Mc
| `Me
| `Mn
| `Nd
| `Nl
| `No
| `Pc
| `Pd
| `Pe
| `Pf
| `Pi
| `Po
| `Ps
| `Sc
| `Sk
| `Sm
| `So
| `Zl
| `Zp
| `Zs ] propval grapheme_base : bool propval grapheme_cluster_break : [ `CN | `CR | `EX | `L | `LF | `LV | `LVT | `PP | `RI | `SM | `T | `V | `XX ]
propval grapheme_extend : bool propval grapheme_link : bool propval hangul_syllable_type : [ `L | `LV | `LVT | `NA | `T | `V ] propval hex_digit : bool propval hyphen : bool propval id_continue : bool propval id_start : bool propval ideographic : bool propval ids_binary_operator : bool propval ids_trinary_operator : bool propval indic_syllabic_category : [ `Avagraha
| `Bindu
| `Consonant
| `Consonant_Dead
| `Consonant_Final
| `Consonant_Head_Letter
| `Consonant_Medial
| `Consonant_Placeholder
| `Consonant_Repha
| `Consonant_Subjoined
| `Modifying_Letter
| `Nukta
| `Other
| `Register_Shifter
| `Tone_Letter
| `Tone_Mark
| `Virama
| `Visarga
| `Vowel
| `Vowel_Dependent
| `Vowel_Independent ] propval indic_matra_category : [ `Bottom
| `Bottom_And_Right
| `Invisible
| `Left
| `Left_And_Right
| `NA
| `Overstruck
| `Right
| `Top
| `Top_And_Bottom
| `Top_And_Bottom_And_Right
| `Top_And_Left
| `Top_And_Left_And_Right
| `Top_And_Right
| `Visual_Order_Left ] propval iso_comment : string propval jamo_short_name : string propval join_control : bool propval joining_group : [ `Ain
| `Alaph
| `Alef
| `Alef_Maqsurah
| `Beh
| `Beth
| `Burushaski_Yeh_Barree
| `Dal
| `Dalath_Rish
| `E
| `Farsi_Yeh
| `Fe
| `Feh
| `Final_Semkath
| `Gaf
| `Gamal
| `Hah
| `Hamza_On_Heh_Goal
| `He
| `Heh
| `Heh_Goal
| `Heth
| `Kaf
| `Kaph
| `Khaph
| `Knotted_Heh
| `Lam
| `Lamadh
| `Meem
| `Mim
| `No_Joining_Group
| `Noon
| `Nun
| `Nya
| `Pe
| `Qaf
| `Qaph
| `Reh
| `Reversed_Pe
| `Rohingya_Yeh
| `Sad
| `Sadhe
| `Seen
| `Semkath
| `Shin
| `Swash_Kaf
| `Syriac_Waw
| `Tah
| `Taw
| `Teh_Marbuta
| `Teh_Marbuta_Goal
| `Teth
| `Waw
| `Yeh
| `Yeh_Barree
| `Yeh_With_Tail
| `Yudh
| `Yudh_He
| `Zain
| `Zhain ] propval joining_type : [ `C | `D | `L | `R | `T | `U ] propval line_break : [ `AI
| `AL
| `B2
| `BA
| `BB
| `BK
| `CB
| `CJ
| `CL
| `CM
| `CP
| `CR
| `EX
| `GL
| `H2
| `H3
| `HL
| `HY
| `ID
| `IN
| `IS
| `JL
| `JT
| `JV
| `LF
| `NL
| `NS
| `NU
| `OP
| `PO
| `PR
| `QU
| `RI
| `SA
| `SG
| `SP
| `SY
| `WJ
| `XX
| `ZW ] propval logical_order_exception : bool propval lowercase : bool propval lowercase_mapping : [ `Cps of cp list | `Self ] propval math : bool propval name : [ `Name of string | `Pattern of string ] prop`Pattern case occurences of the character '#'
(U+0023) in the string must be replaced by the value of the code
point as four to six uppercase hexadecimal digits (the minimal
needed). E.g. the pattern "CJK UNIFIED IDEOGRAPH-#" associated
to code point U+3400 gives the name "CJK UNIFIED IDEOGRAPH-3400".val name_alias : (string * [ `Abbreviation | `Alternate | `Control | `Correction | `Figment ])
list propval nfc_quick_check : [ `False | `Maybe | `True ] propval nfd_quick_check : [ `False | `Maybe | `True ] propval nfkc_quick_check : [ `False | `Maybe | `True ] propval nfkc_casefold : [ `Cps of cp list | `Self ] propval nfkd_quick_check : [ `False | `Maybe | `True ] propval noncharacter_code_point : bool propval numeric_type : [ `De | `Di | `None | `Nu ] propval numeric_value : [ `Frac of int * int | `NaN | `Num of int64 ] propval other_alphabetic : bool propval other_default_ignorable_code_point : bool propval other_grapheme_extend : bool propval other_id_continue : bool propval other_id_start : bool propval other_lowercase : bool propval other_math : bool propval other_uppercase : bool propval pattern_syntax : bool propval pattern_white_space : bool propval quotation_mark : bool propval radical : bool proptypescript =[ `Arab
| `Armi
| `Armn
| `Avst
| `Bali
| `Bamu
| `Batk
| `Beng
| `Bopo
| `Brah
| `Brai
| `Bugi
| `Buhd
| `Cakm
| `Cans
| `Cari
| `Cham
| `Cher
| `Copt
| `Cprt
| `Cyrl
| `Deva
| `Dsrt
| `Egyp
| `Ethi
| `Geor
| `Glag
| `Goth
| `Grek
| `Gujr
| `Guru
| `Hang
| `Hani
| `Hano
| `Hebr
| `Hira
| `Hrkt
| `Ital
| `Java
| `Kali
| `Kana
| `Khar
| `Khmr
| `Knda
| `Kthi
| `Lana
| `Laoo
| `Latn
| `Lepc
| `Limb
| `Linb
| `Lisu
| `Lyci
| `Lydi
| `Mand
| `Merc
| `Mero
| `Mlym
| `Mong
| `Mtei
| `Mymr
| `Nkoo
| `Ogam
| `Olck
| `Orkh
| `Orya
| `Osma
| `Phag
| `Phli
| `Phnx
| `Plrd
| `Prti
| `Qaai
| `Rjng
| `Runr
| `Samr
| `Sarb
| `Saur
| `Shaw
| `Shrd
| `Sinh
| `Sora
| `Sund
| `Sylo
| `Syrc
| `Tagb
| `Takr
| `Tale
| `Talu
| `Taml
| `Tavt
| `Telu
| `Tfng
| `Tglg
| `Thaa
| `Thai
| `Tibt
| `Ugar
| `Vaii
| `Xpeo
| `Xsux
| `Yiii
| `Zinh
| `Zyyy
| `Zzzz ]
val script : script propval script_extensions : script list propval sentence_break : [ `AT
| `CL
| `CR
| `EX
| `FO
| `LE
| `LF
| `LO
| `NU
| `SC
| `SE
| `SP
| `ST
| `UP
| `XX ] propval simple_case_folding : [ `Cp of cp | `Self ] propval simple_lowercase_mapping : [ `Cp of cp | `Self ] propval simple_titlecase_mapping : [ `Cp of cp | `Self ] propval simple_uppercase_mapping : [ `Cp of cp | `Self ] propval soft_dotted : bool propval sterm : bool propval terminal_punctuation : bool propval titlecase_mapping : [ `Cps of cp list | `Self ] propval uax_42_element : [ `Char | `Noncharacter | `Reserved | `Surrogate ] prop
val unicode_1_name : string propval unified_ideograph : bool propval uppercase : bool propval uppercase_mapping : [ `Cps of cp list | `Self ] propval variation_selector : bool propval white_space : bool propval word_break : [ `CR
| `DQ
| `EX
| `Extend
| `FO
| `HL
| `KA
| `LE
| `LF
| `MB
| `ML
| `MN
| `NL
| `NU
| `RI
| `SQ
| `XX ] propval xid_continue : bool propval xid_start : bool prop
In alphabetic order. For now unihan properties are always
represented as strings.
val kAccountingNumeric : string propval kAlternateHanYu : string propval kAlternateJEF : string propval kAlternateKangXi : string propval kAlternateMorohashi : string propval kBigFive : string propval kCCCII : string propval kCNS1986 : string propval kCNS1992 : string propval kCangjie : string propval kCantonese : string propval kCheungBauer : string propval kCheungBauerIndex : string propval kCihaiT : string propval kCompatibilityVariant : string propval kCowles : string propval kDaeJaweon : string propval kDefinition : string propval kEACC : string propval kFenn : string propval kFennIndex : string propval kFourCornerCode : string propval kFrequency : string propval kGB0 : string propval kGB1 : string propval kGB3 : string propval kGB5 : string propval kGB7 : string propval kGB8 : string propval kGradeLevel : string propval kGSR : string propval kHangul : string propval kHanYu : string propval kHanyuPinlu : string propval kHanyuPinyin : string propval kHDZRadBreak : string propval kHKGlyph : string propval kHKSCS : string propval kIBMJapan : string propval kIICore : string propval kIRGDaeJaweon : string propval kIRGDaiKanwaZiten : string propval kIRGHanyuDaZidian : string propval kIRGKangXi : string propval kIRG_GSource : string propval kIRG_HSource : string propval kIRG_JSource : string propval kIRG_KPSource : string propval kIRG_KSource : string propval kIRG_MSource : string propval kIRG_TSource : string propval kIRG_USource : string propval kIRG_VSource : string propval kJHJ : string propval kJIS0213 : string propval kJapaneseKun : string propval kJapaneseOn : string propval kJis0 : string propval kJis1 : string propval kKPS0 : string propval kKPS1 : string propval kKSC0 : string propval kKSC1 : string propval kKangXi : string propval kKarlgren : string propval kKorean : string propval kLau : string propval kMainlandTelegraph : string propval kMandarin : string propval kMatthews : string propval kMeyerWempe : string propval kMorohashi : string propval kNelson : string propval kOtherNumeric : string propval kPhonetic : string propval kPrimaryNumeric : string propval kPseudoGB1 : string propval kRSAdobe_Japan1_6 : string propval kRSJapanese : string propval kRSKanWa : string propval kRSKangXi : string propval kRSKorean : string propval kRSMerged : string propval kRSUnicode : string propval kSBGY : string propval kSemanticVariant : string propval kSimplifiedVariant : string propval kSpecializedSemanticVariant : string propval kTaiwanTelegraph : string propval kTang : string propval kTotalStrokes : string propval kTraditionalVariant : string propval kVietnamese : string propval kXHC1983 : string propval kWubi : string propval kXerox : string propval kZVariant : string proptypeblock =(cp * cp) * string
typenamed_sequence =string * cp list
typenormalization_correction =cp * cp list * cp list * (int * int * int)
typestandardized_variant =cp list * string * [ `Final | `Initial | `Isolate | `Medial ] list
typecjk_radical =string * cp * cp
typeemoji_source =cp list * int option * int option * int option
type t = {
|
description : |
|
repertoire : |
|
blocks : |
|
named_sequences : |
|
provisional_named_sequences : |
|
normalization_corrections : |
|
standardized_variants : |
|
cjk_radicals : |
|
emoji_sources : |
Note. Absence of an optional top-level field in the database
is denoted by the neutral element of its type (empty string, empty
list, Cpmap.empty). This means that the module doesn't
distinguish between absence of a field and presence of the field
with empty data (but incurs no problems in this context).
val cp_prop : t -> cp -> 'a prop -> 'a optioncp_prop ucd cp p is the property p of the code point cp
in db's repertoire, if p is in the repertoire and the property
exists for p.typesrc =[ `Channel of Pervasives.in_channel | `String of string ]
type decoder
val decoder : [< src ] -> decoderdecoder src is a decoder that inputs from src.val decode : decoder -> [ `Error of string | `Ok of t ]decode d decodes a database from d or returns an error.val decoded_range : decoder -> (int * int) * (int * int)decoded_range d is the range of characters spanning the `Error
decoded by d. A pair of line and column numbers respectively one and
zero based.The database and subsets of it for Unicode 6.3.0 are available here. Databases with groups should be preferred, they maximize value sharing and improve parsing performance.
A database is decoded as follows:
let ucd_or_die inf = try
let ic = if inf = "-" then stdin else open_in inf in
let d = Uucd.decoder (`Channel ic) in
match Uucd.decode d with
| `Ok db -> db
| `Error e ->
let (l0, c0), (l1, c1) = Uucd.decoded_range d in
Printf.eprintf "%s:%d.%d-%d.%d: %s\n%!" inf l0 c0 l1 c1 e;
exit 1
with Sys_error e -> Printf.eprintf "%s\n%!" e; exit 1
let ucd = ucd_or_die "/tmp/ucd.all.grouped.xml"
The convenience function Uucd.cp_prop can be used to query
the property of a given code point. For example the
general category of U+1F42B
is given by:
let u_1F42B_gc = Uucd.cp_prop ucd 0x1F42B Uucd.general_category