SWI-Prolog -- unicode

Availability::- use_module(library(unicode)).(can be autoloaded)

[nondet]unicode_property(?Code, ?Property)

Query the Unicode character database for Code. Code is an integer code point (0 .. 0x10FFFF) or a single-character atom; Property is a term of the form Name(Value) drawn from the list below.

This predicate is a thin wrapper over utf8proc's property struct, so its vocabulary matches the utf8proc documentation. In the modes (+,?) and (-,?) the predicate enumerates properties for the given code (or the code for the given property); in (+,+) it is a deterministic test.

Supported properties:

category(Atom)

Unicode general category. Atom is one of cc, cf, cn, co, cs, ll, lm, lo, lt, lu, mc, me, mn, nd, nl, no, pc, pd, pe, pf, pi, po, ps, sc, sk, sm, so, zl, zp, zs. When querying, the single capital letter of a subcategory stands for all its subcategories; e.g.

?- unicode_property(0'A, category('L')).
true.

combining_class(Integer)

Canonical combining class (0 for base characters, 230 for accents above, etc.).

bidi_class(Atom)

Bidirectional class. One of l, lre, lro, r, al, rle, rlo, pdf, en, es, et, an, cs, nsm, bn, b, s, ws, on.

bidi_mirrored(Bool)

true if the character is mirrored for bidi (parentheses, brackets, math operators, ...).

decomp_type(Atom)

Compatibility decomposition type. One of font, nobreak, initial, medial, final, isolated, circle, super, sub, vertical, wide, narrow, small, square, fraction, compat. Fails when there is no decomposition.

ignorable(Bool)

true if the character is a "default ignorable" code point.

boundclass(Atom)

UAX#29 grapheme-cluster break class. One of start, other, cr, lf, control, extend, l, v, t, lv, lvt, regional_indicator, spacingmark, prepend, zwj, extended_pictographic, e_zwg.

width(Integer)

Display width in fixed-width cells, 0..3. Zero for combining marks and control characters, 1 for most, 2 for "wide" characters (CJK, emoji).

ambiguous_width(Bool)

true if the character has East-Asian Ambiguous width --- normally one column, but two in a legacy CJK context.

uppercase(Code)

lowercase(Code)

titlecase(Code)

Single-code-point case mapping. Fails when the code point has no mapping of that kind (e.g. unicode_property(0'A, uppercase(_)) fails because 'A' is already upper-case). For characters whose case mapping produces more than one code point (e.g. U+00DF LATIN SMALL LETTER SHARP S maps to "SS"), use unicode_map/3 with the [casefold] option or unicode_casefold/2 for a full string-level transformation.

indic_conjunct_break(Atom)

Indic_Conjunct_Break property (Unicode 15+; UAX#44). One of none, linker, consonant, extend. Used by the grapheme-cluster-break algorithm for Devanagari, Bengali, etc.

See also: http://www.unicode.org/reports/tr44/ (Unicode property database)