SWI-Prolog -- Unicode Prolog source

2.15.1.9 Unicode Prolog source

The ISO standard specifies the Prolog syntax in ASCII characters. As SWI-Prolog supports Unicode in source files we must extend the syntax. This section describes the implication for the source files, while writing international source files is described in section 3.1.3.

The SWI-Prolog Unicode character classification is currently based on version 14.0.0 of the Unicode standard. Please note that char_type/2 and friends, intended to be used with all text except Prolog source code, is based on the C library locale-based classification routines.

Quoted atoms and strings
Any character of any script can be used in quoted atoms and strings. The escape sequences \uXXXX and \UXXXXXXXX (see section 2.15.1.3) were introduced to specify Unicode code points in ASCII files.
Atoms and Variables
We handle them in one item as they are closely related. The Unicode standard defines a syntax for identifiers in computer languages.^{33http://www.unicode.org/reports/tr31/} In this syntax identifiers start with ID_Start followed by a sequence of ID_Continue codes. Such sequences are handled as a single token in SWI-Prolog. The token is a variable iff it starts with an uppercase character or an underscore (_). Otherwise it is an atom. Note that many languages do not have the notion of character case. In such languages variables must be written as _name.
Numbers

Decimal number characters (Nd) are accepted to form numbers, regardless of the Unicode block in which they appear. Currently this is supported for integers, rational numbers (see section 2.15.1.6) and floating point numbers. In any number, all digits must come from the same block, i.e., if the nominator of a rational is uses Indian script, so must the denominator. All special characters such as the sign, rational separator, floating point ., and floating point exponent must use their usual ASCII character.
White space
All characters marked as separators (Z*) in the Unicode tables are handled as layout characters.
Control and unassigned characters
Control and unassigned (C*) characters produce a syntax error if encountered outside quoted atoms/strings and outside comments. Quoted writing (e.g., writeq/1) of an atom or string that contains one of these characters causes the atom or string to be quoted and the control or unassigned characters to be written using an escape sequence. See section 2.15.1.3.
Other characters
The first 128 characters follow the ISO Prolog standard. Unicode symbol and punctuation characters (general category S* and P*) act as glueing symbol characters (i.e., just like ==: an unquoted sequence of symbol characters are combined into an atom).
Other characters (this is mainly No: a numeric character of other type) are currently handled as‘solo’.