SGML or XML files are loaded through the common predicate load_structure/3. This is a predicate with many options. For simplicity a number of commonly used shorthands are provided: load_sgml_file/2, load_xml_file/2, and load_html_file/2.
stream(StreamHandle)
or a file-name. Options is
a list of options controlling the conversion process.
A proper XML document contains only a single toplevel element whose name matches the document type. Nevertheless, a list is returned for consistency with the representation of element content. The ListOfContent consists of the following types:
CDATA
. Note this is possible in
SWI-Prolog, as there is no length-limit on atoms and atom garbage
collection is provided.
ListOfAttributes is a list of Name=Value
pairs for attributes. Attributes of type CDATA
are returned
literal. Multi-valued attributes (NAMES
, etc.) are
returned as a list of atoms. Handling attributes of the types NUMBER
and NUMBERS
depends on the setting of the number(+NumberMode)
attribute through
set_sgml_parser/2
or load_structure/3.
By default they are returned as atoms, but automatic conversion to
Prolog integers is supported. ListOfContent defines the
content for the element.
SDATA
is
encountered, this term is returned holding the data in Text.NDATA
is
encountered, this term is returned holding the data in Text.<?...?>
), Text
holds the text of the processing instruction. Please note that the
<?xml ...?>
instruction is handled internally.The Options list controls the conversion process. Currently defined options are below. Other options are passed to sgml_parse/2.
<!DOCTYPE ...>
declaration is ignored and the document is parsed and validated against
the provided DTD. If provided as a variable, the created DTD is
returned. See section 3.5.sgml
(default),
html4
, html5
, html
(same as html4
,
xhtml
, xhtml5
, xml
and xmlns
.
See the option dialect
of set_sgml_parser/2
for details./
is accepted with warning as part of an
unquoted attribute-value, though />
still closes the
element-tag in XML mode. It may be set to false for parsing HTML
documents to allow for unquoted URLs containing /
.xml:space
.
See
section 3.2.NUMBER
and NUMBERS
are handled. If token
(default) they are passed as an atom.
If
integer
the parser attempts to convert the value to an
integer. If successful, the attribute is passed as a Prolog integer.
Otherwise it is still passed as an atom. Note that SGML defines a
numeric attribute to be a sequence of digits. The -
sign is not allowed and
1
is different from 01
. For this reason the
default is to handle numeric attributes as tokens. If conversion to
integer is enabled, negative values are silently accepted.true
for XML and false
for SGML and HTML dialects.false
. Setting this option sets the
case_sensitive_attributes
to the same value. This option
was added to support HTML quasi quotations and most likely has little
value in other contexts.false
.false
, only the attributes occurring in the
source are emitted.CDATA
entities can be specified with this construct.
Multiple entity options are allowed.max_memory(0)
(the default) means no resource limit will be enforced.atom
(default), and string
. The choice is not
obvious. Strings are allocated on the Prolog stacks and subject to
normal stack garbage collection. They are quicker to create and avoid
memory fragmentation. But, multiple copies of the same string are stored
multiple times, while the text is shared if atoms are used. Strings are
also useful for security sensitive information as they are invisible to
other threads and cannot be enumerated using, e.g., current_atom/1.
Finally, using strings allows for resource usage limits using the global
stack limit (see set_prolog_stack/2).atom
(default), and string
. See above for the
advantages and disadvantages of using strings.true
, xmlns namespaces with prefixes are returned as
ns(Prefix, URI)
terms. If false
(default), the
prefix is ignored and the xmlns namespace is returned as just the URI.
SGML2PL has four modes for handling white-space. The initial mode can
be switched using the space(SpaceMode)
option to
load_structure/3
and set_sgml_parser/2.
In XML mode, the mode is further controlled by the xml:space
attribute, which may be specified both in the DTD and in the document.
The defined modes are:
\r\n
is still translated to \n
.
To preserve whitespace exactly, use space(strict)
(see below)sgml
space-mode, all consequtive white-space
is reduced to a single space-character. This mode canonicalises all
white space.default
, all leading and trailing
white-space is removed from CDATA
objects. If, as a result,
the CDATA
becomes empty, nothing is passed to the
application. This mode is especially handy for processing‘data-oriented’documents,
such as RDF. It is not suitable for normal text documents. Consider the
HTML fragment below. When processed in this mode, the spaces between the
three modified words are lost. This mode is not part of any standard;
XML 1.0 allows only default
and preserve
.
Consider adjacent <b>bold</b> <ul>and</ul> <it>italic</it> words.
The parser can operate in two modes: sgml
mode and xml
mode, as defined by the dialect(Dialect)
option. Regardless
of this option, if the first line of the document reads as below, the
parser is switched automatically into XML mode.
<?xml ... ?>
Currently switching to XML mode implies:
<element [attribute...] />
is
recognised as an empty element.
lt
(<
), gt
(>
), amp
(&
), apos
('
) and quot
("
).
ELEMENT
, etc.).
_
) and colon (:
) are
allowed in names.
preserve
. In addition to setting
white-space handling at the toplevel the XML reserved attribute
xml:space
is honoured. It may appear both in the document
and the DTD. The remove
extension is honoured as
xml:space
value. For example, the DTD statement below
ensures that the pre
element preserves space, regardless of
the default processing mode.
<!ATTLIST pre xml:space nmtoken #fixed preserve>
Using the dialect xmlns
, the parser will
interpret XML namespaces. In this case, the names of elements are
returned as a term of the format
URL:
LocalName
If an identifier has no namespace and there is no default namespace it is returned as a simple atom. If an identifier has a namespace but this namespace is undeclared, the namespace name rather than the related URL is returned.
Attributes declaring namespaces (xmlns:<ns>=<url>
)
are reported as if xmlns
were not a defined resource.
In many cases, getting attribute-names as url:name is not desirable. Such terms are hard to unify and sometimes multiple URLs may be mapped to the same identifier. This may happen due to poor version management, poor standardisation or because the the application doesn't care too much about versions. This package defines two call-backs that can be set using set_sgml_parser/2 to deal with this problem.
The call-back xmlns
is called as XML namespaces are
noticed. It can be used to extend a canonical mapping for later use by
the urlns
call-back. The following illustrates this
behaviour. Any namespace containing rdf-syntax
in its URL
or that is used as
rdf
namespace is canonicalised to rdf
. This
implies that any attribute and element name from the RDF namespace
appears as
rdf:<name>
:- dynamic xmlns/3. on_xmlns(rdf, URL, _Parser) :- !, asserta(xmlns(URL, rdf, _)). on_xmlns(_, URL, _Parser) :- sub_atom(URL, _, _, _, 'rdf-syntax'), !, asserta(xmlns(URL, rdf, _)). load_rdf_xml(File, Term) :- load_structure(File, Term, [ dialect(xmlns), call(xmlns, on_xmlns), call(urlns, xmlns) ]).
The library provides iri_xml_namespace/3 to break down an IRI into its namespace and localname:
#
or /
. Note however that
this can produce unexpected results. E.g., in the example below, one
might expect the namespace to be http://example.com/images\#,
but an XML name cannot start with a digit.
?- iri_xml_namespace('http://example.com/images#12345', NS, L). NS = 'http://example.com/images#12345', L = ''.
As we see from the example above, the Localname can be the empty atom. Similarly, Namespace can be the empty atom if IRI is an XML name. Applications will often have to check for either or both these conditions. We decided against failing in these conditions because the application typically wants to know which of the two conditions (empty namespace or empty localname) holds. This predicate is often used for generating RDF/XML from an RDF graph.
The DTD (Document Type Definition) is a separate entity in sgml2pl, that can be created, freed, defined and inspected. Like the parser itself, it is filled by opening it as a Prolog output stream and sending data to it. This section summarises the predicates for handling the DTD.
dialect
option from open_dtd/3
and the encoding
option from open/4.
Notably the dialect
option must match the dialect used for
subsequent parsing using this DTD.sgml
. Using xml
or
xmlns
processes the DTD case-sensitive.dtd
using
the call:
..., absolute_file_name(dtd(Type), [ extensions([dtd]), access(read) ], DtdFile), ...
Note that DTD objects may be modified while processing errornous
documents. For example, loading an SGML document starting with
<?xml ...?>
switches the DTD to XML mode and
encountering unknown elements adds these elements to the DTD object.
Re-using a DTD object to parse multiple documents should be restricted
to situations where the documents processed are known to be error-free.
The DTD html
is handled separately. The Prolog flag
html_dialect
specifies the default html dialect, which is
either
html4
or html5
(default).3Note
that HTML5 has no DTD. The loaded DTD is an informal DTD that includes
most of the HTML5 extensions (http://www.cs.tut.fi/~jkorpela/html5-dtd.html).
In addition, the parser sets the dialect
flag of the DTD
object. This is used by the parser to accept HTML extensions.
Next, the corresponding DTD is loaded.
omit(OmitOpen, OmitClose)
, where both
arguments are booleans (true
or false
representing whether the open- or close-tag may be omitted. Content
is the content-model of the element represented as a Prolog term. This
term takes the following form:
cdata
, but entity-references are expanded.*
(SubModel)?
(SubModel)+
(SubModel),
(SubModel1, SubModel2)|
(SubModel1,
SubModel2)cdata
, entity
,
id
, idref
, name
, nmtoken
,
notation
, number
or nutoken
. For
DTD types that allow for a list, the notation list(Type)
is
used. Finally, the DTD construct (a|b|...)
is mapped to the
term
nameof(ListOfValues)
.
Default describes the sgml default. It is one required
,
current
, conref
or implied
. If a
real default is present, it is one of default(Value)
or fixed(Value)
.
NOTATION
declarations.system(+File)
and/or
public(+PublicId)
.
As this parser allows for processing partial documents and process the DTD separately, the DOCTYPE declaration plays a special role.
If a document has no DOCTYPE declaraction, the parser returns a list holding all elements and CDATA found. If the document has a DOCTYPE declaraction, the parser will open the element defined in the DOCTYPE as soon as the first real data is encountered.
Some documents have no DTD. One of the neat facilities of this
library is that it builds a DTD while parsing a document with an
implicit DTD. The resulting DTD contains all elements
encountered in the document. For each element the content model is a
disjunction of elements and possibly #PCDATA
that can be
repeated. Thus, if we found element y
and CDATA in element
x
, the model is:
<!ELEMENT x - - (y|#PCDATA)*>
Any encountered attribute is added to the attribute list with the
type
CDATA
and default #IMPLIED
.
The example below extracts the elements used in an unknown XML document.
elements_in_xml_document(File, Elements) :- load_structure(File, _, [ dialect(xml), dtd(DTD) ]), dtd_property(DTD, elements(Elements)), free_dtd(DTD).
dtd(DTD)
option.file(File)
option.stream_property(Stream, position(Position))
.sgml
, but implies shorttag(false)
and accepts XML empty element declarations (e.g.,
<img src="..."/>
).html
, accept attributes named data-
without warning. This value initialises the charset to UTF-8.xml
. Dialect
xhtml5
accepts attributes named data-
without
warning.<?xml ...>
is encountered. See section
3.3 for details.qualify_attributes
option below.xmlns
) mode. Default and standard
compliant is not to qualify such elements. If true
, such
attributes are qualified with the namespace of the element they appear
in. This option is for backward compatibility as this is the behaviour
of older versions. In addition, the namespace document suggests
unqualified attributes are often interpreted in the namespace of their
element.token
(default), attributes of type number are passed as
a Prolog atom. If integer
, such attributes are translated
into Prolog integers. If the conversion fails (e.g. due to overflow) a
warning is issued and the value is passed as an atom.encoding=
attribute in the header.
Explicit use of this option is only required to parse non-conforming
documents. Currently accepted values are iso-8859-1
and
utf-8
.<!DOCTYPE
declaration has been parsed, the default is the defined doctype. The
parser can be instructed to accept the first element encountered as the
toplevel using doctype(_)
. This feature is especially
useful when parsing part of a document (see the parse
option to
sgml_parse/2.on_begin
, etc.
callbacks from sgml_parse/2.sgml
,
html
, html5
, xhtml
, xhtml5
, xml
or xmlns
).begin
or end
) is caused by
an element written down using the shorttag notation (<tag/value/>
.#pcdata
is part of
Elements. If no element is open, the doctype is
returned.
This option is intended to support syntax-sensitive editors. Such an editor should load the DTD, find an appropriate starting point and then feed all data between the starting point and the caret into the parser. Next it can use this option to determine the elements allowed at this point. Below is a code fragment illustrating this use given a parser with loaded DTD, an input stream and a start-location.
..., seek(In, Start, bof, _), set_sgml_parser(Parser, charpos(Start)), set_sgml_parser(Parser, doctype(_)), Len is Caret - Start, sgml_parse(Parser, [ source(In), content_length(Len), parse(input) % do not complete document ]), get_sgml_parser(Parser, allowed(Allowed)), ...
Input is a stream. A full description of the option-list is below.
atom
(default), and string
. See load_structure/3
for details.source(Stream)
, this implies reading is stopped as soon as
the element is complete, and another call may be issued on the same
stream to read the next element.content
is like element
but assumes
the element has already been opened. It may be used in a call-back from
call(on_begin
, Pred)
to parse individual
elements after validating their headers.doctype
declaration.allowed(Elements)
option of get_sgml_parser/2.
It disables the parser's default to complete the parse-tree by closing
all open elements.max_errors(-1)
makes the parser continue, no matter how many errors it encounters.
error(limit_exceeded(max_errors, Max), _)
informational
.quiet
, the error is suppressed. Can be used
together with call(urlns, Closure)
to provide external
expansion of namespaces. See also section
3.3.1.Handler(+Tag, +Attributes, +Parser)
.Handler(+Tag, +Parser)
.Handler(+CDATA, +Parser)
, where CDATA is an atom
representing the data.Handler(+Text, +Parser)
,
where
Text is the text of the processing instruction.<!...>
) has been read. The named
handler is called with two arguments: Handler(+Text,
+Parser)
, where Text is the text of the declaration
with comments removed.
This option is expecially useful for highlighting declarations and comments in editor support, where the location of the declaration is extracted using get_sgml_parser/2.
Handler(+Severity, +Message, +Parser)
,
where
Severity is one of warning
or error
and
Message is an atom representing the diagnostic message. The
location of the error can be determined using get_sgml_parser/2
If this option is present, errors and warnings are not reported using print_message/3
xmlns
mode, a new namespace declaraction
is pushed on the environment. The named handler is called with three
arguments: Handler(+NameSpace, +URL, +Parser)
.
See section 3.3.1 for details.xmlns
mode, this predicate can be used
to map a url into either a canonical URL for this namespace or another
internal identifier. See section 3.3.1
for details.
In some cases, part of a document needs to be parsed. One option is
to use load_structure/2
or one of its variations and extract the desired elements from the
returned structure. This is a clean solution, especially on small and
medium-sized documents. It however is unsuitable for parsing really big
documents. Such documents can only be handled with the call-back output
interface realised by the
call(Event, Action)
option of sgml_parse/2.
Event-driven processing is not very natural in Prolog.
The SGML2PL library allows for a mixed approach. Consider the case
where we want to process all descriptions from RDF elements in a
document. The code below calls process_rdf_description(Element)
on each element that is directly inside an RDF element.
:- dynamic in_rdf/0. load_rdf(File) :- retractall(in_rdf), open(File, read, In), new_sgml_parser(Parser, []), set_sgml_parser(Parser, file(File)), set_sgml_parser(Parser, dialect(xml)), sgml_parse(Parser, [ source(In), call(begin, on_begin), call(end, on_end) ]), close(In). on_end('RDF', _) :- retractall(in_rdf). on_begin('RDF', _, _) :- assert(in_rdf). on_begin(Tag, Attr, Parser) :- in_rdf, !, sgml_parse(Parser, [ document(Content), parse(content) ]), process_rdf_description(element(Tag, Attr, Content)).