SWI-Prolog -- Predicates for parsing RDF/XML

3 Predicates for parsing RDF/XML

The parser is designed to operate in various environments and therefore provides interfaces at various levels. First we describe the top level defined in library(rdf), simply parsing a RDF-XML file into a list of triples. Please note these are not asserted into the database because it is not necessarily the final format the user wishes to reason with and it is not clean how the user wants to deal with multiple RDF documents. Some options are using global URI's in one pool, in Prolog modules or using an additional argument.

load_rdf(+File, -Triples)

Same as load_rdf(File, Triples,[]).

load_rdf(+File, -Triples, +Options)

Read the RDF-XML file File and return a list of Triples. Options defines additional processing options. Currently defined options are:

base_uri(BaseURI): If provided local identifiers and identifier-references are globalised using this URI. If omited or the atom [], local identifiers are not tagged.
blank_nodes(Mode): If Mode is share (default), blank-node properties (i.e. complex properties without identifier) are reused if they result in exactly the same triple-set. Two descriptions are shared if their intermediate description is the same. This means they should produce the same set of triples in the same order. The value noshare creates a new resource for each blank node.
expand_foreach(Boolean): If Boolean is true, expand rdf:aboutEach into a set of triples. By default the parser generates rdf(each(Container), Predicate, Subject).
lang(Lang): Define the initial language (i.e. pretend there is an xml:lang declaration in an enclosing element).
ignore_lang(Bool): If true, xml:lang declarations in the document are ignored. This is mostly for compatibility with older versions of this library that did not support language identifiers.
convert_typed_literal(:ConvertPred): If the parser finds a literal with the rdf:datatype=Type attribute, call ConvertPred(+Type, +Content, -Literal). Content is the XML element contentas returned by the XML parser (a list). The predicate must unify Literal with a Prolog representation of Content according to Type or throw an exception if the conversion cannot be made.
This option servers two purposes. First of all it can be used to ignore type declarations for backward compatibility of this library. Second it can be used to convert typed literals to a meaningful Prolog representation. E.g. convert’42’to the Prolog integer 42 if the type is xsd:int or a related type.
namespaces(-List): Unify List with a list of NS=URL for each encountered xmlns:NS=URL declaration found in the source.
entity(+Name, +Value): Overrule entity declaration in file. As it is common practice to declare namespaces using entities in RDF/XML, this option allows for changing the namespace without changing the file. Multiple of these options are allowed.

The Triples list is a list of rdf(Subject, Predicate, Object) triples. Subject is either a plain resource (an atom), or one of the terms each(URI) or prefix(URI) with the obvious meaning. Predicate is either a plain atom for explicitely non-qualified names or a term NameSpace:Name. If NameSpace is the defined RDF name space it is returned as the atom rdf. Finally, Object is a URI, a Predicate or a term of the format literal(Value) for literal values. Value is either a plain atom or a parsed XML term (list of atoms and elements).

3.1 RDF Object representation

The Object (3rd) part of a triple can have several different types. If the object is a resource it is returned as either a plain atom or a term NameSpace:Name. If it is a literal it is returned as literal(Value), where Value takes one of the formats defined below.

An atom
If the literal Value is a plain atom is a literal value not subject to a datatype or xml:lang qualifier.
lang(LanguageID, Atom)
If the literal is subject to an xml:lang qualifier LanguageID specifies the language and Atom the actual text.
A list
If the literal is an XML literal as created by parseType="Literal" , the raw output of the XML parser for the content of the element is returned. This content is a list of element(Name, Attributes, Content) and atoms for CDATA parts as described with the SWI-Prolog SGML/XML parser.
type(Type, StringValue)
If the literal has an rdf:datatype=Type a term of this format is returned.

3.2 Name spaces

XML name spaces are identified using a URI. Unfortunately various URI's are in common use to refer to RDF. The rdf_parser.pl module therefore defines the namespace as a multifile/1 predicate, that can be extended by the user. For example, to parse the Netscape OpenDirectory structure.rdf file, the following declarations are used:

:- multifile
        rdf_parser:rdf_name_space/1.

rdf_parser:rdf_name_space('http://www.w3.org/TR/RDF/').
rdf_parser:rdf_name_space('http://directory.mozilla.org/rdf').
rdf_parser:rdf_name_space('http://dmoz.org/rdf').

The initial definition of this predicate is given below.

rdf_name_space('http://www.w3.org/1999/02/22-rdf-syntax-ns#').
rdf_name_space('http://www.w3.org/TR/REC-rdf-syntax').

3.3 Low-level access

The above defined load_rdf/[2,3] is not always suitable. For example, it cannot deal with documents where the RDF statement is embedded in an XML document. It also cannot deal with really large documents (e.g. the Netscape OpenDirectory project, currently about 90 MBytes), without huge amounts of memory.

For really large documents, the sgml2pl parser can be programmed to handle the content of a specific element (i.e. <rdf:RDF>) element-by-element. The parsing primitives defined in this section can be used to process these one-by-one.

xml_to_rdf(+XML, +BaseURI, -Triples)

Process an XML term produced by load_structure/3 using the dialect(xmlns) output option. XML is either a complete <rdf:RDF> element, a list of RDF-objects (container or description) or a single description of container.

process_rdf(+Input, :OnTriples, +Options)

Exploits the call-back interface of sgml2pl, calling OnTriples(Triples, File:Line) with the list of triples resulting from a single top level RDF object for each RDF element in the input as well as the source-location where the description started. Input is either a file name or term stream(Stream). When using a stream all triples are associated to the value of the base_uri option. This predicate can be used to process arbitrary large RDF files as the file is processed object-by-object. The example below simply asserts all triples into the database:

assert_list([], _).
assert_list([H|T], Source) :-
        assert(H),
        assert_list(T, Source).

?- process_rdf('structure,rdf', assert_list, []).

Options are described with load_rdf/3. The option expand_foreach is not supported as the container may be in a different description. Additional it provides embedded:

embedded(Boolean): The predicate process_rdf/3 processes arbitrary XML documents, only interpreting the content of rdf:RDF elements. If this option is false (default), it gives a warning on elements that are not processed. The option embedded(true) can be used to process RDF embedded in xhtml without warnings.