In some cases applications wish to process small portions of large SGML, XML or RDF files. For example, the OpenDirectory project by Netscape has produced a 90MB RDF file representing the main index. The parser described here can process this document as a unit, but loading takes 85 seconds on a Pentium-II 450 and the resulting term requires about 70MB global stack. One option is to process the entire document and output it as a Prolog fact-base of RDF triplets, but in many cases this is undesirable. Another example is a large SGML file containing online documentation. The application normally wishes to provide only small portions at a time to the user. Loading the entire document into memory is then undesirable.
Using the parse(element)
option, we open a file, seek
(using seek/4) to
the position of the element and read the desired element.
The index can be built using the call-back interface of
sgml_parse/2.
For example, the following code makes an index of the structure.rdf
file of the OpenDirectory project:
:- dynamic location/3. % Id, File, Offset rdf_index(File) :- retractall(location(_,_)), open(File, read, In, [type(binary)]), new_sgml_parser(Parser, []), set_sgml_parser(Parser, file(File)), set_sgml_parser(Parser, dialect(xml)), sgml_parse(Parser, [ source(In), call(begin, index_on_begin) ]), close(In). index_on_begin(_Element, Attributes, Parser) :- memberchk('r:id'=Id, Attributes), get_sgml_parser(Parser, charpos(Offset)), get_sgml_parser(Parser, file(File)), assert(location(Id, File, Offset)).
The following code extracts the RDF element with required id:
rdf_element(Id, Term) :- location(Id, File, Offset), load_structure(File, Term, [ dialect(xml), offset(Offset), parse(element) ]).