The library(rdf_db)
module provides several hooks for
extending its functionality. Database updates can be monitored and acted
upon through the features described in section
3.4. The predicate rdf_load/2
can be hooked to deal with different formats such as rdfturtle,
different input sources (e.g. http) and different strategies for caching
results.
The hooks below are used to add new RDF file formats and sources from which to load data to the library. They are used by the modules described below and distributed with the package. Please examine the source-code if you want to add new formats or locations.
library(library(semweb/turtle))
library(library(semweb/rdf_zlib_plugin))
library(library(semweb/rdf_http_plugin))
library(library(http/http_ssl_plugin))
library(library(semweb/rdf_http_plugin))
to load RDF from HTTPS servers.library(library(semweb/rdf_persistency))
library(library(semweb/rdf_cache))
file(+Name)
,
stream(+Stream)
or url(Protocol, URL)
. If this
hook succeeds, the RDF will be read from Stream using rdf_load_stream/3.
Otherwise the default open functionality for file and stream are used.xml
.owl
. Format is either a built-in format (xml
or triples
) or a format understood by the rdf_load_stream/3
hook.
This
module uses the library(zlib)
library to load compressed
files on the fly. The extension of the file must be .gz
.
The file format is deduced by the extension after stripping the .gz
extension. E.g. rdf_load('file.rdf.gz’
.
This module allows for rdf_load('http://...’
.
It exploits the library library(http/http_open.pl)
. The
format of the URL is determined from the mime-type returned by the
server if this is one of
text/rdf+xml
, application/x-turtle
or
application/turtle
. As RDF mime-types are not yet widely
supported, the plugin uses the extension of the URL if the claimed
mime-type is not one of the above. In addition, it recognises
text/html
and application/xhtml+xml
, scanning
the XML content for embedded RDF.
The library library(semweb/rdf_cache)
defines the
caching strategy for triples sources. When using large RDF sources,
caching triples greatly speedup loading RDF documents. The cache library
implements two caching strategies that are controlled by rdf_set_cache_options/1.
Local caching This approach applies to files only. Triples are
cached in a sub-directory of the directory holding the source. This
directory is called .cache
(_cache
on
Windows). If the cache option create_local_directory
is true
,
a cache directory is created if posible.
Global caching This approach applies to all sources, except
for unnamed streams. Triples are cached in directory defined by the
cache option global_directory
.
When loading an RDF file, the system scans the configured cache files
unless cache(false)
is specified as option to rdf_load/2
or caching is disabled. If caching is enabled but no cache exists, the
system will try to create a cache file. First it will try to do this
locally. On failure it will try to configured global cache.
enabled(Boolean)
If true
, caching is
enabled.local_directory(Name)
. Plain name of local directory.
Default .cache
(_cache
on Windows).create_local_directory(Bool)
If true
, try
to create local cache directoriesglobal_directory(Dir)
Writeable directory for storing
cached parsed files.create_global_directory(Bool)
If true
, try
to create the global cache directory.read
, it returns the name of an existing file. If write
it returns where a new cache file can be overwritten or created.
The library library(semweb/rdf_litindex.pl)
exploits the
primitives of section 4.5.1 and the
NLP package to provide indexing on words inside literal constants. It
also allows for fuzzy matching using stemming and‘sounds-like’based
on the double metaphone algorithm of the NLP package.
sounds(Like,
Words)
, stem(Like, Words)
or prefix(Prefix,
Words)
. On compound expressions, only combinations that provide
literals are returned. Below is an example after loading the ULAN2Unified
List of Artist Names from the Getty Foundation. database
and showing all words that sounds like‘rembrandt’and appear
together in a literal with the word‘Rijn’. Finding this
result from the 228,710 literals contained in ULAN requires 0.54
milliseconds (AMD 1600+).
?- rdf_token_expansions(and('Rijn', sounds(rembrandt)), L). L = [sounds(rembrandt, ['Rambrandt', 'Reimbrant', 'Rembradt', 'Rembrand', 'Rembrandt', 'Rembrandtsz', 'Rembrant', 'Rembrants', 'Rijmbrand'])]
Here is another example, illustrating handling of diacritics:
?- rdf_token_expansions(case(cafe), L). L = [case(cafe, [cafe, caf\'e])]
rdf_litindex:tokenization(Literal, -Tokens)
. On failure it
calls tokenize_atom/2
from the NLP package and deletes the following: atoms of length 1,
floats, integers that are out of range and the english words and
, an
, or
, of
,
on
, in
, this
and the
.
Deletion first calls the hook rdf_litindex:exclude_from_index(token,
X)
. This hook is called as follows:
no_index_token(X) :- exclude_from_index(token, X), !. no_index_token(X) :- ...
‘Literal maps’provide a relation between literal values, intended to create additional indexes on literals. The current implementation can only deal with integers and atoms (string literals). A literal map maintains an ordered set of keys. The ordering uses the same rules as described in section 4.5. Each key is associated with an ordered set of values. Literal map objects can be shared between threads, using a locking strategy that allows for multiple concurrent readers.
Typically, this module is used together with rdf_monitor/2
on the channals new_literal
and old_literal
to
maintain an index of words that appear in a literal. Further abstraction
using Porter stemming or Metaphone can be used to create additional
search indices. These can map either directly to the literal values, or
indirectly to the plain word-map. The SWI-Prolog NLP package provides
complimentary building blocks, such as a tokenizer, Porter stem and
Double Metaphone.
rdf_litindex.pl
.not(Key)
. If not-terms
are provided, there must be at least one positive keywords. The
negations are tested after establishing the positive matches.
The library(semweb/rdf_persistency)
provides reliable persistent storage for the RDF data. The store uses a
directory with files for each source (see rdf_source/1)
present in the database. Each source is represented by two files, one in
binary format (see rdf_save_db/2)
representing the base state and one represented as Prolog terms
representing the changes made since the base state. The latter is called
the journal.
cpu_count
or 1 (one) on
systems where this number is unknown. See also concurrent/3.true
, supress loading messages from rdf_attach_db/2.true
, nested log transactions are added to the
journal information. By default (false
), no log-term is
added for nested transactions.
The database is locked against concurrent access using a file
lock
in Directory. An attempt to attach to a
locked database raises a permission_error
exception. The
error context contains a term rdf_locked(Args)
, where args
is a list containing time(Stamp)
and pid(PID)
.
The error can be caught by the application. Otherwise it prints:
ERROR: No permission to lock rdf_db `/home/jan/src/pl/packages/semweb/DB' ERROR: locked at Wed Jun 27 15:37:35 2007 by process id 1748
false
, the
journal and snapshot for the database are deleted and further changes to
triples associated with DB are not recorded. If Bool
is true
a snapshot is created for the current state and
further modifications are monitored. Switching persistency does not
affect the triples in the in-memory RDF database.min_size(KB)
only
journals larger than KB Kbytes are merged with the base
state. Flushing a journal takes the following steps, ensuring a stable
state can be recovered at any moment.
.new
..new
file over the base
state.Note that journals are not merged automatically for two reasons. First of all, some applications may decide never to merge as the journal contains a complete changelog of the database. Second, merging large databases can be slow and the application may wish to schedule such actions at quiet times or scheduled maintenance periods.
The above predicates suffice for most applications. The predicates in
this section provide access to the journal files and the base state
files and are intented to provide additional services, such as reasoning
about the journals, loaded files, etc.3A
library library(rdf_history)
is under development
exploiting these features supporting wiki style editing of RDF.
Using rdf_transaction(Goal, log(Message))
, we can add
additional records to enrich the journal of affected databases with Term
and some additional bookkeeping information. Such a transaction adds a
term
begin(Id, Nest, Time, Message)
before the change operations
on each affected database and end(Id, Nest, Affected)
after
the change operations. Here is an example call and content of the
journal file mydb.jrn
. A full explanation of the terms that
appear in the journal is in the description of rdf_journal_file/2.
?- rdf_transaction(rdf_assert(s,p,o,mydb), log(by(jan))).
start([time(1183540570)]). begin(1, 0, 1183540570.36, by(jan)). assert(s, p, o). end(1, 0, []). end([time(1183540578)]).
Using rdf_transaction(Goal, log(Message, DB))
, where DB
is an atom denoting a (possibly empty) named graph, the system
guarantees that a non-empty transaction will leave a possibly empty
transaction record in DB. This feature assumes named graphs are named
after the user making the changes. If a user action does not affect the
user's graph, such as deleting a triple from another graph, we still
find record of all actions performed by some user in the journal of that
user.
time(Stamp)
.time(Stamp)
.log(Message)
. Id is an
integer counting the logged transactions to this database. Numbers are
increasing and designed for binary search within the journal file.
Nest is the nesting level, where‘0’is a toplevel
transaction.
Time is a time-stamp, currently using float notation with two
fractional digits. Message is the term provided by the user
as argument of the log(Message)
transaction.log(Message)
. Id and Nest
match the begin-term. Others gives a list of other databases
affected by this transaction and the Id of these records. The
terms in this list have the format DB:Id..trp
for the base state and .jrn
for the
journal.