Storing RDF triples in main memory provides much better performance than using external databases. Unfortunately, although memory is fairly cheap these days, main memory is severely limited when compared to disks. Memory usage breaks down to the following categories. Rough estimates of the memory usage is given for 64-bit systems. 32-bit system use slightly more than half these amounts.
Bucket arrays are resized if necessary. Old triples remain at their original location. This implies that a query may need to scan multiple buckets. The garbage collector may relocate old indexed triples. It does so by copying the old triple. The old triple is later reclaimed by GC. Reindexed triples will be reused, but many reindexed triples may result in a significant memory fragmentation.
The hash parameters can be controlled with rdf_set/1. Applications that are tight on memory and for which the query characteristics are more or less known can optimize performance and memory by fixing the hash-tables. By fixing the hash-tables we can tailor them to the frequent query patterns, we avoid the need for to check multiple hash buckets (see above) and we avoid memory fragmentation due to optimizing triples for resized hashes.
set_hash_parameters :- rdf_set(hash(s, size, 1048576)), rdf_set(hash(p, size, 1024)), rdf_set(hash(sp, size, 2097152)), rdf_set(hash(o, size, 1048576)), rdf_set(hash(po, size, 2097152)), rdf_set(hash(spo, size, 2097152)), rdf_set(hash(g, size, 1024)), rdf_set(hash(sg, size, 1048576)), rdf_set(hash(pg, size, 2048)).
s
,
p
, sp
, o
, po
, spo
, g
, sg
or pg
. Parameter is one of:
permission_error
exception.The garbage collector
The RDF store has a garbage collector that runs in a separate thread named =__rdf_GC=. The garbage collector removes the following objects:
rdfs:subPropertyOf
relations
that are related to old queries.
In addition, the garbage collector reindexes triples associated to
the hash-tables before the table was resized. The most recent resize
operation leads to the largest number of triples that require
reindexing, while the oldest resize operation causes the largest
slowdown. The parameter optimize_threshold
controlled by rdf_set/1
can be used to determine the number of most recent resize operations for
which triples will not be reindexed. The default is 2.
Normally, the garbage collector does it job in the background at a low priority. The predicate rdf_gc/0 can be used to reclaim all garbage and optimize all indexes.Warming up the database
The RDF store performs many operations lazily or in background threads. For maximum performance, perform the following steps:
warm_indexes :- ignore(rdf(s, _, _)), ignore(rdf(_, p, _)), ignore(rdf(_, _, o)), ignore(rdf(s, p, _)), ignore(rdf(_, p, o)), ignore(rdf(s, p, o)), ignore(rdf(_, _, _, g)), ignore(rdf(s, _, _, g)), ignore(rdf(_, p, _, g)).
Predicates:
__rdf_GC
performs garbage collection as long as
it is considered‘useful’.
Using rdf_gc/0 should only be needed to ensure a fully clean database for analysis purposes such as leak detection.
The duplicates marks are used to reduce the administrative load of avoiding duplicate answers. Normally, the duplicates are marked using a background thread that is started on the first query that produces a substantial amount of duplicates.