Suitability of RDF - RDF in OpenTox

RDF in OpenTox

RDF seems to fit best the scope of OpenTox: data are held by different servers around the world and more nodes should be able to be added in the future (and without necessarily notifying someone for their existence). Parsing of data coming form different machines and different implementation should be expected. The structure of the documents exchanged between these nodes has to be extensible to allow for the introduction of new features, new algorithms and new web services in the future. Throughout the project it was not few times when we had to introduce new ontological classes or properties to provide new services. These ontological definitions where reused here and there and we did not have to reinvent things from the beginning.

All these requirements paved the way for the adoption of RDF for its ability to extend and to combine with other resources over the world not directly related to OpenTox (e.g. the FOAF ontology or the Knouf BibTeX ontology). Here we raise no doubts on the extensibility and versatility of RDF and I predict that in a few years from now it will be ubiquitous. I will try although to summarize some concerns about its suitability for certain purposes.

 

RDF vs ARFF

The next figure is indeed revealing on the efficiency of RDF when used to describe actual data and not metadata but especially when used to model data over which no or little inference is needed. The following figure visualizes the discrepancy between the use of RDF and ARFF when these two file formats are used to model a dataset  resource in OpenTox. The y-axis is the processing time (download dataset, parse, train MLR model) in ms.

ARFF is as simple as possible: A header is enough to declare the contained features (read: columns) and their datatype (numeric, nominal, string). The main body of the ASCII-formatted (plai text) ARFF document, contains nothing more than the set of features values per compound as a comma separated list. ARFF is not much more than a simple CSV or XLS file.

In terms of resources needed for parsing, ARFF once again outperforms RDF. We downloaded and parsed the dataset http://apps.ideaconsult.net:8080/ambit2/dataset/585036; we needed 2.79GB(!) of heap memory for RDF compared to 1MB using ARFF. Taking into account that a server application has to run in parallel multiple such parsings, if RDF/XML is used then one needs to buy lots of GBs of RAM memory. This is illustrated in the following figure:

Just imagine what kind of hardware one would need to be able to cope with "large" datasets. (the word large here is quoted because from some other point of view it can be small).

 

Some concerns about RDF

Would the above results be better with some other Java library? perhaps... This is something to be examined. The truth is that 2.79GB of RAM to parse an object that includes as much information as can be wrapped in 1MB using a different serialization raises lots of questions and primarily regarding Jena as a parsing library (although it's widely used). So maybe using some other library we would have (significantly) different results. The design of a good (read: fast and scalable) RDF parser is a bet for the IT community nowadays.

However, in this paragraph, I'd like to focus on what cannot be done using RDF. RDF is used to represent non-ordered, non-structured information, meaning that anything can be found in there. Although strict in a sense, you don't know what to expect in the following line as you read the file. It doesn't have a header and a main body, nor the ontological definitions are placed in a predefined location in the document. RDF is a document where knowledge is represented in a machine-comprehensible way but one has to read the whole document to drive any conclusions and most probably has to keep in memory all RDF triples or at least a rage part of them.

So if one takes a look at the structure of an ARFF file (or CSV) it's straight to see how to retrieve any vector of the dataset. In RDF, in order to retrieve such a vector, one would have to parse the entire dataset. So after all, one library might be outperforming another one (we don't have yet evidence for that) but with RDF extensibility and scalability divorce!



And what about streaming RDF parsers?

First of all let me mention that I haven't found any well-documented streaming RDF parsers. A streaming RDF parser is supposed to be the counterpart of StAX for RDF returning triple after triple using an Iterator. But what you need to parse is not triples... it is vectors! So, in any case you should have a stack of triples and wait till you collect all of them to transform them into vectors. Think of it as if you have a large matrix A (m by n) and the iterator returns at each time the element A(i,j) with i, j being practically random while some values in A might not be in the document. Therefore, streaming RDF parsers don't offer a solution!