Suitability of RDF

The use of RDF in OpenTox - An overview of the experience gained regarding RDF technologies and RDF parsing/serialization in Java. Evaluation of the choice of RDF as the common exchange file format among OpenTox web services.

Summary

In this article we introduce the reader to the notions of the Semantic Web or Web of Data, the Resource Description Framework (RDF) and Web Ontologies (OWL). We explain how these technologies are incorporated in the OpenTox framework for the needs of predictive toxicology and we share our experience on RDF parsing and serialization. We present various benchmarking results on RDF processing and we make an evaluation of the use of RDF in OpenTox outlining its pros and cons.

 

Semantic Web

Let us first describe the web as we already know it to identify certain inadequacies. The world wide web is characterized by such a plethora of information that you can find practically everything: news, the weather, tutorials, music, videos and much more. The information on the web is now distributed in a way that different servers own different data and we can say that they keep it for themselves. Data are formated in HTML (or other loosely-structured formats) and little are they linked to data provided by other servers or applications nor do they allow any kind of reasoning to be carried out by a machine. The knowledge that can be found in a web page is tuck away as it cannot be understood by any software, it cannot be processed and no inference can be applied.

There is a huge potential of capabilities we can benefit from a web of data or otherwise called semantic web. According to w3.org, the semantic web provides a common framework that allows data to be shared and reused across application, enterprise and community boundaries. A web of data requires a certain infrastructure which links data to each other and allows their integration.

Bob on his web page has a short paragraph where we talks about himself formatted in HTML. This is how it looks like:

<h3>Bob Smith - Personal Info</h3>
<p>Hi Folks, my name is Bob Smith and I am currently working for <em>XYZ international</em> - check out the <a href=...>company site</a>.</p>
<p>For more information you can mail me at bob_smith[at]yahoo.it or call me in the office. The number is +34567891011. I was born in Nicago in 1985</p>

This is fine when it is to be presented to a human. But what about a machine? What would a software application understand from the above text and what deductions could it make? More questions are raised like for example what if one wants to find on the web a person whose first name is Bob and is 25 to 27 years old? For that to be feasible we need an infrastructure that governs how data are formatted for computers to able to understand them. As Ivan Herman has put it, "Imagine of a web where documents are available for download on the Internet but there would be no hyperlinks among them".

Let us see now what Bob Smith did to allow client to be able to parse this information. He introduced his own XML schema and published the following document...

<Person>
<name>Bob</name>
<surname>Smith</surname>
<birthdate>02-15-1985</birthdate>
<birthplace>Nicago</birthplace>
<mail>bob_smith[at]yahoo.it</mail>
<workplace>XYZ international</workplace>
<workPhone>+34567891011</workPhone>
<workURL>http://xyz.com<workURL>
</Person>

Here the information is much more structured than before. However a machine should be aware of Bob Smith's XML schema in order to parse it and in order to "understand it". And since anyone can invent an XML schema to describe what he has in mind, it is not much helpful towards a web of data...

Admittedly, today the grounds have shifted and the whole www moves towards a semantic web were data are self-described and elaborate queries will be feasible.

 

Semantic Web References

  1. Semantic Web on wikipedia is a good article to introduce the reader to the basics.
  2. The home page of the W3C semantic web activity.
  3. What is the semantic web? An article by purl.org.
  4. An article about the semantic web by Sean B. Palmer: The semantic web, takes form.

 


The Resource Description Framework

The Resource Description Framework (RDF) is a collection of W3C specifications that serves as a data model for metadata of web resources. RDF introduces the notion of triples of the form (s,p,o) where s is the "subject", that is the resource on the web being described, p stands for a well defined property and o is an object that can be either a web resource or a literal value. Both the subject s and the property p are web resources identified by their URI. Here is an example of a triple:

( <http://www.youtube.com/watch?v=M_bvT-DGcWw>, <http://music.org/#band>, <http://music.org/#pinkfloyd> )

Resources are identified by a URI which need not be a URL (web address). The term "Web Resources" refers to information that can be found on the web, including but not restricted to web pages. RDF exists in various equivalent representations: RDF/XML which is XML-formatted, RDF/Turtle, N-Triples and other.

 

RDF references

  1. A quick introduction to RDF. A good guide for begineers.
  2. Wikipedia article on RDF.
  3. James Hollenbach, Joe Presbrey, and Tim Berners-Lee, Using RDF Metadata To Enable Access Control on the Social Semantic Web
  4. RDF tutorial by W3C

 


Web Ontologies

An ontology is a formal representation of knowledge as a set of cognitive entities (ontological classes) and properties over these entities which are to be used to construct semantically meaningful documents (e.g. RDF). An ontology communicates the way knowledge is representable as a set of logical predicates and governs any inference algorithm on the domain it applies. Web ontologies, in particular are introduced to describe knowledge that is exchanged over the Internet or any other network and should be "comprehended" (read: parsed) by computers. For further information, please refer to the following links:

 

References on Web Ontologies

  1. Definition of an Ontology and it's utility in information sciences by wikipedia.
  2. An article in wikipedia on the Web Ontology Languale (OWL)
  3. An overview of OWL by W3C.

 


RDF in OpenTox

RDF seems to fit best the scope of OpenTox: data are held by different servers around the world and more nodes should be able to be added in the future (and without necessarily notifying someone for their existence). Parsing of data coming form different machines and different implementation should be expected. The structure of the documents exchanged between these nodes has to be extensible to allow for the introduction of new features, new algorithms and new web services in the future. Throughout the project it was not few times when we had to introduce new ontological classes or properties to provide new services. These ontological definitions where reused here and there and we did not have to reinvent things from the beginning.

All these requirements paved the way for the adoption of RDF for its ability to extend and to combine with other resources over the world not directly related to OpenTox (e.g. the FOAF ontology or the Knouf BibTeX ontology). Here we raise no doubts on the extensibility and versatility of RDF and I predict that in a few years from now it will be ubiquitous. I will try although to summarize some concerns about its suitability for certain purposes.

 

RDF vs ARFF

The next figure is indeed revealing on the efficiency of RDF when used to describe actual data and not metadata but especially when used to model data over which no or little inference is needed. The following figure visualizes the discrepancy between the use of RDF and ARFF when these two file formats are used to model a dataset  resource in OpenTox. The y-axis is the processing time (download dataset, parse, train MLR model) in ms.

ARFF is as simple as possible: A header is enough to declare the contained features (read: columns) and their datatype (numeric, nominal, string). The main body of the ASCII-formatted (plai text) ARFF document, contains nothing more than the set of features values per compound as a comma separated list. ARFF is not much more than a simple CSV or XLS file.

In terms of resources needed for parsing, ARFF once again outperforms RDF. We downloaded and parsed the dataset http://apps.ideaconsult.net:8080/ambit2/dataset/585036; we needed 2.79GB(!) of heap memory for RDF compared to 1MB using ARFF. Taking into account that a server application has to run in parallel multiple such parsings, if RDF/XML is used then one needs to buy lots of GBs of RAM memory. This is illustrated in the following figure:

Just imagine what kind of hardware one would need to be able to cope with "large" datasets. (the word large here is quoted because from some other point of view it can be small).

 

Some concerns about RDF

Would the above results be better with some other Java library? perhaps... This is something to be examined. The truth is that 2.79GB of RAM to parse an object that includes as much information as can be wrapped in 1MB using a different serialization raises lots of questions and primarily regarding Jena as a parsing library (although it's widely used). So maybe using some other library we would have (significantly) different results. The design of a good (read: fast and scalable) RDF parser is a bet for the IT community nowadays.

However, in this paragraph, I'd like to focus on what cannot be done using RDF. RDF is used to represent non-ordered, non-structured information, meaning that anything can be found in there. Although strict in a sense, you don't know what to expect in the following line as you read the file. It doesn't have a header and a main body, nor the ontological definitions are placed in a predefined location in the document. RDF is a document where knowledge is represented in a machine-comprehensible way but one has to read the whole document to drive any conclusions and most probably has to keep in memory all RDF triples or at least a rage part of them.

So if one takes a look at the structure of an ARFF file (or CSV) it's straight to see how to retrieve any vector of the dataset. In RDF, in order to retrieve such a vector, one would have to parse the entire dataset. So after all, one library might be outperforming another one (we don't have yet evidence for that) but with RDF extensibility and scalability divorce!



And what about streaming RDF parsers?

First of all let me mention that I haven't found any well-documented streaming RDF parsers. A streaming RDF parser is supposed to be the counterpart of StAX for RDF returning triple after triple using an Iterator. But what you need to parse is not triples... it is vectors! So, in any case you should have a stack of triples and wait till you collect all of them to transform them into vectors. Think of it as if you have a large matrix A (m by n) and the iterator returns at each time the element A(i,j) with i, j being practically random while some values in A might not be in the document. Therefore, streaming RDF parsers don't offer a solution!