RDF benchmarking

From OpenTox
Jump to: navigation, search

Contents

Introduction

One of the most powerful debugging approaches is the profiling of the application. Profiling is a form of dynamic program analysis and aims at the detection of runtime errors that might occur on high load and possible memory leaks and estimation of the algorithmic efficiency. The latter can be assessed by correlating the computational time with the complexity of the underlying problem while monitoring the memory allocation.

In OpenTox, RDF has been chosen as the main data exchange format. This offers great flexibility as far as data modeling is concerned and meta information can be easily assigned to the various nodes of the data model so the exchanges messages are meaningful. However, the parsing of such documents has caused trouble to developers mainly in terms of performance. The scope of this experiment is to assess the efficiency of Jena a tool for parsing and creating RDF documents as well as to suggest optimal strategies that minimize the computational time.

In what follows we will focus on datasets as all other resources in OpenTox are much smaller in size. Every single operation is always performed at least 10 times for statistical purposes.


Conditions

The measurements were made on a Linux machine (2.6.31-22-generic kernel, x86_64 GNU/Linux) with 3.8GB of RAM and an Intel Core 2 Duo CPU P8700 @2.53GHz. The SDK ToxOtis was used to perform the measurements (version 0.1.1.13) which includes Weka version 3.6.2 (latest stable version) and Jena version 2.6.2. These libraries run on a Sun™ JVM, version 1.6.0.20 with Java™ SE Runtime Environment (build 1.6.0.20-b02).

The ping time for apps.ideaconsult.net at the time of the experiment was at average 109.096ms (mean dev. 0.450ms, 200 packets transmitted with 0% packet loss).

Results

Computational Time for datasets of various sizes

The class java.net.HttpUrlConnection was used to establish a connection and open a data stream between the client and the remote resource at http://apps.ideaconsult.net:8080/ambit2/dataset/id (where id=9,10). The stream was buffered using a BufferedInputStream with specified length. The effect of this length on the efficiency is discussed in a following section. Jena was used to read the data from the remote stream and parse them into an OntModel object. The following figure exhibits how the download and parsing time is affected by the dimension of the dataset (number of compounds and features). The results that follow in this section were obtained without any inference engines activated during the parsing. The effect of such a choice is discussed later in this document.

A linear correlation between the complexity of the dataset and the computational time is revealed for datasets of the above dimensions (up to 60 features and 1000 compounds). The following figure presents the time needed by ToxOtis to parse the OntModel object provided by Jena into an in house object, namely an instance of org.opentox.toxotis.core.component.Dataset:

Download+Parse RDF document

A linear correlation between the complexity of the dataset and the computational time is revealed for datasets of the above dimensions (up to 60 features and 1000 compounds). The following figure presents the time needed by ToxOtis to parse the OntModel object provided by Jena into an in house object, namely an instance of org.opentox.toxotis.core.Dataset:

Parsing of Ontological Models

Finally, the computational effort for the conversion of the Dataset object into a weka.core.Instances object was recored for datasets of various dimensions and the results are presented in the following figure:

Conversion into Weka Instances

It is also important to mention that the time that Jena needs to parse the data model as it reads it from the remote stream is much higher compared to the time that ToxOtis needs to parse this model and/or convert it into a weka object.

Pie chart with execution times


The buffer size

The length of the buffer used to read the input stream was not found to affect drastically the download time. The following plot of the buffer size versus the time needed to open the remote stream, read it and create an OntModel object (read the triples) presents an almost relation between the two measures.

Impact of the buffer size

Inference Engines and RDF Specifications

Inference Engines should be enables only if it is absolutely necessary as it slows down the parsing procedure. Three of the specifications supported by Jena, without any inference engine are compared in this section. From the following plot it is evident that the OWL Full specification performs worse than OWL DL and OWL Lite when used for parsing datasets. It seems also that OWL_DL performs slightly better that OWL Lite. The results presented here are based on the dataset at http://apps.ideaconsult.net:8080/ambit2/dataset/9?max=x where x takes values from 1 to 1000.

Comparison of specifications

In ToxOtis, OWL DL is used as the specification for parsing or RDF documents.

Allocation of Resources

In this section we present results about memory allocation issues that might appear during the processing of large datasets. In Java, memory is dynamically allocated, or as it is said memory allocation is heap-based, so the resources are released by the garbage collector and not once they are used. That's why we observe such deviations in the plot presented below showing the bytes allocated by the virtual machine versus the number of compounds contained in the dataset that is processed. The results presented here also come from the dataset at http://apps.ideaconsult.net:8080/ambit2/dataset/9?max=x where x takes values from 1 to 1000.

Allocation of system resources - 1

In the following figure the system load, memory usage and network activity for a time interval of 5 minutes is presented.

Allocation of system resources - 2

The CPU load profile is given in the following figure:

Allocation of system resources - 3


XML Writer used by the dataset service

Measurements were made to reveal any difference between the use of StAX and Jena for RDF serialization in terms of performance. It was shown that StAX outperforms the internal implementation of Jena for creating RDF documents. The measurements where made on http://apps.ideaconsult.net:8080/ambit2/dataset/9 (21 features, 1000 compounds) using the URL query ?rdfwriter=jena and ?rdfwriter=stax for the Jena and the StAX writer respectively. Jena was about 14 seconds slower than StAX based on 32 successive measurements that are presented in the following figure:

StAX vs Jena

Impact of the Representation Language

Here we examine the impact of the Representation Language on time needed to download and parse a remote entity in an OntModel object. Based on 20 measurements for each representation (RDF/XML, TURTLE and N3-TRIPLES) it becomes evident that RDF/XML is parsed faster than TURTLE while N3-TRIPLES performs remarkably worse.

Representation Languages


Different Implementations of Model

Different implementations of com.hp.hpl.jena.rdf.model.Model were used to assess the potential use of an alternative implementation. The following results are based on 12 measurements on http://apps.ideaconsult.net:8080/ambit2/dataset/9. The source code of the test can be found at … The first measurement was neglected. The use of OntModel seems to be (at least statistically) equivalent to the Default Model created by the ModelFactory while both outperform non reifying models.

Dependence on the implementation of Model
Personal tools