wiki:XMLAdapter
Last modified 3 years ago Last modified on 11/22/11 00:08:53

XML Adapter

As of version 0.0.2, the XML processing part of X2R provides two, completely independent components.

The original XMLAdapter was designed as a part of the X2R Framework, which implements the DataSourceAdapter interface. It can connect to a running instance of Sedna XML database, extract the XML data from it and push it to the supplied repository connection. Thus, any data to be converted needs to be placed in Sedna first. See Sedna homepage) for details.

In 0.0.2 another component has been introduced: the StreamingXMLAdapter. It can convert XML data to RDF, by reading an InputStream. Thus it can process theoretically arbitrary XML data. The initial use-case for it was crawling XML files generated by Mediawiki - the popular wiki engine which powers (among others) The Wikipedia itself.

Both the XMLAdapter and the StreamingXMLAdapter use the same mapping language, though with certain differences, which will be described in more detail below.

Java Usage

XMLAdapter

It can connect to a remote Sedna database, get the data and convert it to RDF. In order to use it you have to prepare a DataSourceConfiguration object, a connection to the target repository and a URI of the context where the triples are be put.

XMLAdapter adapter = new XMLAdapter();

// prepare the configuration
DataSourceConfiguration c = new DataSourceConfigurationImpl();
c.put(XMLAdapter.KEY_HOST, "192.168.10.121");
c.put(XMLAdapter.KEY_USERNAME, "SYSTEM");
c.put(XMLAdapter.KEY_PASSWORD, "MANAGER");
c.put(XMLAdapter.KEY_DBNAME, "dblp");
c.put(XMLAdapter.KEY_COLLECTION, "dblp");
c.put(XMLAdapter.KEY_FILENAME, "dblp.xml");

// obtain the mapping string
String mappingString = "";

// the connection to the target repository
RepositoryConnection conn = null;

// the context where the triples are to be created
URI targetContext = null;

DumpReport rep = adapter.dump(c, mappingString, conn, targetContext);

The configuration options are outlined below:

Name Java Constant Description
host XMLAdapter.KEY_HOST the host name of the database server
username XMLAdapter.KEY_USERNAME the login
password XMLAdapter.KEY_PASSWORD the password
dbname XMLAdapter.KEY_DBNAME the name of the logical database on the database server
collection XMLAdapter.KEY_COLLECTION the name of the collection which contains the XML document that is to be converted (can be null, if the required document is not in a collection)
filename XMLAdapter.KEY_FILENAME the name of the document which is to be converted

StreamingXMLAdapter

The streaming XML adapter doesn't connect to an external database. It works with an InputStream containing XML data. It has two important advantages:

  • can get the input data from anywhere (files, compressed files, remote documents etc.)
  • can process arbitrarily large inputs

Lack of random access to the XML tree makes the mapping a little less expressive. Details in section on the mapping language.

In order to use a streaming XML adapter you need the source data (an InputStream, the target repository connection and the context, the mapping, and a parent URI, which will usually be the URI of the file the input stream comes from. The parent uri can be referred to in the patterns within the mapping, with #{parentUri}

StreamingXMLAdapter adapter = new StreamingXMLAdapter();

// the source data
InputStream is = null;
// the mapping string
String mappingString = null;
// the target repository connection
RepositoryConnection conn = null;
// the target context
URI ctx = null;
// the value for the #{parentUri} pattern
String parentUri = null;
adapter.dump(is, mappingString, conn, ctx, parentUri);

Mapping Language Constructs

The mapping language of the X2R XML adapter has been inspired by the language of D2RQ. Here's an excerpt of the mapping we used to simulate the D2RQ mapping from dblp.l3s.de to work with the DBLP.uni-trier.de XML dumps.

xml2r:Mapping

Since version 0.0.2 each mapping file must contain one instance of the xml2r:Mapping class. 0.0.1 didn't use them.

Properties

xml2r:namespaceDefinition specifies a prefix and a namespace URI. The prefix can later be used within the XPath expressions specified with xml2r:nodeXPath and xml2r:pattern properties, to support XML documents with namespaces. The value of this property is a resource which brings two additional properties. See the Example section below.

Example

Description
An example which shows a xml2r:Mapping construct, with definitions of two namespace prefixes. The namespace prefixes are later used in relative XPath expressions inside string patterns.
Source dataMapping
<mediawiki 
 xmlns="http://www.mediawiki.org/xml/export-0.4/" 
 version="0.4" 
 xml:lang="ko">
 <page>
   <title>A test title</title>
   <id>3</id>
   <revision>
     <id>5496901</id>
     <timestamp>2010-08-07T05:14:24Z</timestamp>
     <contributor>
       <username>ITurtle</username>
       <id>11473</id>
     </contributor>
     <minor />
     <text xml:space="preserve">
      Some test text
     </text>
   </revision>
 </page>
</mediawiki>
:wikipediaMapping a xml2r:Mapping ;
 xml2r:namespaceDefinition [
  xml2r:namespacePrefix "wiki";
  xml2r:namespaceUri 
   "http://www.mediawiki.org/xml/export-0.4/"
 ] .

:publicationMap a xml2r:ClassMap ;
 xml2r:belongsToMapping :wikipediaMapping ;
 xml2r:nodeXPath "/wiki:mediawiki/wiki:page" ;
 xml2r:uriPattern 
  "http://pubs.org/publications/${wiki:id/text()}" 
 xml2r:class 
  <http://some.cool.ontology/2008/ont#Publication> .
   
:titleBridge a xml2r:PropertyBridge ;
 xml2r:belongsToClassMap :publicationMap ;
 xml2r:property dc:title ;
 xml2r:pattern "${wiki:title/text()}" .

:contributorBridge a xml2r:PropertyBridge ;
 xml2r:belongsToClassMap :publicationMap ;
 xml2r:property dc:contributor ;
 xml2r:pattern 
"${wiki:revision/wiki:contributor/wiki:username}" 
 .

:textBridge a xml2r:PropertyBridge ;
 xml2r:belongsToClassMap :publicationMap ;
 xml2r:property dc:text ;
 xml2r:pattern "${wiki:revision/wiki:text/text()}" .
Result
<http://pubs.org/publications/3> a coolont:Publication ;
   dc:title "A test title" ;
   dc:contributor "ITurtle" ;
   dc:text "Some test text" .

xml2r:ClassMap

An equivalent of the D2RQ ClassMap construct. For XML it defines a set of nodes from the XML tree. Each of those nodes will be converted to a resource in the result RDF graph. A resource is a subject of RDF triples. In the StreamingXMLAdapter there can only be one xml2r:ClassMap in the mapping. The normal database-backed XMLAdapter, can contain more than one xml2r:ClassMap.

Properties

xml2r:belongsToMapping ties the xml2r:ClassMap with the mapping it
xml2r:nodeXPath an XPath expression which defines the sequence of nodes. It should start with the name of the root element of the XML document. Only a very limited subset of XPath is supported here, processing is done via a StreamingPathFilter from the Nux library. See the StreamingPathFilter javadocs for details. The XPath expression can have one step (the entire document will loaded to memory at once and processed as a whole) or more than one.
xml2R:uriPattern an equivalent of the D2RQ uriPattern property. It is a template from which URIs of resources are generated. See the section on patterns below for more information.
xml2r:class generates an rdf:type triple for each generated RDF resource
xml2r:delegatedFrom marks the classmap as a "delegated ClassMap. This property is always used in conjunction with xml2r:onElementName. It allows the user to specify a nodeXPath which iterates over more than one element type and process each element type in a different way. This property MUST NOT occur together with xml2r:nodeXPath. A ClassMap can be either a top-level ClassMap (with nodeXPath) or a delegated class map, with delegatedFrom and onElementName, but not both at the same time.
xml2r:onElementName states that this xml2r:ClassMap definition is to be applied on an element which has a given name
xml2r:setVariable for each element where this ClassMap is applied, a named variable is set. Afterwards it can be used in patterns. This feature usually makes sense in cases where a single nodeXPath first iterates over some header element where the variable is set, whose value is later use in ordinary "data" elements.

Basic Example

Description
The simplest example. The ClassMap iterates over article elements. For each element one resource is created (according to the uriPattern). With each resource comes one triple with the rdf:type predicate.
Source dataMapping
<dblp>
  <article key="ar/Mylka01">
    <title>First Article</title>
  </article>
  <article key="ar/Mylka02">
    <title>Second Article</title>
  </article>
</dblp>
:m a xml2r:Mapping .

:articleMap a xml2r:ClassMap ;
   xml2r:belongsToMapping :m ;
   xml2r:nodeXPath "/dblp/article" ;
   xml2r:uriPattern "http://articles.org/${data(@key)}" ;
   xml2r:class <http://some.cool.ontology/2008/ont#Article> .
Result
<http://articles.org/ar/Mylka01> a <http://some.cool.ontology/2008/ont#Articles> .
<http://articles.org/ar/Mylka02> a <http://some.cool.ontology/2008/ont#Articles> .

Delegated ClassMap example

Description
An example with delegated classmaps. We have two types of XML elements: boys and girls. Boys like cars, girls like flowers. We want to specify a single xml2r:nodeXPath, but attach different properties, depending on the name of the XML element. If the element is a <boy>, we want the ont:Boy type with a ont:favouriteCar property. For <girl> we want RDF resources with the ont:Girl RDF types and ont:favouriteFlower properties.
Source dataMapping
<people>
 <boy>
  <name>Antoni</name>
  <favouriteCar>Porsche</favouriteCar>
 </boy>
 <girl>
  <name>Alina</name>
  <favouriteFlower>Daisy</favouriteFlower>
 </girl>
</people>
:peopleMapping a xml2r:Mapping .

:peopleMap a xml2r:ClassMap ;
  xml2r:belongsToMapping :peopleMapping ;
  xml2r:nodeXPath "/people/*" .
   
:boyMap a xml2r:ClassMap ;
  xml2r:delegatedFrom :peopleMap ;
  xml2r:onElementName "boy" ;
  xml2r:uriPattern 
    "http://www.example.com/people/${name/text()}" ;
  xml2r:class ont:Boy .
    
:boyNameBridge a xml2r:PropertyBridge ;
  xml2r:belongsToClassMap :boyMap ;
  xml2r:property ont:name ;
  xml2r:pattern "${name/text()}" .
   
:boyCarBridge a xml2r:PropertyBridge ;
  xml2r:belongsToClassMap :boyMap ;
  xml2r:property ont:favouriteCar ;
  xml2r:pattern "${favouriteCar/text()}" .
    
:girlMap a xml2r:ClassMap ;
  xml2r:delegatedFrom :peopleMap ;
  xml2r:onElementName "girl" ;
  xml2r:uriPattern 
    "http://www.example.com/people/${name/text()}" ;
  xml2r:class <http://www.example.com/ontology#Girl> .
    
:girlNameBridge a xml2r:PropertyBridge ;
  xml2r:belongsToClassMap :girlMap ;
  xml2r:property ont:name ;
  xml2r:pattern "${name/text()}" .
   
:girlFlowerBridge a xml2r:PropertyBridge ;
  xml2r:belongsToClassMap :girlMap ;
  xml2r:property ont:favouriteFlower ;
  xml2r:pattern "${favouriteFlower/text()}" .   
Result
<http://www.example.com/people/Antoni> a ont:Boy ;
   ont:name "Antoni" ;
   ont:favouriteCar "Porsche" .
   
<http://www.example.com/people/Alina> a ont:Girl ;
   ont:name "Alina" ;
   ont:favouriteFlower "Daisy" .

Variable Example

Description
An example from the Mediawiki mapping. The nodeXPath first meets the <siteinfo> element, where the value "base" variable is set (computed via a rather tricky XPath expression, which takes the common prefix: http://en.wikipedia.org/wiki/ out of the address of the main page: http://en.wikipedia.org/wiki/Main_Page i.e. the substring from the beginning up to the last slash. The value of that variable is later used in uriPatterns of resources corresponding to each mediawiki article. This feature allows the resulting RDF resources to have the same URIs as they'd have on a live Mediawiki instance on the web. Note that those uri patterns also use the urlify string transformer, to make sure that special characters in entry titles are escaped properly in URIs.
Source dataMapping
<mediawiki 
 xmlns="http://www.mediawiki.org/xml/export-0.5/">
 <siteinfo>
  <sitename>Wikipedia</sitename>
  <base>
    http://en.wikipedia.org/wiki/Main_Page
  </base>
 </siteinfo>
 <page>
  <title>Somecity, Somecountry</title>
 </page>
</mediawiki>
:wikipediaMapping a xml2r:Mapping ;
  xml2r:namespaceDefinition [
    xml2r:namespacePrefix 
      "wiki";
    xml2r:namespaceUri 
      "http://www.mediawiki.org/xml/export-0.5/"
  ] .

:mediawikiElementClassMap a xml2r:ClassMap ;
  xml2r:belongsToMapping :wikipediaMapping;
  xml2r:nodeXPath "/wiki:mediawiki/wiki:*" .
   
:siteinfoMap a xml2r:ClassMap ;
 xml2r:delegatedFrom :mediawikiElementClassMap;
 xml2r:onElementName "siteinfo" ;
 xml2r:setVariable [
  xml2r:variableName "baseUri" ;
  xml2r:pattern 
   "${fn:replace(wiki:base/text(),\"/[^/]*$\",\"/\")}" 
 ] .

:publicationMap a xml2r:ClassMap ;
  xml2r:delegatedFrom :mediawikiElementClassMap;
  xml2r:onElementName "page" ;
  xml2r:uriPattern 
    "&{baseUri}${wiki:title/text()||mwurlify}" ;
  xml2r:class ont:publication .
Result
<http://en.wikipedia.org/wiki/Somecity%2C_Somecountry> a ont:Publication .

xml2r:PropertyBridge

An equivalent of the D2RQ PropertyBridge construct. Each occurence defines a predicate-object pair, which is attached to a resource generated from the xml2r:nodeXPath of the corresponding xml2r:ClassMap

Properties

xml2r:belongsToClassMap Ties the xml2r:PropertyBridge with a corresponding xml2r:ClassMap.
xml2r:property The predicate for the generated triple, an equivalent of d2rq:property construct.
xml2r:pattern Generates a plain literal object for the generated triples using a pattern. See the section on Patterns below for more information.
xml2r:uriPattern Generates a URI object for the generated triples using a pattern. See the section on Patterns below for more information.

Examples

Description
The xml2r:nodeXPath specifies that the converter should iterate over all publications. For each publication two triples are generated: the type triple (specified by xml2r:class) and the dc:title triple. The predicate is specified by xml2r:property, while the value is a relative XPath expression. Is evaluated within the context of each "publication" node on the sequence defined by xml2r:nodeXPath. The example contains a relative XPath expression: title/text(). It is evaluated within the context of each publication, to return the text of the <title> element.
Source dataMapping
<dblp>
<mastersthesis key="ms/Swiderska92">
<author>Alina Swiderska</author>
<title>
DFAE: Distributed Feature Aquisition Environment
</title>
<year>1992</year>
<school>
AGH University of Science and Technology
</school>
</mastersthesis>


<mastersthesis key="ms/Mylka97">
<author>Antoni Mylka</author>
<title>
Efficient Data Maintenance at RDF databases
</title>
<year>1997</year>
<school>
AGH Univeristy of Science and Technology
</school>
</mastersthesis>


<article key="tr/oct/BAS2008-423">
<author>Michal Fronczyk</author>
<title>How Semantic Web changed the world.</title>
<journal>
Digital System Research Center Report
</journal>
<volume>BAS2008-423</volume>
<year>2008</year>
<ee>db/labs/oct/BAS2008-423.html</ee>
<ee>
http://www.fronczyk.com/2008/bas2008-423
</ee>
<cdrom>octTR/bas2008-423.pdf</cdrom>
</article>

</dblp>
:m a xml2r:Mapping .

:publicationMap a xml2r:ClassMap ;
   xml2r:belongsToMapping :m ;
   xml2r:nodeXPath "/dblp/*" ;
   xml2r:uriPattern 
      "http://pubs.org/${data(@key)}" ;
   xml2r:class 
      <http://example.org/Publication> .
   
:titleBridge a xml2r:PropertyBridge ;
   xml2r:belongsToClassMap :publicationMap ;
   xml2r:property dc:title ;
   xml2r:pattern "${title/text()}" .
Result
<http://pubs.org/ms/Swiderska92> a <http://example.org/Publication> .
<http://pubs.org/ms/Mylka97> a <http://example.org/Publication> .
<http://pubs.org/tr/oct/BAS2008-423> a <http://example.org/Publication> . 

<http://pubs.org/ms/Swiderska92> dc:title "DFAE: Distributed Feature Aquisition Environment" .
<http://pubs.org/ms/Mylka97> dc:title "Efficient Data Maintenance at RDF databases" .
<http://pubs.org/tr/oct/BAS2008-423> dc:title "How Semantic Web changed the world." .

Patterns

Both the xml2r:pattern and xml2r:uriPattern constructs share a similar syntax. They are strings with constant and variable elements. There are two kinds of variable elements:

  • relative XPath expressions, enclosed in ${ }. Following rules apply:
    • if a pattern contains more than one XPath expression and those expressions yield more than one value, within the context of a single element-then a cartesian product of all value sets of all XPath expressions in a pattern will be used.
    • If the value of an XPath expression is a subtree, the text from that subtree will be used: not the elements and their attributes. For instance in XHTML you can simply say: ${/html/body} to get all character content of the body without the markup.
  • a "special" pseudo-expression ${parentUri} available only in the streaming adapter. Its value is specified as an argument to the dump method.
  • variable references, enclosed in &{ }, if a variable is unset, an empty string will be returned

String transformers

  • mediawiki it removes the mediawiki markup from the string and leaves just the readable text.
  • urlify converts all characters illegal in URLs to their percent-encoded equivalents