Wednesday, January 27, 2016

SparQL SnomedCT with Jena Fuseki

Above is what I tweeted ... and then people asked for the link... So I have to write the recipe in a blog. Hope this answers the questions.

The result is a Web User Interface (and REST) to do SPARQL on the SNOMED CT ontology.

The ingredients:
  1. SNOMED CT RF2 files - We have a license and receive the CD version.
  2. Java 8 <https://www.oracle.com/java/>
  3. Git <https://git-scm.com/downloads>
  4. Apache Maven <https://maven.apache.org/download.cgi>
  5. IHTSDO snomed-publish <https://github.com/IHTSDO/snomed-publish>
  6. Apache Jena & Fuseki <https://jena.apache.org/>
  7. Patience or fast machines ;-)
The machine:
  1. I used a VM 2.8Ghz / 4Gb - I guess 8Gb is better, but this worked for me
Directions:

1. Preparations
  1. Make sure you have Java 8 installed
  2. Get your hands on the RF2 files from SNOMED CT (I used the "snapshot" version)
    1. Delta contains SNOMED CT changes since the last release
    2. Full contains full history of SNOMED CT since 2002
    3. Snapshot contains the latest version of SNOMED CT
  3. Download and install Git
  4. I like the command line tools, but you could also go for the GUI version SourceTree <https://www.sourcetreeapp.com/>
  5. Download and install Maven
2. Build IHTSDO snomed-publish

The IHTSDO snomed-publish package contains a component called the rdfs-exporter that can convert the RF2 format into Tripples. The Triples file is needed to populate the Triple Store (TDB) that Fuseki uses.

Some background information:
https://github.com/IHTSDO/snomed-publish/blob/master/config/README.md
https://github.com/IHTSDO/snomed-publish/tree/master/client/rdfs-export-main

Some package don't compile, I don't know why. We only need rdfs-export-main, but it depends on some of the other components. Luckily Maven has the "-fn" option, that makes maven not stop when a component fails.

Open a cmd box and go to the snomed-publish folder and run:
> mvn clean install -fn
You will see a lot of output and some errors. Just ignore.
Now go to client/rdfs-export folder and run:
> mvn install 
The result will be the rdfs-export.jar file in the client/rdfs-export-main/target folder.

3. Now we need the RF2 files to convert into Triples

Go the the RF2 folder and run (fill in the jar ):
> java -Xms4000m -jar /rdfs-export.jar -c sct2_Concept_Snapshot_INT_20150731.txt –d sct2_Description_Snapshot_INT_20150731.txt -t sct2_Relationship_Snapshot_INT_20150731.txt -if RF2 -of N3 -o sct.n3

The output will look something like this:

4a.

Use N3 can be directly used bij jena tdbloader! ?? rdfparse still seems nessesary..
> rdfparse sct.n3 > sct-pure.n3

4b. Create the Triple Store

Create tripple database from SCT N3 file... 
> tbdloader –-loc=d:\work\TBD sct-pure.n3

5. Start Fuseki

> fuseki-server –-loc=d:\work\TDB /sct

N.B.
Default Shiro config of Fuseki stops Fuseki from showing any datasets when accessed not from localhost. Simpy edit run/shiro.ini and change “/$/** = localhostFilter” to “/$/** = anon” does the trick.


Now you can browse to localhost:3030 and you should see something like this:

Click on "query" and have fun sparql-ing SNOMED CT!

Some SPARXL examples:

Get properties and values of a specific class "Procedure (procedure)"

PREFIX sct:
PREFIX rdfs:
SELECT ?property ?obj
WHERE {
  sct:71388002 ?property ?obj .
}

Check if a class is a kind of "Procedure (procedure)" regardless of path length.

SELECT ?property ?obj
WHERE {
  sct:250404007 ?property ?obj .
    sct:250404007 rdfs:subClassOf+ sct:71388002 .
}