2014/10/14

back to the beginning ... a library module

It has been a while since my last post. But like my friend who blogs here - check it out, he has style - I just can't stop doing it.

In my last post I presented the KBOData site. For all its fancy features, the real work is to get all the raw data (1.12 Gigabytes worth of csv files) into the correct format for the - monthly - database load. The database is Stardog so the csv has to be transformed into one of the rdf formats. Turtle was selected.

For your information, the 1.12 Gigabytes of csv gets turned into 9.68 Gigabytes worth of triples.

Now there are a lot of tools available in NetKernel and we could definitely write our own csv-processor but there are good libraries available. I selected Super CSV and created a library module with it. A library module provides - in case that wasn't clear - functionality to other modules.

I'm not going to discuss the whole module (which you can find here, the module name is urn.org.elbeesee.supercsv), if you've followed the Back to the beginning series most of it should be familiar. I am going to discuss the new stuff though.

I removed the class file and the supercsv jar file before checking the module into Github (both to safe space on Github and to avoid errors due to a different environment). This means the module will not work as is, you'll need to compile it yourself.

One. The version in module.xml matches the version of the Super CSV jar file (2.2.0 at the time I write this). This is good practice when you wrap a 3rd party software (as we are doing here).

Two. The module contains a lib directory underneath the module's root. This is where we're going to put the 3rd party jars. In this case super-csv-2.2.0.jar which you can get from the Super CSV download.

Three. We add functionality. The active:csvfreemarker accessor takes a csv file as input, applies a freemarker template to each line and writes the output to a different file. It assumes the first row of the csv to contain the column headers.

We could export the Super CSV classes so that they can be used directly in other modules. While there may be cases where this is useful, this often quickly leads to classloader hell. Keeping the 3rd party functionality wrapped inside is the best way to go.

Four. The accessor itself contains nothing special (you'll find it - minus the freemarker processing - in the examples on the Super CSV site).

Five. The active:csvfreemarker response just mentions the input file has been processed. It is the side-effect (the output file) that we are interested in, the response is expired immediately.

Six. A unittest is provided. You need to replace the values for infile, outfile and the stringEquals value with your own. The input file could for example contain this :

firstname,lastname
tom,geudens
peter,rodgers
tony,butterfield
tom,mueck
rené,luyckx


Which will result in this output file :
geudens tom
rodgers peter
butterfield tony
mueck tom
luyckx rené


Note that Freemarker allows a lot more in its templates than is shown in the unittest. Here's one from KBOData :

<#if CONTACTTYPE == "TEL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasTelephone>; "${VALUE}" .
<#elseif CONTACTTYPE == "EMAIL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasEmail>; "${VALUE}" .
<#elseif CONTACTTYPE == "WEB">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasURL>; "${VALUE}" .
</#if>


Seven. Usage. In fact the unittest shows how to use the library module. You import the public space and provide freemarker templates as res:/resources/freemarker/[template].freemarker resources. Enjoy !




P.S. I noticed that so far I have assumed that you know how to set up your development environment in order to build a NetKernel module. If this is not the case, there are tutorials for Eclipse and IntelliJ in the documentation.

P.P.S. Applying a Freemarker request to every single line takes - even in NetKernel and even using the lifesaver for batch processing - a while (depending on the size of the input of course). In a next post I'll discuss how we can fan out the requests.