Wednesday, December 2, 2009

The Value of Tools

Several things have happened recently to make me think about and appreciate the value of tools. Tabbed browsers are one example. Many institutions limit people to a specific browser, still thinking of a web browser as something you browse the web with, rather than doing your work. Yet as more apps become web based the value of tabs becomes increasingly important. To me a tabbed browser is a productivity tool.
I’m into this thing called the Semantic Web, I’ve been doing some projects at work, but I’ve also decided to start doing some stuff on my own behalf. That means I have to keep everything separate including the apps I use to achieve the results. At work I have access to some great software, for example Pipeline pilot, but it comes at a cost. It’s true that this software does not understand anything about RDF and semantics. But I can easily cobble together something that gives me RDF as an end result, and its native ease of use means that processing and manipulating of files and the data is trivial.
Over the next couple of months I will be RDF-ising collections of data available from the FDA website. They have huge collections of data in various silos available for download in mainly flat file format.
My thoughts around this are; it’s great that the data is there, but it takes each of us to do something with each file (normally create a RDBMS database) and it doesn’t help create meaning from the data because nothing is linked.
My idea is to start creating RDF versions of the datasets and show how each dataset may give rise to new insights. More importantly I want to start linking the datasets to allow more thought provoking questions to be asked across them. Here are a couple of examples.
1. Given the adverse events published; can I use statistical methods (possibly disproportionate analysis) to spot drugs that have an unusually high incident rate in a given month and is this rate increasing over time?
2. Given an adverse event for a drug you may then want to find out all other related drugs that could be impacted. So I intend to link the drugs in the adverse events to the orange book (list of all approved drugs). Each drug has a list of ingredients and so any other drug with common or overlapping ingredients may also be affected.
Now this may seem silly, but if a company waits for the FDA to spot this, it is likely that the drug will be withdrawn. But wouldn’t it be great if the company could see the trend and intervene to find out what is happening. It might be that it is being prescribed in the wrong way and to remedy this all they need to do is to send out new guidelines and some notes to the doctors. The cost would be negligible compared to losing the drug from the market.
There is also the opportunity to start linking this data into and/or with the open linked data project which already includes some great information e.g. Drugbank, MEDRA etc.
I’ve started converting the Orange Book data. It seems relatively simple with just three flat text files. However on inspection of the products.txt file you find that it not only describes the product but also the drug and the multiple ingredients with each having a dose. All of this is in a single line of data with multiple, and different, delimiters. So how do I manipulate this text into a better format from which to create RDF? Well, without my expensive tools it’s not been easy. Knime and Talend have helped but I don’t always want to think of my data as simple tables, I want to create more detailed relationships, split, and pivot and recombine data and I want to do it without writing lots of java, perl or python. Quite a bit has been done in our old friend Excel and all of this is just preparing the data ready to be RDF-ised! You might expect the data published to be in tip top condition but it’s amazing the amount of data cleaning that needs to be done
A friend recently had a similar experience. He had left a semantic company to work on his own. Then came the questions. “What’s the easiest way to manipulate data?” Answers were similar to above. “How do I create RDF?” “That’s another good question” I replied. “How do I show the data?” answer “Exhibit is a good start, but it won’t deal with much data, and you have to have it in JSON format”..... “Ah that could be problem” he said. But of course there is Babel, “but I can’t” he said “it’s private data”.
Now my friend is a techie and was able to produce scripts / apps to do the stuff he needed. I’m not, so I continue to struggle.
However there is some light. TopQuadrant have released a free version of their TopBraidComposer software. It’s a “MUST HAVE” app for your desktop if you intend to do anything with semantic data. I’m not going to talk about it, but there is so much in it that I’m taking a while to get to grips with it all.
At end of the day I just want to do something with the data and not spend most of my time creating it. It’s hard to show people the hoops I have had to jump through and to expect them to think anything other than, “This semantic web stuff seems a lot of hassle”. It’s hard to “cross that chasm” unless you have tools and it’s impossible to expect someone to be able to do this while trying to understand what the semantic web is all about in the first place.

1 comment:

Michael Waclawiczek said...

Interesting blog. I encourage you to learn more about expressor's semantic integration system. Albeit we are not a Semantic Web type application, we do solve many complex data integration problems that greatly benefit from our semantic metadata abstraction.