Monday, February 1, 2010

FDA Electronic Orange Book in RDF

I was hoping to have posted this entry before Christmas, but I ran into a few issues.
Issue 1: I bought a Wii console and got a bit tied up in it and Issue 2: involved trying to find a suitable application in which to demo the data in this blog.
Due to the size of the data it is impossible to use Exhibit as a front end, so I have been trying to find out what other tools are out there. Unfortunately I’m still looking for suitable tools, so I’ve decided to publish the data files while I keep looking.
Preparing this data was an experience in learning some new tools and some new features of old favourites. Everything was completed using two products. Knime for the original data manipulation and clean up, whilst Topbraid Composer was used to generate RDF and manipulate it to the appropriate graph structure. I’ve made quite a few decisions along the way including issues around URI’s but I won’t bore you with the details.
In the end I’m pretty pleased with the result, but frustrated that I can’t create a UI to show you. I have even tried to work out a sparql query to create a sample of linked data which could be shown. But I’ve also failed in this task.
For those of you who don’t know, the FDA Electronic orange book contains details of drug products, active ingredients and some patient information. Having created the data I was pretty amazed at the results.
For example it contains details of 24.5K products used to create 15.7K drugs which are made up of just 1.8K active ingredients!.
I’ve zipped the files to one folder which contains the orangebook.owl (holding the concepts and properties used) and FDAEOB.owl which contains the RDF data.
I’ve also been doing some experiments and found that it is pretty easy to link ingredients, products and drug entries to instances in the open linked data cloud using DBpedia and other sources. I’m going to give this some more thought and publish an RDF file containing these links. I’m hoping this will increase the utility of the RDF Orange book data giving people some initial jump out points to other sources so you can combine yet more information. This is the whole point of the Semantic web for me.
You can grab the files here

Wednesday, December 2, 2009

The Value of Tools

Several things have happened recently to make me think about and appreciate the value of tools. Tabbed browsers are one example. Many institutions limit people to a specific browser, still thinking of a web browser as something you browse the web with, rather than doing your work. Yet as more apps become web based the value of tabs becomes increasingly important. To me a tabbed browser is a productivity tool.
I’m into this thing called the Semantic Web, I’ve been doing some projects at work, but I’ve also decided to start doing some stuff on my own behalf. That means I have to keep everything separate including the apps I use to achieve the results. At work I have access to some great software, for example Pipeline pilot, but it comes at a cost. It’s true that this software does not understand anything about RDF and semantics. But I can easily cobble together something that gives me RDF as an end result, and its native ease of use means that processing and manipulating of files and the data is trivial.
Over the next couple of months I will be RDF-ising collections of data available from the FDA website. They have huge collections of data in various silos available for download in mainly flat file format.
My thoughts around this are; it’s great that the data is there, but it takes each of us to do something with each file (normally create a RDBMS database) and it doesn’t help create meaning from the data because nothing is linked.
My idea is to start creating RDF versions of the datasets and show how each dataset may give rise to new insights. More importantly I want to start linking the datasets to allow more thought provoking questions to be asked across them. Here are a couple of examples.
1. Given the adverse events published; can I use statistical methods (possibly disproportionate analysis) to spot drugs that have an unusually high incident rate in a given month and is this rate increasing over time?
2. Given an adverse event for a drug you may then want to find out all other related drugs that could be impacted. So I intend to link the drugs in the adverse events to the orange book (list of all approved drugs). Each drug has a list of ingredients and so any other drug with common or overlapping ingredients may also be affected.
Now this may seem silly, but if a company waits for the FDA to spot this, it is likely that the drug will be withdrawn. But wouldn’t it be great if the company could see the trend and intervene to find out what is happening. It might be that it is being prescribed in the wrong way and to remedy this all they need to do is to send out new guidelines and some notes to the doctors. The cost would be negligible compared to losing the drug from the market.
There is also the opportunity to start linking this data into and/or with the open linked data project which already includes some great information e.g. Drugbank, MEDRA etc.
I’ve started converting the Orange Book data. It seems relatively simple with just three flat text files. However on inspection of the products.txt file you find that it not only describes the product but also the drug and the multiple ingredients with each having a dose. All of this is in a single line of data with multiple, and different, delimiters. So how do I manipulate this text into a better format from which to create RDF? Well, without my expensive tools it’s not been easy. Knime and Talend have helped but I don’t always want to think of my data as simple tables, I want to create more detailed relationships, split, and pivot and recombine data and I want to do it without writing lots of java, perl or python. Quite a bit has been done in our old friend Excel and all of this is just preparing the data ready to be RDF-ised! You might expect the data published to be in tip top condition but it’s amazing the amount of data cleaning that needs to be done
A friend recently had a similar experience. He had left a semantic company to work on his own. Then came the questions. “What’s the easiest way to manipulate data?” Answers were similar to above. “How do I create RDF?” “That’s another good question” I replied. “How do I show the data?” answer “Exhibit is a good start, but it won’t deal with much data, and you have to have it in JSON format”..... “Ah that could be problem” he said. But of course there is Babel, “but I can’t” he said “it’s private data”.
Now my friend is a techie and was able to produce scripts / apps to do the stuff he needed. I’m not, so I continue to struggle.
However there is some light. TopQuadrant have released a free version of their TopBraidComposer software. It’s a “MUST HAVE” app for your desktop if you intend to do anything with semantic data. I’m not going to talk about it, but there is so much in it that I’m taking a while to get to grips with it all.
At end of the day I just want to do something with the data and not spend most of my time creating it. It’s hard to show people the hoops I have had to jump through and to expect them to think anything other than, “This semantic web stuff seems a lot of hassle”. It’s hard to “cross that chasm” unless you have tools and it’s impossible to expect someone to be able to do this while trying to understand what the semantic web is all about in the first place.

Wednesday, November 25, 2009

SDTM Ontology for Clinical Data

I've been fortunate enough to work with clinical data this year. It's been a new experience with new challenges.

I was a bit scared at first, since clinical data is always thought of as the holy grail of Pharma data. I was expecting something very complicated, but in reality it's quite straight forward. There were two main challenges. The first was getting around a preconception that as this is regulated data you can't do anything with it. To me that meant you can't touch the original data but it shouldn't stop you duplicating the data and using it elsewhere to feed back into discover research or to use it to look at other questions that the trial may not have focussed on. But all that's another story. Secondly; after winning the first battle, was actually getting someone to agree to give you some data. Even within an organisation this proved tricky, for reasons that lead to yet another story, but that's for another time.

With those resolved I started to look at the data. The data was in a format aligned with the SDTM standard published by CDISC. The first thing that struck me was actually the simplicity of the data. I was expecting something a lot scarier, but it was scary how badly put together the data structures were. The data structures seemed to be aligned with the physical CRF forms used to submit data which made the data a bit powerless rather than powerful. However, with linked data we can change all of that.

So all we need now is the SDTM ontology to align and format the data against. Asking around I found some partial bits and pieces but nothing substantial to use with all the data I had, so there was no choice other than to write my own. Having studied the SDTM standard it didn't take long, but there were several choices to make along the way. The version I came up with isn't the way I would have designed it from a true ontology approach. There was an existing standard and I didn't want to stray too far away from it so that the concepts would remain familiar to people. Yet I also had to enable the data.

When looking at the ontology, those familiar with clinical data and SDTM, will see the usual concepts such as demographics, subject visits, adverse events and medical history along with all the others. What I did do was extract some of the major linking elements, which were not in fact separately described, in the standard. So you will see concepts such as "unique subject", study and a few others. I've also added the properties with names matching the standard. In some cases I've also added some constraints, but not all. Concepts are of course linked with object properties and I've used the "unique subject" class as the center of attention linking to just about everything. That's on purpose as in most cases you want to correlate the subject with all the other data.

I've used the ontology in several pieces of software and it's worked pretty well. Asking questions of the data has been pretty straight forward and produced some pretty good demos.

I'm not saying the ontology is brilliant. I'm sure you'll find mistakes, omissions and areas where I've not gone into enough detail. However I'd like to do my bit for the community and put the ontology out there for people to use. Feel free to alter it for your needs and please give feedback on your experiences with it.

You can get it here. Have fun.

p.s. I've used the namespace "http://cdisc.org/CDISC/Ontology/SDTM#" which is not a valid one.


Tuesday, November 24, 2009

It’s been a while

It's been a while since I last posted. I started the blog with good intentions but due to some controversy I laid off for a while and didn't get back into it.

Hopefully I'll keep going this time!

It's been an eventful year, with some highlights being;

The CSHALS conference was great with some fabulous talks on SW stuff really being applied to Life science areas. It really made me think. Thanks to Eric Neumann for inviting me to speak on the INCHI stuff I have previously blogged about, I'm just sorry I made a bit of a hash of it in the allotted time.

SEMTech was always going to be a highlight, especially taking part in two talks. Unfortunately travel budgets didn't allow my attendance but the talks went ahead with others stepping in to do my bits as well as their own. Special thanks to Dean Allemang and Alan Greenblatt.

I've worked in some different areas this year, most notably with clinical data. This was a real eye opener if not for all the right reasons. I'll blog about this later and publish my version of an ontology I have put together to represent the SDTM standard used in the clinical arena.

I'm currently doing some projects working on FDA published data and I hope to publish some demos on this in the near future.

Back soon with lots to write about.... Hopefully.

Wednesday, November 19, 2008

On Track.... Quite Literally!

I've been riding motorcycles for 27 years and during the course of those years I've watched the bike racing on the TV and said to myself "I'll do that one day". I had planned to do it before I was 40 but somehow that just seemed to pass me by. Even though the thought became increasingly forefront in my mind over the last couple of years, I did nothing about it. I guess this was partly due to the stories I had heard of complete bedlam with novices and full blown racers being on the track at the same time and partly as I couldn't bear the thought of my bike and myself in the kitty litter.

I don't really know what happened, but one day I found myself looking at the IAM website and reading a write up on a "Rider Skills" course held at Mallory Park race track... before I knew it I was signed up in a novice group for the afternoon event on 9th October.

As the weeks passed and the date approached, I became increasingly nervous, what had I done! What had I let myself in for! The appalling weather in late September and early October didn't help with these thoughts either. The day before the event I left work trembling even though the forecast was for an exceptionally nice sunny day and high temperatures. The day had arrived, and I had planned to set off around 9.30am for a nice leisurely trip from Reading to Mallory, aiming to get there in time to watch the later part of the morning session, well that was the plan. I didn't get into the garage until 9.50 as I seemed to spend ages fumbling around getting everything ready, then the bike refused to start! It's never done this before! Was it trying to tell me something, had she refused at the first hurdle? Eventually the bike decided to fire up and we were off. My riding was terrible... I could not seem to concentrate, but as the miles rolled by my nerves seemed to relax somewhat.

Arriving at Mallory, I joined a group of IAM riders that was growing as time passed. There was just every kind of bike imaginable, including a massive BMW K1 200 LT. I found out later that a Goldwing had taken part in the morning session! I didn't have time to check out the track so instead I got chatting to the others waiting in the car park, whilst I eat my sandwiches. As it turned there were quite a few first timers all of whom seemed as apprehensive as me. The sound from the track was impressive as bikes roared around having fun; I really should have gone and taken a peak, as you can't see from the car park. Then 12.30pm arrived, the sound from the track ceased, the main gates opened to allow access to the centre of the track and I was in a line of bikes heading for the registration office. The first thing was to sign up and get allocated to an instructor, then grab a cup of tea and wait in the briefing room for the pre-session talk. Here we met the organisers Roy Aston and John Lickley who along with others gave us our welcome and pre-session briefing. This consisted of learning how things worked, the layout of the circuit, consideration for others, what the flags meant, what not to do (it was not a race and there were no talent scouts watching), and also something that I can only describe as the "fandango" manoeuvre that was used to swap places with the instructor and fellow group members down the start/finish straight. This manoeuvre was necessary to have your turn at the front of the group. Briefing finished, on the bikes and down to the pit area where we met our instructors and were given coloured and numbered bibs. Our group turned out to be quite small with only three pupils (Me (Phil), Paul and Phil) and our Instructor Paul Jones. Paul was an amazing guy who rides bikes at a very high standard as a day job and has been a track instructor on skills days for a couple of decades. He was also an extremely fun guy who lived bikes. Paul explained that he had a syllabus to teach us and we could only progress through the stages if he was happy that we had all reached the required level for each.

Our first session was simple, one lap with the instructor leading, execute a fandango down the straight and continue until everyone had one lap at the front, at which point he would lead us back into the pits. This was all to be carried out around 60mph, but don't look at the speedo! It sounded simple, but as we were the last group to exit the pits, the track was quite full, and checking that you didn't get in anyone's way on the first corner was a little nerve racking. A couple of laps into this session it hit me...."Oh My God I'm on a track!" I whooped as I took off my helmet back in the pits, much to the amusement of a couple of people. "I've done a lap" I said, "err.... four actually" Paul replied. Paul explained that the first session wasn't much more than a look at the track and a huge ice breaker for our nerves. It really worked.

The next session used to be "no braking laps" but had been replaced with "single gear laps". Paul wanted us to use 3rd gear, 2nd allowed at the hairpin and the bus stop, but to use the brakes as little as possible and each person would have two laps at the front. He was looking to ensure that we had perception skills and that we could judge throttle response appropriately for the hazards. Out on the track things felt good, the Fazer having bags of engine braking, and no dramas for any of the group. Our pace hadn't been any higher than the first sessions, perhaps even slower in some sections with no brakes.

Back in the pits and the brief for the third session was any gear allowed but hard braking. He wanted to see a smooth transition from throttle (accelerating) to brakes with no coasting. Braking must have been completed with the bike upright before turning in and without upsetting the bikes stability. I.e. smooth application of the brakes, hard braking with a big squeeze and gentle release so that the bike did not dive and buck. Each person now had three laps at the front, but he also wanted us to "up the pace" somewhat to ensure the desired level of braking was achieved. This would also allow us to experiment with braking distances which he hoped would get shorter and shorter with each lap. Just for good measure Paul said that as I was last in the line I should be an expert by the time I was leading the group. No pressure then! As it turned out the following laps became rather processional following other groups which meant we didn't really achieve the pace Paul was looking for. That was until half way through my first lap when the entire field of riders in front peeled off into the pits leaving me with an open track. It looked so enticing and scary at the same time. Up to the hairpin and I braked way too early coasting the final few metres. Next time around with increased pace and braking later for all the corners, things really started to make sense. It's amazing the feel you get from the brakes and how controlled you can be.

Back in the pits and Paul seemed pleased so on with the next session which was about gear selection for various parts of the track. We now had four laps at the front but we no longer needed to stick to our line formation once complete. Paul had one last piece of advice before going out "Overtake when you get the chance". Out on the track the advice about short shifting for various sections made sense and seemed to be working. I was starting to use the throttle more effectively through the Esses giving more drive down the straight where we were now passing other riders, much to our jubilation. It was my turn in front and I had been making steady progress with more throttle and brakes being used as everything started to fall into place. However on one lap I fixed my sights on passing a rider some distance ahead on the straight. I gave it more throttle than I had before and was approaching him fast. At some point I realised that I was not going to get passed and brake to my usual entry speed before needing to turn into Gerrards. After a very brief moment of mental debate I decided to accept my higher entry speed (still a lot slower than some) and attempt an overtake through the corner. What a feeling it was as I rode around the outside and accelerated out of the corner. I'm not sure if they heard my "Yahoo!" in the pits. The only problem was that in the excitement I suddenly realised I had missed my braking point for the next chicane, Edwinas. On with the brakes and I squeezed as never before. Gentle release and turn in and I was through. Phew! Back in the pits Paul commented it was nice to see the increase in pace and moving the braking point so far on. If only he knew!

The next and final session was "out on your own". Paul would be honing around keeping an eye on us, but we could do our own thing. We still had 45 minutes left of track time, but before we did anything he made sure we took on plenty of fluids. I went out with 30 minutes to go and 15 minutes in I was pretty tired and starting to make mistakes. I eased off a bit and just enjoyed the track and the experience for the rest of the session until the chequered flag came out. Paul seemed happy with us stating that we had doubled our speed and halved our braking distance.

Finally there was a pit lane debrief with everyone, just to calm things down, before heading home.

I was a lot slower than many people around the track, in fact I was the slowest in my group, but I had experienced my bike and myself in a whole new light. What a day it had been and one I would recommend to everyone.

Thanks to John, Roy, Paul and everyone else who made the day possible, it was greatly appreciated by everyone.

Photos of the event can be seen at http://www.photoboxgallery.com/roberthands or my personal pics can be viewed at

http://www.flickr.com/photos/pashworth/sets/72157607960474703/. I didn't say I was Valentino Rossi so don't expect too much.

Cool URIs for Molecules

Firstly I have to say that the title of this blog is a direct plagiarism of the W3C paper "Cool URIs for the Semantic Web" by Leo Sauermann et al.

Introduction: - The nature of the problem

Part of our role in informatics is to create and maintain databases that hold chemical structure information. These databases allow users to search for chemical structures or sub portions (substructures) of them. In order to achieve this we implement chemical search engines provided by vendors. These chemical search engines provide a means to rapidly filter chemical structures through creation of specific indexes and atom matching functionality. The need to have a chemical search engine is not in doubt; however using it to join information creates a huge technical overhead and inflexibility within the data. Within an organisation there may be many chemical databases each serving a different purpose. E.g. Internal molecules, externally available molecules from vendors including reagents and screening molecules, competitor molecules and related information, all form a wealth of information that could be related to chemical structure.

There are several situations where we would wish to link, merge or extend the capabilities of these chemical databases.

e.g.

  1. Many companies create their own chemical database of available compounds supplied by various companies around the world. To do this we bring together chemical catalogues (supplied in a variety of formats) based on the chemical structure using a chemical search engine. Some software vendors supply a ready correlated database in their particular database format, but these systems are often too restrictive, as they do not allow us to update the system with the information we desire.
  2. When company mergers happen (all too frequently!) there is always a mad scramble to import the contents of one company's chemical database into the others. This can take weeks to complete and yet we may still loose information as the schema is not set up to handle all of the information that the other holds.
  3. We may be generating new sets of data properties that our existing chemicals databases cannot cope with. If the same chemicals are in multiple databases then we have quite a job extending the scope of each schema and updating the appropriate primary id for the data source.

When we bring chemical information together from various sources we need to carry out an "Exact match" search using a chemical search engine. This can be quite (extremely) time consuming and may need to be rerun often if used to create a joining table of dependant systems.

There is a growing need within the industry to merge more and more information. Take for example the plethora of information being generated in the external literature each day. Wouldn't it be nice to merge all the latest information gleaned about a molecule from the literature to enhance our total knowledge about it? The problem here is that the literature is not marked up well enough (yet), so we have to resort to image extraction and conversion to a chemical structure that can be submitted to a chemical search engine on one of our databases. This is a lot of work!

Wouldn't it be great if we had an identifier that simply and uniquely identified each and every chemical structure?

Other than the chemical structure itself, there isn't really anyway to tell if two structures are the same. Yes, there are standards for naming molecules but depending on which software you use the name can be composed differently which brings in ambiguity. There are also textual representations of chemical structure such as "smiles", mol file (CTAB), INCHI and a variety of other formats from different vendors, however each can be formed differently bringing in the ambiguity factor once again. One might say that within a company we might be able to produce a single identifier, but this could (and probably would) be different from company to company.

How can we produce a single unique identifier for each molecule that anyone can calculate and use?


Background

This work started out over a year ago when I was looking for a project that would demonstrate the value of the semantic web, or more precisely linked data.

I was attending the W3C
RDF from RDB meeting in Boston. I was still a semantic dummy at that time (perhaps I still am, but that's a different topic) and I was still trying to bring it all together in my head. I found the meeting of great value and well worth attending, especially as afterwards I found myself meeting and heading off down the pub with Eric Neumann and a few others. Eric and I got chatting about various things including the use of immutable identifiers in URIs. Eric chatted about his headache of how to achieve this for proteins, and then went on to describe some thoughts about what could be used for identifiers in the chemical world. Eric proposed the use of the INCHI string for the identifier and mentioned how cool it would be to have a URI that not only encoded chemical structure, but one that chemical properties could also be calculated from. As a chemist turned informatics person I had more than a passing interest in this idea and I wondered if this was the angle I had been looking for to create my internal demo. I had a few doubts about the use of the INCHI string as it also suffers from ambiguity issues. However, I could see that the recently introduced (at that time) InChiKey could provide the answer. More importantly it was produced and distributed free by IUPAC, a respected organisation that is responsible for many standards in the chemical sciences.


The Project

I decided to create a demo that would illustrate the combination of chemical entities from a variety of external suppliers without the need for a chemical search engine. Gathering information from the external suppliers was an easy task as there are a large number of suppliers out there all supplying the data in the well known SD file text format. I needed to generate the InChiKey for each molecule using the IUPAC software, combine this with a namespace to create the URI and assign all the information in the SD file to that URI. Then all I had to do was create an RDF file for each of the suppliers and the URI's should do the rest of the work for me. In addition to the IUPAC software to generate the InChiKeys, I also used Pipeline Pilot for some text processing and TopBraidComposer as the environment to bring it all together. The final icing on the cake was the use of TopBraidLive to demonstrate a model driven application that can change with the data.

To give a little more details to the construction of the URI; an InChiKey looks something like AAFXCOKLHHTKDM-UHFFFAOYAA. If you click on the link it will show a picture of the molecule it represents. As for the namespace, I wanted to use something that everyone could adopt. I felt that if I could choose something that perhaps IUPAC could ratify then it would have a good chance of adoption. In the end I went for a reserved namespace of "urn:iupac:inchikey" so the above molecules full URI would be

<urn:iupac:inchikey#AAFXCOKLHHTKDM-UHFFFAOYAA>

I think that's quite a cool URI for a molecule, but I have no idea if others would agree or indeed if IUPAC would be prepared to get involved in the Semantic Web and reserve the namespace for everyone's use.

The demos worked incredibly well and were well received. The URI's worked brilliantly, each time I imported a new RDF dataset from a vendor the data automatically updated so we could see which chemicals were supplied by which vendors and any overlaps that existed. We could also play around with various OWL restrictions and inferencing techniques to categorise things like building block suppliers, trusted suppliers or various sorts of chemicals based on properties. I also went as far as adding some of our own in-house molecules, logistics information and people information from our HR database.

Starting with a small dataset and a basic model I could create something very powerful and much more advanced than the data itself might suggest, within a matter of minutes. It was like working magic and all without complicated changes to interfaces and database schemas and the use of chemical search engines. Although it is fair to say that a chemical search engine would be needed if substructure searching was intended.


The Web Demos

The demos mentioned above were part of my work within my company. However I want to show you something, so I have reworked a completely new and much smaller dataset to produce some Exhibit demos. This data is completely fictitious so please do not read anything into it. Before I point you at the demos I'll discuss some basic details.

A very simple ontology was created consisting of the classes, Company, Catalogue, Entry and Item with a few object and data properties. This is a very generic ontology that you could fit to any catalogue not just chemicals. The idea was that a chemical supplier could provide a variety of chemical subsets. E.g. Building Blocks, Screening Compounds etc. Each of these subsets would become a "Catalogue" from the "Company". An "Entry" in a catalogue could have many "Items" associated with it, however in this case there is only one. The "Entry" URI was derived from the catalogue ID and the "Item" URI is the chemical URI generated using the InChiKey. Since the "Item" is referenced using the chemical URI the idea would be that an "Item" would join itself to as many "Catalogues", via an entry, as required. Each "Catalogue" was created in a separate RDF file and contained two basic properties for the molecule, molecular formula and molecular weight.

And so we move to the first demo, here all I have done is create an empty .owl file and import each of the RDF files representing the catalogues using TopBraidComposer. For simplicity I then exported the data to a single JSON file, but the individual RDF files work just the same.

In an attempt to spice things up a little and play with Exhibit, I created geographical locations for the companies and displayed them on a map. The demo works on the basis that we are displaying the Items (molecules) in the main forms and a series of facets allow the items to be filtered based on object and data properties. The map updates to show the location of the suppliers for the filtered molecules. In addition three views on the molecules were created; a simple list view, a table with chemical properties, and a general thumbnail viewer for the structures.

The second demo attempts to show the ability to enhance the basic dataset with more information. Data enhancement is always a key issue with traditional RDMS systems so I wanted to show just how easy and simple it could and should be.

The data could have been anything, it may have been extra information we want to add or something that someone else has published and that we just want to use. The Semantic Web makes this possible.

In the end I just created a new RDF file with some additional chemical properties and some text based structural representations attached to the relevant molecule URI. Simply by importing this new RDF file we have a new range of properties to be used in facets or displayed in views. No tricks, no messing with RDBMS tables and columns, it's just the molecule URI doing its job.

The third and final demo (well at the moment) attempts to look at how a company might integrate some of its own data with this supplier database. To illustrate this I created a new ontology that had the classes, Company, Compound, Assay with data properties for holding things like company ID, assay result and object properties that would link the instances. All of this data is again fictitious, but you might have guessed that, as I located "MyBiotech" in the Bahamas. Importing this new ontology and then running some simple inferencing, so that a "Compound" was also an "Item" even if it didn't have a catalogue entry, produced the result you see. The facets have now been mixed from the two ontologies so we can filter on compounds that have certain properties and assay results in a particular assay. I've also added a new table to display the activity results.

Several common questions that came up when I first demonstrated this version were;

  • "How do I see our companies compounds only"
  • "How do I see the compounds in our company that are not available elsewhere"
  • "How do I see the compounds in our company that are available elsewhere"
  • "How do I see supplier compounds only"

All of these are possible, but I admit that I did have to do some mindset changing to get them to see it. A slightly more detailed ontology might have helped. But we can readily update that to adapt to the needs.

We've been talking about doing something similar to this for real in our company for many years. It is unlikely to happen with RDBMS systems due to the effort involved, but I think all of this starts to build a case about why we should be adopting the Semantic Web approach within the enterprise.

It remains to been seen if the IUPAC namespace and molecule URI is the right way to go. But I believe that we have to start providing common languages for inter enterprise data communication. This would be especially useful if the external literature were marked up using a common standard naming convention. How simple it would be to extract and combine new information if this were the case.

This is just my attempt to start the ball rolling in my particular area.

Friday, November 14, 2008

It’s Happened Again

Over the last couple of years I have been fortunate to attend Semantic technology and life Science conferences. I really enjoy chatting to people and finding out what they are doing and sharing experiences. I've been particularly interested by several discussions with people who have tried Semantic technology projects and failed. I never really understood why these projects failed, but many of these people put it down to various aspects of Semantic technologies such as immaturity, difficulty with this and that etc. It wasn't until I spoke to people that had completed a successful semantic project, after several failures, and merge that with some of my own experiences that I started to understand. One individual told me quite bluntly that it wasn't until individuals with semantic web experience were brought into a project that success stories started to happen. It wasn't through lack of anything, enthusiasm, brain power etc, but it seemed that existing technology teams just couldn't make semantic technology work for them. The result was often that semantic technology was thrown out and disparaged.

It happened again today!

Sometime ago a software company presented a project they were about to embark on. It looked great they had thought about a flexible system and were going to use RDF. Today we had the update and initial demo of the software. It didn't start well for me as the technology slides had no mention of RDF and semantic technologies. Instead they had spent ages creating a standard architectural service based - RDBMS system. It got worse when they spoke about how flexible their system was. I started to probe with a few questions, "How many concept types can you deal with?" ... "currently four, but we are looking to add more" they said... "but of course I can create my own concepts can't I?" ... "errr no, that would be through consulting to customise the system".... "but I can create my own attributes and specify how the concepts interact can't I?"... "err no, but we are looking to add that later"..... "OK so why is concept x not a subtype of concept y?" .... "That's because we brought that concept into the project a bit later and it was difficult to add it without a lot of work".

I started to suggest that if semantic technology had been used the concepts could have been created as ontologies which could change on the fly as could relationships and the attributes around them and we would not be restricted by having to recreating schemas in the relational back end to cope if RDF had been used. The explanation came back that when they looked into using semantic technology they had little success, blaming this and that and stating it was immature lacking x, y and z etc. It became clear that they had not understood the fundamentals of the semantic web and were trying to fit semantic technology into a predefined java, OO world and had not really understood that you have to think and design a bit differently. I must say my blood did boil a little and I was a bit rude, which I should not have done, it was unprofessional of me so I do apologise.

Don't get me wrong, I'm not having a go at anyone; this is just my view on things as I have seen and heard them. But, if it is true that we have to get individuals with semantic experience in on these projects for them to become a true success, then how are we going to do it. These companies don't seem to know they need them, saying ..."we don't need semantic technology people as the technology isn't up to much. If our guys can't make it work it's no good". There isn't a huge glut of semantic techies out there for them to choose from even if they did realise.

It's a tricky one.