Wednesday, November 19, 2008

Cool URIs for Molecules

Firstly I have to say that the title of this blog is a direct plagiarism of the W3C paper "Cool URIs for the Semantic Web" by Leo Sauermann et al.

Introduction: - The nature of the problem

Part of our role in informatics is to create and maintain databases that hold chemical structure information. These databases allow users to search for chemical structures or sub portions (substructures) of them. In order to achieve this we implement chemical search engines provided by vendors. These chemical search engines provide a means to rapidly filter chemical structures through creation of specific indexes and atom matching functionality. The need to have a chemical search engine is not in doubt; however using it to join information creates a huge technical overhead and inflexibility within the data. Within an organisation there may be many chemical databases each serving a different purpose. E.g. Internal molecules, externally available molecules from vendors including reagents and screening molecules, competitor molecules and related information, all form a wealth of information that could be related to chemical structure.

There are several situations where we would wish to link, merge or extend the capabilities of these chemical databases.

e.g.

  1. Many companies create their own chemical database of available compounds supplied by various companies around the world. To do this we bring together chemical catalogues (supplied in a variety of formats) based on the chemical structure using a chemical search engine. Some software vendors supply a ready correlated database in their particular database format, but these systems are often too restrictive, as they do not allow us to update the system with the information we desire.
  2. When company mergers happen (all too frequently!) there is always a mad scramble to import the contents of one company's chemical database into the others. This can take weeks to complete and yet we may still loose information as the schema is not set up to handle all of the information that the other holds.
  3. We may be generating new sets of data properties that our existing chemicals databases cannot cope with. If the same chemicals are in multiple databases then we have quite a job extending the scope of each schema and updating the appropriate primary id for the data source.

When we bring chemical information together from various sources we need to carry out an "Exact match" search using a chemical search engine. This can be quite (extremely) time consuming and may need to be rerun often if used to create a joining table of dependant systems.

There is a growing need within the industry to merge more and more information. Take for example the plethora of information being generated in the external literature each day. Wouldn't it be nice to merge all the latest information gleaned about a molecule from the literature to enhance our total knowledge about it? The problem here is that the literature is not marked up well enough (yet), so we have to resort to image extraction and conversion to a chemical structure that can be submitted to a chemical search engine on one of our databases. This is a lot of work!

Wouldn't it be great if we had an identifier that simply and uniquely identified each and every chemical structure?

Other than the chemical structure itself, there isn't really anyway to tell if two structures are the same. Yes, there are standards for naming molecules but depending on which software you use the name can be composed differently which brings in ambiguity. There are also textual representations of chemical structure such as "smiles", mol file (CTAB), INCHI and a variety of other formats from different vendors, however each can be formed differently bringing in the ambiguity factor once again. One might say that within a company we might be able to produce a single identifier, but this could (and probably would) be different from company to company.

How can we produce a single unique identifier for each molecule that anyone can calculate and use?


Background

This work started out over a year ago when I was looking for a project that would demonstrate the value of the semantic web, or more precisely linked data.

I was attending the W3C
RDF from RDB meeting in Boston. I was still a semantic dummy at that time (perhaps I still am, but that's a different topic) and I was still trying to bring it all together in my head. I found the meeting of great value and well worth attending, especially as afterwards I found myself meeting and heading off down the pub with Eric Neumann and a few others. Eric and I got chatting about various things including the use of immutable identifiers in URIs. Eric chatted about his headache of how to achieve this for proteins, and then went on to describe some thoughts about what could be used for identifiers in the chemical world. Eric proposed the use of the INCHI string for the identifier and mentioned how cool it would be to have a URI that not only encoded chemical structure, but one that chemical properties could also be calculated from. As a chemist turned informatics person I had more than a passing interest in this idea and I wondered if this was the angle I had been looking for to create my internal demo. I had a few doubts about the use of the INCHI string as it also suffers from ambiguity issues. However, I could see that the recently introduced (at that time) InChiKey could provide the answer. More importantly it was produced and distributed free by IUPAC, a respected organisation that is responsible for many standards in the chemical sciences.


The Project

I decided to create a demo that would illustrate the combination of chemical entities from a variety of external suppliers without the need for a chemical search engine. Gathering information from the external suppliers was an easy task as there are a large number of suppliers out there all supplying the data in the well known SD file text format. I needed to generate the InChiKey for each molecule using the IUPAC software, combine this with a namespace to create the URI and assign all the information in the SD file to that URI. Then all I had to do was create an RDF file for each of the suppliers and the URI's should do the rest of the work for me. In addition to the IUPAC software to generate the InChiKeys, I also used Pipeline Pilot for some text processing and TopBraidComposer as the environment to bring it all together. The final icing on the cake was the use of TopBraidLive to demonstrate a model driven application that can change with the data.

To give a little more details to the construction of the URI; an InChiKey looks something like AAFXCOKLHHTKDM-UHFFFAOYAA. If you click on the link it will show a picture of the molecule it represents. As for the namespace, I wanted to use something that everyone could adopt. I felt that if I could choose something that perhaps IUPAC could ratify then it would have a good chance of adoption. In the end I went for a reserved namespace of "urn:iupac:inchikey" so the above molecules full URI would be

<urn:iupac:inchikey#AAFXCOKLHHTKDM-UHFFFAOYAA>

I think that's quite a cool URI for a molecule, but I have no idea if others would agree or indeed if IUPAC would be prepared to get involved in the Semantic Web and reserve the namespace for everyone's use.

The demos worked incredibly well and were well received. The URI's worked brilliantly, each time I imported a new RDF dataset from a vendor the data automatically updated so we could see which chemicals were supplied by which vendors and any overlaps that existed. We could also play around with various OWL restrictions and inferencing techniques to categorise things like building block suppliers, trusted suppliers or various sorts of chemicals based on properties. I also went as far as adding some of our own in-house molecules, logistics information and people information from our HR database.

Starting with a small dataset and a basic model I could create something very powerful and much more advanced than the data itself might suggest, within a matter of minutes. It was like working magic and all without complicated changes to interfaces and database schemas and the use of chemical search engines. Although it is fair to say that a chemical search engine would be needed if substructure searching was intended.


The Web Demos

The demos mentioned above were part of my work within my company. However I want to show you something, so I have reworked a completely new and much smaller dataset to produce some Exhibit demos. This data is completely fictitious so please do not read anything into it. Before I point you at the demos I'll discuss some basic details.

A very simple ontology was created consisting of the classes, Company, Catalogue, Entry and Item with a few object and data properties. This is a very generic ontology that you could fit to any catalogue not just chemicals. The idea was that a chemical supplier could provide a variety of chemical subsets. E.g. Building Blocks, Screening Compounds etc. Each of these subsets would become a "Catalogue" from the "Company". An "Entry" in a catalogue could have many "Items" associated with it, however in this case there is only one. The "Entry" URI was derived from the catalogue ID and the "Item" URI is the chemical URI generated using the InChiKey. Since the "Item" is referenced using the chemical URI the idea would be that an "Item" would join itself to as many "Catalogues", via an entry, as required. Each "Catalogue" was created in a separate RDF file and contained two basic properties for the molecule, molecular formula and molecular weight.

And so we move to the first demo, here all I have done is create an empty .owl file and import each of the RDF files representing the catalogues using TopBraidComposer. For simplicity I then exported the data to a single JSON file, but the individual RDF files work just the same.

In an attempt to spice things up a little and play with Exhibit, I created geographical locations for the companies and displayed them on a map. The demo works on the basis that we are displaying the Items (molecules) in the main forms and a series of facets allow the items to be filtered based on object and data properties. The map updates to show the location of the suppliers for the filtered molecules. In addition three views on the molecules were created; a simple list view, a table with chemical properties, and a general thumbnail viewer for the structures.

The second demo attempts to show the ability to enhance the basic dataset with more information. Data enhancement is always a key issue with traditional RDMS systems so I wanted to show just how easy and simple it could and should be.

The data could have been anything, it may have been extra information we want to add or something that someone else has published and that we just want to use. The Semantic Web makes this possible.

In the end I just created a new RDF file with some additional chemical properties and some text based structural representations attached to the relevant molecule URI. Simply by importing this new RDF file we have a new range of properties to be used in facets or displayed in views. No tricks, no messing with RDBMS tables and columns, it's just the molecule URI doing its job.

The third and final demo (well at the moment) attempts to look at how a company might integrate some of its own data with this supplier database. To illustrate this I created a new ontology that had the classes, Company, Compound, Assay with data properties for holding things like company ID, assay result and object properties that would link the instances. All of this data is again fictitious, but you might have guessed that, as I located "MyBiotech" in the Bahamas. Importing this new ontology and then running some simple inferencing, so that a "Compound" was also an "Item" even if it didn't have a catalogue entry, produced the result you see. The facets have now been mixed from the two ontologies so we can filter on compounds that have certain properties and assay results in a particular assay. I've also added a new table to display the activity results.

Several common questions that came up when I first demonstrated this version were;

  • "How do I see our companies compounds only"
  • "How do I see the compounds in our company that are not available elsewhere"
  • "How do I see the compounds in our company that are available elsewhere"
  • "How do I see supplier compounds only"

All of these are possible, but I admit that I did have to do some mindset changing to get them to see it. A slightly more detailed ontology might have helped. But we can readily update that to adapt to the needs.

We've been talking about doing something similar to this for real in our company for many years. It is unlikely to happen with RDBMS systems due to the effort involved, but I think all of this starts to build a case about why we should be adopting the Semantic Web approach within the enterprise.

It remains to been seen if the IUPAC namespace and molecule URI is the right way to go. But I believe that we have to start providing common languages for inter enterprise data communication. This would be especially useful if the external literature were marked up using a common standard naming convention. How simple it would be to extract and combine new information if this were the case.

This is just my attempt to start the ball rolling in my particular area.

2 comments:

L said...

This is terrific stuff, Phil.

Based on your experience here, what are the key differentiators for RDF and friends? I see a few things highlighted in what you've done:

1) URIs as identifiers. (Though the real key seems to be the deterministic nature of Inchikeys?)

2) Flexible graph model for adding arbitrary properties to a unique identifier.

3) Inferencing for classification.

Is that a fair characterization?

Lee

Phil Ashworth said...

Hi Lee. I'm sorry for the tardy reply. There are so many differentiators for me, many of which are not highlighted in the blog, but in essence you are spot on. It is very difficult for many people to see the difference between RDMS tables and ontologies, but they are so far apart (to me). To me the RDF data is self describing (see Dean Allemang's blog, which I agree with). It is also not fixed in it's design, so it allows you splice and dice the data how you want even if a relationship doesn't currently exist. So that leads to inferencing or just the ability to add new relationships through sparql constructs etc. I'm a great fan of these. I view the ontology as the face that you want to put on the data allowing you to make it what you want but also mixing it with a face that someone else has defined as well. That's cool. I partly mentioned this in example 3 when I merged the company data.
I won't let this turn into a blog in it's own right, but one of the mistakes I made in the beginning was to look at the technology stack around all of this and assumed that it was the top (OWL) that I needed to concentrate on. This was a mistake drawn from working with other technologies. This made life tough in understanding what the semantic web was really about, but once I got passed this and realised it was the rdf at the bottom of stack that was most important for my work and how key the URI was to this, life became really simply. (well simpler). I've seen other people struggle like this, but i hope I've been able to pass on my experiences so they haven't had to suffer as I did. The key for me to understanding the complexities of the semantic web was to understand and embrace it's simplicity. Does that make sense?
Cheers
Phil