Semantically Challenged: November 2008

Wednesday, November 19, 2008

On Track.... Quite Literally!

I've been riding motorcycles for 27 years and during the course of those years I've watched the bike racing on the TV and said to myself "I'll do that one day". I had planned to do it before I was 40 but somehow that just seemed to pass me by. Even though the thought became increasingly forefront in my mind over the last couple of years, I did nothing about it. I guess this was partly due to the stories I had heard of complete bedlam with novices and full blown racers being on the track at the same time and partly as I couldn't bear the thought of my bike and myself in the kitty litter.

I don't really know what happened, but one day I found myself looking at the IAM website and reading a write up on a "Rider Skills" course held at Mallory Park race track... before I knew it I was signed up in a novice group for the afternoon event on 9^th October.

As the weeks passed and the date approached, I became increasingly nervous, what had I done! What had I let myself in for! The appalling weather in late September and early October didn't help with these thoughts either. The day before the event I left work trembling even though the forecast was for an exceptionally nice sunny day and high temperatures. The day had arrived, and I had planned to set off around 9.30am for a nice leisurely trip from Reading to Mallory, aiming to get there in time to watch the later part of the morning session, well that was the plan. I didn't get into the garage until 9.50 as I seemed to spend ages fumbling around getting everything ready, then the bike refused to start! It's never done this before! Was it trying to tell me something, had she refused at the first hurdle? Eventually the bike decided to fire up and we were off. My riding was terrible... I could not seem to concentrate, but as the miles rolled by my nerves seemed to relax somewhat.

Arriving at Mallory, I joined a group of IAM riders that was growing as time passed. There was just every kind of bike imaginable, including a massive BMW K1 200 LT. I found out later that a Goldwing had taken part in the morning session! I didn't have time to check out the track so instead I got chatting to the others waiting in the car park, whilst I eat my sandwiches. As it turned there were quite a few first timers all of whom seemed as apprehensive as me. The sound from the track was impressive as bikes roared around having fun; I really should have gone and taken a peak, as you can't see from the car park. Then 12.30pm arrived, the sound from the track ceased, the main gates opened to allow access to the centre of the track and I was in a line of bikes heading for the registration office. The first thing was to sign up and get allocated to an instructor, then grab a cup of tea and wait in the briefing room for the pre-session talk. Here we met the organisers Roy Aston and John Lickley who along with others gave us our welcome and pre-session briefing. This consisted of learning how things worked, the layout of the circuit, consideration for others, what the flags meant, what not to do (it was not a race and there were no talent scouts watching), and also something that I can only describe as the "fandango" manoeuvre that was used to swap places with the instructor and fellow group members down the start/finish straight. This manoeuvre was necessary to have your turn at the front of the group. Briefing finished, on the bikes and down to the pit area where we met our instructors and were given coloured and numbered bibs. Our group turned out to be quite small with only three pupils (Me (Phil), Paul and Phil) and our Instructor Paul Jones. Paul was an amazing guy who rides bikes at a very high standard as a day job and has been a track instructor on skills days for a couple of decades. He was also an extremely fun guy who lived bikes. Paul explained that he had a syllabus to teach us and we could only progress through the stages if he was happy that we had all reached the required level for each.

Our first session was simple, one lap with the instructor leading, execute a fandango down the straight and continue until everyone had one lap at the front, at which point he would lead us back into the pits. This was all to be carried out around 60mph, but don't look at the speedo! It sounded simple, but as we were the last group to exit the pits, the track was quite full, and checking that you didn't get in anyone's way on the first corner was a little nerve racking. A couple of laps into this session it hit me...."Oh My God I'm on a track!" I whooped as I took off my helmet back in the pits, much to the amusement of a couple of people. "I've done a lap" I said, "err.... four actually" Paul replied. Paul explained that the first session wasn't much more than a look at the track and a huge ice breaker for our nerves. It really worked.

The next session used to be "no braking laps" but had been replaced with "single gear laps". Paul wanted us to use 3^rd gear, 2nd allowed at the hairpin and the bus stop, but to use the brakes as little as possible and each person would have two laps at the front. He was looking to ensure that we had perception skills and that we could judge throttle response appropriately for the hazards. Out on the track things felt good, the Fazer having bags of engine braking, and no dramas for any of the group. Our pace hadn't been any higher than the first sessions, perhaps even slower in some sections with no brakes.

Back in the pits and the brief for the third session was any gear allowed but hard braking. He wanted to see a smooth transition from throttle (accelerating) to brakes with no coasting. Braking must have been completed with the bike upright before turning in and without upsetting the bikes stability. I.e. smooth application of the brakes, hard braking with a big squeeze and gentle release so that the bike did not dive and buck. Each person now had three laps at the front, but he also wanted us to "up the pace" somewhat to ensure the desired level of braking was achieved. This would also allow us to experiment with braking distances which he hoped would get shorter and shorter with each lap. Just for good measure Paul said that as I was last in the line I should be an expert by the time I was leading the group. No pressure then! As it turned out the following laps became rather processional following other groups which meant we didn't really achieve the pace Paul was looking for. That was until half way through my first lap when the entire field of riders in front peeled off into the pits leaving me with an open track. It looked so enticing and scary at the same time. Up to the hairpin and I braked way too early coasting the final few metres. Next time around with increased pace and braking later for all the corners, things really started to make sense. It's amazing the feel you get from the brakes and how controlled you can be.

Back in the pits and Paul seemed pleased so on with the next session which was about gear selection for various parts of the track. We now had four laps at the front but we no longer needed to stick to our line formation once complete. Paul had one last piece of advice before going out "Overtake when you get the chance". Out on the track the advice about short shifting for various sections made sense and seemed to be working. I was starting to use the throttle more effectively through the Esses giving more drive down the straight where we were now passing other riders, much to our jubilation. It was my turn in front and I had been making steady progress with more throttle and brakes being used as everything started to fall into place. However on one lap I fixed my sights on passing a rider some distance ahead on the straight. I gave it more throttle than I had before and was approaching him fast. At some point I realised that I was not going to get passed and brake to my usual entry speed before needing to turn into Gerrards. After a very brief moment of mental debate I decided to accept my higher entry speed (still a lot slower than some) and attempt an overtake through the corner. What a feeling it was as I rode around the outside and accelerated out of the corner. I'm not sure if they heard my "Yahoo!" in the pits. The only problem was that in the excitement I suddenly realised I had missed my braking point for the next chicane, Edwinas. On with the brakes and I squeezed as never before. Gentle release and turn in and I was through. Phew! Back in the pits Paul commented it was nice to see the increase in pace and moving the braking point so far on. If only he knew!

The next and final session was "out on your own". Paul would be honing around keeping an eye on us, but we could do our own thing. We still had 45 minutes left of track time, but before we did anything he made sure we took on plenty of fluids. I went out with 30 minutes to go and 15 minutes in I was pretty tired and starting to make mistakes. I eased off a bit and just enjoyed the track and the experience for the rest of the session until the chequered flag came out. Paul seemed happy with us stating that we had doubled our speed and halved our braking distance.

Finally there was a pit lane debrief with everyone, just to calm things down, before heading home.

I was a lot slower than many people around the track, in fact I was the slowest in my group, but I had experienced my bike and myself in a whole new light. What a day it had been and one I would recommend to everyone.

Thanks to John, Roy, Paul and everyone else who made the day possible, it was greatly appreciated by everyone.

Photos of the event can be seen at http://www.photoboxgallery.com/roberthands or my personal pics can be viewed at

http://www.flickr.com/photos/pashworth/sets/72157607960474703/. I didn't say I was Valentino Rossi so don't expect too much.

Cool URIs for Molecules

Firstly I have to say that the title of this blog is a direct plagiarism of the W3C paper "Cool URIs for the Semantic Web" by Leo Sauermann et al.

Introduction: - The nature of the problem

Part of our role in informatics is to create and maintain databases that hold chemical structure information. These databases allow users to search for chemical structures or sub portions (substructures) of them. In order to achieve this we implement chemical search engines provided by vendors. These chemical search engines provide a means to rapidly filter chemical structures through creation of specific indexes and atom matching functionality. The need to have a chemical search engine is not in doubt; however using it to join information creates a huge technical overhead and inflexibility within the data. Within an organisation there may be many chemical databases each serving a different purpose. E.g. Internal molecules, externally available molecules from vendors including reagents and screening molecules, competitor molecules and related information, all form a wealth of information that could be related to chemical structure.

There are several situations where we would wish to link, merge or extend the capabilities of these chemical databases.

e.g.

Many companies create their own chemical database of available compounds supplied by various companies around the world. To do this we bring together chemical catalogues (supplied in a variety of formats) based on the chemical structure using a chemical search engine. Some software vendors supply a ready correlated database in their particular database format, but these systems are often too restrictive, as they do not allow us to update the system with the information we desire.
When company mergers happen (all too frequently!) there is always a mad scramble to import the contents of one company's chemical database into the others. This can take weeks to complete and yet we may still loose information as the schema is not set up to handle all of the information that the other holds.
We may be generating new sets of data properties that our existing chemicals databases cannot cope with. If the same chemicals are in multiple databases then we have quite a job extending the scope of each schema and updating the appropriate primary id for the data source.

When we bring chemical information together from various sources we need to carry out an "Exact match" search using a chemical search engine. This can be quite (extremely) time consuming and may need to be rerun often if used to create a joining table of dependant systems.

There is a growing need within the industry to merge more and more information. Take for example the plethora of information being generated in the external literature each day. Wouldn't it be nice to merge all the latest information gleaned about a molecule from the literature to enhance our total knowledge about it? The problem here is that the literature is not marked up well enough (yet), so we have to resort to image extraction and conversion to a chemical structure that can be submitted to a chemical search engine on one of our databases. This is a lot of work!

Wouldn't it be great if we had an identifier that simply and uniquely identified each and every chemical structure?

Other than the chemical structure itself, there isn't really anyway to tell if two structures are the same. Yes, there are standards for naming molecules but depending on which software you use the name can be composed differently which brings in ambiguity. There are also textual representations of chemical structure such as "smiles", mol file (CTAB), INCHI and a variety of other formats from different vendors, however each can be formed differently bringing in the ambiguity factor once again. One might say that within a company we might be able to produce a single identifier, but this could (and probably would) be different from company to company.

How can we produce a single unique identifier for each molecule that anyone can calculate and use?

Background

This work started out over a year ago when I was looking for a project that would demonstrate the value of the semantic web, or more precisely linked data.

I was attending the W3C
RDF from RDB meeting in Boston. I was still a semantic dummy at that time (perhaps I still am, but that's a different topic) and I was still trying to bring it all together in my head. I found the meeting of great value and well worth attending, especially as afterwards I found myself meeting and heading off down the pub with Eric Neumann and a few others. Eric and I got chatting about various things including the use of immutable identifiers in URIs. Eric chatted about his headache of how to achieve this for proteins, and then went on to describe some thoughts about what could be used for identifiers in the chemical world. Eric proposed the use of the INCHI string for the identifier and mentioned how cool it would be to have a URI that not only encoded chemical structure, but one that chemical properties could also be calculated from. As a chemist turned informatics person I had more than a passing interest in this idea and I wondered if this was the angle I had been looking for to create my internal demo. I had a few doubts about the use of the INCHI string as it also suffers from ambiguity issues. However, I could see that the recently introduced (at that time) InChiKey could provide the answer. More importantly it was produced and distributed free by IUPAC, a respected organisation that is responsible for many standards in the chemical sciences.

The Project

I decided to create a demo that would illustrate the combination of chemical entities from a variety of external suppliers without the need for a chemical search engine. Gathering information from the external suppliers was an easy task as there are a large number of suppliers out there all supplying the data in the well known SD file text format. I needed to generate the InChiKey for each molecule using the IUPAC software, combine this with a namespace to create the URI and assign all the information in the SD file to that URI. Then all I had to do was create an RDF file for each of the suppliers and the URI's should do the rest of the work for me. In addition to the IUPAC software to generate the InChiKeys, I also used Pipeline Pilot for some text processing and TopBraidComposer as the environment to bring it all together. The final icing on the cake was the use of TopBraidLive to demonstrate a model driven application that can change with the data.

To give a little more details to the construction of the URI; an InChiKey looks something like AAFXCOKLHHTKDM-UHFFFAOYAA. If you click on the link it will show a picture of the molecule it represents. As for the namespace, I wanted to use something that everyone could adopt. I felt that if I could choose something that perhaps IUPAC could ratify then it would have a good chance of adoption. In the end I went for a reserved namespace of "urn:iupac:inchikey" so the above molecules full URI would be

<urn:iupac:inchikey#AAFXCOKLHHTKDM-UHFFFAOYAA>

I think that's quite a cool URI for a molecule, but I have no idea if others would agree or indeed if IUPAC would be prepared to get involved in the Semantic Web and reserve the namespace for everyone's use.

The demos worked incredibly well and were well received. The URI's worked brilliantly, each time I imported a new RDF dataset from a vendor the data automatically updated so we could see which chemicals were supplied by which vendors and any overlaps that existed. We could also play around with various OWL restrictions and inferencing techniques to categorise things like building block suppliers, trusted suppliers or various sorts of chemicals based on properties. I also went as far as adding some of our own in-house molecules, logistics information and people information from our HR database.

Starting with a small dataset and a basic model I could create something very powerful and much more advanced than the data itself might suggest, within a matter of minutes. It was like working magic and all without complicated changes to interfaces and database schemas and the use of chemical search engines. Although it is fair to say that a chemical search engine would be needed if substructure searching was intended.

The Web Demos

The demos mentioned above were part of my work within my company. However I want to show you something, so I have reworked a completely new and much smaller dataset to produce some Exhibit demos. This data is completely fictitious so please do not read anything into it. Before I point you at the demos I'll discuss some basic details.

A very simple ontology was created consisting of the classes, Company, Catalogue, Entry and Item with a few object and data properties. This is a very generic ontology that you could fit to any catalogue not just chemicals. The idea was that a chemical supplier could provide a variety of chemical subsets. E.g. Building Blocks, Screening Compounds etc. Each of these subsets would become a "Catalogue" from the "Company". An "Entry" in a catalogue could have many "Items" associated with it, however in this case there is only one. The "Entry" URI was derived from the catalogue ID and the "Item" URI is the chemical URI generated using the InChiKey. Since the "Item" is referenced using the chemical URI the idea would be that an "Item" would join itself to as many "Catalogues", via an entry, as required. Each "Catalogue" was created in a separate RDF file and contained two basic properties for the molecule, molecular formula and molecular weight.

And so we move to the first demo, here all I have done is create an empty .owl file and import each of the RDF files representing the catalogues using TopBraidComposer. For simplicity I then exported the data to a single JSON file, but the individual RDF files work just the same.

In an attempt to spice things up a little and play with Exhibit, I created geographical locations for the companies and displayed them on a map. The demo works on the basis that we are displaying the Items (molecules) in the main forms and a series of facets allow the items to be filtered based on object and data properties. The map updates to show the location of the suppliers for the filtered molecules. In addition three views on the molecules were created; a simple list view, a table with chemical properties, and a general thumbnail viewer for the structures.

The second demo attempts to show the ability to enhance the basic dataset with more information. Data enhancement is always a key issue with traditional RDMS systems so I wanted to show just how easy and simple it could and should be.

The data could have been anything, it may have been extra information we want to add or something that someone else has published and that we just want to use. The Semantic Web makes this possible.

In the end I just created a new RDF file with some additional chemical properties and some text based structural representations attached to the relevant molecule URI. Simply by importing this new RDF file we have a new range of properties to be used in facets or displayed in views. No tricks, no messing with RDBMS tables and columns, it's just the molecule URI doing its job.

The third and final demo (well at the moment) attempts to look at how a company might integrate some of its own data with this supplier database. To illustrate this I created a new ontology that had the classes, Company, Compound, Assay with data properties for holding things like company ID, assay result and object properties that would link the instances. All of this data is again fictitious, but you might have guessed that, as I located "MyBiotech" in the Bahamas. Importing this new ontology and then running some simple inferencing, so that a "Compound" was also an "Item" even if it didn't have a catalogue entry, produced the result you see. The facets have now been mixed from the two ontologies so we can filter on compounds that have certain properties and assay results in a particular assay. I've also added a new table to display the activity results.

Several common questions that came up when I first demonstrated this version were;

"How do I see our companies compounds only"
"How do I see the compounds in our company that are not available elsewhere"
"How do I see the compounds in our company that are available elsewhere"
"How do I see supplier compounds only"

All of these are possible, but I admit that I did have to do some mindset changing to get them to see it. A slightly more detailed ontology might have helped. But we can readily update that to adapt to the needs.

We've been talking about doing something similar to this for real in our company for many years. It is unlikely to happen with RDBMS systems due to the effort involved, but I think all of this starts to build a case about why we should be adopting the Semantic Web approach within the enterprise.

It remains to been seen if the IUPAC namespace and molecule URI is the right way to go. But I believe that we have to start providing common languages for inter enterprise data communication. This would be especially useful if the external literature were marked up using a common standard naming convention. How simple it would be to extract and combine new information if this were the case.

This is just my attempt to start the ball rolling in my particular area.

Friday, November 14, 2008

It’s Happened Again

Over the last couple of years I have been fortunate to attend Semantic technology and life Science conferences. I really enjoy chatting to people and finding out what they are doing and sharing experiences. I've been particularly interested by several discussions with people who have tried Semantic technology projects and failed. I never really understood why these projects failed, but many of these people put it down to various aspects of Semantic technologies such as immaturity, difficulty with this and that etc. It wasn't until I spoke to people that had completed a successful semantic project, after several failures, and merge that with some of my own experiences that I started to understand. One individual told me quite bluntly that it wasn't until individuals with semantic web experience were brought into a project that success stories started to happen. It wasn't through lack of anything, enthusiasm, brain power etc, but it seemed that existing technology teams just couldn't make semantic technology work for them. The result was often that semantic technology was thrown out and disparaged.

It happened again today!

Sometime ago a software company presented a project they were about to embark on. It looked great they had thought about a flexible system and were going to use RDF. Today we had the update and initial demo of the software. It didn't start well for me as the technology slides had no mention of RDF and semantic technologies. Instead they had spent ages creating a standard architectural service based - RDBMS system. It got worse when they spoke about how flexible their system was. I started to probe with a few questions, "How many concept types can you deal with?" ... "currently four, but we are looking to add more" they said... "but of course I can create my own concepts can't I?" ... "errr no, that would be through consulting to customise the system".... "but I can create my own attributes and specify how the concepts interact can't I?"... "err no, but we are looking to add that later"..... "OK so why is concept x not a subtype of concept y?" .... "That's because we brought that concept into the project a bit later and it was difficult to add it without a lot of work".

I started to suggest that if semantic technology had been used the concepts could have been created as ontologies which could change on the fly as could relationships and the attributes around them and we would not be restricted by having to recreating schemas in the relational back end to cope if RDF had been used. The explanation came back that when they looked into using semantic technology they had little success, blaming this and that and stating it was immature lacking x, y and z etc. It became clear that they had not understood the fundamentals of the semantic web and were trying to fit semantic technology into a predefined java, OO world and had not really understood that you have to think and design a bit differently. I must say my blood did boil a little and I was a bit rude, which I should not have done, it was unprofessional of me so I do apologise.

Don't get me wrong, I'm not having a go at anyone; this is just my view on things as I have seen and heard them. But, if it is true that we have to get individuals with semantic experience in on these projects for them to become a true success, then how are we going to do it. These companies don't seem to know they need them, saying ..."we don't need semantic technology people as the technology isn't up to much. If our guys can't make it work it's no good". There isn't a huge glut of semantic techies out there for them to choose from even if they did realise.

It's a tricky one.

Semantically Challenged

Wednesday, November 19, 2008

On Track.... Quite Literally!

Cool URIs for Molecules

Friday, November 14, 2008

It’s Happened Again

About Me

Blog Archive

Followers

My Blog List

Semantically Challenged

Wednesday, November 19, 2008

On Track.... Quite Literally!

Cool URIs for Molecules

Friday, November 14, 2008

It’s Happened Again

About Me

Blog Archive

Followers

Subscribe To

My Blog List