Chemical Bonding InChI by InChI
Image of Daniel Pollock
By Daniel Pollock
March 30, 2009
Important Details: The International Chemical Identifier (InChI - pronounced “INchee”), was developed several years ago by chemistry’s governing body, the International Union of Pure and Applied Chemistry (Iupac), together with the US government’s National Institute of Standards and Technology (NIST). InChIs are generated by a free computer algorithm which analyses a chemical structure to provide a unique, machine-readable identifier - much like an ISBN for chemical substances. Tracing its roots back to a paper in 2003, its aim is to serve as a single public, industry-standard format for identifying chemical structures and so form a basis for sharing chemical information on the web. The first formal release was in August 2006.
With over 40 million chemical substances currently known to exist, and with well over 100 indexes available, organising, searching and cross-referencing substances is a huge challenge. The InChI has seen slow take-up and in an effort to raise enthusiasm for its use, the Royal Society of Chemistry (RSC) worked with ChemZoo (the software team at open-access chemistry search and aggregation engine ChemSpider) to develop a free “InChi resolver”. Launched at the Spring American Chemical Society (ACS) meeting, this can turn an InChI into a shorter, search-engine friendly “InChI key” (a 25-letter code also developed by Iupac and NIST). US Government agencies (such as the National Institutes of Health, NIST, The Cancer Institute, and the FDA) plan to tag entries with InChIs.
Extracting references to substances within the literature is tough, too, due to the nuances of describing the often-complex molecular structures on the written page - many need to be visualised in 3D and to aid readability their complex names are conventionally referred to by numbers within articles. Here the machine-readable InChI is helping the RSC (via its Project Prospect) and now Nature Publishing Group (via the recently launched Nature Chemistry) in pushing towards born-digital journal article services. Both use combinations of automated text mining and manual editorial input to extract and mark up entities within articles and are encouraging authors to add additional metadata to submissions.
* RSC journal articles attach machine-readable InChI codes, SMILES strings and CML (Chemical Markup Language) to chemical names, and cross-reference IUPAC Gold Book terms, Open Biomedical Ontologies (Gene, Sequence and Cell), related RSC articles and link to 2D graphics.
* Nature Chemistry functionality includes InChI and SMILES mark-up, pop-up structures in articles and a common database of NPG proprietary substance pages which can be used as hubs for linking to further information.
Since the InChI is machine-readable, information providers can streamline the identification and cross-referencing substances within journal articles which use InChIs, and as the reliability of automated abstracting techniques improve (e.g OSCAR), economically “retro-tagging” the literature could become a possibility.
Implications: The InChI is an open and non-proprietary standard which can be shared between by competitive players and independently verified. It can therefore facilitate searching across multiple data sources via the open web, and provides a way to connect the various registries of substances in existence.
This in turn speaks to the trend we see across the web, namely that users want the convenience of a one-stop shop search. Scientists want to search “the whole of science” - or at least information held by all providers in their vertical area of interest. So projects such as open access ChemSpider are working towards common point of discovery, allowing users to locate specific information for a chemical structure and then access the data immediately via open access links or have the information necessary to continue their searches into commercially available systems.
The current gold standard for identifying chemical substances are proprietary Chemical Abstracts Service (CAS) Registry Numbers, owned and operated by the American Society of Chemistry (ACS). We do not yet know if CAS plans to map its database to InChI. However, given that CAS has been criticised for its proprietary approach in the past, and took until April 2008 to release a web based version of its flagship SciFinder database, in Outsell’s opinion we may have to wait a while yet.
However, we do hope that this is not the case since it is important that information providers do not Balkanize their information if they are not to get lost in the web (see Insights 18 July 2008, Nature Publishing Group Sets the Cat Amongst the Pigeons of Open Access, But Maybe We’re All Missing the Point). The point here is that open standards can benefit all by making information (products) easier to discover, and this speaks to one of the core demands of the networked environment. So, for example, CAS’s index of 40 million substances is not threatened by open standards and, in fact, our view is that mapping CAS numbers to an standard such as InChI can only help to make it more accessible. And with over 20 million substances now indexed by ChemSpider, the InChI could emerge as a - if not the - industry standard index of chemical substances on the web.
Meanwhile, whilst we can see the reaction of the big chemistry publishers and abstraction services, we can reflect on a sobering question: why is it taking government and voluntary contributions to build an industry standard? Surely that should have be the territory of the information providers? In chemistry it seems, as everywhere, the web changes everything.
Friday, 10 April 2009
Subscribe to:
Posts (Atom)