The Long and Winding Road towards FAIR Data as an Integral Component of the Computational Modelling and Dissemination of Chemistry

: Fifty years of the author‘s activities as a “computa-tional chemist” are described, linked by the theme of scientific data. For the first half of this period, data was often handled in a very impoverished way, appearing as in printed form as supporting information or as appendices in e.g. PhD theses and these methods resulted not only in considerable loss of information but caused severe and error-prone difficulties in the re-use of the data. The second period charts the explosive growth of the digital information era and describes the author‘s own involvement in developing some of the required infrastructures that are nowadays often taken for granted. This culminates in the present-day increasing use of data repositories holding FAIR data, which is now firmly regarded as a first-class citizen of the scientific processes. Many examples of these experiences are included in the essay for the readers to experiment with themselves.


Introductions
This story covers a fifty-year timespan of the author's own experiences in chemistry with the theme of data. It starts in the early 1970s, when the increasing computational power and memory of digital computers of that era, coupled with emerging approximations to solution of the Schrodinger equation, were on the verge of enabling acceptable chemical accuracy to be achieved for quantitative modelling of realworld problems. This resulted in an explosive growth in the quantity and quality of the data produced as an offspring of this marriage, which was accompanied by the gradual realisation during the 1980s that the three-hundred-year-old model [1] of scientific publishing, the "printed journal", was going to have to itself evolve and adapt to the new coming era of data-rich digital science.
As with the song, my own journey along the winding road started in 1970, with the topic I had selected with the help of the project supervisor for my final undergraduate year literature review essay at Imperial College. This was the determination of the association constants of donor-acceptor complexes in solution using NMR chemical shift measurements. The equations yielding an association constant for a 1 : 1 stochiometric equilibrium were simple to analyse using graph paper, but already deviations from the expected linear correlation had been noticed and tentatively attributed to incursion of 2 : 1 stoichiometries (Scheme 1). [2] I had already tried my hand at computer programming whilst at school in 1967, offered to those who thought they might not entirely enjoy the alternative prospect of running a quarter marathon around the local Wimbledon common every Wednesday afternoon. An enlightened and far-sighted chemistry course director Bryan Levitt had just introduced a programming course for the undergraduate students at Imperial College, which I had joined by 1968. So it was with great enjoyment that I took this course a year later. Armed with the new skill of knowing how to write and run a Fortran program, I was seeking a suitable problem to solve using it. With the NMR review, I soon realised I had found it.
The equation relating the more complex equilibrium (Scheme 1) to the chemical shift response could no longer be easily plotted on graph paper, but it could be expressed in terms of a linear least-squares programmable solution. This then yielded values for both K 1 and K 2 and as a bonus for doing it this way, standard errors for both equilibrium constants (if you forward to Figure 1 below, you will see them there as well). Younger readers may not immediately realise that if one then wanted to perform a computer-based analysis, one had to sit down, code a program and gain access to the single mainframe computer available on site to run the code. There was no existing market then for off-the-shelf software for the task. As a technical university, Imperial College had offered computing facilities since around 1962, a mere 18 years it might be noted since Colossus, the first modern programmable computer at Bletchley park had started to operate. When I submitted my essay, I came to realise I had actually presented something of a grading challenge for the assessors, since it had morphed from a simple review of the literature into a full-blown research project, which it was not supposed to be! For the report, I had typed out the required equations in an appendix, including the least-squares summations as encoded in the Fortran language, as well as the chemical shift data extracted visually from various articles and finally the analysed parameters and their errors. I do remember my conclusion, which was that the second equilibrium constant K 2 could be very large indeed, as also was its standard error. All this information is now lost, since such undergraduate efforts were only archived for a few years on paper before being discarded. My own printed copy of the essay has also vanished, as indeed has the computer program, which had only existed in the form of punched cards (digital computer files were not yet in common use). This loss that must have set a flag in my mind for future years; in 1970 there was no easy way to preserve digital data and my project supervisor, who was not familiar with computer programming, had not been able to offer the advice I would have needed as an undergraduate to publish the results in a journal for posterity. The words written here therefore are the only record of this experience.
Another opportunity to exploit digital computers came with my PhD project from 1971-1974, for which I was attempting to measure rate constants and kinetic isotope effects for a variety of hydrogen exchange reactions of the substituted and hindered indoles and related molecules that I was going to synthesize in the laboratory. [3] The strategy was to use substituent-and steric-induced changes to the rate constants and derived kinetic isotope effects to try to construct semi-quantitative transition state models for the reactions. After three years effort, I finally produced a first stab at such a model [4] (Figure 1) and it was trying to verify the correctness of such models that in fact ended up propelling most of my subsequent career in chemistry. But before then, I had to find a solution to a more immediate experimental problem of analysing my kinetics. As with the NMR shifts, [2] some of the kinetic plots I was obtaining were distinctly non-linear and not tractable to analysis using graph paper, the time-honoured tool then in use by everyone in the Challis group that I had joined. I happily settled down to coding the now iterative non-linear least squares equations [3] resulting from the steady state solution of the rate equation, with one immediate practical benefit to my social life! This was the "infinite time measurement" which was normally required to be made in person after at least ten half-lives when using traditional methods of analysis. Now the infinity reading could itself be predicted using only the earlier measurements. The time saved in not having to wait did not go to waste but was put to good social use in the Harington pub! On this occasion, posterity has Henry Rzepa was trained as an experimental physical organic chemist and then spent 3 years learning the emerging area of computational chemistry with Michael Dewar in Austin, Texas. Upon joining the staff at Imperial college in 1977, his researches became focussed on computational mechanistic chemistry, NMR and Chiroptical spectroscopies and Internet-based Chemical informatics, for which he was awarded the 2012 ACS Skolnik award.  The properties of a transition state as inferred using kinetic and isotope data. [6] access to my analyses in the form of archived journal articles [5,6,7] but now some (but not all) of the computer codes produced during this period have also survived [8] as a result of my finding and archiving them at a much later date into a data repository.
Here I am however getting ahead of my story chronologically and I should return to that painful transition state model, constructed over three years and requiring some 1500 separate kinetic experiments (most lasting 10-48 hours). In 1974, the computers best capable of solving approximate solutions of the Schrödinger equations using the new breeds of "semiempirical" SCF-MO methods in a reasonable time (days) were to be found in the USA, not the UK and so it was off to Austin Texas to start modelling transition states with Michael Dewar. I should recount the story of how I ended up in Texas. In 1969, whilst I was still an undergraduate, Derek Barton had won the Nobel prize in chemistry. By then known primarily as a synthetic chemist, Barton early in his career had actually been a molecular modeller; he liked to say [9] that he had actually started the field in the late 1940s! In those days, outside of Bletchley park calculations were mostly done on a mechanical calculator, not a computer. But by 1974, Barton had already heard of the great work being done by his good friend Michael Dewar in Austin using computers to model chemical phenomena and he realised that no-one in the UK was exploiting these methods in the area of physical organic chemistry and synthesis. So he summoned me, possibly aware of my use of computers for kinetic analyses and informed me in what seemed like a non-negotiable way that he would arrange for me to go to Texas. As it happens, three years later, he came to a party thrown by Michael (whose parties were indeed famous) and repeated our earlier conversation but in reverse. I should abandon any thoughts of staying in the USA and should immediately return to the UK to start practicing my new skills! Again probably non-negotiable. So if Barton had been a modeller turned synthetic chemist, my career was developing as the mirror image.
Once in Austin, I soon came across a variation on the least squares minimisation procedure (of residuals) which acted instead on the molecular energy (actually heat of formation) produced by the locally written semi-empirical (MINDO/3) codes and which allowed the geometries of transition states to be located by evaluating numerical (later analytical) first and second energy derivatives with respect to the 3N-6 coordinates of the molecule. [10] These optimisation methods and their successors are now a mainstay of computational modelling, although of course the size of the molecules has increased from perhaps up to ten atoms to many hundreds and the semiempirical methods have largely been superseded by more reliably accurate solutions of the quantum equations such as density functional and coupled cluster theories. In Austin, I was able to compare the computed structure and kinetic isotope effects (KIE) for the transition state of a Diels-Alder cycloaddition with a measured KIE values, [11] using the socalled Bigeleisen-Meyer (BM) partition function ratios first formulated some 27 years earlier. [12] These functions required access to all the (computed) normal mode wavenumbers for both the reactants AND the transition state. These were something that for several decades after the BM formulation had been considered as a holy grail, but also regarded as unachievable in the foreseeable future from the perspective of 1947. I could not help but note at the time that whilst it had taken~3 years of experimental measurement to crudely approximate a single transition state model, the eventual calculations of a more quantitative model based on quantum mechanics alone could be done in a few days if not a few hours. Indeed, jumping forward a few decades to 2014, I took the opportunity of modelling [13] all the basic reactions reported in my PhD thesis, using levels of theory that had by now achieved chemical accuracies of a few kcal/mol, including computed values for many of the isotope effects I had measured. This process took about two weeks of elapsed time and revealed a pleasing congruence between my experimental measurements and the isotope effects computed using the new transition state models ( Figure 2), with one exception that was clearly the fault of the experimenter! I should however now return to my overall theme in this article of data. How much of the data dating from around 1976 on which the reported conclusions [11] relating to the predicted kinetic isotope effects for the Diels-Alder reaction were based has survived? During this era, data in both numerical and graphical form would be submitted as supporting information (SI) with the main manuscript, both being in printed form on paper (hence the term "scientific paper"). The SI component comprises in this case ten printed pages, consisting of four photocopies (a very new technology then) of computer printout ( Figure 3) and six pages of probable typescript, being the atomic coordinates of two computed transition states and reactant and product states, the result no doubt of several hours of transcription from computer printout by a dedicated secretary The four photocopied pages, which are presumably free of transcription errors, originate from one of the programs I had brought with me from my PhD efforts in London, together with a newly programmed implementation of the Bigeleisen-Mayer equations, the code for which also still exists. [14] The other six pages of SI, being human transcriptions, would need to be checked for errors introduced during this process before being usefully relied upon. Neither is directly re-usable since the printed contents of the transcription or photocopy would have to be reversed back into digital form. In fact, back in 1975 or so, the data from the computers was never returned to the user in digital form; it was always directly printed onto paper. A group of around 10 researchers could easily generate a ton of such paper in one year; indeed when the output storeroom was filled, we would recycle the paper with a local company and earn enough income from this for an excellent night out in the Texas Tavern! Such storage also brought other difficulties. On one occasion, a published article was challenged in press, and I was asked to go to the storeroom to try to find the original printouts to check them for errors, the student who had done the work having left the group. After "only" 3-4 hours, I was lucky enough to stumble on the printouts in question and spotted the errors immediately from this source without having to retype the numbers all in again. Another month or so, and those printouts would have been recycled for beer! These experiences no doubt seeded in my mind that the volume of data per published article in the area I was working in was likely to expand considerably from a mere 10 printed pages, and that transcription in both directions not only risked introducing errors and took much time but was very soon going to become entirely unviable. Such issues meant that the re-usability of data in this form was almost non-existent. Access to such SI was also non-trivial and not instant; one had to contact the publisher and ask them to send a copy by post. In the UK, this material was in fact held by the British Library in an underground vault from where a copy had to be requested. It would be fair to say that such data from a scientific article, whether associated with SI or even found in the main body of the article, was probably very rarely re-used in any productive sense by most researchers and likewise rarely checked in any sense by reviewers of journal articles.
After my three year stay in Austin, I returned to the UK as demanded by Barton, where I continued to adopt the, in retrospect unsatisfactory, data practices described above in all my scientific outputs. Around 1980 the other Nobel prize winner in the department, Geoffrey Wilkinson (then Head of Department) asked to see me. He was anecdotally known to be highly sceptical about the usefulness of any chemist who used computers rather than doing real chemistry in a laboratory. Mysteriously he said little when we met, instead leading me to a corner of his own laboratory and pointing to a stack of NMR spectra. This was at a time when he was working on his famous catalyst, based on rhodium-phosphine complexes, and these were all 31 P spectra. None of his students or co-workers had been able to analyse them, which was causing serious delays to the hoped-for publications describing their reactions and the students were starting to fret. With that famous Yorkshire twinkle in his eye, he wondered merely whether I might fancy having a go and thrust them into my hands. Clearly implied was that if I did not succeed, his misgivings about computational chemists would be fully confirmed! No pressure then! As it happens, I was well prepared. I have mentioned that on three prior occasions I had delved into minimisation codes, whether squares of residuals or of computed energies and had also applied them to the parametrisation of semi-empirical SCF-MO methods such as MNDO. It was actually rather easy to connect the minimiser codes to the famous LAOCOON NMR simulation program and minimise the residual difference between the measured and predicted lineshape. Now, 45 RhÀ 31 P 2 J couplings are famously large and the 31 P spectra are accordingly grossly second order in appearance, which was the cause of the difficulties noted above. But the minimisation code did not let me down and soon I was getting near perfect matches between Figure 3. Example analysis dating from~1976 of literature data used to derive activation parameters for a Diels-Alder reaction and presented as Supporting Information. [11] experiment and theory (Figure 4). Adopting the minimalist conversational approach he had previously inflicted on me, I returned to Wilkinson's office and as silently as I could, placed simulation and real spectrum on his desk, together with a listing of the extracted chemical shifts and J-couplings. Then I left. I do not remember Wilkinson saying much at the time either. A few weeks went by without further communication between us, until one day a fat envelope appeared in my mailbox. It contained drafts of three articles, [15,16,17] with me as a starred co-author! I continued to meet him occasionally to chat about chemistry until a few hours before his unexpected death in 1996, and never was there any disparaging remark about computational chemists. As for data, well those three articles still contain only images of NMR spectra, with no digital data available other than the couplings and chemical shifts. The code I wrote for this task is also lost. Addressing these problems was still to be some decades away.
In 1988 I was invited to sit on the advisory boards of scientific journals and there it was that I found myself at a meeting of the RSC Perkin journals, where the Perkin 2 journal was one target for my computational modelling articles at that time. Almost all articles submitted to these journals now contained large amounts of data, either taking the form of computational models or e. g. of spectroscopic data deriving from synthetic chemistry. All of it was still submitted on paper, but the volume was growing almost exponentially. Something had to be done and done rapidly. As a computa-tional modeller, I also took a keen interest in improving institutional infrastructures and in particular digital networks and their bandwidth capacity. These networks were increasingly being used to connect large (but mostly still mainframe) computers to new generations of graphical display terminals and plotters located in offices and laboratories. Although still used, computer printouts were now being rapidly replaced by digital files being sent from source to display devices, and then between that device and perhaps a more sophisticated phototypesetter which preceded laser printers, and which could handle colour and high resolutions. By 1987 in fact, the local area network in the chemistry department at Imperial College had already evolved to connect around 40-50 Macintosh computers with not only laser printers but via the newly emerging Internet to information sources such as CAS online (in Columbus Ohio), to databases of structural information such as crystal structures at the national chemical databank service at Daresbury and using a now forgotten protocol called Gopher to remote collections of data files which had started to supplement an existing method called FTP (file transfer protocol). Journals did not yet make much use of these new infrastructures, but I did suggest that they should exploit these at that 1988 advisory board meeting.
This idea lay rather dormant until 1993, by which time the Internet had acquired a new and friendly user interface known as the World-Wide-Web and was on the cusp of "going viral". Primed with all my previous experiences and having identified  [15] three other chemists, Peter Murray-Rust, Ben Whittaker, and Mark Winter who had also noticed this phenomenon, we sat down and wrote down our ideas of how this new medium could solve a host of issues, not least of which being the sharing of re-usable scientific data. [18,19] The Web as it is now known had indeed been invented to address such sharing and exchange, this being generated from experiments using particle accelerators at CERN. Peter and I went on to develop CML (Chemical Markup Language), [20] itself an early implementation of XML (extensible markup language) as a method of formally capturing the richer semantic meanings of data, something poorly handled by the previous generations of data formats in chemistry. Journals now started to adopt this new medium, but not entirely knowing what its potential might be. I found myself in 1995 as part of a funded project (CLIC) to help re-invent and hence chart that new course for how the modern journal might start to deliver scientific outputs to readers. [21] The title of the article we wrote describing this project summarises the intent: "The Case for Content Integrity in Electronic Chemistry Journals" by which we meant that data was going to be very much part of the content and should be delivered intact without emasculation or loss. Whenever a molecule was discussed in the main body of an article, a "popup" panel could be programmed into the journal article which would contain a rotatable 3D model of the molecule in question. The coordinates of this model could then be downloaded by the reader to be re-used by them as they wished, in the manner we had earlier illustrated. [22] The coordinates could also be contained in a standard form which became known as a media type, and the standard types in common use in chemistry were identified and each given a proposed designation. [23] These types will re-emerge latter in this narrative. The concepts were perhaps in retrospect too problematic for mainstream publishers to incorporate and nowadays it will be difficult to find such data-containing popups as an integral part of journal articles. A good modern example of how this nevertheless operates nowadays is found on the search page of the Cambridge crystallographic database centre ( Figure 5) relating to a recent project as described in more detail below. [24] The centre panel contains a 3D rotatable representation of the crystallographic structure that can be analysed and manipulated using the software tool JSmol. [25,22] The realisation soon dawned that re-inventing journals to also be repositories of accessible and reusable data might be too ambitious; it would be more successful if each aspect had its own dedicated infrastructures, one for a journal and the second for data (now known as hosted on a repository), each with its separate identity and each with the status of a scientific work. The two could be intimately and bidirectionally linked by that seminal invention now associated with the World-Wide-Web, the hyperlink or URL.
During the period~1994-2005, the printed and bound journal started to be replaced by web-based hosts, whilst databases and repositories continued to evolve separately. Data objects in a database were then not considered as a separate publication, in the sense of the esteem they were held in by the community (not contributing to e. g. "altmetrics" such as the hindex). In my own area of computing transition state models as part of reaction mechanisms, no databases where they can be deposited for others to search for, inspect and re-use have become established, even to this day. Instead computational chemists mostly retained the model of data shown in Figure 3, albeit this time not on paper but in the marginally improved form of an electronically delivered PDF file. Molecular coordinates largely retained the same aspect and now access to such data for re-use depended on an informed select/copy/ paste operation from the PDF file into a text editor, where the coordinates could now be further edited to remove any artefacts introduced by this operation and then wrapped with the appropriate commands for a new computation. I have lost count of the number of times I have done this as part of the refereeing of a submitted article. The PDF format was designed as a long term archival medium, but certainly not as one optimised for expressing re-usable data. My treasured transition states were still not being optimally served by the publication processes, a situation that persists to this day in most journal articles.
A little after the turn of the millennium however, another important innovation was introduced. This was the concept of a persistent identifier or PID, which has proved to be the essential and robust piece of infrastructure that allows a journal article to be seamlessly and more reliably linked to any associated data and indeed to other related resources. The best known of this genre is the DOI or digital object identifier, which contains the following new concepts: 1. A DOI is obtained by a process to be associated with publication known as registration with an agency, which ensures it is entirely unique and will remain persistently so. The registration agency for journal articles, CrossRef, [26] had started operation in 2000, but it took another ten years for another agency to be formed known as DataCite [27] which would issue DOIs for the publication of data and other non-article research objects. You may observe that such objects include computer software and that two examples of these are already cited here. [8,14] 2. The registration process included issuing a DOI in exchange for metadata associated with this object. This metadata is simply data describing the object, and originally was restricted to fields such as authors, institution, title, description, date and similar generalpurpose information but is gradually now evolving into much richer subject-specific descriptors. 3. The DOI comes with a prefix known as the "resolution service". This converts the characters of the DOI into a resolved URL, this latter being the Web-based hyperlink address pointing to the so-called landing page for the object in question. This directly addresses one of the increasingly recognised deficiencies in the original Web implementation known as "link rot". The URL had already proved to be often less than persistent, even over a short period of a year or so, never mind decades. This was due to internal reorganisations or indeed relocations of web servers which often meant that the URL would change. A PID service allows a mapping to any new URL to be done without the user even being aware of the change; in theory the DOI should just always work, and the resolution service will redirect the original URL (that of the DOI and its prefix) to any new location of the resource. 4. PIDs and their associated metadata were designed with the capability of being handled or processed by not only a human, the traditional target of the scientific publishing processes, but by unsupervised machines on as large a scale as is needed. That aspect relates to the enabling of the emerging field of artificial intelligence or AI and we are likely to see much more of this aspect in the next twenty years or so. The introduction of PIDs now allowed a more stable infrastructure to be constructed, comprising a journal serving articles and a repository serving data, the two being bidirectionally connected using unique PIDs. So in 2005 we started a new project with enlightened colleagues Matt Harvey and Andrew McLean to install such a data repository, [28] a project that has now reached its second incarnation. [29] Since that original date, most of my group's publications have had an accompanying "data DOI" which we increasingly started to cite in the bibliography of the article. That bibliographic component of a scientific article is increasingly also now regarded as metadata in its own right, [30] and now is increasingly included in the Crossref metadata record of a published article. We are starting to see the "article", previously a single object, being disassembled into at least three components comprising the narrative or story, the associated data on which the story is based and the bibliography establishing the context of the story with other articles. Much like Rutherford's atom, after three hundred years of existence as the "article" (the term "paper" now seems quaintly obsolete) we are now seeing it starting to be split into separate (sometimes referred to as first class) objects! An associated publishing experiment involved constructing such a new object, what became known as a WEO, or webenhanced-object, to at least in part replace some of the SI associated with the article and which acted as the "pop-up" of the original CLIC journal project. [21] Creating a WEO required a fair bit of expertise and devotion and it would be fair to say that the concept did not catch on with most authors. It was also in those days hosted by the journal publisher and was susceptible to "link-rot" since it had no PID of its own. A good example of a WEO can be found associated with this article [31] dating from 2010, wherein you will find a WEO link. [32] At some stage in the last ten years this link rotted and to be fair to the journal, we were informed at the time of publication that this might happen in the future. When we were alerted to this by someone who wanted access to our data, we relocated the original WEO and gave it its own PID, [33] where you the reader can currently interact with it (at least in 2021).
You can also find it here as Figure 6, where you can view a static visual image for comparison. Our imagining of what a WEO could offer included adding hyperlinks, as seen in its right-hand column labelled OAI archive. These links took the form of PIDs and which themselves point to the data source held on our first-generation data repository for the property displayed in the WEO. These data include a full computed wavefunction for the molecule in question, taking the form of the final formatted checkpoint file from the program calculation. This then allows anyone to derive other properties from the wavefunction which had not been originally reported (an example of the I or Interoperability of FAIR, see below). In the context of where I started this story, this checkpoint file can also contain the full force constant matrix, which allows e. g. any given isotope effect of interest to be computed. [34] Most of the objects displayed in the WEO are 3D rotatable models, particularly orbitals and other isosurface properties of which the perception on a static two-dimensional journal page can be poor. This experiment using WEOs made us realise however that they are high-maintenance objects with probably a relatively short, expected lifetime (< decade?) before something decays, most probably the software environment used to create e. g. the 3D animated model. A long-term solution to this aspect is one challenge facing us in the future.
The second phase [29] of the project in 2015 involved using the DataCite agency [27] to register the DOI and using a new repository designed in part to exploit rich metadata. At this stage we started thinking deeply about the design of such metadata. Most of it would be generated automatically by scripts running on the repository, so that the human need not be deterred from the process of data publication. What characteristics or attributes would that metadata have? The following aspects gradually emerged, and here I give them their currently known descriptions, these being independently coined around that period in the form of the acronym FAIR. [35] Before expanding this, I must note that the design of FAIR data is targeted not just at humans but also at automated and unsupervised processes, one already mentioned in the context of PIDs. FAIR data therefore means it is: 1. F = Findable. The metadata could be directly searched to find data with the specified properties, independently of any associated article if need be. Such searchable metadata would include the media type [23] or container in which the data would find itself and molecular descriptors it might relate to in the form of e. g. an InChI identifier. [36] 2. A = Accessible. The data should be capable of being acquired (the content "negotiated") by scripted and if necessary unsupervised processes and not merely by a human responding to its description on a Web landing page. 3. I = Interoperable. A human or an automated process should be capable of transforming the data into a form suitable for any new analysis, including forms not previously anticipated by the creators of the data and with any semantic meanings of the data appropriately handled. 4. R = Reusable. A rather flexible term which in part can also relate to the interoperability but also includes any conditions which might be imposed on its re-use using licenses. Each of these four presents different aspects of the metadata required to FAIR-enable the data. Each term is also inevitably a work in progress.
Perhaps it is appropriate to now illustrate the journey so far with a concrete example and to do this I will show an article written in 2018 relating to the reaction between an amine and a carboxylic acid using a boron-based catalyst. [24] Data was generated from many sources by several groups at three different institutes during this project, some relating to the synthesis of new molecules and their characterisation using NMR spectroscopy, some relating to crystallographic characterisation and on my part relating to modelling the proposed mode of catalytic action by locating transition states on potential energy surfaces. These types of data were made available in FAIR form and given their own citation in the bibliography for this article, [37] whilst data for which presentation in FAIR form still represented a challenge in 2018 was placed in the traditional SI PDF container. The visual identification of the FAIR data in the article itself also took a new form in the PDF version hosted by the journal (Figure 7), in which not only is each step of the proposed catalytic cycle labelled with a DOI resolving to the data collection on which it is based, but in which the figure in the PDF version of the article has hyperlinks to this data embedded into the image.
From article context to data is just one click away! Give it a go yourself!
The WEO itself has now taken on a new form, no longer being hosted by the journal publisher as was previously the case [31] but is now itself published on a separate data repository and having its own DOI. A recent example [38,39] illustrates how the concept of a WEO is now largely reduced to being a stylesheet shell for the data, serving only to display it on demand using appropriate script software and retrieving the data also only on demand directly from a repository when the DOI of the required dataset is invoked. That floating window is also back (Figure 8).
Another feature of this article [24] was how the NMR spectroscopic data was handled. Conventionally, the data is recorded on a spectrometer and made available on a file server for the researcher to collect. When acquired, the first action is to subject it to a Fourier Transform, the resulting spectrum is then annotated and finally saved as a PDF format file, where it is aggregated into the supporting information document. This container may contain many tens of such spectra. In the process, the original spectrometer data in the form of an FID is replaced by the spectrum, and the former is then discarded or lost, along with the original data with the original data resolution. This suppresses any ability for others to re-analyse the original instrument data, or to check that the subsequent transforms have been appropriately handled. For example it is known that the FT technique can introduce artefacts known as Gibbs-Wilbraham oscillations, [40] but the loss of pre-FT data would mean that these could not now be investigated. In contrast, our model [41] for disseminating NMR data [24] presents it again as an object identified with a DOI [42] which allows access to the original unprocessed instrument data for each reported molecule, together with the instrument parameters used for the acquisition. Because such time domain data cannot be interpreted (by a human) without transformation to a frequency domain spectrum, a free license which allows it to be loaded into a commercial analysis program (MestreNova) for such analysis is also provided in the form of a mnpub Figure 7. In ref [24] this appears as e. g. Figure 9, to which is appended the following caption (in part): The DOI for data repository entries for individual species are shown in the form e. g. 10.14469/hpc/1885. All DOIs in this figure are available as clickable links (final paginated PDF only). access file. [41] Similar procedures can be followed for other types of spectroscopic data. It was also a pleasure for me to return to first year undergraduate students, where this story started, and help introduce them to an experiment designed by my colleague Ed Smith, in which they first make a new and unique organic ester, one never before made by anyone, record its NMR spectrum, and then publish the data as described above. In doing so, they would attain their very first scientific publication (albeit of data) and become a published researcher, as evidenced in their ORCID (Open Researcher and Collaborator Identifier) record. [46] How I would have loved to have done this fifty years earlier for my analyses of first and second equilibrium constants as also derived from NMR spectra.
The generous labelling of many of the data components of a scientific article with their own DOI creates a well populated and rich metadata record for that data as held by the registration agency. One of the value-added services that this agency can then provide is to create an indexed metadata store (MDS) and offer a tool for searching it as a complement to the well-established routes for searching for journal articles or for entries in bespoke databases. A friendly interface for such searching has recently been constructed [43] by a collaborator Charles Romain, one of the new generation of young and enthusiastic converts to FAIR data and yes, it too has been assigned a DOI (Figure 9). Because all these article, data, software and now research presentation [44] objects have PIDs, the metadata for each object can cross-reference to other such objects, allowing a so-called PID graph to be constructed [45] which can be used to reveal relationships between all these different entities. The task of exploiting all this interconnected linked chemical data and information has only just begun. [46] It has been a fascinating journey along that winding road so far, which at the start a mere fifty years ago took the mature and established 300 + year-old model of scientific journal publishing and with the help of some extraordinary technical innovations in computing and networking, started to drag it into the 21st century. Most of that new story has taken place in the last quarter century, and I suspect an equally diverse and probably entirely unexpected new story will emerge in the next quarter as well. I cannot wait!