Perhaps the most famous data retrieval case in the history of science comes from 16th century orbital mechanics. Copernicus had laid the foundations for a viable heliocentric system; Kepler stood ready to finalise it. Between the two, both problem and solution, lay the mysteries of Mars, ‘the wanderer planet’. The data that Kepler needed already existed, in a database of naked eye observations painstakingly constructed over two decades by Danish philosopher Tycho Brahe.
The problem was twofold. Brahe had nailed his colours to a mixed system at odds with that of Copernicus; and his data was his claim to posterity. He employed Kepler as an assistant, but jealously guarded access to the full observational data set.
Kepler did, eventually, gain access to the data. It wasn’t easy, nor always amicable (though allegations that he murdered Brahe to achieve it have been discounted) – but it was done. He still had to learn how to retrieve it productively, but six years of mining and analysis finally bridged the gap to produce a final, successful, validated model.
Things have changed almost unrecognisably over the four or five centuries since Copernicus, Kepler and Brahe, but some features recognisably remain amid the new. Investment in research is balanced against the advantages of shared access. Boundaries, proprietary or otherwise, remain between researchers and data repositories. Murder and less extreme espionage methods may be rare (though not unheard of) as means of gaining access to data stores, but Kepler would no doubt recognise in essence the processes of negotiation and persuasion that allow those boundaries to be permeated.
The biggest early 21st century data retrieval issue, however, is a different one. Acquisition in large quantity is becoming ever easier. Storage is, in relative terms, becoming cheaper. The headache often becomes how to ensure that one retrieves the right data for particular purposes from the ever-ballooning volumes that are thus becoming available.
And then there is the problem of storage format obsolescence. Unlikely as it may seem, digital information which is by definition recent and (you might think) ought to be more easily accessible, and more carefully curated, can sometimes be harder to reach than older analogue stores. Reusing information after it has passed its initial shelf life is a bigger issue than it ought to be. As a recent case of my own shows (see box ‘Retrieving the recent past’), multiple rapidly changing factors combine to deny retrievability remarkably quickly and opaquely. The most future-proof data storage option is probably plain text CSV (comma separated value) files, stored online; but even that is not guaranteed, and frequent review is the only certain guarantee of posterity. Methods to provide transparent access to contemporary but incompatible databases (as described below) also suggest ways to retain access to older ones as new developments occur.
Some areas of scientific work generate more product, and correspondingly bigger headaches, than others: the life sciences, particularly the mapping of genomes, are particularly fecund for example. A scan of patents shows a marked rise in the number of applications relating to data retrieval methods since the blossoming of the Human Genome Project (HGP). Clinical research programmes increasingly accumulate and share data, with a concomitant need to manage its use efficiently. Few fields are slow to feed the flood, however. Whether probing the picoseconds after the Big Bang or seeking to map every fragment of rock in our solar system, the same pattern of explosive data growth is present everywhere.
The key word so far, and not one universally common among statisticians, is ‘database’. Data analysts (especially of my generation, but things are really only just beginning to change) tend to think in terms of lists, or tables, or worksheets of data, rather than databases. In local terms, we are perhaps right to do so; most statistical tests are applied to quite small subsets of the data available. When a subset contains a few hundred cases from two variables, it looks like (and is) a pair of lists, just as much as if it were only the dozen or so cases in a textbook. But these days the subset will probably have been retrieved from a very much larger database – and the criteria for that retrieval will range far and wide through other variables, and intercase comparators, not visible in the final subset itself.
In an ongoing investigation which I’ve just pulled up, for instance, the comparative concentrations in a scent of two specific chemicals (call them C1 and C2) have been retrieved at 73 different time points for hypothesis testing. Those paired lists, however, have been retrieved from an operational historian database currently containing more than 300 million cases (and growing) of just over 2,000 variables. They do not represent a randomised selection, nor a systematised extraction; they are the result of painstakingly developed queries based on the hypotheses to be tested. They represent the concentrations of C1 and C2 at, and only at, all moments when a dozen other variables meet specific criteria: other concentrations, temperature range, light level, atmospheric humidity, wind direction and the occurrence of a very specific and relatively infrequent phenomenon in insect flight. Depending on what this and previous tests show, the query will be adjusted to extract different cases from the same two variables; and so on. The queries also allow for switching between inclusion or exclusion, as appropriate, from any time frame, of the new entries being continually added to the database In essence, it’s an idea that Kepler (who arrived at his elliptic orbit solution by progressing from analysis of retrieved triplets on a mistaken circular assumption) would have understood perfectly 400 years ago; in detailed execution, it can only happen in very recent computerised methods.
While that example was an in situ entomological observational study, historian databases are mostly widely used in process analysis, fed by sensors and other data generators integrated into the machinery of industrial contexts. Since they are time- series based, they are (in principle, at least) very easy to link with other databases – not necessarily from the same or even similar context – to produce burgeoning data complexity. It would be perfectly feasible, for example, to relate the database from my insect habitat to that from a nearby factory, an adjacent motorway, an industrial dairy, a weather station – and that sort of combinatorial multiple database approach is becoming increasingly common.
The retrieval tools do not necessarily have to be in the same software as the analytic, of course. Flexibility often argues for their separation, in fact. Efficiency, however, favours an integrated system, and this is increasingly reflected by software suppliers.
Statsoft’s Statistica analysis product, for instance, has long ago evolved into the core of much bigger aggregate solutions aimed at specific purposes. There was built in database management from an early stage. Data mining has been a priority for a long time, leading to development of various product clusters including process control and investigation. Automation is handled by SVB, Statistica’s specific implementation of VBA (Visual Basic for Applications). For the Enterprise Edition there is a specific add-in for retrieving analytic data from OSIsoft’s PI data historian product (which, by comparison with my insect study’s collection rate of well under a million cases a day, can cope with a capture resolution of half a million events per second) and also extends retrieval to other VBA methods. An interface provides for the defining the data repository and the method by which data are to be retrieved, collections of queries specifying data to be retrieved and analysed, and metadata specifying appropriate treatment of the retrieved data for the analysis in hand.
In a different direction, Weisang’s analytic software, FlexPRO, whose central focus is time series and refers to variables as signals, emphasises an uncompromisingly database view of its numerical content over the more usual worksheet approach. Again, a product specific automation model allows programmed retrieval of specified data from this store, with the database approach providing a rigorous environment for sample design planning.
The provision of automation methods sometimes obscures the fact that most heavyweight data analysis programs are, behind their graphical user interfaces, actually programming languages in their own right with automated data retrieval potential of their own. There are also long-term developments, such as Microsoft’s ODBC (open database connectivity) standards, which facilitate access to generic data stores by analytic tools.
VSNi’s GenStat is a good example of this, its present interactive graphical face being a relatively recent development on top of a data analysis specific high-level language with a long scientific pedigree. Logical structures and expressions, loops and conditional branching, free (as well as fixed) field input, ability to incorporate user programmes into the main program resources as transparent extensions alongside native directives and procedures, all provide far more scope for automated responsively adaptive approaches to data retrieval than most users ever dream of. It also has an unusually flexible and extensible file import facility which permits users to design their own format templates or (see box: ‘Retrieving the recent past’) draw on the experience of a user community that may have already trodden the same or similar paths.
Genetics is, as I noted earlier, one of the drivers behind the flood of data which has made retrieval such a high priority area. From it, and particularly from the rise of genome sequencing and the HGP, have developed two key practical coping concepts, which point the way to more general solutions: federated database systems (FDBS) and genome browsers, of which BioMart and Ensembl are good representative examples.
BioMart, like other FDBS, is a project designed to provide single entry point access via portals to multiple and disparate databases. Geographical distribution is irrelevant: of the 45 databases currently federated at the time of writing, 30 are in Europe, 11 in the Americas and four in Asia. Open source in structure, it is designed to ‘promote collaboration and code reuse’, provide ‘unified access to disparate, geographically distributed data sources’ and be ‘data agnostic and platform independent, such that existing databases can easily be incorporated’[1]. In this it offers hope of backward and forward compatibility for some of the storage format obsolescence issues mentioned above, as well as addressing contemporary incompatibility problems.
Ensembl is one of several genomic browsers designed to bring bioinformatic data retrieval and relational database principles under a single interface umbrella, providing researchers with a unified retrieval view. Automated annotation of sequence data produces a MySQL database, which Ensembl then makes freely available to researchers. Several levels are available, from a web-based GUI to large dataset retrieval through the BioMart data mining tool or tightly defined direct SQL queries. Developed in early response to the HGP, it now includes other key model organisms (such as fruitfly, mouse and zebra fish) and an expanded range of genomic data. It focuses on vertebrates, but a sister project, Ensembl Genomes, has extended the scope to bacteria, fungi, invertebrate metazoa, plants and protists.
There are similar tools and approaches being developed in other areas of science, though some are less open and distributed than others.
Mapping extreme orbit objects in the solar system to make them predictable is a particular case, crying out for a shared database in which all positional ‘stranger’ observations can be logged for subsequent query based retrieval and analysis as the size grows and patterns begin to be discerned. Which brings us neatly full circle to Brahe and Kepler.
References and Sources
1. BioMart project. www.biomart.org. [cited 2013 2013-03-01]
2. Kinsella, R.J., et al., Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford), 2011. 2011: p. bar030.
I was asked to revisit and reanalyse data from a 20-year-old longitudinal study in the light of new knowledge. Large and detailed, the data set was backed up onto VHS video cassettes.
Magnetic tape deteriorates with time, and the stored signal also tends to ‘print through’ from layer to layer. Finding a VHS cassette player that could 
be connected to a PC was a challenge. But a university IT department was able to solve these problems, copying the content onto a backed-up network.
After investigating several defunct VHS backup systems, I was rescued by a helpful hobbyist in Albania who had a copy of the necessary restore program and a 286 PC on which it would run. We were now able to decode the backups to yield sets of files created and saved by the spreadsheet Wingz.
Wingz was a spreadsheet program from Informix, ground-breaking in its day, far ahead of its time... but that day and time came to an end in 1994. GenStat, from VSNi, has a well-deserved reputation for being able to import a wide range of file formats, so I tried it. No dice: even GenStat was stumped. I sent a hopeful email off to VSNi (suppliers of GenStat), asking if they had any suggestions, and their ‘expert in NZ’ offered to develop a complete direct file format import solution in a week or two, but also suggested an immediate workaround.
The immediate workaround involved a ‘player’ application for Wingz files used for educational chemistry models by Professor Tom O’Haver at the University of Maryland. I could load the epidemiology files into this and then save them as WK2 (middle period Lotus 123) format. WK2 files are readable by a number of current worksheet-oriented products, so could then be saved yet again in any form I wished. (An alternative would have been to copy and paste via the Windows clipboard, but file saves were more elegant and preserved fuller numerical precision.)
After rigorously checking that the results were preserving the integrity of the original data, it then became a fairly simple process to convert the whole archive. But the fact that this degree of ingenuity should be required to read information recorded only two decades ago does make you think.