Keywords: big data, reproducibility, data aggregation, bioinformatics, imaging
Webpages:
https://github.com/kuwisdelu/matter,
http://bioconductor.org/packages/release/bioc/html/matter.html A common challenge in many areas of data science is the proliferation of large and heterogeneous datasets, stored in disjoint files and specialized formats, and exceeding the available memory of a computer. It is often important to work with these data on a single machine, e.g. to quickly explore the data, or to prototype alternative analysis approaches on limited hardware. Current solutions for working with such data on disk on a single machine in
R involve wrapping existing file formats and structures (e.g., NetCDF, HDF5, database approaches, etc.) or converting them to very simple flat files (e.g.,
bigmemory,
ff).
Here we argue that it is important to enable more direct interactions with such data in
R. Direct interactions avoid the time and storage cost of creating converted files. They minimize the loss of information that can occur during the conversion, and therefore improve the accuracy and the reproducibility of the analytical results. They can best leverage the rich resources from over 10,000 packages already available in
R.
We present
matter, a novel paradigm and a package for direct interactions with complex, larger-than-memory data on disk in
R.
matter provides transparent access to datasets on disk, and allows us to build a single dataset from many smaller data fragments in custom formats, without reading them into memory. This is accomplished by means of a flexible data representation that allows the structure of the data in memory to be different from its structure on disk. For example, what
matter presents as a single, contiguous vector in
R may be composed of many smaller fragments from multiple files on disk. This allows
matter to scale to large datasets, stored in large stand-alone files or in large collections of smaller files.
To illustrate the utility of
matter, we will first compare its performance to
bigmemory and
ff using data in flat files, which can be easily accessed by all the three approaches. In tests on simulated datasets greater than 1 GB and common analyses such as linear regression and principal components analysis,
matter consumed the same or less memory, and completed the analyses in a comparable time. It was therefore similar or more efficient than the available solutions.
Next, we will illustrate the advantage of
matter in a research area that works with complex formats. Mass spectrometry imaging (MSI) relies on imzML, a common open-source format for data representation and sharing across mass spectrometric vendors and workflows. Results of a single MSI experiment are typically stored in multiple files. An integration of
matter with the
R package
Cardinal allowed us to perform statistical analyses of all the datasets in a public Gigascience repository of MSI datasets, ranging from <1 GB up to 42 GB in size. All of the analyses were performed on a single laptop computer. Due to the structure of imzML, these analyses would not have been possible with the existing alternative solutions for working with larger-than-memory datasets in
R .
Finally, we will demonstrate the applications of
matter to large datasets in other formats, in particular text data that arise in applications in genomics and natural language processing, and will discuss approaches to using
matter when developing new statistical methods for such datasets.