Norwich, UK (Scicasts) - Data integration or die – the motto of all big data industries today also applies to biosciences, say researchers at The Genome Analysis Centre (TGAC). The time has come for the academic community to establish common rules for formatting biological datasets, helping computational researchers to integrate new data into the existing databases, argues the team of Dr. Maria Victoria Schneider in their latest article “Data integration in biological research: an overview”, published in the Journal of Biological Research.

Access to all existing data in a standardized format will also improve the analysis of large datasets, say the researchers, and allow bioinformaticians to build generic algorithms, which all biologists can use to put their results into perspective.

“Data integration principles are fundamental in providing tools that are user friendly and allow [biologists] to focus their efforts on the actual study of the data instead of being lost in the process of looking for the data they need,” say the authors.

Computational sciences currently define two main approaches to data integration – the so-called “eager” and the “lazy”. The eager approach relies on building centralized databases, such as UniProt or GenBank, where the data from existing sources is stored in one data repository, also known as a data warehouse.

In the lazy approach, the data remains in distributed sources online and is integrated on demand, using in-house software or web-based user interfaces. The best known example of this approach is ExPASy - a bioinformatics resource portal that allows users to access a number of scientific databases and research tools (e.g. compare protein or gene sequences, calculate physical or chemical parameters of a protein).

In both cases, to allow seamless integration of all datasets, the entries need be follow a particular format and be computer-friendly. In their article, the researchers describe some of the most commonly used standards, such as FASTA for gene and protein sequences and SAM for sequence alignments. They also provide a list of resources where scientists can learn about other formats, to be able to adjust their data outputs for further re-use.

“Standards facilitate data re-use,” say the authors. “Absence of standards means substantial loss of productivity and less data available to researchers.”

“Funnily", remark Dr. Schneider and her team, "the Roslin Bioinformatics Law’s First Law declaims: “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats”".

The problem is further aggravated by the absence of standard bioinformatics training for biologists and an important gap between computer specialists and scientists who claim that “they are not good with computers”.

“There is no well-defined curriculum for bioinformaticians at the moment,” says Dr. Schneider. “Here at TGAC, we are spending a lot of time establishing graduate programs for students. I think that every biologist today should receive at least some basic training in bioinformatics and it needs to be done as early in their studies as possible.”

Dr. Schneider currently leads the 361° Division at TGAC, where she specializes in Bioinformatics training and works on bridging the gap between biologists and computational researchers. Her team points out a list of free professional development courses in the article, that can help biologists acquire the necessary knowledge to understand bioinformatics and work with computer scientists.

One of such international initiatives that Dr. Schneider is involved with, Global Organisation for Bioinformatics Learning, Education & Training (GOBLET), provides a range of workshops and short courses, from Evolution and Genomics to Python programming and building protein signalling networks. Each aims to equip biologists with a set of specific tools they can use to enhance their research.

“The importance of biologists in data integration is huge. They are those who produce and analyse data, which need to be shared for a better science,” adds Dr. Allegra Via, Assistant Professor in the Biocomputing Group of Sapienza, University of Rome, and a senior author of the paper.

“This article is a wake-up call for biologists,” says Dr. Schneider. “Data integration should not be left on the shoulders of bioinformaticians. It needs to be a community effort and everyone has to pro-actively get involved in this.”

Publication: Data integration in biological research: an overview. Lapatas, V et al. Journal of Biological Sciences (September, 2015): Click here to view.