Mycoplasma contamination is a familiar headache to molecular biologists but a new study found that these tiny bacteria have also managed to jump out of the lab and into a well-known database: The 1000 Genomes Project.
1000 Genomes Project, an international consortium with over 70 participating research groups. Part of the motivation was also to examine contamination of a different sort: whereas the GeneChip is all on expressed data, the 1000 Genomes Project refers to the genomes themselves.
This database, jointly kept in Cambridge and Boston, was the first to include genome sequences from a large number of people and has so far uncovered over 35 million genetic variants. From a practical point of view, all this information occupies just under 200 terabytes of which Langdon used a random sample of 6 terabytes.
The initial comparison analysis identified millions of sequence matches to different species of Mycoplasma. After filtering out the ones present in the human reference genome, Langdon ended up with multiple non-human DNA sequences matching one or more Mycoplasma genes. The final result: about 7% of the analysed sequences - supposedly human - turned out to be Mycoplasma contamination after all. This value varied widely between labs involved in the original analysis as well as how long ago sequencing was done. Langdon believes the contamination message is getting through to the lab, as recent data tended to be cleaner.
However, when it comes to ways to eliminate this problem from the computer scientists' point of view, answers get somewhat murkier. “It is feasible,” Langdon says, referring to the possibility of developing automated ways of removing Mycoplasma contamination, “but the question is whether people want to do it.”
“There's a problem in the sense that Mycoplasma is known to be endemic in terms of microbiology labs, but the implication that a lot of it has already got uploaded into computer databases hasn't really being taken seriously as yet,” he adds.
Despite Langdon’s concerns, the 1000 Genome Project recognises that there are mycobacterial sequence reads in some of the raw data sets. However, they believe this hasn’t affected published results and analyses. “All the human variation results that are the primary output of the project are derived from the sequencing reads that align to the human genome reference, which do not include the mycobacterial reads,” explains Richard Durbin, acting head of computational genomics at the Wellcome Trust Sanger Institute and one of the project’s co-chairs.
In fact, most of the database users opt for sequences aligned to the human reference genome, so they’re unlikely to find any problems, unless they go actively looking for trouble in the unchartered territory of raw data. In that case “they should not be so surprised to find non-human sequences!” says Durbin.
Nevertheless, the 1000 Genome Project will make available a list of those cell lines containing mycobacterial reads identified by Langdon, for those potentially affected by this analysis or interested in exploring this further.