Archive: Garbage In, Garbage Out: Wrestling with contamination in Microbial Sequencing Projects
August 15th, 2018
By: Noah Fierer, Jessica Henley, and Matt Gebert
They are insidious. They are difficult to eliminate. They strike fear in the hearts of microbiologists worldwide. Hard drive failures? Third reviewers? Bedbugs lurking in conference hotels? Nope. Contaminants in your high-throughput sequence data.
High-throughput sequencing approaches are now routinely used to characterize microbial communities. One common problem when running these types of analyses is contamination, i.e. sequences assigned to a given sample may not necessarily originate from the microbial DNA found in that sample. Contamination is a persistent problem and the possibility of contamination should worry anyone generating or analyzing microbial sequence data. The literature is filled with examples of published datasets that are highly likely to be contaminated - e.g. finding lots of skin bacteria in atmospheric samples collected from the middle of the Pacific Ocean or reports of brain and placental microbiomes. Contamination can be particularly problematic when dealing with lower biomass samples (e.g. air or tap water samples), but it can also cause problems when analyzing higher biomass samples, with contaminants introducing novel taxa and potentially contributing to significant run-to-run variation. While contamination is most likely to be encountered when analyzing bacterial communities as bacterial DNA is everywhere (‼), contamination can also cause problems when analyzing other microbial taxa (e.g. fungal DNA can also be a common contaminant).
In our lab, we analyze the microbial communities in hundreds of samples per month, primarily using marker gene and shotgun metagenomic sequencing approaches, and these samples come from a wide range of environments (including tombstones, showerheads, indoor air, and soil – to name a few). As a result, we are frequently asked about strategies to recognize, minimize, and deal with potential contamination in these datasets, whether the data are generated via shotgun metagenomic sequencing, functional gene sequencing, or marker gene sequencing (e.g. 16S rRNA gene sequencing).
What are the potential sources of contamination?
The first thing to point out is there are two dominant ways that microbial contaminants can be introduced into a high-throughput sequencing dataset:
1) Contamination introduced from lab reagents or during sample collection/handling/processing. Examples include: foreign DNA from Pseudomonas, Bradyrhizobium, Deinococcus, or other common reagent contaminants (see here and here) introduced during DNA extraction, PCR, or library preps; skin bacteria shed from the hand of a researcher during sample collection or processing; or fungal spores introduced into samples processed in a mycology lab (or a moldy barn).
2) Contamination from sample cross-over prior to or during the sequencing. This sample cross-over can either happen prior to the DNA sequencing itself (e.g. some DNA from sample x got into sample y at the PCR step) or during the DNA sequencing itself due to misassignments during de-multiplexing (reads from sample x were mis-identified as belonging to sample y), and contamination introduced during the sequencing itself (see here and here).
Strategies for minimizing contamination
Ideally one would eliminate all possible sources of contamination, but this can be difficult (if not impossible). Below are some tips for minimizing these potential sources of contamination and detecting the most likely source of any contamination. This is not meant to be an exhaustive list and we are sure there are many tips and tricks we haven't yet figured out. Also, many of these points are probably obvious to anyone who has ever set foot in a lab, but hopefully they are helpful to those readers that are new to these types of DNA sequencing-based community analyses or simply wondering why your dataset may look like crap:
- Run lots of controls and sequence the controls. For every batch of samples, it is good to include multiple DNA extraction 'blanks' as well as ‘no template control' (NTC) control samples – a point that has been raised in a number of recent papers (e.g. here, here, and here). These ‘negative controls' are ideally sequenced - a faint or non-existent band on a gel does not mean your samples are contaminant-free as the concentration of amplicons pooled per sample for sequencing is often very low - even a faint band on a gel could mean you end up with a lot of contaminating sequences in your final dataset. One advantage of sequencing the 'blanks' is that you will know if you have contamination, which specific taxa might be contaminants, and where that contamination might be coming from (e.g. sample cross-over versus reagent contamination).
- Test reagents for contamination prior to starting a study. Bacterial DNA can be surprisingly common in DNA extraction kits, PCR reagents, or elution buffers. Even a trusted vendor can have bad batches of reagents or kits. Find out if your master mix or polymerase has been tested for bacterial DNA contamination (most are not – if not, test the reagents yourself). Be sure to record reagent lot numbers- you'll avoid repeatedly testing reagents and be able to determine the source of contamination more easily. The master mix we typically use is not specifically tested for bacterial DNA contamination by the manufacturer so we know we need to test each lot before using it for projects. Be sure that all the consumables you are using are PCR grade. Finally, it is also important to reduce or eliminate any potential contaminating DNA that may be present in reagents/buffers used during sample processing. Even if autoclaving of reagents is possible, autoclaving itself is rarely sufficient – extracellular DNA can be remarkably persistent.
- Basic lab technique is important. There are a lot standard wet lab procedures that can minimize the risk of sample cross-over when doing DNA extractions, PCRs, or library preps. Before starting any sample processing, make sure you are set up for success. Keep different sets of pipettors for pre- and post-PCR work. If possible, set up PCR reactions in a PCR workstation (but DO NOT work with amplicon in the workstation). Always use filter pipet tips. Clean your area with diluted bleach or RNase Away and ethanol before and after working. If your lab cannot be devoted solely to sequencing prep (an infeasible option for many labs), it is best to devote a small area of the lab to the molecular work, away from any culturing, to prevent any cross-over between projects. Do not rely on UV light to clean your workstation, that's just a false sense of security. Always wear gloves and change them frequently while working- gloves are not where you want to reduce trash or pinch pennies. Be mindful of what others are doing around you. If you are extracting DNA, you don't want someone near you sloshing around amplicon or cleaning their soil sieves. Try to avoid putting high biomass samples next to low biomass samples on the same plate and be careful when re-using aliquots of PCR primers that had been used by someone less experienced.
- Check taxa for plausibility. OK – this is a no-brainer. Before you start running machine learning algorithms, constrained ordinations, network analyses, and other fancy data analyses – look at the taxa that are in your samples – are they what you might expect to see? Granted, there are occasions when your samples are so weird it is hard to know a priori what taxa you might see, but for most sample types there should be some pre-existing annotated data available to make comparisons. Are you seeing lots of skin bacteria in a deepsea sediment sample? A pristine forest soil sample dominated by E. coli? If so, something may be amiss.
Bioinformatic/statistical strategies for dealing with contaminants in your data
OK – so let's say you followed all these procedures and you still appear to have contamination. Now what do you do? The answer: it depends (which we realize is an annoying answer). There is no universal pipeline to follow or a set of guidelines you can use to identify contaminated samples or remove sequences originating from contaminants, and one should be careful when considering the removal of contaminants in the first place. There are a handful of statistical approaches or bioinformatic pipelines that have been developed to identify and remove contaminants (e.g. here, here, and here). However, how you handle potential contamination in downstream analyses will depend on the source of contamination, what you want to do with your dataset, the extent of contamination, and whether the contaminating taxa likely overlap with the taxa you would expect to see in your samples (i.e. if you are analyzing skin-associated bacterial communities that have been contaminated with skin bacteria from the person who processed the samples – you clearly have a problem).
One question we often get is – ‘Well, I ran numerous negative controls, can I just remove those taxa from my output files, thereby excluding all the contaminants from downstream analyses?' The answer, in a nutshell, is no. Blindly removing taxa from a taxon table will absolutely cause issues and artifacts downstream. For example, reagent contamination where a few specific taxa are abundant in negative controls can be relatively easy to handle by identifying and removing those contaminant reads. However, if the contamination is a product of sample cross-over, simply removing all of the taxa that are abundant in the negative controls could lead to the removal of abundant taxa found in the actual samples. Likewise, your threshold for determining if contamination of your negative controls is problematic will depend on the read coverage across your sample set and how that compares to your controls (e.g. if you get only 100 reads from a few of your negative controls, and >100,000 reads from all your samples – it is unlikely that you have serious contamination problems). Approaching contamination bioinformatically cannot and should not be done with a line of code or the push of a button. Dealing with contamination must be done with the same level of care and thoughtfulness that would go into analyzing the actual biological data. Much like in the field of microbiology in general, there are no quick remedies that can be implemented with the push of a button.
A final plea
The key to dealing with potential contamination in downstream analyses is to be transparent in how you identify and remove contaminants and not just use arbitrary criteria or blindly follow pre-existing pipelines. More generally - when it comes to contaminants in sequence data, just remember the old adage - garbage in, garbage out. Relying on bioinformatics to save the day and salvage contaminated datasets is sketchy at best. Better to reduce the contamination in the first place.
We are sure we have missed many important points to consider or other tips/tricks to reduce contamination. Please feel free to contribute any comments that may be helpful to those that have had the patience to read this entire post.