By Noah Fierer, Tess Brewer, & Mallory Choudoir
Typically, when we analyze 16S rRNA gene data for bacterial and archaeal community analyses we start by clustering sequences into operational taxonomic units (OTUs), i.e. clustering sequences that fall into fixed similarity thresholds (with OTUs often, but not always, defined at the ≥97% sequence similarity level). A similar approach is typically used for the processing of other marker genes commonly used for taxonomic analyses, including 18S rRNA gene sequences for eukaryotic analyses or ITS sequences for fungal analyses.
There has been a lot of recent discussion about whether microbial ecologists should abandon OTUs and instead quantify the diversity and composition of microbial communities using exact sequence variants (ESVs, also known as unique sequence variants, zero-radius OTUs, and sub-OTUs). Essentially this approach avoids clustering sequences at a somewhat arbitrarily defined similarity threshold (e.g. 97%) and instead uses only unique, identical 16S rRNA sequences for downstream community analyses. Unlike OTUs, these exact sequence variants (ESVs) could differ by as little as 1 base pair. Of course, resolving sequence variants is not trivial as it is difficult to distinguish between ESVs that are ‘real’ versus those that are just a product of amplification or sequencing errors. Fortunately, there are a number of tools now available to ‘de-noise’ and identify ESVs (including Deblur, DADA2, oligotyping, and UNOISE2). Here we do not dwell on how these algorithms actually work nor do we want to enter into a potentially contentious debate about which one of these tools is better. Instead, we want to focus on the pros and cons of using ESVs versus the more traditional OTU approach for microbial community analyses. To cluster or not to cluster, that is the question.
Advantages of using ESVs:
Improved taxonomic resolution. OK – this is an easy one. Since 16S rRNA sequences clustered at the 97% sequence similarity threshold could effectively encompass different microbial strains that may have diverged millions of years ago and have distinct phenotypic or ecological attributes, it makes intuitive sense to abandon OTUs. Using ESVs should improve our ability to detect biogeographical patterns, differentiate between pathogenic and non-pathogenic lineages, or discriminate between strains that have distinct environmental preferences (as demonstrated by recent work on the vaginal microbiome.
ESVs as consistent labels When we use an OTU-based approach, there is not a single sequence for all members of that OTU, instead the representative sequence is representative of a cloud of divergent sequences. By definition, when we use an ESV-based approach, the sequences within each ESV are identical to one another. This has a couple of important ramifications (as outlined previously). Most notably, this means that different datasets are more readily compared against one another. One does not need to guess if a new sequence would be assigned to a pre-existing OTU – a direct comparison of ESVs across studies is relatively straightforward and computationally efficient as one is comparing apples to apples.
Disadvantages of using ESVs:
Too much diversity When it comes to microbial diversity, there really can be too much of a good thing. For example, an individual soil sample can harbor thousands of bacterial or fungal OTUs (even when sequences are clustered at the ≥97% sequence similarity level). In addition to high alpha diversity, soil microbial communities also have high beta diversity – a high degree of inter-sample variation in community with relatively few bacterial or fungal taxa shared between any pair of soils. Such high levels of taxonomic diversity can complicate downstream statistical analyses, making it difficult to identify specific taxa, or sub-sets of taxa, that change in abundance across gradients or sample categories. We would expect that using an ESV approach would effectively increase alpha and beta diversity by increasing the taxonomic resolution (though this is not necessarily true, see figure below). Depending on the downstream analyses being conducted, the increased resolution of ESVs may make actually make analyses more difficult than if similar sequences were clustered into ‘species’ or OTUs, by increasing alpha diversity and reducing the overlap between samples.
Divergence in rRNA operons Many bacterial and archaeal taxa harbor multiple rRNA operons and, even within a given genome, the multiple rRNA genes are rarely identical. In fact, 16S rRNA genes within a single genome typically diverge by at least 4 base pairs, with rRNA gene dissimilarity increasing with increasing copy numbers (see here). In some cases, a single bacterial genome can contain 16S rRNA genes that differ by more than 40 base pairs! Fungal ITS analyses can also be affected by this intra-genomic heterogeneity as ITS gene sequences can diverge by up to 20% within a single multinucleate spore (see here). Thus, multiple ESVs could come from the same cell or population of identical cells – a problem that would be significantly reduced when using an OTU approach.
Sensitivity to data quality It happens – our data is not always perfect quality and sequencing errors (or PCR amplification errors) could be more common than we might like (as anyone who has used the 2×300 bp Illumina chemistry would know well). One of the main challenges facing pipelines designed to identify ESVs revolves around how to effectively ‘de-noise’ the data (discriminate between PCR or sequencing errors and ‘real’ biological variation). The higher the rate of PCR or sequencing errors, the more reads will be tossed during the ‘de-noising’ step of ESV pipelines (as many as 80% of reads in worst case scenarios). This loss of data may or may not be an issue, but an accumulation of single base pair errors will clearly lead to removal of more reads prior to downstream analyses when using an ESV-based approach versus an OTU-based approach.
Clearly the ESV approach is not inherently better than the more traditional OTU-based approach, each approach has advantages and disadvantages. Hopefully microbial ecologists will avoid the ‘lumpers versus splitters’ debate that has plagued plant/animal taxonomy over the years and instead use the approach best suited to the data in hand and the questions being asked. If your data is high quality, you want improved taxonomic resolution, and you are not concerned about the intra-genomic heterogeneity in the targeted marker genes, an ESV-based approach could be advantageous. Otherwise, a more standard OTU-based approach might be your best bet.
Richness (# of ESVs or OTUs) at a sequencing depth of 2000 16S rRNA gene reads per sample. Results from 300 outdoor dust samples collected from across the globe with the data processed using two different ESV pipelines (DADA2 and UNOISE2) and a more standard OTU-based approach (uPARSE) where OTUs were clustered at the ≥97% sequence similarity level. Contrary to expectations, alpha diversity is not necessarily higher when using an ESV approach.