Intragenomic heterogeneity and its implications for ESVs

By Angela Oliverio and Noah Fierer

October 9, 2017

In a recent blog post, we focused on the advantages and disadvantages of using exact sequence variants (ESVs) versus OTUs to cluster marker gene sequences for microbial community analyses. Just to recap – let’s say we sequenced a pool of 16S rRNA genes from a bacterial community found in your environment of choice. The traditional approach has been to cluster these sequence reads at a set threshold, e.g. all 16S rRNA gene sequences that are >97% similar are considered to be from the same operational taxonomic unit or ‘OTU’ (a term used by microbiologists that are afraid of the word ‘species’). An alternative approach is to avoid clustering sequences based on some arbitrarily defined level of similarity and instead identify exact sequence variants (ESVs) in the dataset, with each ESV having a unique 16S rRNA sequence (even if they differ by only a single nucleotide). The implicit assumption is that each ESV represents a distinct bacterial taxon with downstream community analyses based on the relative abundance or presence/absence of these ESVs across a given set of samples. Here we want to follow up on a particular point raised in the previous post – the potential for intragenomic heterogeneity to complicate interpretation of ESV-based community analyses.

Here is the problem. A single bacterial cell/genome can often have multiple rRNA operons and these rRNA operons are not necessarily identical. In other words, an ESV-based approach has the potential to divide a single population of genetically-identical cells into multiple ESVs. Thus, we could potentially inflate estimates of diversity and end up with more ESVs than the # of unique cells/genomes found in a given community. This intra-genomic heterogeneity in rRNA operons has been reasonably well-documented. We know that approximately 48% of bacteria have more than one rRNA operon (Pei et al. 2010) with some bacteria having >10 rRNA operons in a single genome. We also know that these rRNA operons can have variable nucleotide composition. For example, Sun et al. (2013) found that of 2,013 bacterial and archaeal genomes analyzed, 952 genomes (47%) had intragenomic variation within the 16S rRNA genes. This led to up to a 123.7% overestimation of prokaryotic diversity when using ESVs with full-length 16S rRNA gene sequence data. The extent of nucleotide variation depends on the length and region of the rRNA operon sequenced – the regions V1 and V6 had the most intragenomic heterogeneity while V4 and V5 had the least (Sun et al. 2013).

Intra-genomic heterogeneity in rRNA operons is not a phenomenon restricted to bacteria and archaea. In fact, eukaryotic microbes likely have even more substantial intra-genomic heterogeneity in their rRNA operon sequences.  Protists and fungal genomes often have hundreds to thousands of rRNA operons and these operons can be highly divergent. For example, two fungi, Rhizophagus irregularis and Gigaspora margarita were found to have high intra-isolate nucleotide variation with the average sequence similarity across rRNA operons for SSU, LSU, and ITS at 99%, 96%, and 94%, respectively (Thiéry et al. 2016). Within protists, the oligotrich and peritrich ciliates are known to have extremely large rDNA copy numbers per cell – up to 310,000 copies and intra-isolate diversity was documented (Gong et al., 2013). For both protistan and fungal communities, the potential for this intragenomic heterogeneity to lead to an overestimation of taxon richness has been noted previously (Thiéry et al. 2016; Gong et al. 2013).

Now, given that many microbes have multiple rRNA operons and that these operons often are distinct, we would expect that ESV-based analyses of microbial communities could, on occasion, split a single taxon into multiple ESVs. If so, the relative abundance of those ESVs originating from the same cell/genome would be highly correlated across a set of samples. In fact, that is exactly what we observe in an ongoing study of the sourdough microbiome where we sequenced a region of the 16S rRNA gene from >500 samples and then used uNoise3 to cluster those sequences into ESVs. What we find is that some of those ESVs are very well-correlated with one another (see figure). This, in and of itself, is not evidence that these ESVs are originating from the same population of cells – they could just be distinct lineages that strongly co-occur. However, all of these ESVs are classified as belonging to the same species (Lactobacillus rossiae) that only diverge by 1-3 base pairs, this genus of bacteria has an average of 5 rRNA operons, and the ratios between the multiple ESVs are close to 2:1, 3:1, or 4:1 – exactly what we would expect if these ESVs are originating from the same genome. Of course, to confirm that the multiple ESVs are just picking up intra-genomic heterogeneity, we would have to cultivate and sequence this strain, but we haven’t done that yet.

So – where do we go from here? To be clear, we still think there are many good reasons for using ESVs and we are not advocating otherwise. No reason to throw the baby out with the bath water. However, we do think it is important to carefully consider how intragenomic heterogeneity might impact ESV-based analyses of microbial community analyses. Due to the high likelihood that multiple rRNA operons within a given genome have variable sequences, one might, incorrectly, assume that distinct taxa/ESVs strongly co-occur when, in reality, multiple ESVs could be coming from a single population of cells with identical genomes. When using ESVs to characterize microbial communities, it is important to consider a few points. If multiple ESVs are highly correlated and phylogenetically similar, that might suggest that those ESVs are just a product of intragenomic heterogeneity, particularly if the ratio falls close to expected ratios of rRNA operon numbers (e.g. 2:1 or 3:1). Of course, to confirm if highly correlated ESVs represent distinct taxa that tend to strongly occur or if the patterns are a product of intragenomic heterogeneity, it is necessary to obtain genomic information for the taxon of interest and quantify intragenomic heterogeneity directly. Regardless, the growing number of researchers using ESVs for microbial community analyses need to be aware that their ESVs may not necessarily be coming from distinct taxa.