In this paper we present HirBin, a new method for functional annotation and identification of differentially abundant functions in metagenomes. HirBin uses a data-centric approach to improve gene quantification where bins are identified by supervised annotation using known functional domains, and then divided into sub-bins using unsupervised clustering over all samples. Due to this two-step binning procedure HirBin has the ability to identify changes between metagenomes that would be overlooked by traditional methods. In a case study, HirBin was used to compare metagenomes from individuals with type 2 diabetes to healthy controls . The analysis showed that the number of sub-bins detected by HirBin was up to a five-fold larger than the more general bins (TIGRFAMs). Furthermore, a large proportion of the sub-bins were differentially abundant and many of these effects could not be identified at the bin level. A substantial number of bins that were found to be non-significant had at least one sub-bin that was significantly differentially abundant. This is a result of the broad functional classification of the bins where gene counts from multiple sub-bins are pooled. Our results suggest that effects are diluted at the bin level, due to the large number of sequences belonging to the bin that are not changing in abundance, or changing in the opposite direction, which makes the effect hard to identify in the statistical analysis. This was also underlined by the bins studied in detail (Fig. 3), which showed that significant changes present at more specific sub-bins can be masked at the bin level due to dilution of the significant effects. Furthermore, integrative analysis using GSEA showed, compared to the analysis using traditional bins, a higher number of significantly over-represented Gene Ontology terms when using sub-bins calculated with HirBin. This suggest that the increased sensitivity to detect changes in more specific functions enabled by HirBin can increase the data interpretability and provide more biologically relevant results.
The effects of dilution and the ability to detect differential abundance was further examined and quantified using resampling of an additional independent metagenomic dataset, where effects were systematically added using down-sampling of reads . When effects were added to more specific sub-bins (80% sequence similarity), the estimated fold-change at the bin level was close to zero, and, as a consequence, the statistical power (the ability of the method to detect the change) was substantially reduced (97.6% decrease). When the effect was added to less specific sub-bins (50% sequence similarity) the power of detecting the change at the bin level increased, but were still reduced (the power of detecting the change at the bin level was 23.7%.). These results confirm that effects introduced at more specific functional levels are likely to be diluted to the extent that they are hard to identify based on the quantitative information generated by traditional binning strategies. There is thus a significant gain of statistical power in performing the analysis at a more refined functional level, where the effect of dilution (i.e., the impact of the sequences that are not changing in abundance, or changing in another direction) will be reduced.
Changes in the abundance of biochemical functions and pathways between metagenomes reflects the microbial community response to a perturbation or change in environment. These changes are caused, explicitly or implicitly, by selection pressures acting on specific microbial phenotypes. The exact nature of these selection pressures is in most situations complex affecting both general and more specific biological functions in the microbial community. The stringency of the sub-bins derived by HirBin should therefore be based on underlying hypothesis and decided based on whether general or more specific effects are in focus. As a consequence, the stringency cutoff has been made adaptable and can be set by specifying the minimum amino acid similarity allowed for the functional domains defining each sub-bin. More stringent sub-bins (higher sequence similarity) are thus in general more homogenous, both sequence-wise and functionally. It is however important to note that the stringency of the sub-bins will affect the number of associated sequence reads, i.e., the coverage of the annotated regions. Since sub-bins are formed by splitting less specific bins and sub-bins, the number of sequence reads will decrease with increased stringency. Both analyzed datasets showed almost a ten-fold decrease in the average abundance of the sub-bins (75% sequence similarity) compared to the general bins (Additional file 7). The 112 bins that were not detected by HirBin at the 50% sub-bin level, but were significant at the bin level all had a relatively low average abundance, and are therefore harder to detect when splitting up the bins to sub-bins. The ability of detecting differential abundance is dependent on the total read count and an increased stringency will thus result in a reduction of the statistical power. Many modern metagenomes are however sequenced at high depths in order to ensure a satisfactory de novo assembly of the reference [29, 30] and it is very likely that this results in enough sequence reads for a suitable statistical power, even for the more stringent sub-bins. However, the reduction of power at more specific levels will not be as drastic for metagenomic datasets with a high number of biological samples. However, for less comprehensively characterized metagenomes with a few number of samples, the sub-bin stringency should be set with sequencing depth in mind. Furthermore, applying a high stringency cut-off may result in sub-bins formed around sample-specific gene variants. This phenomenon was seen in both human gut datasets analyzed in this study where the number of sub-bins classified as representative (present in at least 75% of the samples) decreased with increasing stringency (a drop from 15,740 to 10,798at a sequence similarity of 50 and 75% respectively). These results are in concordance with the previously reported high diversity of the human gut microbiome between individuals where the overlap at more specific taxonomic levels often is small [25, 26, 31]. Thus, metagenomes that are genetically diverse will, in general, exhibit a lower rate of stringent sub-bins present in the majority of the samples. The sequence similarity cutoff in the sub-binning step should thus be chosen in a way to assure that the clusters are large enough to be representative over the samples, but still specific enough to capture the effect at the right functional level.
Previous shotgun metagenomics studies, both from the human microbiome and environmental samples have shown that natural microbial communities are very complex and show a high diversity, both at the species and, consequently, the functional level [25, 32, 33]. In order to capture the change in abundance between different samples in the binning process it is therefore necessary to first capture the high diversity present in each functional domain. Profile databases such as PFAM and TIGRFAM have been specifically designed to be as broad as possible and thus to include a large number of taxonomically diverse gene variants [16, 17]. As a result, these domains are general classifications of functions. We show in this paper, that for many functional domains only subsets of the annotated sequences change in abundance, which means that the effect might be missed when comparing all the reads associated with that domain. The two-step procedure in HirBin makes it possible to detect those changes by first performing a broader annotation and then zooming in at more specific clusters of the functional domain (in the sub-binning step), giving a higher resolution in the binning process. It should, however, be emphasized that HirBin does not provide any refined functional information, since the sub-bins are annotated with the same function as their parent bin. HirBin will, however, identify functional differences that are only present at the higher resolution. For selected sub-bins it is possible to align the sequences to a reference database for further investigation of their function (can be done directly using the HirBin function blastSubbinSequences). For example, when the sub-bins for TIGR03537 in Fig. 3a are compared to GenBank (nr database) we found that the sequences in sub-bins 1 and 2 (at 50% sequence similarity) matched different variants of the aminotransferases (Additional file 8). All sequences in sub-bin 1 annotates as histidinol-phosphate transaminase (HisPAT, E.C. 126.96.36.199) involved in histidine biosynthesis  while the sequences in sub-bin 2 annotates as LL-diaminopimelate aminotransferase (LL-DAP-AT, E.C. 188.8.131.52), involved in lysine biosynthesis . This shows an example where HirBin was able to separate two functionally different gene variants, and identify one as increasing and one as decreasing in abundance between the studied conditions. It should be noted that the HisP aminotransferases in sub-bin 1 were annotated mainly in bacteria from the Bacteroidetes phylum (in Bacteroides sp. and Prevotella sp.), while the LL-DAP aminotransferases in sub-bin 2 were mainly found in the Firmicutes phylum (in Coprococcus sp., Clostridium sp. and Eubacterium sp.). The observed changes in abundance for these could therefore be an indirect effect of changes in the taxonomic composition between t2d and control and does not necessarily reflect a direct impact on the aminotransferases. We argue, however, that it is important to be able to differentiate between these variants in the binning process, in order to identify differentially abundant functions, and facilitating the interpretation of the data.
The ability of detecting differentially abundant bins and sub-bins is dependent on the reference sequences. For many metagenomes, comprehensive catalogs of representative genomes and genes are lacking or completely missing, and contigs assembled de novo from the data, are therefore typically used as references . The approach used by HirBin is general and can be applied to most types of references. HirBin can also be combined with different types of databases for annotation of the reference. For the datasets analyzed in this study, we used HMM-based annotation using the TIGRFAM database, which contains a comprehensive catalogue of bacterial functions. It is, however, possible to generate annotation based on other functional domains (e.g., PFAM and FOAM) or by BLAST-based sequence alignments against protein databases (e.g., KEGG Orthologies). Functional annotation of the reference sequences can be performed using the HirBin function functionalAnnotation, using a database of choice. The annotation can also be supplied by the user as a tab-separated annotation file with coordinates of any functional annotation. This makes it also possible to combine HirBin with many of the existing binning algorithms. Furthermore, the gene quantification in HirBin is done by matching reads against the reference sequence. HirBin can use a variety of mappers for this purpose, such as bowtie2 , or mappers using distributed computing for large-scale metagenomes, such as Tentacle . HirBin is thus highly adaptable and should hence be applicable to a wide range of metagenomes and experimental designs.
HirBin uses sample-wide clustering of the functional domains to identify the sub-bins. An alternative approach would be to instead cluster the individual reads matching each functional domain. This clustering problem is however substantially harder due to the short length of the reads generated by the current sequencing technology (most modern metagenomes have a read length between 100 and 150 bases) and the often high error rate . Furthermore, the number of reads are larger than the number of proteins, which also would increase the computational complexity. The clustering of functional domains in HirBin was done by UCLUST, which employs a centroid-based algorithm that identify clusters with sequences with a similarity above a specified cut-off . UCLUST highly efficient and capable to calculate the sub-bins within an acceptable timeframe (<1 h), even for a large number of protein sequences (there were e.g., 23 million functional domains in the data from Qin et al. ). UCLUST have been shown to have an overall good and robust performance when clustering a large number of sequences . It should however be pointed out that UCLUST does not perform hierarchical clustering and there is thus no guarantee that sub-bins at different sequence similarity cutoffs are formed as perfect subsets . For the datasets in this study, this effect was present, but in general minor where 92.5% of the sub-bins were perfect subsets of a less stringent sub-bin. Replacing UCLUST with e.g., agglomerative hierarchical clustering based on complete linkage, would result in sub-bins that are more homogenous and strictly monotonously decreasing with increased stringency . In order to compare UCLUST with hierarchical clustering we compared the number of sub-bins produced when clustering the sequences in 4 selected bins using both methods (Additional file 9). Hierarchical clustering resulted in a moderate increase in the number of sub-bins (on average 31%). However, the computational time was up to 2000 times longer using hierarchical clustering due to the high computational complexity (O(n^3) for agglomerative hierarchical clustering). Thus, UCLUST constitutes a good compromise between run-time and cluster quality and enables HirBin to be applied to very large reference databases.