The quality of a metagenomic assembly is assessed by its similarity to the reference set at either the nucleotide or protein level. As metagenomic assemblies often contain multiple contigs for the same reference, unique sets of nucleotide level contigs do not necessarily contain the same protein information, as ORF prediction can be disrupted at the edges of a contig. We thus aim to assess the assemblers in both contexts, as different metagenomics experiments may desire accuracy for different information.
The real mock communities were utilized to test the ability of an assembler to find low abundance species, while having an equal abundance community present as a control. Species missing from both communities do not necessarily indicate a difficulty in estimating the abundance of scarce organisms, but rather a difficulty in assembling the sequence for the species.
Within the balanced community, all assemblers covered a similar number of bases and PRG. However, while Omega, metaSPAdes and IDBA-UD had much larger longest contigs, MEGAHIT has the most stable contig size, evidenced by its N50 almost equaling its largest contig, and large linear range (Fig. 2b). While there is no clear choice for the best assembler for nucleotide level information for the balanced community as metaVelvet had a much lower number of misassemblies than the other community despite having lower scores in the other metrics (Fig. 2), Omega makes 5–14 fold more errors than the other assemblers, making its output uninformative.
To understand the ability of each assembler to identify low abundance organisms, we compared the staggered community performance to the balanced community performance. When shifting from assembling the balanced community to the staggered community, metaSPAdes has a comparable number of misassemblies (Fig. 2c), a 600 KB shorter longest contig (Fig. 2a), 20 M less bases in its assembly (Fig. 2b) and 30% less PRG (Fig. 2d). IDBA-UD also had a performance drop when used on the staggered community, though it was less of a drop than metaSPAdes; IDBA-UD’s longest contig fell by 400 kb (Fig. 2a), but had a larger increase in misassemblies than metaSPAdes (Fig. 2c). In comparison to IDBA-UD, metaSPAdes appears to capture longer and more accurate nucleotide level information. It is also important to note that MEGAHIT has a large PRG (on par with metaSPAdes) despite its short longest contig and smaller N50 than metaSPAdes.
In the balanced communities, MEGAHIT and metaVelvet both have shorter longest contigs than metaSPAdes and IDBA-UD (Fig. 2a), but cover a comparable number of nucleotides in their assemblies (Fig. 2b). The shift to assembling a staggered community causes MEGAHIT to make the second most misassemblies of all assemblers tested (Fig. 2c). However, MEGAHIT’s PRG is similar to that of metaSPAdes (Fig. 2d). Despite metaVelvet having the smallest contigs, its N50 remains unchanged by the staggered community (Fig. 2b). By examining the correlation of PRG with true species abundance, we are able to see that metaSPAdes and MEGAHIT are capturing species across all abundances, while IDBA-UD misses a few at low abundance.
MetaVelvet, on the other hand, has a lower PRG yet detects species in low abundance well (Figs. 2d and 5), indicating that it is missing information from abundant species. The number of misassemblies for metaVelvet decreases when shifting to the staggered community as well. There are two possible explanations: metaVelvet is skipping lowly abundant species, thus not capturing their sequence and reproducing the same errors as in the balanced community; or metaVelvet is missing the low-abundant species and thus not incorporating them into chimeric contigs, thereby both missing some sequence data and skipping chimeric contigs as compared to the balanced community. Regardless of the cause, metaVelvet captures the most accurate nucleotide level information for scarce species, albeit in small chunks. A tool that combines both metaVelvet and metaSPAdes may result in the longest and most accurate contigs for low-abundant species.
In a separate pattern from the other four assemblers, Omega has the largest longest contig size in both the balanced to the staggered communities (Fig. 2a), yet a PRG in-between that of metaVelvet and the other assemblers (Fig. 2d). Furthermore, the number of misassemblies in Omega remains far above (5–24×) the others in both the balanced and staggered communities (Fig. 2c). The low PRG combined with the high number of missassemblies, large contig size and large size of misassembled contigs (Additional file 1: Tables S4 and S5) indicates that Omega is potentially over scaffolding, similar to metaSPAdes, yet only capturing a small amount of the population, similar to metaVelvet. This combination indicates that Omega captures a small, yet highly erroneous, portion of the community. The high number of errors may be due to the overlap graph approach of Omega.
To understand how well assemblers can delineate strains of the same species, synthetic communities of multiple strains from the same species of microbes in balanced abundance were simulated. Unsurprisingly, the assemblers did not perform as well on these communities than the previous mock communities (Fig. 3). While metaSPAdes continued to have the largest contigs and N50, MEGAHIT consistently had the largest assembly size and largest PRG. The number of misassemblies appears to depend more on the species being assembled than the assembler being used since the number of misassemblies per community is close across assemblers, except for E. coli with IDBA-UD, which created 4 times as many misassemblies as metaSPAdes. Thus, MEGAHIT is an excellent choice for recovering the different serotypes within a microbial community. MEGAHIT, for example, would be ideal for detecting a particular pathogen in a community of similar but non-pathogenic species.
To evaluate the effect of the breakpoints between contigs generated by the assemblers on protein abundance prediction, we used MetaGene to call ORFs from assembled contigs. The only reads used for assembly were those that came from the reference genome, therefore, only two types of ORFs can be predicted by MetaGene: 1) ORFs from the reference data set that were assembled correctly; or 2) aberrant ORFs, which are not present in the reference. These ORFs are the only possibilities as it is not possible for an ORF that is absent from the reference to be a novel and correct ORF since the BEI mock community is comprised of organisms with known complete references, and the single species communities were simulated data sets.
In the balanced community, the assemblers can recapitulate over 99% of the reference ORFs, and only vary by 1% for aberrant ORFs; the staggered community, however, has a larger disparity (Fig. 4a). Overall, as points shift to the right on the X-axis, they also shift down the Y-axis, indicating a relationship between the number of aberrant ORFs and the number of missed reference ORFs. The increase in the number of aberrant ORFs, however, is much larger than the number of missing reference ORFs.
While metaVelvet creates accurate contigs (Fig. 3d), the number of breakpoints within the contigs causes a large loss of reference ORFs from the data set. MetaVelvet does, however, creates the smallest number of aberrant ORFs. MetaSPAdes has the least number of missing reference ORFs, and the most number of aberrant ORFs. This relationship is complementary to our previous notion that metaVelvet, while having a much smaller amount of the metagenomic data set covered by its contigs, has a much higher quality in the assembly for low-abundant microbes. Similarly, metaSPAdes, while capturing the most information, is highly prone to making mistake in low abundance organisms during its scaffolding process.
The trade-off of a larger change in the number of aberrant ORFs created than the number of reference ORFs found is apparent in the single species communities as well (Fig. 4b), though the ordering of accuracy within assemblers is shifted. Notably, the organism being assembled has a much larger role in the capability of an assembler to accurately assemble ORFs than the assembler itself. Despite the large role species plays in assembler accuracy across all communities, metaSPAdes consistently misses the largest number of ORFs from the reference, and metaVelvet captures the highest number of ORFs from the reference. IDBA-UD had a large change depending on the community, having the lowest number of aberrant ORFs for B. fragilis, but the largest by a wide margin for S. aureus. Over all communities, MEGAHIT is consistently in the middle or the lowest, furthering its prowess for strongly related community assembly.
We also assessed how sensitive each assembler was to the relative abundance of the organisms present in its ability to successfully reconstruct the expected ORFs. This analysis was done by comparing the relative abundance of each species relative to the absolute number of missing ORFs from that species for each assembler (Fig. 5, top). MetaSPAdes is the most linear with its drop in performance with low abundant species, whereas MEGAHIT and IDBA-UD both have a large, quick drop at mid abundance. MetaVelvet has a bimodal distribution, with a large number of missing ORFs at low abundance, and then almost no missing ORFs at high abundance. These results further metaSPAdes as a strong choice for ORF prediction in diverse communities where important functions might only be found at low abundances, while also suggesting that metaVelvet might be appropriate for ORF prediction in the case where one favors accurate information for the most prevalent functions in the community.
Finally, some efforts have examined functional capabilities of a community as a whole. It is extremely difficulty and infeasible, however, to accurately measure a community’s protein abundances for ORF abundance comparison. We thus used a proxy to measure how each assembler distorts the true abundances of ORFs. To do so, we used the concordance of species coverage measured by mapping reads to the joint reference genomes with the average coverage of ORFs called by the assembler for each species. All assemblers recapitulate the mock balanced community to relatively the same abundances, with identical CODs (r 2 = 0.99, Additional file 2: Figure S1, bottom). We expect this similarity due to the similarity between all previous metrics examined for the balanced community.
There is a small difference, however, between an assembler’s ability to determine the relative abundances of species within the staggered community. This difference mirrors the ability of each assembler to recreate reference ORFs. MetaSPAdes is able to most accurately reproduce the relative abundances compared to IDBA-UD, MEGAHIT or metaVelvet (Fig. 5, top). Though the difference in the COD is quite small for metaSPAdes, IDBA-UD and MEGAHIT, metaSPAdes has more normally distributed ORF coverage profiles at the lower abundances than IDBA-UD and MEGAHIT, indicating it is finds a more consistent abundance across the ORFs it reassembles. Furthermore, it misses no species, while MEGAHIT and IDBA-UD each miss one, and only call a single ORF for another (Fig. 5, bottom).