RED HOT Contributors


SMRT genome assembly corrects reference errors, resolving the genetic basis of virulence in Mycobacterium tuberculosis


Genome assembly and methylation motif detection

Using the data from two sequencing runs (SMRTCells), the genome assembled with 217x average coverage into a single contig containing 4426109 base pairs after circularization and polishing. Applying the same protocol using data from only one of the two SMRTCells (103x average coverage) resulted in an identical sequence. Figure


shows sequencing coverage and GC-content as a function of genome position.

Fig. 1

Sequencing Coverage and GC-content by Genome Position. GC-content and coverage are shown in 1kb windows. The coverage plot refers to reads mapped to our assembly during the final polishing round. Reads with mapping quality values less than 10 were not used in polishing and are not counted here. Imposing linearity in the contig despite circularity of the genome creates mapping difficulties at the contig edges, resulting in irregularities in apparent sequencing coverage at these sites

Circularization was impeded by discrepancies in the edges of the contig, where an IS6110 insertion was present in only one of the two edges. It appears heterogeneously in our sample, as aligning our reads against our assembly shows that a minority of reads have interrupted mapping to this segment while the majority do not. With regard to base modifications, N6-methyladenine was detected in 99.67% of the instances of the partner sequence motifs CTGGAG and CTCCAG. The methylation of these motifs in both H37Ra and H37Rv was previously reported by Zhu and colleagues in H37Ra as part of their study of mycobacterial methylomes [13].

Direct comparison with the hitherto H37Ra reference genome

Comparison of our assembly with the H37Ra reference sequence (NC_009525.1, hereafter referred to as H37RaJH, for Johns Hopkins) showed significant variation. We found 33 single nucleotide polymorphisms (SNPs), and 77 insertions and deletions in our assembly with respect to H37RaJH (Additional file 1).

Structural variations

Two of the insertions with respect to H37RaJH were substantial structural variations: one was an insertion of IS6110 into the gene corresponding to Rv1764 and the other was an in-frame insertion of 3456bp into the PPE54 gene.

The insertion of IS6110 into Rv1764 (an IS6110 transposase) is unsurprising, as IS6110 insert frequently into that general region of the genome, as well as within their transposase [14, 15]. This insertion was the heterogeneous insertion responsible for the discrepant contig ends in our raw genome assembly. Such heterogeneity implies either a lack of selection pressure on the insertion in culture, a recent emergence of the insertion, or both.

The 3456bp insertion in ppe54 with respect to H37RaJH incidentally corresponds to a tandem duplication of a 1728bp sequence at the same site in H37Rv with 100% identity. The complete absence of this tandem repeat at this site in H37RaJH, however, is not necessarily an assembly error, as this is also observed in several clinical isolates (unpublished data). This, along with the 100% identity between each 1728bp duplicate of the tandem repeat with respect to H37Rv, lead us to believe that both the duplication in our sequence and the deletion observed in H37RaJH are instances of in vitro evolution, following the divergence of the lineages from which H37RaJH and our assembly were drawn. These two structural variations, or, at least, very similar structural variations, have been observed previously in virulent strains of M. tuberculosis, and therefore likely do not contribute to virulence attenuation in H37Rv (unpublished data) [14, 16], but shed light on in vitro evolution of this strain [8, 17].

Analysis of motif variants in H37Ra and H37Rv

With the knowledge that the CTGGAG/CTCCAG motifs are methylated in both H37Ra and H37Rv [13], we determined the motif variants, or sequence polymorphisms that create or destroy motifs, between H37Rv and H37Ra. By first comparing H37RaJH to H37Rv, we see that all but two motif variants were due to structural variations. Both of these variants instantiate the CTGGAG motif in H37Ra where it is absent in the H37Rv reference sequence. The first is due to the GT polymorphism at H37Rv position 2043284 (upstream of PPE30) in H37RaJH, but this variant is contradicted by our H37Ra assembly. The second is due to the TG polymorphism at H37Rv position 2718852 (upstream of nadD) and confirmed by our H37Ra assembly, yet also appears in CDC1551 and is a previously reported sequencing error in H37Rv [8] that has not been applied to the current reference. Based on these results, DNA methylation and motif variants do not play a role in the attenuation of virulence in H37Ra.

Status of previously reported “H37Ra-specific” polymorphisms

With our assembly, we aimed to replicate the study performed by Zheng and colleagues when they first assembled the H37Ra genome [5]. In their study, they compared their assembly with H37Rv, then filtered out variants also present in CDC1551 (NC_002755.2) to find mutations likely specific to H37Ra [5]. Zheng and colleagues identified a set of mutations in H37Ra unique with respect to H37Rv and CDC1551 as “H37Ra-specific”. These mutations fall within or adjacent to (which we term “affecting”) 56 genes in H37Rv, which we refer to as the high-confidence (HC) gene set. While comparing the variants, Zheng and colleagues also discovered sequencing errors in the H37Rv reference sequence [5], a number of which were corrected in NC_000962.3 [9], the version used in our study.

To see how well the HC genes are supported by our assembly of H37Ra, we determined variants with respect to H37Rv for our assembly and H37RaJH and performed set comparisons after excluding mutations shared with CDC1551 and other finished assemblies for H37Rv (H37RvBroad: NC_018143.2; H37RvSiena: NZ_CP007027.1; H37RvTMC102: NZ_CP009480.1) (Additional files




). We then categorized the HC genes as follows. We labeled a gene “contradicted” if all mutations affecting it were observed only in H37RaJH. We labeled a gene “supported” if all mutations affecting it were observed in both H37Ra assemblies. Otherwise, we labeled a gene “adjusted” if it had a different variant profile between H37RaJH and our assembly in a manner distinct from the two categories defined above. Figure


shows example classifications based on these criteria.

Fig. 2

Example Classification of Genes Based on Variant Comparisons. Considering the profile of H37Ra-specific variants (those with respect to H37Rv not also appearing in CDC1551), a given gene (blue arrow) is categorized as “supported”, “contradicted”, or “adjusted” by our H37Ra assembly as a result of comparison with the hitherto reference sequence NC_009525.1. The illustration shows examples of the different variant profiles a gene could have and their resulting classifications. Genes in the “supported” and “contradicted” categories are strictly those where our assembly either fully matches the H37Ra reference (supported) or the H37Rv reference (contradicted). Multiple factors may cause a gene to be classified as “adjusted”. Such genes may have variant profiles not fully meeting the criteria of “supported” or “contradicted”, or they may have novel H37Ra-specific variants observed only in our assembly

We first noted that two of the HC variants reported by Zheng and colleagues, those affecting nadD (Rv2421c) and nrdH (Rv3053), were included erroneously (Table 1 d). These variants were a TG mutation 44 bases upstream of nadD, at H37Rv position 2718852, and a 14bp deletion in the promoter of nrdH. These mutations, although confirmed by our assembly, also appear in CDC1551 and thus cannot be considered H37Ra-specific.

Of the variants in the remaining 54 HC genes, our assembly contradicts 36 (Table


), adjusts 4 (Table


), and confirms 14 (Table


). We then considered how these results affect the picture of how the genotypic differences between H37Rv and H37Ra give rise to the phenotypic differences observed between the two strains, which are discussed below and depicted graphically in Fig.


. As our analysis focused on the HC gene set reported by Zheng and colleagues [


], we did not re-evaluate whether additional genes and variants should belong to this grouping. We did, however, carefully consider all variants unique to our assembly (Table


, Additional file


) and their potential effect on the organism’s phenotype.

Fig. 3

Visualization of the Reduced Set of H37Ra-specific Variants and Their Effect on Phenotype. Our assembly contradicts many variants previously thought to be H37Ra-specific, reducing the number of genes that may contribute to H37Ra’s virulence attenuation. Several of these genes have been reassigned function since the first published assembly of the H37Ra genome [5], which is reflected in the figure. a The set of genes identified to carry H37Ra-specific polymorphisms in the original H37Ra genome publication [5] and their contribution to phenotype as understood at that time. 56 genes are affected, the majority of which were PE_PPE genes or were of unknown function. b The set of genes with H37Ra-specific variants confirmed by our assembly is reduced markedly, particularly in PE_PPE genes, highlighting the strength of single-molecule sequencing in resolving GC-rich and repetitive stretches of DNA. Genes with functions not yet characterized were also reduced significantly.*Though in a few instances this was because these genes’ function was characterized between 2008 and now, most were due to our assembly showing that they matched that of H37Rv and, therefore, are not H37Ra-specific. **For lpdA, the altered copy number in H37Ra was found not to be specific to the avirulent phenotype. However, the observed altered expression of lpdA in H37Ra may be due to altered regulation from PhoP. blue(bigstar )The H37Ra-specific variant(s) in these genes have been shown to confer a phenotypic change in H37Ra relative to H37Rv in wet-lab studies. For these genes, the mechanisms affected by the H37Ra-specific variant are illustrated in detail (see Fig. 4 for hadC and phoP). For other genes, their general function is described or briefly illustrated

Fig. 4

Cell Wall Differences in H37Ra and H37Rv. a State of knowledge following publication of H37RaJH. At this time it was known that the SNP in the DNA-binding site of phoP abrogated synthesis of sulfolipids (yellow) and acyltrehaloses (purple and red) of the mycomembrane outer leaflet, while two SNPs in pks12, both of which were refuted in our assembly, were thought to cause the observed lack of phthiocerol dimycocerosates (blue) in H37Ra. b Current state of knowledge. Advances were made in understanding the inner leaflet. A single nucleotide, frameshift deletion in the now annotated hadC gene was shown by Slama and colleagues [33] to alter the mycolic acid profile in three distinct ways: i. Lower proportion of oxygenated mycolic acids (K-MA and Me-MA; green and blue carbon skeletons, respectively) to α-MAs (orange carbon skeleton). There are seven Me-MAs depicted in H37Rv compared to three in H37Ra, reflecting the proportions reported by Slama and colleagues [33]. ii. Extra degree of unsaturation (red circles) in H37Ra mycolic acids due to truncation of the HadC protein in H37Ra. iii. Shorter chain lengths of mycolic acids in H37Ra. Note that Me-MAs have larger loops in H37Rv than in H37Ra, and that the height of the α-MAs is shorter in H37Ra than H37Rv. Carbon chain lengths are based on results reported by Slama and colleagues. The folding geometry of the mycolic acids is depicted in panel B, as described by Groenewald and colleagues [56], and inspired by the illustration style of Minnikin and colleagues [57]

Table 1

Status of Genes Previously Reported as Affected by H37Ra-specific Mutations

(a) Genes with all High-Confidence Variants Contradicted by our Assembly



Probable conserved integral membrane protein




PE-PGRS family protein PE_PGRS2




Probable dihydroxy-acid dehydratase IlvD (dad)


[48, 58]



PE-PGRS family protein PE_PGRS4




Possible conserved secreted protein

masks sequencing error in H37Rv




PE-PGRS family protein PE_PGRS7




Possible MarR-family transcriptional regulatory protein





PE-PGRS family protein PE_PGRS16




PE-PGRS family protein PE_PGRS20




PE-PGRS family protein PE_PGRS22


[36, 38] ∙



Probable PHOH-like protein PhoH2




PPE family protein PPE18


[25, 36] ∙



PE family protein PE15




PE-PGRS family protein PE_PGRS27




PPE family protein PPE30

SNV instantiates CTGGAG motif



Conserved hypothetical protein




Polyketide synthase Pks12




Class a beta-lactamase BlaC




RNA polymerase sigma factor, ECF subfamily, SigC


[18, 19]



PE-PGRS family protein PE_PGRS36

Likely pseudogene



Adenosine kinase

Synonymous mutation




PE-PGRS family protein PE_PGRS41




Probable transposase for IS6110




Conserved hypothetical alanine, arginine-rich protein




Conserved hypothetical protein




Conserved hypothetical protein




Conserved protein

Synonymous mutation




Probable transposase

labeled intergenic in H37RaJH




Conserved hypothetical alanine and proline-rich protein

labeled intergenic in H37RaJH




NAD(P)H quinone reductase LpdA

tandem repeat copy number variation




PPE family protein




PE-PGRS family protein PE_PGRS52


[36] ∙



Probable 3-hydroxyacyl-thioester dehydratase HtdY




PE-PGRS family protein PE_PGRS53


[37] ∙



PE-PGRS family protein PE_PGRS59


[36, 38] ∙



Hypothetical arginine and proline rich protein

One deletion also at ftsH -57bp


(b) Genes with Different H37Ra-specific Variant Profiles in our Assembly



Putative transposase of insertion element IS6110

disrupted by IS6110 in our assembly




PPE family protein

tandem repeat copy number variation

[36] ∙



PE-PGRS family protein PE_PGRS54


[3638] ∙



PE-PGRS family protein PE_PGRS57


[36, 37] ∙

(c) Genes with High-Confidence Variant Profiles Fully Confirmed by our Assembly



Probable conserved membrane protein




Possible conserved transmembrane protein




Probable peptide synthetase Nrp (peptide synthase)


[9, 24]



(3R)-hydroxyacyl-ACP dehydratase subunit HadA


[33, 38]



(3R)-hydroxyacyl-ACP dehydratase subunit HadC


[25, 33]



Member of Two-component response complex PhoPR


[18, 23, 25, 49, 58]



PPE family protein PPE13


[36, 38] ∙



Probable para-aminobenzoate synthase component I





Unknown protein




NTP Pyrophosphohydrolase, MazG


[31, 32, 48, 58, 60]



Probable phospholipase C 4 (fragment) PlcD


[8, 61]



PE-PGRS family protein Wag22


[36] ∙



PPE family protein PPE38

exact, adjacent duplication of PPE38

[25, 40] ∙



ESX-1 secretion-associated protein EspK.


(d) Genes with Variant Profiles Erroneously Declared as H37Ra-specific



Probable nicotinate-nucleotide adenylyltransferase NadD

SNV instantiates CTGGAG motif




Probable glutaredoxin electron transport component of NRDEF NrdH



Table 2

Genes with Variants in H37Ra Unique to our Assembly

Rv0279ca, b


Two substitutions

Both mutations are not specific to H37Ra

Rv0383ca, b


A459399C – 84bp upstream of Rv0383c

Potential sequencing error in H37Rv [8]

Rv1450ca, b


208bp inframe insertion




insertion of IS6110


Rv3303ca, b


174bp insertion 12bp upstream

Tandem repeat CNV



1728bp insertion

Tandem duplication with respect to H37Rv



multiple variants




multiple variants

Only two are H37Ra-specific

Accuracy of the H37Rv reference sequence

Ioerger and colleagues listed 73 polymorphisms (excluding those in PE_PPE genes) with respect to the H37Rv reference shared between six H37Rv strains from different laboratories, but considered all but one of them as errors in the reference sequence because they also appeared in the H37Ra reference [8]. The remaining polymorphism was a AC transversion at position 459399, a position upstream of Rv0383c masked by a 55bp deletion in H37RaJH. Interestingly, our assembly contradicts this 55bp deletion, but is in perfect concordance with the transversion at position 459399. The revelation that H37Ra is in fact the same as all H37Rv strains at this position invalidates the maximum parsimony tree in Fig. 1 of their publication [8]. Thus, through our improved assembly of the H37Ra genome, we have identified an additional error in H37Rv, the standard reference genome of M. tuberculosis.

SNPs previously reported to cause expression changes in H37Ra are contradicted by our assembly

Interestingly, SNPs in the putative promoter regions of two genes, phoH2 and sigC, found by Zheng and colleagues to be up-regulated in vitro and down-regulated in macrophage in H37Ra relative to H37Rv, were contradicted by our assembly [5]. Zheng and colleagues attributed this differential expression to these (now contradicted) SNPs, but it appears there instead must be a distal causative factor driving the observed expression changes of both genes. The SNP affecting sigC has been cited as the cause of the differential expression of SigC in macrophages relative to H37Rv [18, 19], illustrating how incorrect sequences can propagate through the literature.

SNPs previously thought to affect polyketide synthesis in H37Ra are contradicted by our assembly

Altered polyketide synthesis has been proposed as one of the primary mechanisms attenuating virulence in H37Ra, through disrupting phthiocerol dimycocerosate (PDIM) production, which has shown to manifest deleteriously in H37Ra [20, 21]. Our assembly contradicts both reported SNPs in pks12 (polyketide synthase 12) of H37RaJH. This means that some factor other than disruption of pks12 causes the observed lowered PDIM production in H37Ra. Thus, it remains unclear which (epi)genomic factor(s) underlie the observed reduction in PDIM synthesis in H37Ra, as supported variants (those in phoP and nrp) once considered to cause this reduction [22] have been shown not to [23, 24]. However, it is possible the decreased production of PDIMs is merely an artifact of repeated subculturing in vitro [17].

Variants in phoP, mazG, and hadC account for much of the virulence attenuation in H37Ra

Of all the HC genes, only variants in phoP, mazG, and hadC have been connected strongly with virulence attenuation in H37Ra through wet-lab work, each of which our assembly supports.

Of these, the most thoroughly studied is the nsSNP (S219L) in the DNA-binding region of phoP, part of the two component PhoPR regulatory system. There is an abundance of literature linking phoP to virulence attenuation in H37Ra, through several mechanisms, including disrupted sulfolipid and trehalose synthesis (Fig. 4), diminished ESAT-6 secretion, and additional downstream effects from altered expression of other genes under its regulon [5, 18, 23, 2530]. However, several of these studies also show that phoP alone [23, 29] is not responsible for virulence attenuation in H37Ra, but rather that the genomic cause behind virulence attenuation in H37Ra is multifactorial.

The second gene, mazG, has a nsSNP (A219E) in a region coding for a highly conserved alpha-helix residue in its protein product, a nucleoside triphosphate (NTP) pyrophosphohydrolase [5]. MazG exhibits diminished hydrolysis activity in H37Ra relative to both MazG in H37Rv and MazG of the fast-growing Mycobacterium smegmatis. Wild-type MazG hydrolyzes all NTPs, including those that are mutagenic and appear more frequently with oxidative stress (Fig. 3 b), which is experienced by the bacterium inside activated macrophages [31]. This decreased ability to hydrolyze mutagenic NTPs contributes to virulence attenuation in H37Ra [32].

In the third gene, hadC, there is a frameshift-inducing 1-bp insertion, which creates a premature stop codon and results in truncation of HadC. hadC is a member of the essential hadA-hadB-hadC gene cluster, which forms two hydratases (HadAB and HadBC) of the M. tuberculosis fatty acid synthase II system. Our assembly and H37RaJH both show a 5-bp insertion in hadA which, along with hadC, are the only genes with variants in H37Ra [33] that encode proteins known to be necessary for mycolic acid synthesis.

Recent complementation and knockout studies using hadC from H37Ra and H37Rv showed that intact HadC is necessary for cord formation, and that the truncated form H37R a/h a d C affects length and oxygenation of mycolic acids (Fig. 4 b). Furthermore, when tested in murine lung and spleen, H37R a/h a d C Rv grew an intermediate amount of colony forming units, between that of H37Ra and H37Rv, at a level commensurate with H37R v Δ h a d C which suggests that the H37Ra hadC variant underlies some of its virulence attenuation [33].

Interestingly, while both our assembly and H37RaJH harbor a 5-bp insertion in hadA, sequences obtained by Lee, Slama, and their respective colleagues do not [29, 33]. These two sequences were both derived from a culture from Institut Pasteur, while ours and that of Zheng and colleagues [5] were acquired directly from ATCC, which suggests that the two cultures diverged in vitro prior to sequencing despite sharing the same ATCC identifier. We expect the deleterious effects of h a d C Ra shown by Slama and colleagues would be exacerbated by the 5bp insertion in our assembly, as it results in an abnormal HadAB enzyme which, when normal, has been posited to compensate for faulty HadBC [33]. However, the experiments discussed above indicate that the hadC variant alone is sufficient to attenuate virulence, and is one of the primary sources of attenuation in H37Ra.

Copy number variation in lpdA promoter

The polymorphism reported in H37RaJH that affects lpdA (NAD(P)H quinone reductase) is a third (as opposed to the two in H37Rv) 58bp repeat in its promoter region. Our assembly reveals an additional two copies of this 58bp region, resulting in a total of five copies of the repeat. LpdA has been shown to protect bacilli from oxidative stress and improve M. tuberculosis survival in a mouse model [34]. However the copy number of this tandem repeat in our assembly matched two of the H37Rv assemblies—H37RvBroad and H37RvTMC102—meaning this copy number variation is not specific to the avirulent strain and does not contribute to the phenotype of H37Ra. Despite the contradiction of this copy number variation being H37Ra-specific, the observed differential expression of lpdA with respect to H37Rv [5] may contribute to virulence, perhaps through altered regulation by PhoP, as lpdA is under its regulon [30].

Variants affecting uncharacterized hypothetical genes

Several genes classified with unknown or hypothetical functions were among the HC genes of H37RaJH (Table 1). Our assembly contradicts all variants in the majority of these, leaving three which we supported in full.

Though none of these genes have an implicated role in virulence in the literature, they may in reality. These genes should be investigated, as they are three of the few supported HC genes yet to be explored. The value of exploring hypothetical genes is evidenced by the recent discovery of a significant contribution of HadC [33]—the function of which was unknown when H37RaJH was published—to virulence attenuation in H37Ra (Fig. 4).

Significant reduction of H37Ra-specific variants in PE_PPE genes

The PE_PPE family of genes is unique to mycobacteria but poorly characterized, both functionally and genomically, in M. tuberculosis, the latter owing to the family’s high-GC content and repetitive nature [35]. Evidence for contribution from PE_PPE family members to virulence has amassed support since 2008[3639], but this gene family was the most drastically altered by our assembly: while PE_PPE genes comprise approximately 10% of the genome, they account for nearly half (16/36) of the contradicted genes. It is likely that the majority of these are errors in H37RaJH rather than manifestations of hypervariability, as few PE_PPE genes fell into the adjusted or novel categories, as one would expect if they were due to hypervariability.

Consequently, some extant work examining polymorphic PE_PPE genes between H37RaJH and H37Rv is invalidated by our assembly. For example, our assembly contradicts or changes the variant profile of all four PE_PPE genes reported to be positively selected for in H37Ra in an evolutionary genomics study by Zhang and colleagues [38] using H37RaJH.

Another study affected profoundly by our results is that of Kohli and colleagues [36], which used H37RaJH and H37Rv in an in silico genomic and proteomic comparison of PE_PPE family genes. Though our assembly renders much of the results from their analyses invalid, applying their methodology to our updated assembly would yield interesting results.

Our assembly contains polymorphisms in 6 of 22 genes that encode PE_PPE family members reported as unique to H37RaJH (Table 1, Fig. 3 b). Of the three PE_PPE family members fully corroborated by our sequence, one was the duplication of ppe38, which McEvoy and colleagues have also identified in 3 different samples of H37Rv, suggesting this duplication likely plays no role in virulence [40]. All 3 of the adjusted PE_PPE family members, as well as the supported Wag22 and PPE13, belong to PE_PPE sublineage V. Sublineage V members comprise the majority of PE_PPE proteins that interact with the host, and are overrepresented in proteomic studies of in vivo infection [35]. This enrichment of subfamily V PE_PPE family members in the set of supported or adjusted genes suggests they may be more integral to virulence attenuation in H37Ra than other PE_PPE family members. The role of PE_PPE family members in virulence should become better understood as more genomes are sequenced using third-generation platforms.

In addition to the differences due to sequence alterations in PE_PPE family genes, the corroborated polymorphism in phoP may confer altered expression of many PE_PPE family proteins, as at least 13 are under its regulon [35], which could mediate some virulence attenuation.

The precise roles of PE_PPE family members have yet to be elucidated in full. It is difficult to evaluate rigorously the effect of each PE_PPE variant, as their function in wild-type M. tuberculosis is poorly characterized [35]. Moreover, their contribution to virulence may well require complexities of the native host environment beyond what can be replicated in vitro or ex vivo with current technology. Thus, the role the polymorphisms in this family play in the phenotype of H37Ra compel further research, which our reduction of variants has made more tractable.


We Support OUR Contributors

Get Our Newsletter

 Receive podcast updates
Exclusive insights
Patient Engagement Tips from industry experts
We hate SPAM as much as you do and promise to keep your email address safe.
  • Subscribe to the Podcast