In this study we compared mRNA and protein expression across diverse datasets: mouse inner ear tissues, mouse organs, cancer cell lines and primate lymphoblastoids. We observed that the correlations in protein expression between groups are higher than the correlations in mRNA expression, across all datasets. It was previously observed that across taxa protein levels are more conserved than mRNA levels . We showed this phenomenon across tissues as well, and explained it by changes in the transcript level that are attenuated at the protein levels. A direct outcome of this phenomenon is the compression of large differences in mRNA expression to smaller ones in the protein domain. This is the first observation of this phenomenon for non-proliferating tissues, though it was previously seen in proliferative ones . Moreover, the aforementioned studies used OLS regression, which is known to suffer from a strong dilution bias . Using the more robust MA regression instead, we provided evidence for such compression in EAR, PRIMATE and in MMT (except for one tissue pair). In NCI60 and the brain-cerebellum pair [MMT] the regression results supported expansion, instead of compression.
When comparing tissues that are very similar in level of expression, small biases can render the regression invalid. In order to solve this issue, we tried a non-parametric approach, which can be less powerful but is not dependent on an underlying linear model. Using this approach, we showed buffering for all datasets except NCI60. We therefore conclude that a partial buffering between translation and transcription exists in the MMT, EAR, and PRIMATE datasets. For NCI60, the results were insignificant, and supported neither compression nor its opposite, signal amplification. Perhaps a more powerful test (for example, a random effects model ) may provide the answer. For the PRIMATE dataset such an observation was made previously . In this study, by addressing some of the limitations of that statistical analysis, we reaffirmed the correctness of the observation (Additional file 2: Supplementary Results).
We did not necessarily expect to see the same phenomena in cancer cell lines as in healthy tissues, for obvious reasons: cell lines are programmed to proliferate, whereas cells in healthy tissues divide slowly, if at all; cell lines somewhat lose their resemblance to their tissue of origin, thus becoming more similar to a “global cancer pattern”; and cell lines of the same origin may diverge in their transcriptomic and proteomic profiles as they follow different paths of cancer evolution. In addition, the post-transcriptional regulation may be altered or even damaged in cancer. We showed one manifestation of these biological differences, namely the lesser ability to separate NCI60 samples based on their origin, compared to the EAR and MMT datasets. Since the cell lines are more similar to each other in their expression profiles, the compression effect is expected to be less dominant in cancer.
A translational model has been proposed, where transcriptional signals are amplified by translational regulation . The existence of an amplifying mechanism might appear to contradict the buffering suggested here. However, the authors studied budding yeast, a single cell type. In this model an increase in the mRNA level of a transcript would translate into an exponential increase of the matching protein, while our analysis is based on multiple tissues. In each tissue the transcriptional, translational and post-translational regulations are fine-tuned to enable the correct function of the tissue. Both mechanisms can coexist, i.e., the expression profiles that we observed result from a balance between compressing and amplifying mechanisms. The first is related to the tissue identity (perhaps through epigenetic marks), and the second is connected to the way the translational apparatus of a cell functions. A very similar argument was made in , in the context of different species. We speculate that the contradicting evidence we observe for buffering in groups that are more similar to one another might be the result of such balance; i.e., in such groups, the balance between the two mechanisms leans towards amplification.
What biological mechanism explains the buffering observation? Decoupling is achieved by changing the translation rates, the protein degradation rates, or both. We cannot distinguish between these three options using our analysis, yet according to the literature, protein translation is assumed to be the major contributor to the variance of protein concentration , and was shown to change through tissue differentiation . Hence we can speculate that the translation rate is the factor that is changing between the two tissues, although in a different context, of expression quantitative trait locis in LCLs, the buffering observed between protein and mRNA was attributed mainly to protein degradation . In Supplementary Results [Additional file 2] we discuss explanations from the literature [6, 7] as to how the coordination of translation and transcription is achieved, and demonstrate that alternative polyadenylation, one of the proposed mechanisms , plays only a minor role, if any, in this balance in the EAR dataset.
We acknowledge the possibility that mRNA measurement error might cause an overestimation of the buffering effect. It is well known that distinct tissues may contain different amounts of RNAase that degrade mRNA at dissimilar degrees and with different specificities . Given the impact mRNA integrity has on transcript quantification , these differences may result in measurement errors that are inconsistent between tissues. By using ribosome profiling data instead of RNA-seq measurements, one can avoid this problem altogether, and obtain more rigorous results. Another source of error is the number of amplification cycles and the precise PCR conditions used for each sample. We used the Spearman’s correction to mitigate the between-replica error but we did not account for systemic errors between tissues. Tighter experimental controls, together with more elaborate statistical normalization techniques, can address this potential error.
We demonstrated how the prediction of protein can be improved by taking the range compression into account. Models that allow PTR to vary between tissues in a direction that buffers the change in protein levels (RFCB), performed better than models that did not allow this variation or ignored RNA levels altogether. The improvement in the prediction error was between 9 and 24%, depending on the dataset. The largest improvement was achieved in the EAR, but in this dataset the prediction was very good to begin with. In the PRIMATE dataset the smaller improvement of 14% can make a large difference in the prediction quality. This enhanced ability to predict protein levels can be utilized, for example, to better predict disease status using machine learning. The higher accuracy exhibited by the RFCB method in the prediction of the NRAS protein level in breast cancer cell lines, supports its usage in disease status evaluation, as overexpression of NRAS is associated with poor prognosis in breast cancer . In the future, as understanding of mRNA-protein relationship improves, more sophisticated prediction tools can be developed that will be aware of this mechanism and explore different features of it (for example, whether it saturates in higher mRNA expression levels).
If buffering worked in the linear fashion captured by the FCB model, and the noise level was similar in the measurements of protein and mRNA, we would expect the correlations between tissue pairs in the protein and the mRNA domains to be almost equal. We observed, however, that the correlations in the protein domain were higher. This is a surprising finding, especially in light of the higher noise level in protein, suggesting that a more powerful nonlinear buffering model could be described. Another support for a stronger buffering comes from the number of DE genes we found, which was much higher in the mRNA domain. As mentioned, the protein measurements are slightly noisier, though probably not to the extent that justifies these high differences.
In the enrichment analysis we observed that the functionalities represented at the protein domain were, by and large, a subset of the functionalities represented at the mRNA domain, which were far more numerous. The fact that we find less enrichment categories in protein is partially explained by the missingness pattern in the protein measurements: we have less chance to detect categories in which some or all of the genes are lowly expressed in the protein domain (or characterized by low detectability by MS). Focusing on the subset of genes with full measurements in protein allows a more fair comparison, but nearly ignores the possible differences between those ‘low expression’ categories. In that comparison we found a similar number of enrichment categories for protein and mRNA. The lists differ greatly; however, we notice that the categories that were found in the protein and not in the mRNA, were represented in the analysis of the full, non-filtered, mRNA data. We can conclude that all the functionalities that are represented in the protein are also evident in the mRNA data. For the opposite direction it is much harder to tell; to accurately answer this question we need to somehow predict the missing values in the protein, or develop an enrichment analysis tool that is aware of the ‘missing not at random’ nature of the data .
Why does one tissue maintain higher mRNA levels but the same protein levels compared to another, where such practice requires more energy from the cell? We suggest that functionally distinct tissues possess different mRNA profiles but similar protein profiles, in rest, as part of a preparation for a stimulus. Under some stimulus a translational inhibition is removed from a gene (or group of genes) that is DE between the tissues only at the mRNA domain, so that the tissue that possesses higher levels of the gene’s transcript will synthesize the protein faster. Indeed, one of the virtues attributed to translational control is the possibility of rapid response to external stimuli . Moreover, when exposing mammalian cells to stress induced by dithiothreitol, mRNA- and protein-level regulation contribute equally to the change in protein expression , demonstrating the importance of protein-level regulation under stress. If our suggestion is correct, it might be beneficial to measure both mRNA and protein levels in order to deduce functionality of genes. If a gene is DE at the protein domain, then the protein is important to the function of the resting tissue. If a gene is DE only at the mRNA domain, then it is required for the tissue functionality under some stimulus.
The fact that the vestibular up-regulated genes are enriched for response to stimulus and chemicals only in the mRNA domain might be a manifestation of this hypothesis, as a role for these responses in the normal development of the ear is not known. Also fitting this hypothesis are the multiple immune related terms found in the mRNA domain, in the analysis of the non-filtered data. Nevertheless, the lack of these terms from the protein analysis might be related to a relatively low expression of the genes in these categories. In the MMT analysis we see a similar pattern. Response to stress terms are enriched in mRNA data and not in protein, and those of immune system response are unique only to mRNA. In the literature we can find examples where the translational regulation of genes changes in response to heat shock , hypoxic stress , changes in iron concentration , and exposure to EGF . It is interesting to explore whether the genes activated in these responses are highly expressed in the mRNA domain, compared to a tissue that is not normally subjected to these types of stress, even before the actual exposure.