General patterns

As 85% of the maize genome is repetitive sequence [26, 49], and 63% structurally recognizable TE sequence [48], TEs contribute more to the maize genome than sequence that is uniquely ‘maize.’ Like most plant genomes [11], retrotransposons contribute more base pairs to the maize genome than do DNA transposons (Table 1 and Fig 2B). This is a consequence of the high number of copies (Fig 2A) and the large size of individual retrotransposons (Fig 2C), likely due to a ‘copy and paste’ replication mode that leaves existing copies intact when generating new copies. Also like other plant genomes [57, 58], several superfamilies of DNA transposon in the maize genome are found closer to genes than are retrotransposons (Fig 2B). This is likely due to targeted insertion into euchromatic sequences [59, 60], and differences in removal through natural selection after insertion [61, 62].

TE superfamilies

The bulk of TE sequence is often described at a finer scale, that of individual superfamilies of TEs. Each TE superfamily defined in the maize genome has representatives across the tree of life [6365], suggesting an ancient origin of these genomic parasites. Some superfamilies have retained dramatic and consistent differences in their spatial patterning across chromosomes over hundreds of millions of years. For example, the superfamily RLG is enriched near centromeres in all plants [6668] including maize (Fig 3A), highlighting a genomic niche that allows long-term survival near the centromere. Similar patterns exist at deep time scales for DNA transposon superfamilies, which preferentially insert near genes in both monocots and dicots [57, 58, 69, 70] and in maize are enriched on chromosome arms where genes are concentrated (Fig 3A).

These patterns likely reflect the evolution of different ecological strategies of TEs in the genome. Kidwell and Lisch (1997) [7] described two extremes to the ‘ecology of the genome’—one, a TE that preferentially inserts far from genes, into low recombination heterochromatic regions, and a second, risky TE that inserts near low copy sequences, more likely to disrupt gene function. We observe these extremes at play in the maize genome, in that LTR retrotransposons dominate the heterochromatic space, with over half of all copies greater than 16 kb from a gene (Fig 2B), and most copies heavily methylated (Fig 6A, 6B and 6C). The alternate strategy also exists in the maize genome, with risky insertions near genes and transcribed regions seen for several TIR superfamilies. For example, over half of Mutator transposons (DTM) are found within 1 kb of a gene (and over one quarter of DTM within 100 bp of a gene) (Fig 2B). This likely results from the preferential insertion of DTM elements upstream of genes [30, 60, 71, 72]. We note that we find TIR copies are found further from genes (17.2 kb) than previously reported for grass genomes [49, 58]. We believe this may be due to previous analyses based on preferential assembly and identification of genic TEs—indeed subsetting to the 893 TIR families found in the Maize TE Database [49] results in a much reduced 1.6 kb median distance to genes. On a genome-wide scale at the level of all TEs, the spatial patterns we observe could result from either preferential insertion or differential removal by selection after insertion. Further characterization of these ecological strategies will be facilitated by investigating TE polymorphism across maize individuals [33, 73] and de novo recent insertions that selection has not yet acted on [74].

TE families

While superfamily level observations are useful for gaining an overview of the distribution and survival of TEs in a genome, more detailed study on a time scale relevant to the evolution of the genus Zea comes from studying TE families. Maize TE families are shared with closely related host species, but the number of shared families rapidly decreases with phylogenetic distance. Many families are shared with congeners Zea diploperennis [7577] and Zea luxurians [78], but few families investigated are found in maize’s sister genus Tripsacum (1 mya divergence; [79]) [7577, 80, 81], and the only families shared between maize and Sorghum (12 mya; [55]) are shared only as a result of horizontal transfer events between the species [82]. This suggests that in order to understand TE evolution at a timescale relevant to maize as a species, it is essential to understand families of TEs, rather than the aggregate properties of superfamilies or orders.

Indeed, our family-level analysis also reveals patterns obscured when TEs are averaged together at the level of superfamily. For example, despite the fact that the RLG superfamily is enriched in centromeric and pericentromeric domains (Fig 3A), the second largest family RLG00003 (homologous to the RLG family huck [83]) is predominantly found on chromosome arms (Fig 3D). While many RLG elements contain a chromodomain targeting domain in their polyprotein [84] allowing targeted insertion to centromeres, RLG00003 does not (S12(G) Fig). This lack of a chromodomain may explain a proximal cause of the observed niche of RLG00003, although other factors are certainly at play, as other families with centromeric enrichment also lack chromodomains (S12(G) Fig). DNA transposons are also best described at the family level. While Mutator (DTM) elements are found a median distance of 2.5 kb from genes (Fig 2B) and have long been observed to target insertions near genes in maize [30, 60, 72], the second largest family, DTM13640, is found a median distance of 34 kb away from genes (Fig 2B). The mechanism for gene targeting seems to be mediated through recognition of open chromatin [60, 71], but precise details of the targeting are unknown. Further investigation into the families that insert near and far from genes may pinpoint how their molecular mechanisms of targeting may differ.

Furthermore, differences in the timing of transpositional activity vary extensively between families. Most TE families in maize have had most new insertions in the last 1 million years (Fig 4). Some TE families have bursts of activity, punctuated by a lack of surviving new insertions, while others appear to be headed towards extinction. All of these timings are much more recent than allopolyploidy in maize (≈ 12 mya) [55] and families show little subgenome bias in their distribution (S9(B) Fig), suggesting that these represent lineages evolving within maize.

Maize was domesticated from teosinte (Zea mays subsp. parviglumis) 9,000 years ago [85, 86]. It is tempting to address the contribution of TEs to this major transition, especially given the contribution of TE insertions to maize domestication and improvement [8789]. Although we caution that mutation rates and estimation can complicate ascertainment (see below), 46,949 TEs across all 13 superfamilies have an estimated age of less than 9,000 years, and 24,630 TEs have an estimated age of 0. This suggests that transposition has been ongoing since the divergence of maize from its wild ancestor, but we caution that we lack appropriate confidence intervals for these estimates, especially as non-zero age requires observing at least one nucleotide mutation.

The family-level ecology of the genome

It can be difficult to predict exactly why a particular TE family differs from other families. Community ecologists aim to understand the environmental factors that give rise to the observed diversity of organisms living in one place, including not just features of the environment but also interactions between species. TE families are analogous to species in the genomic ecosystem, and because the genomic environment a TE experiences is constrained to the cell, TEs are forced to interact in both time (Fig 4) and space (Fig 3). We predict each family of TE is adapted to its genomic ecological niche, where the genomic features we measure represent the environmental conditions and resources limiting a species’ ecological niche [90]. TEs additionally can act as ecosystem engineers, modifying the environment they insert into, and generating new habitat for future colonization [10, 44].

In the genomic ecosystem, we can observe interactions between species much like we would see in a traditional ecosystem. We see a number of patterns, including cyclical dynamics of TE activity through time for several families, sustained activity through time, and a reduction in new copies towards the present (Fig 4 and S4 Fig). This means that the genomic environment a newly inserted TE experiences is affected by the activity and abundance of all other TE families in the genome. At one extreme, members of the same family can even encode different proteins required for retrotransposition in different TE copies, where both types are required to be transcribed and translated for either to transpose. Such a system approaches a mutualism, where the success of one type depends on another. Previous knowledge of these systems was limited to the maize retrotransposon families Cinful, which codes for polyprotein domains, and Zeon, which codes for GAG [25] (represented here by a single family, RLG00001). This strategy has been successful in maize, and RLG00001 [48, 77] for example makes up 135Mb of sequence. Sorghum, in contrast, has a genome the size of maize [91] and lacks homologs to RLG00001. Such symbiotic relationships within a TE family have been thought of as remarkably rare [92, 93]; however we identify 25 LTR retrotransposon families where GAG and POL protein domains are found in separate TE copies but less than 1% of copies contain both, suggesting that this pattern is much more prevalent than previously described. These types of elements are best classified as subtypes of a single family, because the cis components of the LTR are recognized by protein domains of both GAG and POL proteins, leading to homogenization of sequence signals. As noted by Le Rouzic et al. (2007) [92], symbiotic TE families face a major barrier in being horizontally transferred, as both copies must be transmitted through an already rare process. Their prevalence in the maize genome thus supports instead a long term coevolution of the maize genome and the TEs that live within it, specializing and diversifying with different ecological strategies.

Unlike most contemporary ecological communities, which are censused when a researcher surveys them, the genomic ecosystem carries a record of past transposition. We can investigate this past ‘fossil’ record using the age of individual TE copies. This allows a robust analysis of the features that define TE survival across time. The TEs we see today are a readout of the joint processes of new transposition—which may not be uniform through time—and removal through selection, deletion, and drift [62]. Survival of a TE can be measured by its age or time since insertion, as our observation of a TE is conditioned on the fact it has not been removed by either neutral processes or selection. Changes in the TE community over time give rise to evolution.

Although relative age differences between TE insertions are limited only by our ability to count mutations, absolute age estimates can be shifted by mutation rate estimates. We use a maize-specific mutation rate [94], which leads to a five times younger estimated age of maize LTR retrotransposons than the 3–6 million years originally estimated by SanMiguel et al. (1998) [95]. Additionally, as nucleotide mutation rates in TEs may be higher than other parts of the genome (≈ 2 fold higher in TEs in Arabidopsis thaliana, [96]), we consider our age estimates to represent an upper bound of TE age. Nonetheless, age represents a comparable metric of survival in the genome, especially when summarized across multiple copies and families. Our random forest model predicting age of TEs thus relates the action of transposition to the processes that occur afterwards on an evolutionary time scale. The model shows that TE superfamily and family size provide best predictive power for age (Fig 7B).

Another TE feature with high predictive power for age and survival in the genome is the length of the TE, both of itself and its length including copies nested within it. For most families, there is a negative correlation between TE length and age (Fig 7C). However, we find that the relationship between TE length and age in maize is often nuanced, with some long TEs surviving over millions of years (Fig 7C and S13(A) and S3(B) Fig). In other taxa, selection is stronger on long TEs, mediated by a higher potential for nonhomologous (ectopic) recombination [9799]. Although a number of factors may contribute to these patterns, it also seems likely that a genome as repetitive and TE-rich as maize perhaps could not have evolved without mechanisms to prevent improper pairing of nonhomologous sequences with high nucleotide similarity [100].

Other predictors of age are expected. For example, we expect a new insertion to be younger if we show that the TE disrupts another TE, which we see for most families shown in Fig 7C. Additionally, we expect the proportion of segregating sites in the TE and the region flanking a TE insertion will be positively associated with TE age, as they reflects a count of the mutations that have accumulated on the haplotype carrying the TE. There is a positive relationship between age and segregating sites for most families shown in Fig 7C. We note that imprecise repair of a double stranded break after excision of a TIR element [70] could obscure this signal to some extent, increasing the number of flanking SNPs while decreasing the average frequency of the TE. Consistent with this mechanism, the superfamily DTT, which excises precisely without introducing nucleotide mutations [101] shows lower median flanking segregating sites per base pair (0.0295) than TIR elements from other superfamilies (0.0310) (Wilcoxon Signed-rank test shows significant effect of DTT superfamily; p < 2.2e − 16).

Elevated CHH methylation of TEs has been found in recently activated TEs in Arabidopsis thaliana [102] and in TEs near genes in maize [103, 104]. We find complicated, nonlinear relationships of CHH methylation with age (Fig 7F and 7G). These differences between and within families may reflect a natural senescence of TE copies. Young copies not yet silenced by the genome lack CHH methylation, intermediate age copies are effectively silenced with higher CHH methylation levels, and the oldest TEs with low CHH methylation are defunct copies incapable of transposition that are no longer silenced. More detailed study of recently active maize TE families will allow understanding of the temporal dynamics of transcriptional and post-transcriptional silencing of TEs.

In spite of previous predictions, distance to a gene and recombination are not found in the top 30 explanatory variables of age. Old TEs are underrepresented near genes in humans and Arabidopsis thaliana [105, 106], consistent with selection against such insertions. Recombination has been implicated in both the removal of TEs and in modifying their impact on fitness via ectopic recombination [6]. We believe that both distance to a gene and recombination rate reflect broad-level summaries of genomic regions, such that they are not predictive in our model once other local features are included. For example, regions with high recombination rate generally show low CG methylation in maize [107], but a subset of genes in such regions show CG methylation across the gene body. Since CG methylation plays a role in TE survival (Fig 7B), inclusion of this feature in our models will thus reduce the importance of recombination rate. Similarly, CHH methylation is most prominent in regions of the genome close to genes, presumably a result of RNA-directed DNA methylation reinforcing the boundary between heterochromatin and euchromatin [103, 104]. As this elevated CHH methylation is often over the TE closest to a gene [104], the distance of a TE to the closest gene may provide largely redundant information beyond what is captured by measurements of CHH methylation. Finally, despite many other features being correlated with either gene density or recombination rate, the two are inextricably linked, as recombination in maize primarily initiates in genes [108]. Together, these combine to reveal few patterns in the relationship between distance to gene, recombination rate, and age of TE copies (S13(C) and S13(D) Fig).

Finally, in spite of the fact our model includes more than 400 features of the genomic environment, TE taxonomy contributes substantially to prediction of TE age (Fig 7A and 7B). We have seen that the relevance and direction of effect of individual features can differ among families (Fig 7C), essentially generating family-specific niches in the genomic ecosystem. In fact, there is no genomic feature we measure which shows even the same direction of correlation with age across all families. The importance of taxonomy in our model suggests that there are unmeasured latent variables that are best captured with superfamily and family labels. This further emphasizes that the analysis of TEs in maize should focus on family, as each family is surviving in a slightly different way, exploiting a unique genomic niche.

Source link