What is the potential function of ultra-conserved elements in the genome?

What is the potential function of ultra-conserved elements in the genome?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

If selection pressure results in conservation of DNA sequences, what is the most plausible explanation for the existence of ultra-conserved elements (refs here and here) given that there hasn't been any significant validation of the functional significance of these elements other than a lot of bioinformatic analyses across different genome datasets? If these are of such high significance, does this mean that there is still some significant gap in our understanding of fundamental biology or is there another explanation? The second reference in particular show that there are UCEs shared between plants and animals, but are not syntenic (which is not necessarily a surprise) so it might suggest that at least a class of UCEs are associated with structural rather than functional elements.

It appears that there are at least several different 'classes' of ultra-conserved elements, based on the number of matching/identical bps, their spatial distribution across the genome and the species in which they exist. Even though there is probably no single explanation that would account for all the possible functions they can have, it is surprising that they are difficult to test functionally. This is again probably due to a lack of understanding about their properties and therefore no real method to validate their function. I think this is where we need to think outside the box to come up with the answer.

What would be the most obvious (and possibly not so obvious) function for UCEs?

Probably development, in particular transcriptional regulation. To quote each link in turn,

They are found in clusters across the human genome, principally around genes that are implicated in the regulation of development, including many transcription factors. These highly conserved non-coding sequences are likely to form part of the genomic circuitry that uniquely defines vertebrate development.


[Highly conserved non-coding sequences] are significantly associated with transcription factors showing specific functions fundamental to animal development, such as multicellular organism development and sequence-specific DNA binding. The majority of these regions map onto ultraconserved elements and we demonstrate that they can act as functional enhancers within the organism of origin, as well as in cross-transgenesis experiments


Here we report that 45% of these sequences functioned reproducibly as tissue-specific enhancers of gene expression at embryonic day 11.5. While directing expression in a broad range of anatomical structures in the embryo, the majority of the 75 enhancers directed expression to various regions of the developing nervous system.

These regions tend to be highly clustered in around 200 areas, and most of them are non-coding. ncRNA is often regulatory, and those UCE clusters are associated closely with developmental genes. That being said, not all of them are clustered near known genic regions, which might be a good indicator that there are heretofore unknown genes in those areas; UCEs might be useful for discovery. And here's a paper trying to give a role to one in cancer.

It's with noting that just because something is evolutionary diverged sequentially through specialization does not mean that that function has not been to the same degree conserved. For instance non coding RNA are not characterized by sequential homology alone but through structural conservation. To a greater degree even this structure function relationship exists in other parts of the genome with retro elements and beyond the genome with proteins especially antibodies.

Conservation might indicate selective pressure as a general rule, but this isn't a given; you can delete some ultraconserved elements and still get viable mice…

The Genome Biology of Effector Gene Evolution in Filamentous Plant Pathogens

Filamentous pathogens, including fungi and oomycetes, pose major threats to global food security. Crop pathogens cause damage by secreting effectors that manipulate the host to the pathogen's advantage. Genes encoding such effectors are among the most rapidly evolving genes in pathogen genomes. Here, we review how the major characteristics of the emergence, function, and regulation of effector genes are tightly linked to the genomic compartments where these genes are located in pathogen genomes. The presence of repetitive elements in these compartments is associated with elevated rates of point mutations and sequence rearrangements with a major impact on effector diversification. The expression of many effectors converges on an epigenetic control mediated by the presence of repetitive elements. Population genomics analyses showed that rapidly evolving pathogens show high rates of turnover at effector loci and display a mosaic in effector presence-absence polymorphism among strains. We conclude that effective pathogen containment strategies require a thorough understanding of the effector genome biology and the pathogen's potential for rapid adaptation.

Keywords: effectors epigenetics gene regulation genome evolution pangenome polymorphism population genomics.


The comparative genomics revolution of the past decade rests upon the notion that variation in levels of sequence conservation along the genome are informative for defining functional genomic elements (e.g., Birney et al. 2007 Roy et al. 2010 Dunham et al. 2012). Functional regions (exons, enhancers, promoters, etc.) are predicted to be constrained by natural selection in their sequence evolution, and thus should show less sequence divergence between species than nonfunctional regions of the genome. Consistent with this expectation, sequence conservation information has substantially improved ab initio gene and RNA predication (e.g., Carter and Durbin 2006 Pedersen et al. 2006).

While sequence conservation is an appealing source of information, surprisingly little is known about the biological roles of many conserved sequences, particularly those that do not encode proteins. Human ultraconserved elements (UCEs) best epitomize this paradox. Bejerano et al. (2004) described hundreds of stretches of the human genome of length 200 bp or greater that are perfectly conserved in alignments of the human, mouse, and rat genomes, representing approximately 100 Myr of evolution. The vast majority of these elements occur in regions with no known annotation, and less than one-fourth of UCEs overlap a known transcript. Because their initial description, only limited progress has been made in elucidating the function of vertebrate UCEs. Some UCEs seem to serve a role in gene regulation (Bernstein et al. 2006 Lee et al. 2006 Pennacchio et al. 2006 Paparidis et al. 2007 Visel et al. 2008). Indeed some elements function specifically as distal enhancers for neighboring developmental genes (Pennacchio et al. 2006 Paparidis et al. 2007 Visel et al. 2008). This role in development is also supported by bioinformatic analyses which demonstrate clustering in regions enriched for transcription factors and developmental genes (Bejerano et al. 2004). Other elements have been shown to function as transcriptional regulators, a subset of which are altered in human cancer (Calin et al. 2007 Ferreira et al. 2012 Lin et al. 2012). However, knockout mouse strains of four separate UCEs showed no detectable effects on viability or fecundity (Ahituv et al. 2007). These results are particularly surprising given that each of these four elements had been previously shown to have tissue-specific in vivo enhancer activity in mouse transgenic assays (Pennacchio et al. 2006). Thus to what extent are UCEs essential for fitness and development of the organism?

Inferential evidence from population and evolutionary genetics suggests that UCEs are indeed very important for organismal fitness. UCEs are under strong purifying selection in human populations (Katzman et al. 2007), are depleted among segregating segmental duplications and copy number variants (Chiang et al. 2008), and are nearly indispensible within mammalian genomes over deeper evolutionary timescales (McLean and Bejerano 2008). An alternative hypothesis to explain the existence of UCEs is that they are simply mutational coldspots of the genome. Fortunately, we can test between these two hypotheses using predictions from probabilistic population genetic models. Such analyses demonstrate that human UCEs appear to be strongly constrained by selection and thus are predicted to be functional. Human UCEs were investigated using targeted resequencing from human populations and a hierarchical Bayesian analysis, and found to be under roughly 3-fold stronger negative selection (i.e., constraint) compared with nonsynonymous sites (amino acid changing sites Katzman et al. 2007). Put another way, levels of selection on amino acid sequences, our previous gold standard for sequence conservation, are only a fraction of what we observe acting on UCEs in humans. This pattern also generalizes to the entire “tail” of the distribution of conserved sequences. For example, independent sets of conserved noncoding sequences (non-CDS), by varying definitions, are under strong selection in both humans (Drake et al. 2005) and Drosophila (Casillas et al. 2007).

Thus, while UCEs must be important to fitness, the question remains as to what aspects of fitness they encode. Here, we present a comprehensive set of UCEs within the Drosophila genome that we have uncovered using 12 fully sequenced fruit fly genomes. We show using population genetic data that these elements are highly constrained by natural selection both historically and currently within Drosophila melanogaster populations. Further we show that several UCEs are transcribed and thus likely correspond to novel ncRNAs.


SINEs are classified as non-LTR retrotransposons because they do not contain long terminal repeats (LTRs). [4] There are three types of SINEs common to vertebrates and invertebrates: CORE-SINEs, V-SINEs, and AmnSINEs. [3] SINEs have 50-500 base pair internal regions which contain a tRNA-derived segment with A and B boxes that serve as an internal promoter for RNA polymerase III. [5] [3]

Internal structure Edit

SINEs are characterized by their different modules, which are essentially a sectioning of their sequence. SINEs can, but do not necessarily have to possess a head, a body, and a tail. The head, is at the 5' end of short-interspersed nuclear elements and is an evolutionarily derived from an RNA synthesized by RNA Polymerase III such as ribosomal RNAs and tRNAs the 5' head is indicative of which endogenous element that SINE was derived from and was able to parasitically utilize its transcriptional machinery. [1] For example, the 5' of the Alu sine is derived from 7SL RNA, a sequence transcribed by RNA Polymerase III which codes for the RNA element of SRP, an abundant ribonucleoprotein. [6] The body of SINEs possess an unknown origin but often share much homology with a corresponding LINE which thus allows SINEs to parasitically co-opt endonucleases coded by LINEs (which recognize certain sequence motifs). Lastly, the 3′ tail of SINEs is composed of short simple repeats of varying lengths these simple repeats are sites where two (or more) short-interspersed nuclear elements can combine to form a dimeric SINE. [7] Short-interspersed nuclear elements which do not only possess a head and tail are called simple SINEs whereas short-interspersed nuclear elements which also possess a body or are a combination of two or more SINEs are complex SINEs. [1]

Short-interspersed nuclear elements are transcribed by RNA polymerase III which is known to transcribe ribosomal RNA and tRNA, two types of RNA vital to ribosomal assembly and mRNA translation. [8] SINEs, like tRNAs and many small-nuclear RNAs possess an internal promoter and thus are transcribed differently than most protein-coding genes. [1] In other words, short-interspersed nuclear elements have their key promoter elements within the transcribed region itself. Though transcribed by RNA polymerase III, SINEs and other genes possessing internal promoters, recruit different transcriptional machinery and factors than genes possessing upstream promoters. [9]

Changes in chromosome structure influence gene expression primarily by affecting the accessibility of genes to transcriptional machinery. The chromosome has a very complex and hierarchical system of organizing the genome. This system of organization, which includes histones, methyl groups, acetyl groups, and a variety of proteins and RNAs allows different domains within a chromosome to be accessible to polymerases, transcription factors, and other associated proteins to different degrees. [10] Furthermore, the shape and density of certain areas of a chromosome can affect the shape and density of neighboring (or even distant regions) on the chromosome through interaction facilitated by different proteins and elements. Non-coding RNAs such as short-interspersed nuclear elements, which have been known to associate with and contribute to chromatin structure, can thus play huge role in regulating gene expression. [11] Short-interspersed-nuclear-elements similarly can be involved in gene regulation by modifying genomic architecture.

In fact Usmanova et al. 2008 suggested that short-interspersed nuclear elements can serve as direct signals in chromatin rearrangement and structure. The paper examined the global distribution of SINEs in mouse and human chromosomes and determined that this distribution was very similar to genomic distributions of genes and CpG motifs. [12] The distribution of SINEs to genes was significantly more similar than that of other non-coding genetic elements and even differed significantly from the distribution of long-interspersed nuclear elements. [12] This suggested that the SINE distribution was not a mere accident caused by LINE-mediated retrotransposition but rather that SINEs possessed a role in gene-regulation. Furthermore, SINEs frequently contain motifs for YY1 polycomb proteins. [12] YY1 is a zinc-finger protein that acts as a transcriptional repressor for a wide-variety of genes essential for development and signaling. [13] Polycomb protein YY1 is believed to mediate the activity of histone deacetylases and histone acetyltransferases to facilitate chromatin re-organization this is often to facilitate the formation of heterochromatin (gene-silencing state). [14] Thus, the analysis suggests that short-interspersed nuclear elements can function as a ‘signal-booster' in the polycomb-dependent silencing of gene-sets through chromatin re-organization. [12] In essence, it is the cumulative effect of many types of interactions that leads to the difference between euchromatin, which is not tightly packed and generally more accessible to transcriptional machinery, and heterochromatin, which is tightly packed and generally not accessible to transcriptional machinery SINEs seem to play an evolutionary role in this process.

In addition to directly affecting chromatin structure, there are a number of ways in which SINEs can potentially regulate gene expression. For example, long non-coding RNA can directly interact with transcriptional repressors and activators, attenuating or modifying their function. [15] This type of regulation can occur in different ways: the RNA transcript can directly bind to the transcription factor as a co-regulator also, the RNA can regulate and modify the ability of co-regulators to associate with the transcription factor. [15] For example, Evf-2, a certain long non-coding RNA, has been known to function as a co-activator for certain homeobox transcription factors which are critical to nervous system development and organization. [16] Furthermore, RNA transcripts can interfere with the functionality of the transcriptional complex by interacting or associating with RNA polymerases during the transcription or loading processes. [15] Moreover, non-coding RNAs like SINEs can bind or interact directly with the DNA duplex coding the gene and thus prevent its transcription. [15]

Also, many non-coding RNAs are distributed near protein-coding genes, often in the reverse direction. This is especially true for short-interspersed nuclear elements as seen in Usmanova et al. These non-coding RNAs, which lie adjacent to or overlap gene-sets provide a mechanism by which transcription factors and machinery can be recruited to increase or repress the transcription of local genes. The particular example of SINEs potentially recruiting the YY1 polycomb transcriptional repressor is discussed above. [12] Alternatively, it also provides a mechanism by which local gene expression can be curtailed and regulated because the transcriptional complexes can hinder or prevent nearby genes from being transcribed. There is research to suggest that this phenomenon is particularly seen in the gene-regulation of pluripotent cells. [17]

In conclusion, non-coding RNAs such as SINEs are capable of affecting gene expression on a multitude of different levels and in different ways. Short-interspersed nuclear elements are believed to be deeply integrated into a complex regulatory network capable of fine-tuning gene expression across the eukaryotic genome.

The RNA coded by the short-interspersed nuclear element does not code for any protein product but is nonetheless reverse-transcribed and inserted back into an alternate region in the genome. For this reason, short interspersed nuclear elements are believed to have co-evolved with long interspersed nuclear element (LINEs), as LINEs do in fact encode protein products which enable them to be reverse- transcribed and integrated back into the genome. [4] SINEs are believed to have co-opted the proteins coded by LINEs which are contained in 2 reading frames. Open reading frame 1 (ORF 1) encodes a protein which binds to RNA and acts as a chaperone to facilitate and maintain the LINE protein-RNA complex structure. [18] Open reading frame 2 (ORF 2) codes a protein which possesses both endonuclease and reverse transcriptase activities. [19] This enables the LINE mRNA to be reverse-transcribed into DNA and integrated into the genome based on the sequence-motifs recognized by the protein's endonuclease domain.

LINE-1 (L1) is transcribed and retrotransposed most frequently in the germ-line and during early development as a result SINEs move around the genome most during these periods. SINE transcription is down-regulated by transcription factors in somatic cells after early development, though stress can cause up-regulation of normally silent SINEs. [20] SINEs can be transferred between individuals or species via horizontal transfer through a viral vector. [21]

SINEs are known to share sequence homology with LINES which gives a basis by which the LINE machinery can reverse transcribe and integrate SINE transcripts. [22] Alternately, some SINEs are believed to use a much more complex system of integrating back into the genome this system involves the use random double-stranded DNA breaks (rather than the endonuclease coded by related long-interspersed nuclear elements creating an insertion-site). [22] These DNA breaks are utilized to prime reverse transcriptase, ultimately integrating the SINE transcript back into the genome. [22] SINEs nonetheless depend on enzymes coded by other DNA elements and are thus known as non-autonomous retrotransposons as they depend on the machinery of LINEs, which are known as autonomous retrotransposons.< [23]

The theory that short-interspersed nuclear elements have evolved to utilize the retrotransposon machinery of long-interspersed nuclear elements is supported by studies which examine the presence and distribution of LINEs and SINEs in taxa of different species. [24] For example, LINEs and SINEs in rodents and primates show very strong homology at the insertion-site motif. [24] Such evidence is a basis for the proposed mechanism in which integration of the SINE transcript can be co-opted with LINE-coded protein products. This is specifically demonstrated by a detailed analysis of over 20 rodent species profiled LINEs and SINEs, mainly L1s and B1s respectively these are families of LINEs and SINEs found at high frequencies in rodents along with other mammals. [24] The study sought to provide phylogenetic clarity within the context of LINE and SINE activity.

The study arrived at a candidate taxa believed to be the first instance of L1 LINE extinction it expectedly discovered that there was no evidence to suggest that B1 SINE activity occurred in species which did not have L1 LINE activity. [24] Also, the study suggested that B1 short-interspersed nuclear element silencing in fact occurred before L1 long-interspersed nuclear element extinction this is due to the fact that B1 SINEs are silenced in the genus most-closely related to the genus which does not contain active L1 LINEs (though the genus with B1 SINE silencing still contains active L1 LINEs). [24] Another genus was also found which similarly contained active L1 long-interspersed nuclear elements but did not contain B1 short-interspersed nuclear elements the opposite scenario, in which active B1 SINEs were present in a genus which did not possess active L1 LINEs was not found. [24] This result was expected and strongly supports the theory that SINEs have evolved to co-opt the RNA-binding proteins, endonucleases, and reverse-transcriptases coded by LINEs. In taxa which do not actively transcribe and translate long-interspersed nuclear elements protein-products, SINEs do not have the theoretical foundation by which to retrotranspose within the genome. The results obtained in Rinehart et al. are thus very supportive of the current model of SINE retrotransposition.

Insertion of a SINE upstream of a coding region may result in exon shuffling or changes to the regulatory region of the gene. Insertion of a SINE into the coding sequence of a gene can have deleterious effects and unregulated transposition can cause genetic disease. The transposition and recombination of SINEs and other active nuclear elements is thought to be one of the major contributions of genetic diversity between lineages during speciation. [21]

Short-interspersed nuclear elements are believed to have parasitic origins in eukaryotic genomes. These SINEs have mutated and replicated themselves a large number of times on an evolutionary time-scale and thus form many different lineages. Their early evolutionary origin has caused them to be ubiquitous in many eukaryotic lineages.

Alu elements, short-interspersed nuclear element of about 300 nucleotides, are the most common SINE in humans, with >1,000,000 copies throughout the genome, which is over 10 percent of the total genome this is not uncommon among other species. [25] Alu element copy number differences can be used to distinguish between and construct phylogenies of primate species. [21] Canines differ primarily in their abundance of SINEC_Cf repeats throughout the genome, rather than other gene or allele level mutations. These dog-specific SINEs may code for a splice acceptor site, altering the sequences that appear as exons or introns in each species. [26]

Apart from mammals, SINEs can reach high copy numbers in a range of species, including nonbony vertebrates (elephant shark) and some fish species (coelacanths). [27] In plants, SINEs are often restricted to closely related species and have emerged, decayed, and vanished frequently during evolution. [28] Nevertheless, some SINE families such as the Au-SINEs [29] and the Angio-SINEs [30] are unusually widespread across many often unrelated plant species.

There are >50 human diseases associated with SINEs. [20] When inserted near or within the exon, SINEs can cause improper splicing, become coding regions, or change the reading frame, often leading to disease phenotypes in humans and other animals. [26] Insertion of Alu elements in the human genome is associated with breast cancer, colon cancer, leukemia, hemophilia, Dent's disease, cystic fibrosis, neurofibromatosis, and many others. [4]

MicroRNAs Edit

The role of short-interspersed nuclear elements in gene regulation within cells has been supported by multiple studies. One such study examined the correlation between a certain family of SINEs with microRNAs (in zebrafish). [31] The specific family of SINEs being examined was the Anamnia V-SINEs this family of short interspersed nuclear elements is often found in the untranslated region of the 3' end of many genes and is present in vertebrate genomes. [31] The study involved a computational analysis in which the genomic distribution and activity of the Anamnia V-SINEs in Danio rerio zebrafish was examined furthermore, these V-SINEs potential to generate novel microRNA loci was analyzed. [31] It was found that genes which were predicted to possess V-SINEs were targeted by microRNAs with significantly higher hybridization E-values (relative to other areas in the genome). [31] The genes that had high hybridization E-values were genes particularly involved in metabolic and signaling pathways. [31] Almost all miRNAs identified to have a strong ability to hybridize to putative V-SINE sequence motifs in genes have been identified (in mammals) to have regulatory roles. [31] These results which establish a correlation between short-interspersed nuclear elements and different regulatory microRNAs strongly suggest that V-SINEs have a significant role in attenuating responses to different signals and stimuli related to metabolism, proliferation and differentiation. Many other studies must be undertaken to establish the validity and extent of short-interspersed nuclear element retrotransposons' role in regulatory gene-expression networks. In conclusion, though not much is known about the role and mechanism by which SINEs generate miRNA gene loci it is generally understood that SINEs have played a significant evolutionary role in the creation of "RNA-genes", this is also touched upon above in SINEs and pseudogenes.

With such evidence suggesting that short-interspersed nuclear elements have been evolutionary sources for microRNA loci generation it is important to further discuss the potential relationships between the two as well as the mechanism by which the microRNA regulates RNA degradation and more broadly, gene expression. A microRNA is a non-coding RNA generally 22 nucleotides in length. [32] This non-protein coding oligonucleotide is itself coded by longer nuclear DNA sequence usually transcribed by RNA polymerase II which is also responsible for the transcription of most mRNAs and snRNAs in eukaryotes. [33] However, some research suggests that some microRNAs that possess upstream short-interspersed nuclear elements are transcribed by RNA polymerase III which is widely implicated in ribosomal RNA and tRNA, two transcripts vital to mRNA translation. [34] This provides an alternate mechanism by which short-interspersed nuclear elements could be interacting with or mediating gene-regulatory networks involving microRNAs.

The regions coding miRNA can be independent RNA-genes often being anti-sense to neighboring protein-coding genes, or can be found within the introns of protein-coding genes. [35] The co-localization of microRNA and protein-coding genes provides a mechanistic foundation by which microRNA regulates gene-expression. Furthermore, Scarpato et al. reveals (as discussed above) that genes predicted to possess short-interspersed nuclear elements (SINEs) through sequence analysis were targeted and hybridized by microRNAs significantly greater than other genes. [31] This provides an evolutionarily path by which the parasitic SINEs were co-opted and utilized to form RNA-genes (such as microRNAs) which have evolved to play a role in complex gene-regulatory networks.

The microRNAs are transcribed as part of longer RNA strands of generally about 80 nucleotides which through complementary base-pairing are able to form hairpin loop structures [36] These structures are recognized and processed in the nucleus by the nuclear protein DiGeorge Syndrome Critical Region 8 (DGCR8) which recruits and associates with the Drosha protein. [37] This complex is responsible for cleaving some of the hair-pin structures from the pre-microRNA which is transported to the cytoplasm. The pre-miRNA is processed by the protein DICER into a double stranded 22 nucleotide. [38] Thereafter, one of the strands is incorporated into a multi-protein RNA-induced silencing complex (RISC). [39] Among these proteins are proteins from the Argonaute family which are critical to the complex's ability to interact with and repress the translation of the target mRNA. [40]

Understanding the different ways in which microRNA regulates gene-expression, including mRNA-translation and degradation is key to understanding the potential evolutionary role of SINEs in gene-regulation and in the generation of microRNA loci. This, in addition to SINEs' direct role in regulatory networks (as discussed in SINEs as long non-coding RNAs) is crucial to beginning to understand the relationship between SINEs and certain diseases. Multiple studies have suggested that increased SINE activity is correlated with certain gene-expression profiles and post-transcription regulation of certain genes. [41] [42] [43] In fact, Peterson et al. 2013 demonstrated that high SINE RNA expression correlates with post-transcriptional downregulation of BRCA1, a tumor suppressor implicated in multiple forms of cancer, namely breast cancer. [43] Furthermore, studies have established a strong correlation between transcriptional mobilization of SINEs and certain cancers and conditions such as hypoxia this can be due to the genomic instability caused by SINE activity as well as more direct-downstream effects. [42] SINEs have also been implicated in countless other diseases. In essence, short-interspersed nuclear elements have become deeply integrated in countless regulatory, metabolic and signaling pathways and thus play an inevitable role in causing disease. Much is still to be known about these genomic parasites but it is clear they play a significant role within eukaryotic organisms.

The activity of SINEs however has genetic vestiges which do not seem to play a significant role, positive or negative, and manifest themselves in the genome as pseudogenes. SINEs however should not be mistaken as RNA pseudogenes. [1] In general, pseudogenes are generated when processed mRNAs of protein-coding genes are reverse-transcribed and incorporated back into the genome (RNA pseudogenes are reverse transcribed RNA genes). [44] Pseudogenes are generally functionless as they descend from processed RNAs independent of their evolutionary-context which includes introns and different regulatory elements which enable transcription and processing. These pseudogenes, though non-functional may in some cases still possess promoters, CpG islands, and other features which enable transcription they thus can still be transcribed and may possess a role in the regulation of gene expression (like SINEs and other non-coding elements). [44] Pseudogenes thus differ from SINEs in that they are derived from transcribed- functional RNA whereas SINEs are DNA elements which retrotranspose by co-opting RNA genes transcriptional machinery. However, there are studies which suggest that retro-transposable elements such as short-interspersed nuclear elements are not only capable of copying themselves in alternate regions in the genome but are also able to do so for random genes too. [45] [46] Thus SINEs can be playing a vital role in the generation of pseudogenes, which themselves are known to be involved in regulatory networks. This is perhaps another means by which SINEs have been able to influence and contribute to gene-regulation.


Assembly produced 632,401 contigs (min = 224 bp, max = 17,453 bp) with a mean length of 396.6 bp (±0.27 bp 95% CI) for a total of 250,802,355 bp. Fully 9,194 contigs were over one Kb in length. After identifying UCE loci and removing potential paralogs, we recovered 4,018 UCE loci. After filtering UCE loci for quality, calling SNPs, phasing (reconstructing haplotypes), and applying additional quality filters, we identified 2,635 loci that contained data for all individuals and were variable. This complete matrix of variable loci included a total of 9,449 SNPs (averaging 3.6 sites per locus). Per-site sequencing depth for these SNPs averaged 26.3 reads (±16.9 SD). An additional 587 loci exhibited variation but the data were not of sufficient quality (i.e., GQ < 10) among all individuals to confidently call both alleles. There were 796 high-quality invariant loci (loci with invariant data, rather than an absence of data), providing a full dataset of 3,431 loci with mean length of 1153.6 bp (±4.95 bp 95% CI). The shortest locus was 228 bp, the longest 2,543 bp, and 2,482 loci were longer than one Kb (Fig. S2). The total length of these loci was 3,957,876 bp. The distribution of SNP variation among loci confidently called for all individuals is given in Fig. 2. Nucleotide diversity (π) was 0.000519 overall, 0.000523 for snow buntings, and 0.000493 for McKay’s buntings.

Figure 2: Distribution of single nucleotide polymorphisms (SNPs) per locus.

No alleles showed fixed differences (FST = 1.0) between the two populations, and few alleles showed strong segregation. No variable sites had an FST value above 0.9, and there were only three each at 0.86 and 0.72 (Fig. S3 two of these sites were on the same locus). One of the five loci with the highest FST values was Z-linked all of the others were on different chromosomes (figshare There were 128 Z-linked loci among the 2,635 variable loci. As noted, only one showed high FST between the two species. The two populations had an overall FST = 0.034, which was significant (P = 0.03). The average distance between taxa (dxy) was 5.3 × 10 −4 , and the net average distance (dA) was 2.0 × 10 −5 . DAPC in adegenet assigned all individuals to their correct taxon of origin (retaining the first four PCs), with 100% probabilities for each, indicating a high level of genomic diagnosability (Fig. S4).

Fully 2,510 loci were in Hardy–Weinberg equilibrium 124 were not (one was triallelic). McKay’s buntings had fewer unique alleles (4,238) than snow buntings (4,389), concordant with the smaller population size of McKay’s buntings. Bartlett’s test rejected homogeneity of variance between observed heterozygosity (Ho = 0.18, 0.19) and expected heterozygosity (He = 0.20, 0.22), but Ho did not differ from He (t = −3.1653, df = 2,633, P = 1.0).

The four-gametes test suggested that recombination occurred in hundreds of loci. For 405 loci, locus lengths were shortened by IMgc to meet the four-gametes test, and for 252 loci one or more individuals were removed to meet the same criteria (a few of these loci had both done IMgc automatically performs one or the other or both operations to obtain non-recombinant sequence data). There were thus 15.4–24.9% of variable loci exhibiting patterns indicative of recombination. As noted in the Methods, these sequence data, together with all other unchanged sequences, were not used further we used only SNP data for further analyses.

In testing our six, two-population models with δaδi, the highest maximum log composite likelihood values were obtained for the split-with-migration model (−112.76), which made it the best-fitting model for these data (model 2 in Fig. 1). We obtained successively lower likelihood values for the neutral (−588.45), isolation with bidirectional migration and population growth (−803.30), and isolation with population growth and no migration (−2026.93) models. The final model tested, split-bidirectional-migration, had an intermediate likelihood of −286.49. The split-with-no-migration model was unstable under all conditions tried, and we could not get it to run to convergence. We provide jackknifed estimates and CIs for the best-fitting, split-with-migration model in Table 1.

Model parameters Parameter (+95% CI) Estimates (+95% CI) Lower–upper bounds Biological units
nu1 (pop size McKay’s) 3.52 (±0.54) 109,330 (±16,790) 92,540–126,120 Individuals McKay’s
nu2 (pop size snow) 5.95 (±1.79) 184,991 (±55,523) 129,467–240,514 Individuals snow
T (split time) 1.44 (±0.37) 241,491 (±62,429) 179,061–303,920 Years
m1 (migration) 1.65 (±0.39) 2.90 (±0.10) 2.8–3.0 Individuals using nu1
m2 (migration) 1.65 (±0.39) 4.90 (±0.35) 4.6–5.2 Individuals using nu2
theta 249.97 (±32.71) a 31,072 (±4,066) a 27,006–35,138 Ancestral population individuals


We have developed a powerful new genomic tool for estimating phylogenetic relationships among members of the hyperdiverse insect order Hymenoptera. By extending and improving prior work (Faircloth et al. 2012 ), we identified over 1500 highly conserved genomic regions between distantly related Hymenoptera taxa, collected these loci from 14 genome-enabled and 30 non-genome-enabled taxa using in silico and in vitro techniques and used the resulting genome-scale sequence data to accurately infer both deep (c. 220–300 Ma) and relatively shallow (≤1 Ma) relationships. Although other phylogenomic approaches have been employed among arthropods (Johnson et al. 2013 ), this is the first time that sequence capture of conserved regions has been used to collect genome-scale DNA data from this group.

Compared to recent phylogenetic studies investigating higher-level relationships within Hymenoptera (Sharkey 2007 Heraty et al. 2011 Klopfstein et al. 2013 ), the UCE data recovered all well-established relationships with complete support. In addition, the UCE data suggest a novel relationship within the Aculeata, in which the ants are sister to all remaining aculeate lineages included here. The aculeates contain all major lineages of social insects (except termites) including ants, vespid wasps and several lineages of social bees. Aculeata also includes the most important group of pollinators (bees). Hence, understanding relationships among the aculeates is critical to provide the comparative framework needed to study the origins and evolution of sociality and pollination biology in this group (Danforth 2013 ). Until recently, phylogenetic studies of aculeates have been based on a relatively small number of characters and have produced conflicting results (Brothers 1999 Pilgrim et al. 2008 Peters et al. 2011 Debevec et al. 2012 ). A recent transcriptome-based study (Johnson et al. 2013 ) sequenced key lineages within Aculeata and produced a fully resolved phylogeny of aculeate lineages, recovering a novel relationship in which ants are sister to the Apoidea (spheciform bees+wasps). Our UCE data set did not recover this relationship. Instead, we found ants to be sister to all remaining aculeate lineages with complete support, but there were several nodes within each clade receiving moderate (≥58%) support. Our study also differed from Johnson et al. ( 2013 ) in the placement of vespid wasps as sister to the tiphioid-pompiloid wasps (Chyphotidae+Pompilidae+Sapygidae) and the scoliid wasps as sister to the spheciform wasps+bees (Apoidea). Previous work by Debevec et al. ( 2012 ) also recovered this placement of scoliid wasps as sister to the spheciform wasps+bees.

Given the importance of resolving relationships among aculeate lineages, we tested the effects of removing sawfly lineages on the topology and support inferred across the UCE tree presented in Fig. 1. Following inference from this updated data set with R Ax ML, the resulting phylogeny (Fig. S6, Supporting information) had the same topology as the tree including sawflies, except that in Fig. 1, two nonaculeate taxa, Evaniella and Orthognalys form a clade with maximum support, while in Fig. S6 (Supporting information), these taxa form a grade, also with maximum support. Support values for internal nodes were marginally higher in the tree excluding sawflies. The stability of the recovered relationships within Aculeata between these two trees and across different assembly methods suggests that neither the count of loci, nor the total amount of data, nor the assembly approach are driving the differences we observed between our results and those of Johnson et al. ( 2013 ).

Rather, taxon sampling (e.g. our study does not include any chrysidoid wasps) or other differences among each data set including size, analytical approach, nucleotide composition, locus type, the number of independent loci sampled and matrix completeness could explain the differences in topology we observed. For example, Johnson et al. ( 2013 ) collected and analysed both larger and smaller amounts of data (175 404–3 001 657 sites) of a different type (amino acid residues) from fewer taxa (n = 19) that included variable counts of loci (308–5214 genes) spanning a range of matrix completeness (50–100%), and they inferred their phylogeny using concatenated maximum likelihood, concatenated Bayesian and summary-statistic gene tree species tree approaches. In contrast, we collected and analysed a less variable amount of data (102 418–469 081 sites), from a larger number of taxa (n = 41–43) that included variable counts of loci (196 – 638 loci) spanning a small range of matrix completeness (70–75%). We inferred the phylogeny using a concatenated maximum-likelihood approach. The types of differences between these two studies and their effects on phylogenetic reconstruction are the sorts of questions that deserve the bulk of current and future analytical effort in phylogenomics.

A major advantage of the UCE approach we describe over transcriptome-based methods is that it does not require specially preserved tissues. Here, we successfully extracted and enriched DNA from insect specimens that ranged from 12 years old to weeks old using a variety of collection methods, including several that were suboptimal for DNA preservation (ethanol preserved or dry pinned) and resulted in the extraction of little DNA (Table S1, Supporting information). Furthermore, we successfully generated and enriched UCE loci from genomic libraries constructed using as little as 70 ng of DNA. This finding is significant because many arthropod taxa are small, yielding very low amounts of DNA, and our results suggest we can successfully prepare and enrich libraries from low DNA inputs. New library preparation approaches, including the Hyper Prep Kit (Kapa Biosystems) and the NEBNext Ultra Kit (New England Biolabs), should make it possible to use even less DNA in the future without resorting to expensive modifications of protocol. The ability to use small, moderately old and sometimes low-quality specimens with the UCE approach we describe means that much of the available materials in museums and other collections can be used as a DNA source for phylogenomic studies – making it possible to sequence very rare and, often, very important taxa.

Note Added in Proof

In a response to previous critiques which appeared as this Perspective was in final revision, ENCODE investigators admit to some difficulties around defining function ( Kellis et al. 2014). Remarkably, however, these authors focus on reconciling "the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments" but avoid dealing with the central conceptual issue, which is the problematic nature of "function" itself. A simple folk-philosophical dismissal of this issue leaves the confusion over "junk DNA" unresolved.

Author summary

Gene expression is regulated at different levels and by different mechanisms in Eukaryotes. At the DNA level, transcription factors (TFs) are supposed to play a key role by binding short motifs in promoters or enhancers. In Plasmodium falciparum, the causative agent of severe malaria in humans, different levels of gene regulation are also present, but very few TFs have been identified and validated so far. We propose here a computational method for the identification of a new type of regulatory elements called long regulatory elements (LRE). Contrary to TF motifs, that are usually 6-12bp long, LREs may span dozens or hundreds of base pairs. Moreover, no computational method have been specifically dedicated to their identification until now. We show with our method that, depending on species and conditions, LREs may play important role in gene regulation. For P. falciparum, these elements appear to determine a very large part of gene expression variation in all stages of the parasite life cycle.

Citation: Menichelli C, Guitard V, Martins RM, Lèbre S, Lopez-Rubio J-J, Lecellier C-H, et al. (2021) Identification of long regulatory elements in the genome of Plasmodium falciparum and other eukaryotes. PLoS Comput Biol 17(4): e1008909.

Editor: Ilya Ioshikhes, University of Ottawa, CANADA

Received: September 7, 2020 Accepted: March 24, 2021 Published: April 16, 2021

Copyright: © 2021 Menichelli et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The source code (python) of DExTER is available at address This git repository also provides the R scripts for reproducing the main experiments described in the paper.

Funding: The work was supported by funding from CNRS (International Associated Laboratory "miREGEN", C-H.L. & L.B.), INSERM-ITMO Cancer (BIO2015-04 "LIONS", C-H.L. & S.L. & L.B.), Plan d’Investissement d’Avenir (#ANR-11-BINF-0002 "Institut de Biologie Computationnelle", C-H.L. & S.L. & L.B. and #ANR-11-LABX-0024-01 "ParaFrap", J-J.L-R. & V.G.), Labex NUMEV (GEM Flagship project, C-H.L. & S.L. & L.B.), CNRS/INSERM funding Défi Santé numérique (project REGAI, C-H.L.), the Fondation pour la Recherche Médicale (DEQ2018033199, J-J.L-R. & R.M.M.), and the program ATIP-Avenir (J-J. L-R.) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.


Comparative genomics has substantially contributed to detecting and classifying functional regions in genomes and understanding genome evolution [ 1, 2]. A foundation for most comparative genomics analyses are alignments between entire genomes. Several computational methods rely on genome alignments for annotating coding and non-coding genes, and genome alignments have been used to detect novel coding exons, revise exon-intron boundaries, and correct the positions of annotated start or stop codons [ 3–9]. Many gene or exon finders utilize genome alignments to increase the reliability of their predictions [ 10–14]. In addition, genome alignments provide an effective way to project genes from a reference species annotation to aligned (query) species [ 15–17]. Genome alignments have also been used to identify regions that evolve under purifying selection and thus likely have a biological function [ 18, 19]. Approximately 3–15% of the human genome is estimated to be evolutionarily constrained [ 20], and most of the constraint detected in genome alignments is located in conserved non-exonic elements that often overlap cis-regulatory elements such as enhancers [ 21, 22]. Furthermore, genome alignments have been instrumental for understanding the evolution of genomes, which uncovered genomic determinants of trait differences [ 23–30] and provided insights into evolutionary history and species’ biology [ 31–34].

A key factor affecting the power of comparative analyses is the number of species included in the genome alignment. Because higher taxonomic coverage increases the power to detect evolutionary constraint [ 35] and yields more robust results in phylogenetic and evolutionary studies [ 36, 37], it is desirable to include many sequenced genomes to capture the diversity of species in a respective clade. While the availability of sequenced genomes was a limiting factor in the past, advances in sequencing and assembly technology have led to a wealth of sequenced genomes, illustrated by the availability of >100 mammalian genomes.

To provide a comparative genomics resource that reflects the increased availability of sequenced mammals and is easily accessible to genomics experts and non-experts, we generated a multiple genome alignment of 120 mammals. We used the human gene annotation and Coding Exon-Structure Aware Realigner (CESAR) to provide comparative gene annotations for all 119 non-human mammals. Furthermore, we demonstrate the utility of the high species coverage in our alignment by (i) quantifying how variable ultraconserved elements are among placental mammals and (ii) identifying cis-regulatory elements (enhancers) that arose in the placental mammal lineage and showing that these enhancers are significantly associated with placenta-related genes. To facilitate comparative analyses using our resources, we provide the multiple genome alignment, a phylogenetic tree, conserved regions including GERP++ and PhastCons conservation scores, and the comparative gene annotations in a UCSC genome browser installation [ 38].

An Ultraconserved Brain-Specific Enhancer Within ADGRL3 (LPHN3) Underpins Attention-Deficit/Hyperactivity Disorder Susceptibility

Background: Genetic factors predispose individuals to attention-deficit/hyperactivity disorder (ADHD). Previous studies have reported linkage and association to ADHD of gene variants within ADGRL3. In this study, we functionally analyzed noncoding variants in this gene as likely pathological contributors.

Methods: In silico, in vitro, and in vivo approaches were used to identify and characterize evolutionary conserved elements within the ADGRL3 linkage region (

207 Kb). Family-based genetic analyses of 838 individuals (372 affected and 466 unaffected patients) identified ADHD-associated single nucleotide polymorphisms harbored in some of these conserved elements. Luciferase assays and zebrafish green fluorescent protein transgenesis tested conserved elements for transcriptional enhancer activity. Electromobility shift assays were used to verify transcription factor-binding disruption by ADHD risk alleles.

Results: An ultraconserved element was discovered (evolutionary conserved region 47) that functions as a transcriptional enhancer. A three-variant ADHD risk haplotype in evolutionary conserved region 47, formed by rs17226398, rs56038622, and rs2271338, reduced enhancer activity by 40% in neuroblastoma and astrocytoma cells (pBonferroni < .0001). This enhancer also drove green fluorescent protein expression in the zebrafish brain in a tissue-specific manner, sharing aspects of endogenous ADGRL3 expression. The rs2271338 risk allele disrupts binding of YY1 transcription factor, an important factor in the development and function of the central nervous system. Expression quantitative trait loci analysis of postmortem human brain tissues revealed an association between rs2271338 and reduced ADGRL3 expression in the thalamus.

Conclusions: These results uncover the first functional evidence of common noncoding variants with potential implications for the pathology of ADHD.

Keywords: ADGRL3 ADHD Cis-acting regulatory element Enhancer Evolutionary conserved regions Genetics LPHN3 Latrophilin Zebrafish.