We are searching data for your request:
Upon completion, a link will appear to access the found materials.
Non-allelic or non-alletic
I stumbled across the term in my Human Genetics textbook. It didn't explain it there, and a quick google search only showed scientific papers that refer to 'recombinations with non-allelic genes' without explaining what they are.
It is my understanding that a non-allelic gene is one that affects another's traits, but is not as a typical dominant / recessive manner. If I am correct, this is the same thing as epistasis - meaning that one gene modifies how another is expressed. There exists epistatic genes for horse coat color. One gene is the coat color itself (horses may have either brown or red color alleles at this loci), the second gene is for a protein that allows the coat color gene to be expressed. For example, a horse has two brown alleles at the coat color locus. It also has a wild type gene in the second (modifier) position. This modifier allows the color brown to be expressed and the horse is brown. B/B, wt/wt --> Brown If the horse has the two brown alleles, but has a mutant allele at the modifier position, then coat color is not expressed and the horse is white. B/B, mt/mt --> white
Because the modifier alleles are separate genes from the coat color alleles, the two are non-allelic - yet they both affect the same trait in some way.
I hope this helps.
The context of the term "non-allelic" is non-allelic recombinations which means recombinations between genes that are not exactly alleles but have enough sequence homology to permit recombination.
From "Gene Interactions: Allelic and Non-Allelic" (biologydiscussion.com)
'Non-allelic or inter-allelic interactions… occur where the development of single character is due to two or more genes affecting the expression of each other in various ways.'
'Two non-allelic gene pairs affect the same character. The dominant allele of each of the two factors produces separate phenotypes when they are alone. When both the dominant alleles are present together, they produce a dis-tinct new phenotype. The absence of both the dominant alleles gives rise to yet another pheno-type.'
'When a gene or gene pair masks or prevents the expression of other non-allelic gene, called epistasis. The gene which produces the effect called epistatic gene and the gene whose expres-sion is suppressed called hypostatic gene.'
From "Genetic Terminology" (ncbi.nlm.nih.gov)
'Whereas Drosophila geneticists used to talk of two loci for a gene, and human geneticists used to talk of two genes at a locus, modern geneticists talk of “two alleles of a gene” or “two alleles at a locus”; this last, which is nowadays so common, is the terminology that will thus be used in this book. It then follows (rather awkwardly) that two alleles at the same locus are allelic to each other, whereas two alleles that are at different loci are non-allelic to each other.'
From slides "Interaction of genes" (slide_9) (slideshare.net)
'Non-Allelic gene interaction: In inter allelic genetic interactions, the independent (non homologous) genes located on the same or on different chromosomes interact with one another for the expression of single phenotypic trait of an organism.'
My understanding of it got me to the point that non-allelic recessive means a single allele can show its effect in the complete absence of its partner; whether it is a dominant or recessive - as in the case if hemophilia. Because at the molecular level normally we define a recessive allele as that which cannot encode any enzyme. Help me to get through this.
Non allelic genes are genes located in different positions of chromosomes, which control the development of different characters
non-allelic gene is alleles at different position of chromosome loci but can affect one gene over the other in different way of intereaction.different human character are determined not only based on the allelic gene but also with the effect to non allelic genes. the pigment gene is an allelic gene for it self having two allele at a locus of chromosome.but this pigment gene character can affect color appearance which is found at different locus. this effect is due to non allelic interaction.
Alleles of two or more independent genes interact to produce a phenotypic expression different from normal expression
Gene vaccines are a new approach to immunization and immunotherapy in which, rather than a live or inactivated organism (or a subunit thereof), one or more genes that encode proteins of the pathogen are delivered. The goal of this approach is to generate immunity against diseases for which traditional vaccines and treatments have not worked, to improve vaccines, and to treat chronic diseases. Gene vaccines make use of advances in immunology and molecular biology to more specifically tailor immune responses (cellular or humoral, or both) against selected antigens. They are still under development in research and clinical trials. The mechanisms for inducing cellular (as opposed to humoral) responses against a particular antigen have been elucidated. Gene vaccines provide a means to generate specific cellular responses while still generating antibodies, if desired. In addition, by delivering only the genes that encode the particular proteins against which a protective or therapeutic immune response is desired, the potential limitations and risks of certain other approaches can be avoided. This article describes the rationale for, immunologic mechanisms involved in, and design of gene vaccines under development. Preclinical and clinical studies of these vaccines are discussed for various clinical applications, focusing on infectious diseases.
An intein (internal protein) is a protein sequence that is translated as an insertion within a host protein. The intein is then post-translationally excised, simultaneous with the ligation of the two flanking segments of the host protein [1–7]. The result of intein excision is two proteins derived from a single initial translation product: (i) the free intein sequence, and (ii) the mature form of the host protein, with the two halves (the N-terminal and C-terminal external proteins, or exteins) ligated by a peptide bond. The reactions in which the intein is excised from the precursor protein and the flanking exteins are joined are mediated primarily by the intein itself, although the first residue of the C-extein also has an important role. The term intein strictly refers to a protein molecule, but the gene segment encoding the intein is also often referred to as an intein.
In addition to containing sequences necessary for their excision and the splicing of their flanking exteins, many inteins have a homing endonuclease domain. Inteins carrying such domains are often referred to as full-length inteins. Some inteins lack a homing endonuclease domain, containing only those sequences necessary for their excision and extein splicing. These are known as mini-inteins. Most of the homing endonuclease domains found in full-length inteins belong to the LAGLIDADG family . Homing endonucleases are believed to promote the spread of an intein through the gene pool of the host species via a recombination process (homing). In a diploid cell heterozygous for the intein, cleavage of the empty allele by the homing endonuclease will be followed by DNA repair performed by the host repair machinery, using the occupied allele as a template . This will result in the cell becoming homozygous for the intein. In this way, the intein gene is duplicated and can spread throughout a population. Most inteins have no known function, and thus are considered to be selfish or parasitic elements . However, inteins are efficiently removed from the host protein [10–14], so their effect on the host phenotype is minimal.
The homing pathway is dependent on the homing endonuclease recognition of the target site and on the allelic homology of the surrounding sequences. If an intein homing endonuclease were to cut an ectopic site, this would not precipitate homologous recombination (gene conversion) of the intein sequence because of the lack of flanking homology. For this reason, it is apparently very difficult for inteins to move to (or colonise) a new site, and such ectopic movement is likely to be a very rare event. This belief is supported by the finding that allelic inteins (i.e. inteins inserted at corresponding sites in homologous genes), even in distantly related species, are usually more closely related to each other than they are to non-allelic inteins, including those from the same species [5, 9, 15].
Inteins are rarities, and have a puzzling distribution among genes and species: the majority of species do not carry any known inteins, while some species have many for example, the archaeon Methanococcus jannaschii has 19 distinct inteins. The species that carry inteins do not cluster together on evolutionary trees, but are phylogenetically dispersed, and closely related species do not necessarily have similar sets of inteins. Inteins have only been found in microorganisms. The vast majority of genes have no known inteins, but some genes contain multiple inteins. For instance, replication factor C of M. jannaschii contains three distinct inteins  and a ribonucleotide reductase of Trichodesmium erythraeum contains four . Of the more than 80 distinct (non-allelic) inteins described, most (>75%) are found in genes involved in replication or transcription, such as DNA polymerases and helicases, or in related processes such as the metabolism of nucleotides (together these genes could be said to have information-processing functions).
The reasons behind the unusual distribution of inteins are currently unknown. One possible explanation for their phylogenetic distribution is that inteins were formerly much more widespread than they are now, but over time they have been randomly lost on many independent occasions in different lineages, resulting in their current sporadic appearances . It is also possible that their distribution is partly a result of horizontal transfer (that is, movement between species that might be only distantly related). The predominance of inteins in information-processing genes may reflect the horizontal transfer of inteins via virus infection . The genomes of phage and viruses consist predominantly of genes involved in information processing. It is possible that the pattern of multiple coincident insertions is also a reflection of the inteins occurring predominantly in the subset of genes that are common to cellular organisms and their infecting viruses. Three of the allelic intein groups have members that are genomic and viral. For example, RIR1-l allelic inteins are found in eubacteria, eubacterial phages and the eukaryote iridescent viruses, DnaB-b allelic inteins are present in eubacteria and their phages, while Pol-c allelic inteins are found in archaea and in eukaryote viruses (mimivirus and the Heterosigma akashiwo virus (HaV)).
In total, five distinct inteins have been found in eukaryotic nuclear genes. These appear in the VMA1 gene that encodes a subunit of a vacuolar membrane adenosine triphosphatase [10, 18] PRP8, encoding an essential component of the spliceosome  GLT1, glutamate synthase  CHS2, chitin synthase 2  and ThrRS, threonyl tRNA synthetase (submitted by S. Pietrokovski to InBase ). All of these nuclear-encoded inteins have been found exclusively in fungi. VMA inteins have been found in a variety of hemiascomycete yeasts, including Saccharomyces cerevisiae, Kluyveromyces lactis and Candida tropicalis. The PRP8 intein was first found in the basidiomycete fungus Cryptococcus neoformans . Since then, PRP8 inteins have been found in some additional Cryptococcus species (C. gattii and C. laurentii)  and in a variety of ascomycete fungi, including Aspergillus fumigatus, Histoplasma capsulatum and Botrytis cinerea [14, 22] and in three species of Penicillium . GLT1 inteins have been identified in a small number of ascomycetes (Debaryomyces hansenii, Pichia guilliermondii, Podospora anserina and Phaeosphaeria nodorum). The CHS2 intein has been found in only one species, P. anserina, despite a large number of fungal CHS2 gene sequences being available in GenBank. Finally, the fifth eukaryotic nuclear full-length intein gene, ThrRS, was very recently identified in the ascomycete yeast C. tropicalis (Pietrokovski, InBase ). An allelic mini-intein is also found in the closely related yeast Candida parapsilosis. In addition to these nuclear intein genes, three intein genes have been found in chloroplast genomes: there are allelic inteins in the DnaB helicase genes of the chloroplasts of the cryptophyte alga Guillardia theta  and the red alga Porphyra purpurea , and a distinct intein in the ClpP protease gene of the chloroplasts of the green alga Chlamydomonas eugametos [26, 27]. Furthermore, inteins have been identified in viruses of eukaryotes: allelic inteins have been found in the DNA polymerase B genes of Acanthamoeba polyphaga mimivirus  and HaV01 . A distinct full-length intein appears in the RIR1 gene of Chilo iridescent virus [29, 30], with two other insect iridoviruses (Costelytra zealandica iridescent virus and Wiseana iridescent virus) containing allelic mini-inteins (Pietrokovski, InBase ). We have detected an intein in a helicase of PBCV (Paramecium bursaria Chlorella virus PBCV) NY2A that is not present in the homologous sites of other PBCV strains (authors' unpublished data and InBase ).
DNA-dependent RNA polymerases are complex proteins consisting of several polypeptides including two large and several smaller subunits . Eukaryote nuclei generally encode three RNA polymerases: RNA polymerase I synthesizes a pre-rRNA, 45S, which matures into 28S, 18S and 5.8S rRNAs that will form the major RNA sections of the ribosome. RNA polymerase II synthesizes precursors of messenger RNAs and most small nuclear RNAs. RNA polymerase III synthesizes transfer RNAs, 5S ribosomal RNAs and other small RNAs found in the nucleus and cytoplasm. Some of the various subunits of the different RNA polymerases (including the two largest subunits) are encoded by genes that are homologous (paralogous) throughout cellular life. Some viruses also contain homologous genes encoding their own RNA polymerase.
Here we report the identification and characterisation of seven previously undetected intein-coding sequences from eukaryotic nuclear genomes. These were all identified in genes encoding the second largest subunits of RNA polymerase. They are inserted at six distinct (non-allelic) sites. Four were found in fungi (an ascomycete, a zygomycete and two chytrids), one was found in the slime mould Dictyostelium discoideum, and one in the green alga Chlamydomonas reinhardtii. The last was an intein identified in a viral remnant embedded in the nuclear genome of the oomycete Phytophthora ramorum. Partial sequences of inteins allelic to this latter intein were also identified in the RNA polymerase of a strain of the Emiliania huxleyi virus and in a sequence generated by the Sargasso Sea Metagenomics Project. Analysis of these intein sequences leads to insights into the origins and evolution of inteins in eukaryotes.
We developed a model for detecting NAHR from paired-end read data that addresses many of the issues that typically arise due to repeats. Our model rigorously analyzes read depth inside repeats, termed repeat read depth, using a novel homology-based framework that is robust to read mappings inside repeats. The model also identifies NAHR breakpoints inside highly homologous repeats using hybrid reads – reads generated from the breakpoint of a new repeat formed from the hybridization of two other repeats – which hitherto have gone unnoticed and unstudied. The foundation of the model is based on the current biological knowledge of the mechanism of NAHR (the rules of NAHR [11,34,35]), and a database of segmental duplications , allowing our model to construct hypothetical NAHR breakpoint junctions exactly and align all relevant reads to them. Because any individual read will overlap a small number of VPs, many mapping algorithms often concordantly map reads generated from novel hybrid repeats to a specific existing repeat in the reference genome, resulting in what we term phantom concordance. Our model addresses this challenge by explicitly calculating the probability of each potential mapping location, including for candidate novel hybrid repeats, for every read mapped to a repeat in the reference. In so doing, it geometrically accumulates evidence from multiple reads for or against the existence of hybrid elements in an individual.
Overall, explicitly modeling the outcomes of the NAHR mechanism allows our model to capitalize on several key features:
We know exactly where NAHR events might occur and what their breakpoint regions would look like using a biological model of the mechanism of NAHR.
We exactly construct hypothetical NAHR breakpoint junctions inside repeats (novel with respect to the reference) and manually align reads to them, allowing us to analyze hybrid reads that result from putative NAHR events and are distinguished by VPs.
We evaluate all paired-end reads for evidence of NAHR breakpoints (in feature 2), not just discordantly mapped ones, preventing our model from overlooking hybrid reads that were mapped with phantom concordance to other areas of the genome.
We evaluate read depth over groups of disjoint homologous regions, termed repeat read depth, rather than over consecutive linear subintervals of the genome (for example sliding windows or bins).
These concepts are summarized in Figure 2. We implemented our model in a C++ program that we call detect-NAHR . We describe each aspect of the model in detail below.
Schematic example of our model’s approach to detecting NAHR events from paired-end read data. The bottom half shows our framework for repeats and potential NAHR events. Each pair of homologous LCRs represents a potential NAHR event, annotated E 1,…,E 6. We then identify homologous LCRs into equivalence classes, folding the reference genome by homology. We focus on E 2 and the blue LCRs for this example. We collect all paired-end reads homologous to the blue LCRs. In the middle, we analyze the schematic blue data for two cases: no NAHR event at E 2 versus NAHR deletion at E 2. For this schematic, assume that E 2 indeed resulted in an NAHR deletion. According to the mechanism of NAHR, a deletion at E 2 results in a hybrid LCR. There are two major components to each analysis: the repeat read depth and the alignment of reads. For the repeat read depth, we compare the observed number of reads across all blue LCRs against the expected number. In the no-event case, there are three blue LCRs and so we expect 3× blue reads, but we only observed 2× blue reads. For the NAHR deletion, a novel hybrid LCR was formed by hybridizing the dark blue and aqua blue LCRs thus, we expect 2× blue reads, as observed. For alignments, we focus on the hybrid reads that span the NAHR breakpoint in the hybrid blue LCR. Since the hybrid blue LCR is novel with respect to the reference genome, then the hybrid reads can be mapped concordantly to blue LCRs in the reference with very few errors, which is termed phantom concordance. As they are mapped concordantly, they are ignored by most other existing structural variation detection algorithms. But when we align them against the hybrid LCR, many of the read errors are resolved. LCR, low-copy repeat NAHR, non-allelic homologous recombination.
Modeling non-allelic homologous recombination events and breakpoints
The foundation of NAHR is the pair of repeats that mediate the rearrangement. Any pair of highly homologous repeats may potentially mediate an NAHR rearrangement. For such a pair of homologous repeats, we refer to the interval on the reference genome from the beginning of the first LCR to the end of the second LCR as a potential NAHR locus, which may experience an NAHR event in the individual with respect to the reference.
We define the space of potential NAHR events as all possible pairs of homologous repeats in the human reference genome. The Human Segmental Duplication Database (HSDD) lists all pairs of sequences of length ≥1 kb and identity ≥90%, termed segmental duplications or LCRs [15,39] (obtained from ), on the reference (GRCh37). We consider the space of repeats to be the LCRs listed in the HSDD (see Additional file 1: Section S7.24 for justification).
Thus, every entry in the HSDD (that is a pair of homologous LCRs) therefore represents a distinct potential NAHR event E, where E is a random variable whose potential NAHR outcomes are determined by the locations and orientations of its pair of LCRs that is the rules of NAHR as in Section ‘Mechanism of non-allelic homologous recombination’. In this study, we restrict our attention to deletion events and duplication events.
The relationship between the potential NAHR events E 1,…,E n is complicated by the layout of LCRs across the genome. For example, between a pair of homologous LCRs might lie a third LCR homologous to neither of its flanking neighbors. Therefore, when calculating the probabilities of various NAHR scenarios, events at certain potential NAHR loci must be considered simultaneously because an event at one locus may have implications for events at another locus. For illustration, Additional file 1: Figure S12a) contains a schematic genome and several hypothetical NAHR events. Therein, we see that potential NAHR events E 3 and E 4, although mediated by non-homologous LCRs, do in fact impact each other: if E 3 occurs as an NAHR deletion, then one of the mediating LCRs for E 4 is deleted, and hence E 4 cannot undergo NAHR, and vice versa. This is an exclusivity constraint between E 3 and E 4 (see  for an analysis of exclusivity constraints). In addition, E 4 and E 6 must be considered simultaneously because the mediating LCRs are all homologous. On the other hand, E 4 and E 5 are not related because their mediating LCRs are not homologous and the regions potentially affected are non-homologous and disjoint. All of the types of relationships illustrated here are unambiguous and completely determined by the layout of the reference genome. Thus, potential NAHR events in the reference genome have varying degrees of complexity, that is the number of other potential NAHR events whose outcomes must be considered simultaneously. Note that this extends beyond the notion of repeat families: potential NAHR events are related not only because the mediating LCRs share homology, but because the intervening regions may intersect or be partially homologous.
When an NAHR event E occurs, it must have a breakpoint B. We assume that B occurs somewhere within the two LCRs. Technically, the breakpoint can be at any position along the LCRs. But given a pair of consecutive VPs on a repeat, all possible breakpoints occurring in the interim region will therefore result in exactly the same hybrid repeat. We therefore restrict the space of potential NAHR breakpoints to be the set of VPs that distinguish a given pair of repeats.
VPs have been investigated and utilized in NAHR detection previously [3,11,42,43], although not explicitly named as such. For example, Ou et al. used VPs to locate the breakpoint region of experimentally validated NAHR events in the same way we describe above , showing the switch in VP pattern surrounding the breakpoint region.
The rules of NAHR thus provide us with a well-defined space of possible outcomes and breakpoints. For any putative rearrangement, we may therefore exactly construct the entire affected region, including the breakpoint within the LCRs. Being able to construct every hypothetical outcome and calculate its probability, we compare all possibilities against each other via Bayes’ rule (see ‘Materials and methods’).
We introduce a novel approach for evaluating read depth, called repeat read depth, which considers read depth over collections of homologous regions. Since we group homologous regions together, we avoid the classic issue of uncertainty in the mappings of short reads into repeats. Indeed, for evaluating read depth across homologous LCRs, we do not need to worry which paralog a certain read came from, but only that it came from some paralog of an LCR. Thus, we never attempt to determine the correct mapping for any read at any stage in our model.
Instead, we address repetitive regions by associating homologous regions of the genome with each other, that is forming equivalence classes of homologous regions, in a manner similar to the de Bruijn and A-Bruijn graph formulations by Pevzner [44,45] (see Additional file 1: Section S7.12). Knowing exactly which regions compose each equivalence class, we use probability theory to determine the impact of a given NAHR event and breakpoint on the expected read depth of a certain set of regions. We model the number of reads in a region as a negative binomial distribution (an overdispersed Poisson),approximated by a normal in large regions. This form is based on studies that have found that the distribution of fragments along the reference genome has greater variation than a Poisson distribution , and is biased by the GC content of the fragment [46,47].
We do not take any read mappings as given. Instead, given a collection of possible NAHR events, we consider all of the locations in the individual’s hypothetical genome that may have (concordantly) generated each paired-end read. Paired-end reads that were generated from an NAHR breakpoint, that is hybrid reads, can be mapped to either of the two mediating LCRs (A or B) with few mismatches, that is with phantom concordance. Figure 3 shows a schematic example of paired-end reads generated from the breakpoint region of an NAHR duplication. Such paired-end reads may contain the switch in VP pattern, and thus provide evidence completely analogous to the evidence presented by Ou et al.  and Kidd et al.  when justifying breakpoint calls in repetitiveregions.
Schematic example of an NAHR duplication with paired-end read data.(a) A schematic reference genome and paired-end read data. The dark green and light green regions are homologous LCRs that form a potential NAHR event locus. Green nucleotides are the VPs that distinguish the two LCRs. We suppose the individual experienced an NAHR duplication at this locus, and that the two paired-end reads shown were generated from the breakpoint region, that is they are hybrid reads. We consider two possible outcomes: no event and an NAHR duplication. (b) If no NAHR event occurs at this locus, then this locus of the individual’s genome is the same as in the reference. Notice that the paired-end reads are aligned concordantly to these LCRs, albeit with a small number of errors at the VPs we call this phantom concordance. Suppose the probability of a read error is 2%. Then here, the likelihood of the mediating LCR is 0.98×0.02=0.0196 for each paired-end read. (c) The hybrid LCR formed from the NAHR duplication event is shown with aligned paired-end reads. The hybrid LCR is novel to the individual it does not exist in the original reference genome. Notice that the VPs switch from dark green to light green after the breakpoint in the hybrid LCR. For simplicity in this schematic, we calculate the probability of a paired-end read’s alignment according to the agreement between its mates’ bases at the VPs, although in the algorithm a full alignment probability is calculated. The likelihood that the paired-end reads came from the hybrid LCR is 0.96 2 =0.9604 for each read. LCR, low-copy repeat NAHR, non-allelic homologous recombination VP, variational position.
Because multiple independent reads increase the probability of a breakpoint in proportion to the product of the probabilities of the individual reads, even a limited number of supporting reads can provide strong evidence of a breakpoint. Figure 3 gives a schematic illustration of the power gained from two informative reads, each containing only two VPs.
To calculate the probability of a set of reads given an individual’s hypothetical genome, we consider eachpossible generating location to be a priori equally likely and calculate each read’s likelihood of each generating location using the context-sensitive hidden Markov model (Additional file 1: Section S7.14). For reads withhomology to the regions surrounding NAHR breakpoints in the hypothetical genome, we construct the NAHR breakpoint junction (that is, a novel sequence with respect to thereference) and perform pairwise alignments of all relevant paired-end reads to the NAHR breakpoint region using the context-sensitive conditional hidden Markov model for more details, see Additional file 1: Section S7.21.
To estimate the sensitivity and specificity of our model, we randomly drew 20 one-copy NAHR events, changed the reference genome accordingly, simulated 15× paired-end read data, and tested our model’s ability to recover the NAHR events (details in Additional file 1: Section S7.20). The 20 drawn NAHR events served as positive controls, while the remaining 304 potential NAHR loci without a drawn event served as negative controls. We repeated this procedure 18 times. Our estimated specificity and sensitivity are 99.8% and 61.4%, respectively, indicating that our model is conservative and does not make a large number of false positivepredictions.
Non-allelic homologous recombination calls on real data
As discussed above, both experimental validation and computational analysis of NAHR is extremely difficult due to the repetitiveness of the relevant regions . To address the issues of validation in repeat regions, we developed a set of statistical tests for NAHR regions to be used to isolate a conservative set of reliable NAHR event calls.
False discovery rate via read depth
Our model integrates read depth and paired-end read alignments from both unique and repetitive regions into a single, joint probabilistic model of high-throughput sequencing data and NAHR events with breakpoints. An NAHR deletion or duplication call can be evaluated by a simpler statistical test that is separate from our Bayesian model and uses only the read depth in the putatively affected region.
The statistic we chose for the read depth of a region is the ratio γ of the observed number of reads over the expected number of reads under the null hypothesis (no NAHR event). The expected number of reads is sensitive to GC, following Speed & Benjamini . Note that γ does not depend on the length of the region at hand. If there was no event, we would expect γ≈1. We calculated the false discovery rate (fdr) for each potential NAHR locus from γ in a manner based on the method developed by Efron . Details of this test are in Additional file 1: Section 7.17.
Breakpoint log odds
Modeling the mechanics of NAHR allowed our model to construct every possible breakpoint region and to align reads against them. We quantify the evidence of a called breakpoint by calculating an odds ratio of alignment probabilities for reads relevant to the breakpoint region. Given a breakpoint B, we may denote a small region around B and all paralogous regions as , and collect all reads mapped to a region in . The likelihood P 0 of the null hypothesis (there was no NAHR, and so B is not a breakpoint) can then be computed by aligning each read to every location in . The likelihood P A of the alternative hypothesis (B is indeed the breakpoint) is calculated by including the newly formed hybrid breakpoint region in , removing from the pair of regions that together form the hybrid, and aligning each read to each region in this modified set. The log-odds ratio logP A/P 0 represents how much more likely the existence of a specific breakpoint is compared to the null case (no breakpoint).
Conservative call set
We analyzed low-coverage Illumina paired-end read data for 44 individuals obtained from the publicly available database of the 1000 Genomes Project, focusing on the detection of NAHR deletions and duplications. We chose the 44 low-coverage individuals with the largest datasets. b We analyzed the same 324 potential NAHR deletion/duplication loci for each individual. This subset of all possible NAHR loci was chosen strictly based on computational constraints (following Section ‘Modeling non-allelic homologous recombination events and breakpoints’, we chose the 324 loci with the smallest computational complexity, as determined by the number of potential NAHR loci that must be simultaneously considered during probability calculations). Analyzing the same 324 potential NAHR events across 44 individuals gives a total space of 44×324=14,256 possible NAHR event calls. To isolate a reliable set of NAHR deletion/duplication calls, we required a candidate call to have fdr ≤0.01 according to the repeat-sensitive read-depth test of Section ‘False discovery rate via read depth’. We found that this fdr threshold gives a good separation between positive and negative NAHR calls (Additional file 1: Section S7.7 and Additional file 1: Figures S9 and S10).
Our results are summarized in Table 1. Of our NAHR event calls, 1,043 passed the fdr threshold, which were called at 109 distinct potential NAHR loci when collapsed across the 44 genomes. Of the 1,043 calls, 722 were duplications and 321 were deletions. Notice that the total number of distinct loci with a positive NAHR event call in some individual is not the sum of the number of such distinct loci for deletions and duplications separately this is because some loci were called as NAHR deletions in some individuals, but were called as NAHR duplications in others. The median number of positive NAHR calls per individual is 24 (7.41% of all 324 tested loci). Comparing against structural variation calls and experimentally validated rearrangements reported in [3,4,17], we found that only 106 of our 1,043 calls (21 of the 109 distinct loci with a positive NAHR event call) were previously reported, and of the 106 previously reported calls, 59 were positively experimentally validated (see Additional file 1: Section S7.6).
Since our model evaluates both read depth and read alignments to make NAHR event calls, it is possible for our model to make a high-confidence NAHR event call at a locus due to a strong read-depth signal, and yet have much lower confidence about the location of the breakpoint of that NAHR event. We can identify the subset of our positive NAHR event calls that have strong evidence of a breakpoint by imposing a threshold of 6 on the breakpoint log odds (see Section ‘Breakpoint log odds’) for each called NAHR event (see Additional file 1: Section S7.25 for the choice of threshold). Of the 1,043 positive NAHR event calls across the 44 individuals, 512 calls had breakpoints with log odds ≥6. Below, we analyze in detail the impact of our positive NAHR calls on genes using the 512 NAHR calls with high-confidence breakpoints.
To understand the kinds of reads that supported our 512 high-confidence NAHR breakpoint calls, Additional file 1: Figure S7 contains a histogram of the number of discordant paired-end reads supporting each such call. Recall that nearly all other structural variation algorithms detect structural variation using only discordantly mapped reads. But of the 512 NAHR event calls with a high-confidence breakpoint, 425 (83%) were supported by zero discordant paired-end reads, guaranteeing them to be undetectable by other algorithms. Another 10.4% would be very unlikely to be detected by other algorithms since so few (≤2) discordant reads support them. Thus, most of the support for our high-confidence NAHR breakpoints comes from paired-end reads that were mapped with phantom concordance, that is mapped concordantly to a highly homologous region of an LCR from which they were not actually generated. Note that 90% of the 512 high-confidence breakpoints were supported by ≥4 hybrid reads (see Additional file 1: Section S7.5).
Non-allelic homologous recombination events across individuals and relation to ancestry
Repetitive regions pose difficulties not only for detecting rearrangements, but in constructing the reference genome as well. As such, an immediate concern would be that putative NAHR rearrangements reflect anomalies in the reference genome rather than genuine rearrangements. In such cases, we would expect that all (or nearly all) of the individuals tested would display such a signal. Further, the erroneous signal displayed by each individual would perhaps be of slightly different magnitude, and a criticism could be that our model merely chooses some of the individuals to make a call on according to some arbitrary threshold imposed on a signal that does not actually separate the data well.
Additional file 1: Figure S9 shows that, in general, a given potential NAHR locus has NAHR event calls in only a subset of the 44 tested individuals. Among the loci in which an NAHR event was called positive in at least one individual, the number of individuals with some NAHR event at a particular locus has median 5 (11.4%), which is far from all 44. Further, only 7 (6.4%) loci had a positive NAHR event call in 34 (77.3%) of the tested individuals.
Additionally, if our called NAHR events are true genetic polymorphisms (as opposed to artifacts), then their presence or absence among different individuals should be correlated with ancestry. To explore the relationship of our detected NAHR events to ancestry, we tested the hypothesis that the occurrence of NAHR events was independent of ancestry. Of our 1,043 calls passing the fdr threshold, 431 calls (41.3%) were in individuals of African ancestry, 309 (29.6%) in individuals of Asian ancestry, and 303 (29.1%) in individuals of European ancestry. These numbers are reasonable, as the reference genome is European. Testing the relationship between ancestry (African, Asian or European) and NAHR loci (109 distinct loci with a positive NAHR event call), we find that ancestry and NAHR events are overall not independent (χ 2 =302.8, degrees of freedom =216 and P=8.8×10 −5 ). That is, the occurrence of NAHR deletions and duplications across the genome is not independent of ancestry. Because of the limited sample size (44 individuals), we were unable to identify specific ancestry-related events that are statistically significant after correction for multiple comparisons.
Impact on genes
We searched a database of Ensembl genes on the reference genome (obtained from BioMart ) and found that 216 genes were affected c by the 109 distinct loci with some NAHR event call passing the fdr threshold. The median number of affected genes per individual was 52. The affected genes included several highly studied genes, such as hemoglobin (HBA1, HBA2, HBMA and HBZ), haptoglobin (HP and HPR), and those involved in drug metabolism (CYP2E1). A full list of genes affected by our called NAHR events can be found in Additional file 1: Section S7.2.
A more detailed analysis of the impact of an NAHR event on a gene depends on the relative locations of the gene, the mediating LCRs, the NAHR breakpoints, and any pseudogenes. The simplest case is when a gene is contained in the region between the two mediating LCRs, but not intersecting either LCR then an NAHR deletion or duplication will completely delete or duplicate the gene, respectively. If a gene is contained within one of the mediating LCRs then the exact locations of the breakpoints are very important. If the gene lies within the breakpoints, then it will be completely deleted or duplicated, as before. If the gene lies outside of the breakpoints, then it will be physically unaffected, although the distance to its promoter or regulatory elements may change. If the breakpoint intersects the gene, then a new fusion gene will arise. The composition of this fusion gene will depend on what was lying at the homologous position on the other LCR: another gene or a pseudogene. Finally, sometimes a gene actually contains a pair of LCRs (as does the haptoglobin gene HP, for example), in which case, an NAHR deletion or duplication will cause a contraction or expansion of the gene, respectively. Clearly the exact location of the breakpoint within the gene will have major implications for transcription. Several instances of these scenarios are highlighted below in Section ‘Case study’ and Additional file 1: Table S6.
NAHR is an important evolutionary mechanism for the creation of pseudogenes and the creation of novel genes via fusion or contraction/expansion. For each of our 512 NAHR event calls with a high-confidence breakpoint (see Section ‘Conservative call set’), we searched a database obtained from BioMart of all Ensembl genes and pseudogenes to determine the impact of the called NAHR event and breakpoint. Table 2 contains the results. In particular, notice that 381 genes and 12 pseudogenes genes were duplicated completely, and 19 novel genes were formed via fusion.
We now demonstrate the various facets of our model by investigating a single NAHR event call in detail: a two-copy duplication of 20.5 kb on chromosome 1 with GRCh37 reference breakpoints 155,184,704 and 155,205,331 for Yoruban individual NA19129. This rearrangement is novel it was not reported in any of the previous validation studies [3,4,17]. The called breakpoints are deep inside the mediating LCRs: 4,531 bp and 4,564 bp inside LCRs of lengths 10,583 and 12,491, respectively. This highlights the crucial role that VPs play in detecting breakpoints of NAHR events inside repeats, as we describe in detail below. Indeed, the called breakpoints are nearly right in the middle of the mediating LCRs, far away from any flanking unique regions that could have been used to anchor mates of any overlapping paired-end reads, as some algorithms attempt to do.
Note that we detected an NAHR event at this locus in exactly one other individual: NA19190, also Yoruban. The call for NA19190 was identical to that for NA19129: also a two-copy duplication, and with the same breakpoints.
For illustrative purposes, we collected all paired-end reads for NA19129 that displayed the switch in VPs as implied by the called NAHR two-copy duplication breakpoints. Figure 4 shows a multiple alignment of these paired-end reads against each of the LCRs A and B that mediated the NAHR duplications, and against the hybrid LCR BA resulting from these duplications. There are two VPs v 1 and v 2 of interest at this locus of the mediating LCRs. d Our model called a two-copy NAHR duplication between LCRs A and B, with both duplications having a breakpoint in the region [v 1+1,v 2]. e When the reads are aligned against LCR A, then all of the reads agree with the reference at v 2, but disagree with the reference at v 1, and they all display the same incorrect base ( G instead of A ). On the other hand, when the reads are aligned against LCR B, the situation is reversed. Finally, when the reads are aligned to hybrid BA, which would hypothetically result from an NAHR duplication with breakpoints in the region [v 1+1,v 2], then all of the reads agree with the reference at both v 1 and v 2. This is very strong evidence in favor of the hybrid over either mediating LCR indeed, the log-odds ratio (see Section ‘Breakpoint log odds’) of the reads aligning to the mediating LCRs versus the hybrid LCR is 12.3.
Multiple alignments of the paired-end reads that display the expected switch in VP patterns. These were at the breakpoint of a duplication on chromosome 1 with breakpoints 155,184,704 and 155,205,331 for Yoruban individual NA19129. The called NAHR duplication was mediated by LCR A (red) with coordinates [155180173,155190755] and LCR B (blue) with coordinates [155200767,155213257]. At the top is a schematic representation of the reference genome (not to scale) in the region chr 1:[155180173,155213257]. For presentation, we collected all paired-end reads that display the expected switch in VPs. This same collection of paired-end reads (mates connected by dots) are shown in three multiple alignments: against the predicted hybrid LCR against LCR A and against LCR B. VPs are colored according to which reference LCR’s VP pattern they agree with. Positions in the reads that disagree with the reference/hybrid LCR are colored yellow, while those that agree are colored appropriately. When aligning the reads to LCR A, notice that the reads perfectly agree at the VPs on the right-hand side of the alignment, but completely disagree with the VPs of the left-hand side of the alignment. But when aligning the reads to LCR B, the situation is reversed. Finally, when aligning to the hybrid LCR, all disagreements between the reads and the reference are resolved. The log-odds ratio of the probability of the reads given that there was no NAHR event (that is the hybrid does not exist) versus the probability of the reads given that the two-copy duplication indeed occurred (that is the hybrid LCR does exist) is −13 strong support for the two-copy duplication and the specific hybrid LCR. LCR, low-copy repeat NAHR, non-allelic homologous recombination VP, variational position.
Unaware of the rules of NAHR, a naive approach may dismiss the disagreements at v 1 and v 2 as SNPs or read errors, or may ignore the reads for containing too little information. For example, of the eight paired-end reads shown spanning the breakpoint, all of them were mapped concordantly by BWA (and designated as properly paired) . Further, seven of the eight paired-end reads have a mate with mapping quality 0. Mates that are given mapping quality 0 are considered to contain too little information to be confidently mapped to a unique location, and are ignored by many structural variation algorithms, including BreakDancer and PEMer [23,27]. But when we precisely construct the hypothetical hybrid LCR that results from an NAHR duplication at this locus according to the rules of NAHR, we see that in fact both mates of all of these paired-end reads contribute a significant amount of power: seven out of eight of the reads have a posterior probability >0.96 for mapping to the hybrid LCR as opposed to either mediating LCR. Thus, although the difference in signal between the hybrid LCR and the mediating LCRs is slight (literally a few SNPs), there is much discriminative power to be gained from the low read-error rate (approximately 2%) and from multiple reads displaying the signal.
For contrast, we considered all paired-end reads homologous to the region about the same called breakpoints, but in European individual NA07051 and Yoruban individual NA18501, for whom we did not call any NAHR events at this locus. To summarize the above, for NA19129, we found eight paired-end reads informative of the called breakpoint that displayed a switch in VP that is, eight reads simultaneously correctly matched VPs from both mediating LCRs. On the other hand, for NA07051 there were eight paired-end reads potentially informative of a breakpoint at the same location as called for NA19129. However, zero of them correctly matched VPs from both LCRs, that is displayed a switch in VPs consistent with a hybrid LCR. Instead, five paired-end reads correctly matched VPs from the first mediating LCR but not the second, while three paired-end reads correctly matched the second mediating LCR but not the first. The situation is similar for NA18501. The corresponding multiple alignments for NA07051 and NA18501 of all paired-end reads in the region of the breakpoint called for NA19129 are in Additional file 1: Section S7.10.
In Section ‘False discovery rate via read depth’, we infer the read depth in the repetitive regions and plot its signal in Figure 5a alongside the unique read-depth signal. Note that the signal in repetitive regions transitions from following the expected signal for no event to the expected signal under the called two-copy duplication, as we expect. Further, the transition from null to alternative expected read depth occurs near the called breakpoints this shows the significant amount of information contained even in reads from highly repetitive regions. Including inferred read depth from repeat regions together with the unique region, we calculate the fdr to be 1.3×10 −3 . This again indicates that a two-copy duplication is much more likely than a null event at this locus. Indeed, we were able to detect NAHR events involving only repetitive regions (for example NAHR between tandem LCRs), as shown in Additional file 1: Table S6.
Observed and expected read-depth signals for a two-copy duplication on chromosome 1. There are breakpoints 155,184,704 and 155,205,331 for two individuals. The mediating LCRs have coordinates [155180173,155190755] and [155200767,155213257]. For perspective, the observed and expected read depths are shown for an additional 100 kb flanking the breakpoints. The expected read-depth signal appears as a dotted line red for the called NAHR two-copy duplication, blue for no NAHR event. The solid green line is the inferred observed read-depth signal in repetitive regions (see Section ‘False discovery rate via read depth’). The solid black line is the observed read-depth signal in the unique region of this locus. Vertical thin black lines mark the called breakpoints. Read-depth curves are calculated as sliding 1,250-bp window sums for presentation. The expected read-depth signals are highly non-uniform due to the GC-bias in fragment distribution (see Section ‘Read depth’). (a) Read depth in unique and repeat regions for NA19129. Notice that the observed read-depth signal for the unique sequence in between the mediating LCRs follows the expected read-depth signal of a two-copy NAHR duplication (red dotted line) much closer than the expected read-depth signal if there was no NAHR event (blue dotted line). We also see the inferred observed read-depth signal transition from closely following the expected no-event signal (blue) to the expected duplication signal (red) and back again at approximately the location of the breakpoints. Together, the observed read-depth signal in the unique regions and the inferred observed read-depth signal in the repetitive regions give an fdr =1.3×10 −3 strong support for the proposed two-copy NAHR duplication. (b) Unique read depth and inferred repeat read depth is shown at the same locus for European individual NA07051. Our model determined there was not an NAHR event for this individual at this locus. The fdr at this locus for NA07051 is >0.99. chr, chromosome.
For perspective, Figure 5b shows the unique and inferred read-depth signal at the same locus for European individual NA07051. Our model determined that this individual did not have an NAHR event at this locus. The fdr for this locus for NA07051 individual is >0.99. It is already obvious from the read-depth signal graph alone that, indeed, individual NA07051 did not experience an NAHR event at this locus, and the fdr reinforces this conclusion. This also serves as another particular example of the strong difference in signal between positively called NAHR events and negatively called ones.
Relations to disease
The two-copy NAHR duplication studied here affects the genes GBA and M T X1 and their respective pseudogenes G B A P1 and M T X1P1. Mutations in GBA cause Gaucher’s disease and are strongly associated with Parkinson’s disease in populations worldwide [51,52]. Mutations in M T X1 have also been linked to Parkinson’s disease . Somatic mutations in GBA have also been linked to lung cancer, and somatic mutations in GBAP and M T X1 have also been linked to endometrium cancer .
The gene context of our two-copy NAHR duplication is shown in Additional file 1: Figure S11. According to our called breakpoints, two identical fusion GBA genes are created from the called two-copy NAHR duplication. The fusion genes consist of the first 1.1 kb of GBA followed by the last 12.5 kb of pseudogene G B A P1. The breakpoint occurs 1,093 bp inside GBA, and is 901 bp inside the coding region of GBA. Two additional complete copies of the pseudogene M T X1P1 are alsoformed.
Other important examples of non-allelic homologous recombination
To highlight the biological impact of NAHR, we briefly present four more positive NAHR event calls and their impact on several highly studied genes, including RNASE2, RNASE3, FLG, CYP2E1, SPRN, SYCE1, HP, HPR, and TXNL4B. Additional file 1: Table S6 contains basic information for each called NAHR event, as well as figures similar to those above demonstrating the read-depth signal, and genome context. This table also describes the genes affected by each call and their functions or associated genomic disorders. All calls presented in the table are novel.
Difference Between Gene and Allele
Gene vs Allele
A gene is a part of the DNA. Alleles on the other hand refer to different versions of the same gene. There are other more subtle differences between the two and this is what we are going to explore on this page:
- Genes are the different parts of the DNA that decide the genetic traits a person is going to have. Alleles are the different sequences on the DNA-they determine a single characteristic in an individual.
- Another important difference between the two is that alleles occur in pairs. They are also differentiated into recessive and dominant categories. Genes do not have any such differentiation.
- An interesting difference between alleles and genes is that alleles produce opposite phenotypes that are contrasting by nature. When the two partners of a gene are homogeneous in nature, they are called homozygous. However, if the pair consists of different alleles, they are called heterozygous. In heterozygous alleles, the dominant allele gains an expression.
- The dominance of a gene is determined by whether the AA and Aa are alike phenotypically. It is easier to find dominants because they express themselves better when they are paired with either allele.
- Alleles are basically different types of the same gene. Let’s explain this to you in this way- If your eye color was decided by a single gene, the color blue would be carried by one allele and the color green by another. Fascinating, isn’t it?
- All of us inherit a pair of genes from each of our parents. These genes are exactly the same for each other. So what causes the differences between individuals? It is the result of the alleles.
- The difference between the two becomes more pronounced in the case of traits. A trait refers to what you see, so it is the physical expression of the genes themselves. Alleles determine the different versions of the genes that we see. A gene is like a machine that has been put together. However, how it will works will depend on the alleles.
Both alleles and genes play an all important role in the development of living forms. The difference is most colorfully manifest in humans of course! So next time you see the variety of hair color and eye color around you, take a moment and admire the phenomenal power of both the gene and the allele!
1. Genes are something we inherit from our parents- alleles determine how they are expressed in an individual.
2. Alleles occur in pairs but there is no such pairing for genes.
3. A pair of alleles produces opposing phenotypes. No such generalization can be assigned to genes.
4. Alleles determine the traits we inherit.
5. The genes we inherit are the same for all humans. However, how these manifest themselves is actually determined by alleles!
What Is Gene Splicing?
Gene splicing is a technique used in genetic engineering where the DNA of a living thing is edited, in some cases replacing existing genes with genes taken from another plant or animal. Enzymes are used to cut the DNA strand and remove a piece, which can then be replaced with new information. This can transfer characteristics from one species to another or give an organism entirely new characteristics.
One famous example of gene splicing was the implantation of an antifreeze gene into tomatoes, creating a frost-resistant fruit. Since the antifreeze gene originally came from a fish adapted to thrive in cold waters, opponents of genetic modification began calling the result a "fish tomato," despite the fact that no fish characteristics were actually transferred into the plant. Another example involves the transference of bioluminescent genes into various organisms to aid in studying their life processes in the laboratory.
In some cases, entirely new genes are created to be spliced into the organism's DNA. The biotech firm LS9 created an entirely synthetic batch of genes to plant into a sugar-eating bacterium, causing the microorganism to excrete a compound functionally identical to diesel fuel. Other bacteria have been altered to produce lifesaving medicines, and another project re-engineered the bacterium responsible for tooth decay to halt its production of lactic acid, eliminating its ability to damage tooth enamel.
What is Hypostatic Gene?
The hypostatic gene is the gene whose expression is affected by the epistatic gene in an epistatic event. The phenotype of the hypostatic gene alters due to the influence of the epistatic gene. Hence, the expression of hypostatic gene entirely depends on the epistatic gene. More often, the epistatic gene suppresses the expression of the phenotype of the hypostatic gene.
Figure 02: Coat Colour of Labrador Retriever
For instance, the alleles determining the colour of the dog Labrador retriever which is black or brown are alleles of the hypostatic gene while the chocolate coat colour is the expression of the epistatic gene of it.
Mutations change genes and provide the raw material for evolution. Genes are sections of DNA that contain the instructions for making proteins or other molecules, and so determine the physical characteristics of each organism. Genetic mutations that increase an organism's number of offspring and chances of survival are more likely to be passed on to future generations. Changes to when or where a gene is switched on (so-called regulatory mutations) can also provide fitness benefits and can therefore be selected for during evolution.
Transposable elements are sequences of DNA that are also called ‘jumping genes’ because they can make copies of themselves and these copies of the transposable element can move to other locations in the genome. Some transposable elements contain sequences that switch on nearby genes. If different copies of a transposable element that contains such a regulatory sequence insert themselves in more than one place, it can result in a network of genes that can all be controlled in the same way. The regulatory sequences contained within transposable elements are not always optimal, but they can be fine-tuned through evolution.
A fruit fly called Drosophila miranda has a transposable element called ISX that has, over time, placed up to 77 regulatory sequences around one of this species' sex chromosomes. Just as in humans, female flies are XX and males are XY but having only one copy of the X chromosome means that male flies need to increase the expression of certain genes to produce a full-dose of the molecules made by the genes. This process is called dosage compensation and in 2013 the 77 ISX regulatory sequences on the fruit fly's X chromosome were shown to help recruit the molecular machinery that carries out dosage compensation to nearby genes, albeit inefficiently. Now Ellison and Bachtrog—who also conducted the 2013 study—report how these transposable elements have been fine-tuned to make them more effective for dosage compensation.
Ellison and Bachtrog uncovered two mutations that make the ISX transposable element better at recruiting the dosage compensation molecular machinery. ISX spread around different locations along the fly's X chromosome before these mutations arose this means that initially none of the 77 insertions carried the two mutations, but now 30% of the 77 elements have the mutations in all flies, and 41% have them in only some flies.
The same mutations have spread between the different ISX elements because transposable elements with the mutations have been used to directly convert other ISX elements without them. These mutations have also become more common in the fruit fly population by being passed on to offspring and increasing their survival. These two routes have accelerated the fine-tuning of these transposable elements for use in gene regulation. This implies that regulatory sequences derived from transposable elements evolve in a way that is fundamentally different from those that arise by other means, as the direct conversion between these insertions allows fine-tuning mutations to spread more rapidly.
Human Endogenous Retroviruses (HERVs) are inherited DNA proviruses arising from retroviral infections of germ-line cells and subsequent integration into the host genome 1,2 .
After integration, the provirus may experience numerous amplifications. The human genome is estimated to contain thousands of copies of these interspersed repetitive elements. In fact, they make up a significant fraction (
8%) of its sequence content 3,4,5 . Many families of HERVs exhibit high transcriptional activity in different human tissues, and both beneficial 6,7,8 and detrimental effects 8,9,10 have been described. Similarly to other endogenous proviruses, they consist of a 5–10 kb sequence which code for viral proteins, flanked by two Long Terminal Repeats (LTRs) (0.3–1.6 kb in length) that contain regulatory elements for viral protein expression 11 .
Although the majority of HERV genes are highly defective due to several inactivating mutations in their coding sequences (ranging from single nucleotide changes to large deletions), LTR elements still retain their regulatory activity participating in the regulation of cell-type specific gene expression in mammals 12,13,14,15 . Furthermore, LTRs flanking a HERV might recombine with each other leading to the excision of the coding region, thus becoming solitary elements (“Solo LTRs”) 16,17 . Solo LTRs are quite common in the human genome and play important roles in genome evolution 18,19 .
In general, it has been hypothesized that LTRs may be important contributors to the genome plasticity of the host species due to their capability to undergo ectopic recombination events such as unequal crossing over 19 .
Recent data suggested that ectopic (non-allelic) gene conversion might have played a role in the evolution of flanking LTRs in humans 20 . Gene conversion mediates the transfer of genetic information from a “donor” sequence to a highly similar “acceptor” 21 and this can have two major effects: on one hand, it can act as a homogenizing force, increasing similarity between paralogous sequences, while on the other hand, it can generate an excess of genetic diversity among allelic copies of the “acceptor” sequence. Although episodes of non-allelic gene conversion between flanking LTRs have been previously hypothesized through an inter-species comparative analysis in different primates 18,19,20 , there is little information available regarding the dynamics and the pervasiveness of this process in humans (also considering “solo LTR” elements).
Intra-species sequence diversity comparisons among different individuals could shed light on the dynamics of gene conversion among LTRs. However, in most of the genome, the phenomenon of diploidy complicates the comparative sequencing strategies because of sequence diversity between alleles. The haploid Male Specific region of the human Y chromosome (MSY) has no such limitations. Moreover, because of its genetic features, the MSY is an ideal tool to study the dynamics of ectopic gene conversion in humans 22,23,24,25 .
The purpose of this study is to gain new insights into the evolutionary history of LTRs by determining the nature and the extent at which gene conversion occurs among these elements. To this end, we carried out an intra-species phylogenetic analysis of 52 LTRs belonging to 14 different subfamilies on several human Y chromosomes, representing a wide range of worldwide human MSY diversity.
A recent history of gene conversion among LTRs associated with the MSY may be postulated: 1) by the identification of a higher than expected nucleotide diversity within LTR elements, which act as gene conversion “acceptor” sequences 22,24,26 and 2) by the occurrence of multiple phylogenetically equivalent Single Nucleotide Polymorphisms (SNPs) occurring at closely spaced positions (clusters of SNPs) showing the derived allele to be the same as the paralogous base on the donor 23,24,25,27 .
By exploiting the haploid nature of the human Y chromosome and its known phylogeny, we hypothesized that some LTR elements have acted as gene-conversion “acceptors” and re-sequenced two of them in a wider sample set. Our comparative re-sequencing analysis provide direct evidence of new LTR-related gene conversion hotspots on the human Y chromosome and suggests complex genetic links among elements spread across the entire human genome. Moreover, we identify a third form of productive recombination of the human MSY, autosome-to-Y gene conversion. Finally, we found that some LTRs are characterized by an extremely high density of polymorphisms showing one of the highest nucleotide diversities in the human genome, as well as a complex patchwork of sequences derived from different elements.
Non-allelic gene conversion enables rapid evolutionary change at multiple regulatory sites encoded by transposable elements
Transposable elements (TEs) allow rewiring of regulatory networks, and the recent amplification of the ISX-element dispersed 77 functional but suboptimal binding-sites for the dosage-compensation-complex to a newly-formed X-chromosome in Drosophila. Here we identify two linked refining-mutations within ISX that interact epistatically to increase binding affinity to the dosage-compensation-complex. Selection has increased the frequency of this derived haplotype in the population, which is fixed at 30% of ISX-insertions and polymorphic among another 41%. Sharing of this haplotype indicates that high levels of gene-conversion among ISX-elements allow them to 'crowd-source' refining-mutations, and a refining-mutation that occurs at any single ISX-element can spread in two dimensions: horizontally across insertion sites by non-allelic gene-conversion, and vertically through the population by natural selection. These describes a novel route how fully functional regulatory elements can arise rapidly from TEs and implicate non-allelic gene-conversion as having an important role in accelerating the evolutionary fine-tuning of regulatory networks.