What options exist to quantitatively detect enzyme activity in non-model insect tissue?

What options exist to quantitatively detect enzyme activity in non-model insect tissue?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Evolution/genomics person here: what are the options to measure the activity or presence of broad categories of enzymes--like peroxidases or catalases--active in a specific tissue (in a non-model insect)?

I understand that antibodies, e.g. ELISA, may be one approach but am unsure if it's the best option or what the alternatives would be. Are any sequence based-approaches available, if I know the peptide sequence (or at least a specific function domain, ie Pfam ID)?

Frontiers in Physiology

The editor and reviewers' affiliations are the latest provided on their Loop research profiles and may not reflect their situation at the time of review.


Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.


Twenty years ago, the first version of the human genome was sequenced and published (Venter et al., 2001 ). Since then, sequencing technologies and the ‘omics data sets they produce have become indispensable to biological research. The progression, over the past two decades, from Sanger sequencing to next-generation sequencing (NGS) and, more recently, long-read sequencing of DNA and RNA has driven cost reductions, improved accessibility, technological progress and availability of supporting tools and resources. Many open-source software packages and annotation pipelines have been developed, not only for genomics but also for the wider family of ‘omics disciplines including proteomics, metabolomics (e.g. glycomics and lipidomics) and others. For the purposes of this review, the term ‘omics refers strictly to the genomics, transcriptomics and proteomics approaches that have become increasingly popular in the bioadhesion literature over the past decade. Characterisation of single genes or proteins in isolation does not constitute genomics or proteomics. ‘Omics studies produce data for the entire system under investigation, which are then refined by various means to reveal the genes or proteins of interest, and their functions. Transcriptomics via RNA sequencing (RNA-seq) provides a ‘bottom- up’ method for identifying putative proteins based on the principle that molecules of messenger RNA (mRNA) are used to translate genes into proteins somewhat quantitatively. The proteins are then secreted in an unmodified state, or as post-translationally modified variants that can be identified through proteomics. Genes of interest can be targeted by various means, but often by combining differential tissue sampling, analysis and prediction methods.

At the time of writing, over 55000 genome records ( were publicly available on the National Centre for Biotechnology Information (NCBI) servers. The number of transcriptome and proteome data sets publicly available is less clear due to deposition of data in a variety of archives. However, total numbers for these types of data sets are also in the thousands. For example, a search of the NCBI SRA archive with the key word, ‘transcriptomic’ provides links to over 5000 BioProjects and over 11000 proteomic data sets are available on the EMBL-EBI PRIDE server. These numbers exemplify the ‘omics revolution throughout the biosciences, from the study of human diseases to crop plant production and novel compound discovery (Fukushima et al., 2009 Tanaka, 2010 ). It is unsurprising, therefore, that the ‘omics approach has also been adopted within bioadhesion research and it is timely to ask what the impact has been on our basic understanding of bioadhesion mechanisms.

Adhesion via a secreted chemical bioadhesive may either be reversible or irreversible, facilitating temporary or permanent attachment (Hennebert et al., 2015b Lengerer & Ladurner, 2018 ). Many species use adhesion for essential processes on which their survival depends. Bioadhesion likely evolved multiple times independently and, in each case, it was an adaptation of another pre-existing physiological process. So, while bioadhesion mechanisms are diverse and complex, they are also rooted in core physiological processes that can be interrogated using ‘omics-based approaches. Indeed, it is possible that ‘omics-based studies of adhesion could identify either ancient physiological processes, such as salivary secretion (Sehnal & Akai, 1990 Yan et al., 2020 ), from which adhesion evolved in different lineages, or similarities among lineages through the presence of functional gene and protein domains. The desirable characteristics of bioadhesive interfaces in nature are not provided by chemistry alone, of course. So-called ‘wet’ (Wilhelm et al., 2017 ) and ‘dry’ (Labonte et al., 2016 ) adhesion systems rely on mechanics to enhance their performance: modulating adhesive contact area (Crawford et al., 2016 ), dissipating energy under stress at the micro- (Cohen et al., 2019 ) and nanoscales (Phang et al., 2010 ) and enabling controlled detachment (Federle & Labonte, 2019 ). Scale is important, since extrapolations from experiments at the nanoscale will be unlikely to reflect the true properties at the micro- or macroscales (Desmond et al., 2015 ). These mechanical phenomena are beyond the scope of this review, the focus of which is on identification and characterisation of secreted materials.

Adhesives secreted by aquatic invertebrates contain proteins, glycans (polysaccharides) and lipids in varying proportions. Often metals are involved and are instrumental to crosslinking (Richter, Grunwald & von Byern, 2018 ). Interest in bioadhesives has been driven to a significant degree by the demand for novel biomimetic adhesives with capabilities beyond the synthetic glues currently available to consumers. Understanding the mechanisms that control adhesion in aquatic systems is considered to be central to the development of bio-inspired adhesives for the construction, biomaterial and manufacturing industries, as well as for clinical therapies (Palacio & Bhushan, 2012 ). Many current synthetic adhesives are damaging to the surfaces they are applied to and are contaminating, toxic or hazardous to the environment. In addition, most of those currently available have low efficacy on hydrated surfaces. Substitution with bio-inspired adhesives could therefore provide more suitable and sustainable alternatives (Richter et al., 2018 ).

Marine bioadhesives of biomimetic interest were recently reviewed by Almeida, Reis & Silva ( 2020 ). The purpose of this review is not to provide a similar application-focused overview. Rather, we aim to identify the trends in bioadhesion research that have informed current understanding and ask, ‘where next?’. The resurgent focus on the basic biology of aquatic adhesion systems is welcome and has been driven, in part, by the ‘omics revolution. But it is now timely to look beyond the use of these data, to identify ways to build rigour and consistency into the analyses, and to ensure that the conclusions of studies are both sufficient and meaningful. This article therefore covers strengths, opportunities, limitations and future challenges presented by ‘omics in the context of bioadhesion research. To illustrate some key points more effectively, we have included original data and analyses where appropriate.


First-strand buffer, 5×

  • 375 mM KCl
  • 15 mM MgCl2
  • 250 mM Tris·Cl, pH 8.3
  • Store up to 1 year at −20°C

Pfu-sso7d buffer, 5×

  • 0.5 mg/ml bovine serum albumin (BSA Sigma B6917)
  • 50 mM KCl
  • 10 mM MgSO4
  • 50 mM ammonium acetate
  • 150 mM Tris·Cl, pH 10
  • 0.5% (v/v) Triton X-100
  • Store up to 1 year at −20°C

Sodium citrate buffer, pH 6.4, 0.1 M

Purchase citric acid monohydrate (210.14 g/mol) and sodium citrate dihydrate (294.12 g/mol). Prepare 80 ml of distilled water in a suitable container. Add 0.21 g of citric acid monohydrate to the solution. Add 2.65 g of sodium citrate dihydrate to the solution. Adjust solution to final desired pH using HCl or NaOH. Add distilled water until volume is 0.1 liter.

Cell-Free Strategies for Sustainable Materials Biomanufacturing

Living cells and organisms have evolved highly complex enzymes and metabolic processes that generate extremely diverse biochemistries. Exploring these natural biochemistries may lead to important foundational advances in our understanding of natural product synthesis. Foundational discoveries in functional genomics, cellular metabolism and natural product synthesis are also important, because they might inspire novel biosynthetic pathway designs for biological materials production. In synthetic biology, cell-free metabolic engineering (CF-ME) approaches can reconstitute entire biosynthetic pathways using either cell extracts from diverse species, engineered cells and/or cell-free synthesized recombinant enzymes (Karim and Jewett, 2018 Martin et al., 2018 Yim et al., 2019 Bowie et al., 2020) (Figure 1). Also, cell-free protein synthesis and cell extract biotransformation reactions can be combined to create more complex cell-free reactions (Karim and Jewett, 2018 Kelwick et al., 2018). Another important advantage in using cell-free approaches is that pathway reaction bottlenecks can be identified, through the direct addition of the required recombinant enzymes, enzyme co-factors or chemical substrates needed for each stage of a biosynthetic pathway (Dudley et al., 2015). Increasingly sophisticated combinatorial CF-ME strategies, together with high-throughput automation, deep data omics and design of experiments (DoE) approaches to cell-free reaction optimization have been deployed (Caschera et al., 2018 Jiang et al., 2018 Dopp et al., 2019). These advancements have considerably improved the feasibility of refactoring and optimizing fine chemical or natural product biosynthetic pathways within short timeframes (Dudley et al., 2015 Korman et al., 2017 Moore et al., 2017a Wilding et al., 2018).

Cell-free synthetic biology approaches have also been directed toward the de novo bioproduction of biological materials, including biopolymers or their monomers, cellulosic materials and nanoparticles (Table 1). However, the maximum cell-free bioproduction yields or reaction efficiencies of several reported materials were generally low or unspecified. Examples of cell-free produced materials and their reported maximum production yields and reaction efficiencies include bio-cellulose (3.726 ± 0.05 g/L 57.68%) (Ullah et al., 2015), chitin (yields not stated) (Jaworski et al., 1963 Endoh et al., 2006), lactic acid (6.6 ± 0.1 mM 47.4 ± 3.9%) (Kopp et al., 2019), gold nanoparticles (yields not stated) (Chauhan et al., 2011 Krishnan et al., 2016), (R)-3-hydroxybutyrate-CoA (32.87 ± 6.58 μM) (Kelwick et al., 2018), silver nanoparticles (yields not specified) (Costa Silva et al., 2017) and silk fibroin (yields not specified) (Greene et al., 1975 Lizardi et al., 1979). Poor cell-free production yields and efficiencies can be due to a variety of factors including rapid depletion of reaction energy mix components (e.g., ATP, amino acids), the formation of inhibitory waste products (e.g., inorganic phosphates) or unwanted side reactions that divert reaction fluxes away from desirable pathways (Caschera and Noireaux, 2014). Because of these limitations, cell-free synthetic biology may not be an ideal production method for some biological materials. Nevertheless, whilst actual cell-free material production yields can be relatively low, these approaches are still beneficial for prototyping different biosynthetic pathways, substrates or reaction conditions to boost both in vitro and whole-cell production yields. An exemplar is the use of cell-free assays to characterize polyhydroxyalkanoates (PHAs) biosynthetic pathways from phaCAB operons that also enhanced in vivo PHAs production (Kelwick et al., 2018). Furthermore, the same study also demonstrated that the cell-extract biotransformation of whey permeate into 3-hydroxybutyrate (3HB), could be simultaneously coupled with the cell-free protein synthesis of a potential Acetyl-CoA recycling enzyme (Kelwick et al., 2018). Thus, highlighting that combinatorial cell-free reaction formats can be a useful strategy for bioplastic pathway prototyping and optimization.

Interestingly, in some cases, cell-free bioproduction may actually be a more desirable manufacturing route. For instance, several in vitro gold or silver nanoparticle production studies reported desirable nanoparticle characteristics (e.g., size/zeta potential) and/or easier purification protocols within cell-free bioproduction reactions than whole-cell production methods (Krishnan et al., 2016 Costa Silva et al., 2017). Cell-free bioproduction can also be carried out at industrially relevant scales, as illustrated by Sutro biopharma who have developed a highly scalable good manufacturing practices (GMP)-compliant cell-free protein synthesis platform, for producing therapeutic proteins within 100 L bioreactor reaction volumes (Zawada et al., 2011). For cell-free materials production, a highly efficient synthetic biochemistry module was developed to convert glucose into bio-based chemicals, including the PHA bioplastic monomer polyhydroxybutyrate (PHB) (Opgenorth et al., 2016). To achieve this, purified recombinant enzymes were used to reconstitute core elements of the pentose, bifido, glycolysis and PHB pathways (Opgenorth et al., 2016). Cell-free PHB production yields (40 g/L) and efficiencies (90%) were impressive and are promisingly close to industrially attractive scales (Opgenorth et al., 2016). These improvements in PHB production are also welcome since PHB, as well as other PHAs biopolymers, are biodegradable and can potentially be used as 𠆍rop-in’ replacements for oil-derived plastics (e.g., food packaging) or as biomaterials for tissue engineering (Choi S.Y. et al., 2019 Tarrahi et al., 2020). PHAs are also an interesting example because of their industrial importance and the diversity of cell-free strategies that have been applied to PHAs research (Table 1). Building upon these examples, CFME approaches could be used to explore a greater diversity of PHAs biopolymers given that PHA biopolymers can be composed of a variety of different wildtype and/or synthetic monomers (� different monomers exist) to create complex co-polymers, with an array of material characteristics (Choi S.Y. et al., 2019). Future cell-free synthetic biology explorations of PHAs are likely to unlock novel PHAs biopolymers with unique characteristics (Chen and Hajnal, 2015) and therefore, accelerate bioplastic materials development.

Microbial (in vivo) PHAs production has been commercially manufactured at industrial scales over the last several decades. Unfortunately, the commercial impact of PHA-based bioplastics has been historically prohibited by their higher production costs than oil-derived plastics (Chen et al., 2020). However, more efficient PHAs production processes have been devised through the rational design of phaCAB biosynthetic pathways (Hiroe et al., 2012 Kelwick et al., 2015b Li et al., 2016 Tao et al., 2017 Zhang X. et al., 2019), key metabolite recycling processes (e.g., Acetyl-CoA) (Matsumoto et al., 2013 Beckers et al., 2016), alternative microbial production hosts (e.g., Halomonas sp., Tan et al., 2011) and the use of industrially sourced, low-cost feedstocks (e.g., whey permeate) (Wong and Lee, 1998 Ahn et al., 2000 Kim, 2000 Nikel et al., 2006 Cui et al., 2016 Nielsen et al., 2017). Interestingly, several of these microbial PHAs production strategies are also compatible with cell-free synthetic biology reactions. In particular, using locally sourced, low-cost feedstocks (e.g., whey permeate) may help to make cell-extract based PHAs production more economically viable (Kelwick et al., 2018). A similar approach has already been piloted for cell-free lactic acid production from spent coffee grounds (Kopp et al., 2019) and could become a generalized strategy for sustainable cell-free materials bioproduction (Figure 2). We would argue that combining cell-free extracts with local feedstocks enables immediate access to highly diverse cellular biochemistries and low-cost substrates (e.g., waste feedstocks), that could potentially be used for the sustainable biomanufacturing of a diverse array of biological materials (Yan and Fong, 2015 Le Feuvre and Scrutton, 2018). Furthermore, just as lyophilized cell-free reactions enable the on-demand production of biotherapeutics (Pardee et al., 2016b), we likewise envision that cell-free reactions might one day lead to rapid and distributed, on-demand biological materials production or bio-functionalization.

Figure 2. Sustainable cell-free biomanufacturing of biological materials. Schematic depicts the local, on-demand cell-free mediated, biomanufacturing of biological materials. Local feedstocks can potentially be utilized as replacements for expensive reaction energy mix components, or to provide the enzymatic co-factors and biosynthetic pathway substrates that are required to produce biological materials of interest.

Material & methods

Insect cultures, bacterial growth and procedure of females infection

All insects used in this study were from stock cultures maintained in standard laboratory conditions (24 ± 2°C, 70% RH in permanent darkness). They were provided ad libitum with bran flour, water and supplemented once a week by apple.

Females were primed with the Gram-positive Bacillus thuringiensis (Bt) and the Gram-negative Serratia entomophila (Se). These bacteria are known natural pathogens of T. molitor [30] and they induce contrasting immune priming responses within and between generations in this insect [15,16,19]. The bacteria were all obtained from the Pasteur Institute (Bt: CIP53.1 Se: CIP102919) and suspensions for immune priming were prepared as described in [15]. Briefly, the bacteria were grown overnight at 28°C in liquid broth medium (10 g bacto-tryptone, 5 g yeast extract, 10 g NaCl in 1000 mL of distilled water, pH 7). They were then inactivated in 0.5% formaldehyde prepared in PBS for 30 min and rinsed three times in PBS. Inactivation was tested by plating a sample of the bacterial solution on sterile broth medium with 1% of bacterial agar and incubated at 28°C for 24 hr. Aliquots were kept at −20°C until use.

For all experiments, immune priming was performed on virgin females (10 ± 2 days post-emergence) by first chilling them on ice for 10 min for immobilization purpose and then by injecting a 5-μL suspension of inactivated bacteria (Bt or Se) or of buffer only (procedural control for effect of the injection). Injections were done through the pleural membrane between the second and third abdominal segments using sterile glass capillaries that had been pulled out to a fine point with an electrode puller (Narashige PC-10). For both injected bacteria, the concentration was adjusted to 10 8 microorganisms per mL using a Neubauer improved cell counting chamber [16]. Immediately after their treatment, the females were paired with a virgin and immunologically naive male of the same age and allowed to produce eggs in a Petri dish supplied with wheat flour, apple and water in standard laboratory conditions (24°C, 70% RH dark).

De novo assembly of T. molitor Illumina transcriptome (RNAseq)

In the absence of genome, a reference T. molitor transcriptomic database (Fig 1-1) was necessary to identify candidate proteins from the proteomic study. Because we aimed at generating a database enriched in transcripts involved in immune and stress responses, the sequencing was performed on pooled RNA from individuals of various developmental stages (third instar larvae, pupae, adults), sex (males and females) and physiological conditions. More precisely, groups of 10 individuals of third instar larvae, adult males or adult females, were either not treated or injected with B. thuringiensis, or with S. entomophila bacteria as described above, or injected with the drug phenobarbital (0.1%), known to induce a detoxification response in insects [69]. We also used RNA from 30 pupae (untreated because of their higher sensitivity to stress), resulting in a total of 150 individuals. RNA was extracted 48h after priming, using trizol method (TRIzol LS Reagent, Invitrogen) according to the manufacturer's instructions.

Sequencing was carried out on an Illumina HiSeq2000 Genome Analyser platform using paired-end (2x100bp) read technology with RNA fragmented to an average of 380 nucleotides. Sequencing of two technical replicates was performed by Eurofins-mwg-operon and resulted in a total of 70 and 51 million paired-end reads. Quality control measures, including the filtering of high-quality reads based on the quality score given in fastq files (FactQC, version0.10.1), removal of reads containing primer/adaptor sequences and trimming of read length, were carried out using Trimmomatic (version 0.3.1 [70]). Reads with Phred-like score <20 and read length less than 40 nucleotides were removed. After this quality filtering, the two technical replicates resulted in a total of 58 and 44 million reads that were pooled to obtain a reference transcriptome. The de novo transcriptome assembly was carried out using Trinity (version 2014/09/07 [71]) with k-mers sized 25, T = 50 and Jaccard similarity coefficient (option from trinity to reduce chimeric transcripts). Our de-novo transcriptome contains 110,963 transcripts with an N50 (sequence length of the shortest contig at 50% of the total transcriptome length) of 1261 nucleotides and an ExN50 of 1135 nucleotides. We used Bowtie (version 0.12.9 [72]) to align reads in our transcriptome. Complete metrics can be found in Table 1. To reduce the numbers of transcripts we performed a super-assembly with TGICL (version 2.1 [73]) using default settings.

Translation of the transcriptome was performed using FrameDP (version 1.2.0 [74]) and Uniprot (, version of 29April2015) was used as database for the machine learning phase of detection of the best genetic code. The resulting predicted proteins (45,505) were compared using BLASTP (version 2.2.30+, e-value of 10 −5 ) and Tr. castaneum proteome available at Uniprot (version 14April2015). Functional proteome annotation was predicted using InterProScan (version 5 [75]) on the GALAXY-BBRIC INRA platform ( using InterPro database (version 52). CEGMA analysis was used to validate the quality of the transcriptome-proteome [76]. Our reference transcriptome, its proteome and annotation were used as a resource for candidate genes and proteins in the following experiments. They are available for download on the IHPE laboratory website ( and on the NCBI database under the BioProject ID PRJNA646689 with SRA numbers SRR12235350 and SRR12235349.

Global egg proteome and AMPs analysis

Proteomic and peptidomic analyses were performed on eggs laid or extracted directly from ovaries from primed and control females. Two complementary experiments, 2D-DiGE (Fig 1-2a) and mass spectrometry (Fig 1-2b), were conducted to characterize the proteins and peptides differentially abundant between eggs originating from primed females and those from control ones, respectively.

Considering that T. molitor females were reported to protect their eggs through TGIP from day 2 to day 8 post-maternal priming [17], only eggs produced after the third day following the maternal immune treatment were collected (Fig 6A). Eggs were either collected directly in the ovaries or freshly laid in the Petri dish. In the latter case, laid eggs have always been collected within 16 hours after laying to make sure that the embryo has not started its development. To make this possible, couples were transferred into a new Petri dish supplied with fresh flour, apple and water three days after the maternal treatment. Eggs produced into this new Petri dish were collected 16 hours after couple was transferred (Fig 6A). At the moment of egg collection, females were chilled on ice for 10 min and then dissected in ice cold PBS to remove the ovaries. Isolated eggs from ovaries were then rinsed in clean cold PBS to removed small sticky ovarian tissue and then gently dried for few seconds on sterile filter paper tissue before collecting and immediately frozen in liquid nitrogen and stored at -80°C until use. Females used to collect laid eggs were not dissected. Their eggs were sieved and treated as above.

Egg sampling strategies for proteomic approach (panel A), antimicrobial activity following RNAi experiment (panel B), kinetics of gene expression by RT-qPCR (panel C) and antibacterial activity following mothers priming (panel D) are indicating. A full description of the procedure for each experiment is described in details in the corresponding part of the Material and Methods section.

A two-dimensional difference gel electrophoresis (2D-DiGE) approach was used to qualitatively and quantitatively determine the differential abundance of proteins (of size ranging from

10 to 150 kDa) between conditions (eggs from primed females vs eggs from naïve females, eggs primed with Bt vs eggs primed with Se). 2D-DiGE uses direct labeling of proteins with fluorescent dyes (CyDyes: Cy2, Cy3, and Cy5) prior to their isoelectrofocusing to solve the known quantitation and reproducibility problem of 2D-Electrophoresis.

A total of 433 eggs were collected 3-day post-priming (70–74 per condition) from 117 different females (16–24 per condition) injected at 2 or 5–7 different dates for eggs in ovaries and freshly laid (sampled within 16 h post-laying), respectively. For each condition, eggs were pooled into 6 different biological replicates, each constituted of 11–17 eggs from 2–7 different females injected at the same date when possible. All information on female priming date, egg sampling and pooling are available in S6 Table. For each replicate, eggs were ground in UTTC buffer (urea, 7 M thiourea, 2 M Tris, 30 mM CHAPS, 4% pH 8.5) using a sterile pestle and incubated at RT for 2 hours. After centrifugation (5 min at 10,000 g), the supernatant was collected and the protein concentration was quantified using the 2D-Quant Kit (GE Healthcare) following manufacturer’s instructions, before being stored at -80°C until use. Fifty micrograms of proteins were labeled with either Cy3 or Cy5 while 50 μg of an equimolar pool of proteins from all extracts were labeled with Cy2 as an internal standard. The CyDye minimal labeling of the purified proteins was performed following manufacturer’s instructions (GE Healthcare). A dye swap was performed to ensure that the observed differences between the three different priming conditions for eggs laid and eggs in ovaries were not due to different labeling efficiencies of the dyes. The DiGE labeling setup is available in S6 Table. Labeled proteins were then mixed together and diluted to a final volume of 340 μL with rehydration buffer (urea, 7 M thiourea, 2 M CHAPS, 4% DTT, 65 mM) containing 0.2% of Bio-Lyte 3/10 ampholyte (Bio-Rad). Isoelectrofocusing was performed as previously described [77]. Briefly, sample was loaded on a 17 cm ReadyStrip IPG strip with a non-linear 3–10 pH gradient (Bio-Rad) for passive (5h) and active rehydration (14h at 50V). Focusing was performed using the following program: 50 V for 1 h, 250 V for 1 h, 8,000 V for 1 h, and a final step at 8,000 V for a total of 50,000 V.h with a slow ramping voltage (quadratically increasing voltage) at each step. Rehydration and focusing were both performed on a Protean IEF Cell system (Bio-Rad). After reduction with DTT and alkylation with iodoacetamide of the proteins, strips were loaded on top of a 12%/0.32% acrylamide/piperazine diacrylamide gel and run at 25 mA/gel for 30 min followed by 75 mA/gel for 8 h using a Protean II XL system (Bio-Rad). Protein standards (Unstained Precision Plus Protein Standards (Bio-Rad)) were loaded on Whatman papers disposed on the left part of gels. Gels were scanned using a ChemiDoc MP Imaging System (Bio-Rad) associated with Image Lab software version 4.0.1 (Bio-Rad) using the blue (530/28 filter), green (605/50 filter) and red (695/55 filter) epi-illumination parameters for scanning Cy2, Cy3 and Cy5-labeled proteins, respectively. The qualitative and quantitative comparative analysis of digitized proteome maps was conducted using the image analysis software PDQuest 7.4.0 (Bio-Rad). Only spots significantly (p < 0.05, Mann-Whitney test) 1.5-fold differentially abundant between conditions were considered. In order to identify the proteins in the different spots of interest, classical 2D-SDS-PAGE were conducted on each condition separately and gels were stained following a mass spectrometry-compatible silver staining procedure previously described in [77]. Spots were excised using a Onetouch Plus Spot Picker Disposable (Harvard Apparatus) equipped with specific 1.5-mm methanol-washed tips. For each spot, protein in gel plug was trypsin-digested and digested peptides were analyzed with a nano-LC1200 system coupled to a Q-TOF 6550 mass spectrometer equipped with a nanospray source and an HPLC-chip cube interface (Agilent Technologies) as previously described [77]. Protein identification was performed by extracting the peak lists and comparing with the T. molitor translated transcriptome database by using the PEAKS studio 7.5 proteomics workbench (Bioinformatics Solutions Inc., build 20150615). The searches were performed with the following specific parameters: enzyme specificity, trypsin three missed cleavages permitted fixed modification, carbamidomethylation (C) variable modifications, oxidation (M), pyro-glu from E and Q monoisotopic mass tolerance for precursor ions, 20 ppm mass tolerance for fragment ions, 50 ppm MS scan mode, quadrupole and MS/MS scan mode, time of flight. Only significant hits with a false discovery rate (FDR ≤ 1%) for peptide and protein cutoff (−logP ≥ 13 (corresponding to a p-value of 0.05) and number of unique peptides ≥ 3) were considered for top hits. For ensuring a proper identification of the proteins, a BLAST search against NCBI nr database was performed and the conserved domains of the sequence were retrieved using the NCBI CD-search available at [78]. For each protein, pI and molecular mass were calculated with the ExPASy Compute pI/Mw tool (available at

A limit of the 2D-DiGE approach is its inability to reliably reveal small proteins (<10 kDa) and peptides, which include the AMPs whose role in TGIP has already been demonstrated [8,16,21,23,27,29,79,80]. We investigated their abundance in eggs freshly laid (sampled within 16 h post-laying) by mothers 3 days after Bt and Se priming compared to PBS-primed ones. Two replicates consisting of 51 and 31 eggs and 49 and 43 eggs were prepared for the control and B. thuringiensis conditions, respectively. Only 52 eggs were available from S. entomophila-primed mothers and they constituted the only replicate for this condition. Eggs originated from a total of 54 different females primed at 8 different dates (S6 Table). Due to the limited number of replicates, only qualitative differences were extrapolated from data to identify candidate proteins. We first incubated each egg pool in trifluoroacetic acid (TFA) 0.1% to perform an acidic extraction of peptides that were then purified using a Sep-Pak C18 Plus Light Cartridge (Waters) following manufacturer’s instructions. Peptides were eluted in 5 mL acetonitrile 60% / TFA 0.1% and lyophilized overnight. Protein concentration was estimated by their absorbance at 205 nm using NanoDrop One Spectrophotometer (Thermo Fisher Scientific), indicating that we extracted approx. 60 μg of protein per egg sample. Peptides were then suspended in 50 μL of acetonitrile 2% / TFA 0.1% and their quality was first assessed by MALDI-TOF (Matrix Assisted Laser Desorption Ionization—Time of Flight) fingerprinting with the HCCA Biotyper matrix (Bruker Daltonic, Germany) mixed with three different dilutions for each sample. MALDI analysis was run on an Autoflex III Smartbeam (Bruker) controlled by the Bruker Compass 1.4 software. Once validated, samples were diluted in ammonium bicarbonate (ABC) 100 mM, reduced by adding DTT (11 mM final concentration, incubation 30 min at 56°C), and alkylated with iodoacetamide (34 mM final, 1h at room temperature, in the dark). Samples were then acidified with diluted TFA, centrifuged 10 min 15,000 g at 4°C, and the supernatant volume was reduced by speed-vacuum. The supernatants were then analyzed by top-down LC-MS/MS: one-tenth of each sample was injected on an Ultimate 3000 nano-HPLC system equipped with a 75-μm C18 column and connected to a Q-Exactive Orbitrap high resolution mass spectrometer operated in data-dependent acquisition positive mode (all from Thermo Scientific, Germany). Components, solvents, and operating parameters for nano-LC-MS/MS analysis were as described by [81]. Spectra from the acquired MS/MS files were matched to the sequences of the database of T. molitor proteins we built for this study, using the software Proteome Discoverer (version 1.4) set with the following parameters: no enzyme and 12 / 144 amino acids as minimum / maximum peptide lengths, respectively, tolerance of 10 ppm / 0.02 Da for precursors and fragment ions, carbamidomethylation of cysteine set as a fixed modification, C-terminal protein amidation, and methionine and tryptophan oxidation set as variable modifications, analysis in high-confidence mode (false discovery rate 1%). The Sequest HT algorithm implemented in Proteome Discoverer software was used for the database search. A short list of fifteen known or candidate AMPs proteins was extracted from this database, by screening the annotations obtained with the transcriptomic database and by a literature search. High scores indicate both a reliable identification and an abundance of matching peptides, potentially reflecting a high abundance of the related protein. Low scores, usually below 20, rather suggest a weak identification of the protein and/or a low abundance in the sample. For readability reasons, only scores higher than 20 were highlighted in the Table 3 but all scores (low and high) are discussed. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD018772.

Functional invalidation of candidate immune genes using RNA interference

This experiment aimed at testing the contribution of candidate AMP genes in egg protection through TGIP by knocking down their expression using RNAi technology (Fig 1-3). Despite extensive investigation efforts, important individual variations in AMPs expression levels prevented the observation of a firm effect of injection of a single AMP ds-RNA on the under-expression of AMPs in mothers. Efforts to reduce variability included using non-intrusive delivery method of dsRNA ingestion [60,61], testing the effect of a single or double exposure to dsRNA and testing several time points after exposure to dsRNA. None of these trials resulted in decreased variations in AMP expression levels in mothers and in a significant effect on eggs antimicrobial activity. Of note, it has been shown in Drosophila model that the knock-down of single TEP (Thioester-containing protein) genes with redundant functions was insufficient to significantly impact the insect phenotype and that a change in microbial infection resistance was only observed upon the knock-down of all 4 TEP genes simultaneously [82]. Therefore, we decided to inject females with a cocktail of different candidate AMP genes to maximize the potential phenotype effect on the eggs. For this purpose, on day 1, 10-day post-emergence females were injected with 5 μL of either i) PBS as control for injection procedure, ii) GFP-dsRNA (0.6 μg/μL) as control for non-relevant dsRNA, or iii) a cocktail of dsRNA from tenecins 1 and 4, coleoptericins A and B, and attacin-2 (0.12 μg/μL each, 0.6 μg/μL for total ds-RNA), after being chilled on ice for 10 min. Double-stranded RNA (dsRNA) were synthesized by in vitro transcription using the MEGAscript T7 kit (Ambion, UK) according to the manufacturer’s recommendations. Regions of approximately 200 bp were synthesized from tenecin-1 (GenBank accession number Q27023), tenecin-4 (GenBank accession number AB669089), coleoptericin-A (GenBank accession number KF957599.1), coleoptericin-B (GenBank accession number KF957600.1), and attacin-2 (GenBank accession number MF754108.1). Non-relevant dsRNA used as a negative control consisted in a 208 bp portion of the jellyfish green fluorescent protein (gfp5) RNA (GenBank Accession number L29345).

Females were individually placed in a 12-well plate, containing bran flour and a 10 μL drop of gelose for hydration (1% agar-agar, 10% sucrose) until day 3. At day 3, all females were immune primed by injection of a 5 μL suspension of inactivated B. thuringiensis (10 8 bacteria/mL) as described above (Fig 6B). On day 4, females were injected again with either i) PBS as control for injection procedure, ii) GFP-dsRNA (0.6 μg/μL) as control for non-relevant dsRNA, or iii) a cocktail of dsRNA from tenecins 1 and 4, coleoptericins a and b, and attacin-2 and then individually transferred to 55mm diameter Petri dishes containing one 15 days-old male, bran flour and gelose for hydration (Fig 6B). Couples were maintained in Petri dishes to allow reproduction until day 7, when females were separated from males and placed in fresh individual Petri dishes containing bran flour and gelose for eggs production. Females were removed from the Petri dishes on day 9 and eggs laid 6 days after immune priming were maintained in Petri dishes for 3–4 days of maturation. Eggs (3–4 days old) were isolated by sieving the flour (mesh size 600 μm) [31] and immediately frozen in liquid nitrogen for future analysis of their antimicrobial activity (Fig 6B). A total of 33 females, leading to the laying of 123 eggs, were used for each treatment (control, GFP-dsRNA and AMPs-dsRNA).

Temporal dynamics of candidate immune genes expression

To determine the temporal dynamics of the TGIP response in eggs of primed mothers, we assessed the production of the transcripts of candidate immune genes found differentially expressed at the proteomic level in the previous experiments in both females and eggs (Fig 1-4). We especially wanted to quantify the production of these transcripts in both females and eggs at day 1, 5 and 12 post-priming, allowing to have eggs before protection, at maximal protection and when the protection is over, respectively [17] (Fig 6C). In addition, as antibacterial protection in maternally-primed eggs varies with time post-oviposition (eggs age) according to the pathogen that has immune primed the mother [19], we further assessed the effect of eggs age on the amount of transcripts (Fig 6C). For this purpose, groups of 10 days (± 2 days) post-emergence virgin females were, as explained above, immune primed by injection with B. thuringiensis (N = 44) or S. entomophila (N = 47) or injected with PBS as control (N = 43). Each female was then paired with an immunologically naïve and virgin male of the same age immediately after the priming injection, and allowed to produce eggs. Within each immune treatment modality, a minimum of 9 couples was randomly sacrificed at day 1, 5 and 12 post-priming, while having provided 6 eggs. The remaining couples were transferred into a new Petri dish the day before their sacrifice to control for eggs age. Immediately after the sacrifice of each female, they were stored in liquid nitrogen and from each, 2 eggs were also frozen and stored in liquid nitrogen. The remaining eggs (2 for each time-point) were allowed to age for 3 days or 7 days post-oviposition before their storage in liquid nitrogen. Hence, within each egg laying sequence, each female contributed to eggs that were allowed to age for 1, 3 or 7 days post-oviposition before to be frozen in liquid nitrogen for later examination. Details of the females, number of eggs used and dates of injection are available in S6 Table.

Total RNA was extracted from both adult females and eggs with Direct-zol RNA MiniPrep kit (Zymo Research, Irvine, CA, USA). Briefly, frozen eggs or adult females were lyzed with a tissue grinder in TRIzol reagent (Life Technologies, Carlsbad, CA, USA). RNA was then purified according to the manufacturer’s instruction, including the optional in-column DNAse I treatment, and stored at -80°C. RNA concentration and purity were controlled by absorbance measurement using an Epoch spectrophotometer (Biotek, Winooski, VT, USA). Reverse transcription of RNA into cDNA was performed with Maxima H minus first strand cDNA synthesis kit (TermoFisher Scientific, Waltham, MA, USA) using random hexamer according to the manufacturer’s instructions. Depending on the RNA quantity extracted, 1 μg was used for RT when possible (from adult female an older eggs) or less quantity (from younger eggs).

Quantitative PCR analyses were performed at the “qPCR Haut Débit (qPHD) Montpellier GenomiX” platform using the Labcyte Echo 525 liquid handler for pre-PCR preparation and the LightCycler 480 System (Roche, Basel, Switzerland) for PCR running. PCR reactions were performed in a 1.5 μL total volume comprising 0.5 μL of cDNA (diluted 1:80 or 1:20 with ultrapure water from adult mother or from egg, respectively) and 0.75 μL of No Rox SYBR MasterMix dTTP Blue (Takyon Eurogentec, Liege, Belgium), and 100 nM of each primer. PCR amplification efficiencies were established for each target and house-keeping gene by calibration curves using two times serial dilutions of cDNA (from 1:20 to 1:2580) in triplicates. Amplification efficiencies were calculated using slope values of the log-linear portion of the calibration curves by the LightCycler 480 Software release1.5 (Roche). Only primer couples with amplification efficiency of 2 were retained. All details about primers used are reported in S7 Table. The cycling program was as follows: denaturation step at 95°C for 3 min, 45 cycles of amplification (denaturation step at 95°C for 10 s, annealing and elongation step at 60°C for 45s). Quantitative PCR was ended by a melting curve step from 65 to 97°C with a heating rate of 0.11°C/s and continuous fluorescence measurement. For each condition, PCR experiments used 6–18 biological replicates of adult females and 3–4 biological replicate of eggs (pool of 6 eggs from 3 different females, 2 eggs per female), in addition to three technical replicates. The mean value of Ct was calculated. Corrected melting curves were checked using the Tm-calling method of the LightCycler 480 Software release1.5. The relative expression of each gene was calculated with the ΔΔCt method as the efficiency of all couple of primers (target and housekeeping genes) presented the same PCR amplification efficiency. For each target gene, the ΔCt was calculated with respect to the mean value of two reference genes coding for 18S and 28S ribosomal RNA. For adult mother, the ΔΔCt were calculated with respect to the ΔCt values of control condition mothers (PBS-injected) sacrificed 1-day post-injection. For the eggs, the ΔΔCt were calculated independently for each laying day, with respect to the ΔCt values of eggs laid by PBS injected mothers. Relative expression values (R) of genes between different conditions were calculated according to the formula R = 2 -ΔΔCt [83].

Antibacterial assay

Antibacterial activity in the eggs, allowing to check egg protection at the phenotypic level, was measured using a standard diffusion zone assay [16,17,19]. On the one hand, this assay allowed to test the effect of the RNAi invalidation on egg antibacterial activity in the experiment aiming to invalidate the expression of candidate AMPs in eggs laid 5 days post maternal priming and 3 days post-oviposition (Fig 6B). On the other hand, it was used in the experiment allowing to link egg antibacterial activity, from naïve and from PBS, B. thuringiensis and S. entomophila-treated mothers when laid at 1 and 5 days post priming and 7 days post-oviposition (Fig 6D), with the quantification of AMP candidate gene transcripts in in eggs from similarly treated mothers (Fig 6C).

Individual egg samples were thawed on ice, and egg extracts were prepared by mashing eggs into an acetic acid solution (0.05%, 5 mL per egg). After centrifugation (3,500 g, 2 min, 4°C), 2 mL of the supernatant was used to measure antibacterial activity on zone of inhibition plates seeded with the bacterium Arthrobacter globiformis, obtained from the Pasteur Institute (CIP105365). An overnight culture of the bacterium was added to Broth medium containing 1% agar to achieve a final concentration of 10 5 cells per mL. Six milliliters of this seeded medium was then poured into a Petri dish and allowed to solidify. Sample wells were made using a Pasteur pipette fitted with a ball pump. Plates were then incubated overnight at 28°C. The diameter of inhibition zones was then measured for each sample.

Statistical analyses

Size of egg zones of inhibition following RNA interference treatments were analyzed using Shapiro-Wilk test for normality and Student T-test to test differences in zone of inhibition size (p<0.05), whereas the analysis of the proportion of eggs exhibiting antibacterial activity according to RNA interference treatments were analyzed using Fisher exact tests (p<0.05). Presence of antibacterial activity among the eggs laid by control, PBS-, B. thuringiensis- and S. entomophila-treated females after 1 day or 5 days post-priming were analyzed using binomial logistic regressions.

Expression of AMPs by RT-qPCR following RNAi treatment in mothers were compared between dsAMP and dsGFP-treated mothers using a Mann Whitney test considering that that data were not normally distributed according to the Shapiro-Wilk test (p<0.01).

The relative expression of the 11 candidate immune genes measured in adult primed females with B. thuringiensis and S. entomophila and in their 7 days-old eggs laid after 1, 5 and 12 days post-priming relative to PBS-injected control females were analyzed using Mann-Whitney tests (p<0.05), following normality verification using Shapiro-Wilk test. Analyses were made using IBM® SPSS® Statistics 19 software.

Sensitivity Analysis for k max vivo Values

Flux Variability Analysis.

FBA requires an objective function to optimize, and there are various alternatives (e.g., biomass, growth rate, ATP yield, etc.). The pFBA minimizes the total sum of fluxes and was shown to agree well with measured fluxes and expression data in E. coli (63, 65). In the parsimonious algorithm, the solution space is very small. Nevertheless, flux can still deviate considerably for few reactions and, thus, affect the value of their k max vivo . To estimate the variability of our pFBA solution, we perform Flux Variability Analysis (FVA) for all reactions with measured k max vivo values. For this we calculate the minimum flux ( v min ) and maximum flux ( v max ) supported by each reaction in the condition in which k max vivo was obtained. We constrain the model by a fixed total sum of fluxes (from pFBA) and fixed growth rate and environmental conditions (from measurements). For each reaction, we get a range for k max vivo : range of k max vivo = [ v min E , v max E ] . [S1] Fig. S2 shows the range of k max vivo relative to the value obtained from the original pFBA solution. pdxH is the only enzyme that shows significant variability. Because E. coli has an alternative anaerobic enzyme for this reaction (71), the v min through pdxH can go down to zero (i.e., when only the anaerobic enzyme is used), resulting in large variability in the k max vivo of pdxH. Because we only use aerobic conditions in our analysis, the suggested variability is probably irrelevant.

The range of k max vivo relative to k cat measurements. Flux variability analysis was performed for the pFBA solution for all reactions (N = 132). Data points correspond to Fig. 2 in main text. The y = x line is shown in black dashed brown line represents the best fit by orthogonal regression in log 10 error bars (typically so small to be within the size of the points themselves) represent the range between the upper and lower k max vivo estimates.

ATP Requirement.

The nongrowth related ATP requirement (also known as maintenance ATP requirement) in the iJO1366 model is set to support 3.15 mmol gCDW −1 h −1 of ATP, based on experimental data (23). This value, however, may change across conditions. Because our analysis of enzyme catalytic rates is performed across many conditions, we performed a sensitivity analysis by calculating k max vivo values given 0% or 200% of the flux through the ATP–maintenance reaction in the model, and found negligible effect on k max vivo values, as can be seen in Fig. S3.

Nongrowth related ATP requirement has negligible effect on k max vivo values. (A) k max vivo values given flux of 6.30 mmol gCDW −1 h −1 (200% of reported maintenance value, x axis) compared with 3.15 mmol gCDW −1 h −1 (y axis) through the ATP maintenance reaction. (B) k max vivo values given flux of zero (x axis) compared with 3.15 mmol gCDW −1 h −1 (y axis) through the ATP maintenance reaction. Blue line indicates the y = x line.

Results and discussion

Sequencing and assembly

The input data for the whole-genome assembly included five libraries (see Table S1). Most of the data were sequences from the ends of short (400 bp) fragments. We also generated three longer fragment libraries of 3, 10, and 12.5 kbp that assisted greatly in linking contigs into scaffolds. After trimming to remove low-quality data, the data set contained 248 million pairs or about 500 million reads of 151 bp. Given an estimated genome size of 606 Mbp, this represented approximately 120× coverage of the genome. The total length of the assembled genome was 667 Mb, about 10% or 24% longer than the genome size estimated using flow cytometry (606 Mb) (Horjales et al., 2003 ) or K-mer depth distribution of sequenced reads (538–540 Mb Figure S1), respectively. Both MaSuRCA and SOAPdenovo2 produced good assemblies, but the MaSuRCA assembly had an N50 scaffold size of 366 541 versus 355 832 for SOAPdenovo2. As a quality check, we aligned reads from one of the paired-end libraries back to both assemblies and about 12% more pairs of reads aligned consistently – at the right distance and orientation – to the MaSuRCA assembly. We therefore chose the MaSuRCA assembly as the basis for annotation and subsequent analysis. However, the SOAPdenovo2 assembly was superior for the mitochondrion and chloroplast it assembled the chloroplast into a single molecule and the mitochondrion into seven contigs. We used these as the final organelle assemblies and removed all scaffolds from the MaSuRCA assembly that matched the organelles. The initial assembly had a N50 contig size of 44 398 and an N50 scaffold size of 366 541 bp, based on an estimated genome size of 606 Mbp (Table 2). AMOScmp identified 5426 overlapping scaffolds that it merged into 237 larger scaffolds. In the scaffolding step using the assembled transcriptome, 2397 scaffolds were linked into 1063 scaffolds based on transcript alignments. The final assembly contained 27 032 scaffolds over 1000 bp, with a total length of 667 Mbp and an N50 size of 464 955 bp (based on a genome size of 606 Mbp). The total of assembled contigs was 687 Mb, with an N50 length of 46 148 bp and a total of 221 640 contigs. The walnut physical map (Luo et al., 2015 ) is a potential supplemental resource for this whole genome shotgun (WGS) assembly. In total, 26 596 bacterial artificial chromosome (BAC) end sequences associated with J. regia BioSample SAMN00188677 were obtained from NCBI. Of these, 22 050 were identified with scaffolds in the J. regia physical map. In total 20 821 of these physically mapped BAC end sequences uniquely aligned to 4005 scaffolds in the draft assembly. In addition, 145 BAC ends aligned to more than one scaffold. The combined length, representing a unique assignment of BAC ends to scaffolds, of the target scaffolds totaled 546 Mbp (File S1), a majority (79%) of the sequence in our WGS assembly.

Estimated genome size (Mb) 606
Chromosome number (2n) 32
Total size of assembled contigs (Mb) 687
Number of contigs 221 640
Largest contig (bp) 603 060
N50 length (contig) (bp) 46 148
Number of scaffolds 186 636
Total size of assembled scaffolds (Mb) 712
Number of scaffolds >1000 bp 27 032
Total size of assembled scaffolds (>1000 bp) (Mb) 667
N50 length (scaffolds) 464 955
Longest scaffolds (Mb) 3
Number of gaps 35 004
Mean gaps length (bp) 712
Total size of gaps (Mb) 25
GC content (%) 37

Assembly validation

Our validation strategy using PacBio reads aligned to the assembly showed that about 88% of the assembly was covered by PacBio reads, at an average identity of around 80.2%. We also assessed the linking potential of the PacBio reads by aligning subsets of 1000 reads to the current assembly using three different aligners (nucmer, bwa and blasr) and counting how many PacBio reads aligned to more than one scaffold. The results varied significantly among aligners (Table S2). Overall, between 0.18 and 1.35% of the PacBio reads had linking potential.

Transcriptome sequencing and assembly

Over one billion (1 062 838 572) paired-end reads were sequenced across 19 libraries. After quality control and trimming, the number of reads was reduced to 978 128 921. Trinity assembled 132 041 772 bp into 111 944 transcripts with a mean length of 1180 bp and an N50 of 1833. Of these transcripts, 78 645 were unique ‘genes’. Detection of true open reading frames (ORFs) and validation techniques produced 35 836 ORFs (Table 3), 16 594 of which were full-length (46%). The N50 for full-length sequences is 1449. The genome was scaffolded using 29 785 partial and full-length ORFs (representing unique ORFs) and further analyzed.

Total reads 1 062 838 572
Total reads after QC 978 128 921
Total no. of transcripts 111 944
Total no. of Trinity genes 78 645
N50 (all) 1833
All contigs, median length 750
All contigs, mean 1180
All contigs, assembled bases 132 041 772
Total no. of full-length sequences 16 594
Total no. of ORF sequences 35 836
Total no. of selected frame sequences 29 785
Total no. of non-contaminated sequences 28 932
Total no. of annotated sequences 25 373
Total no. of sequences without annotation 4140
Total no. of uninformative hits (unknown function) 9379

Genome annotation

We annotated the assembled genome with the semi-automated genome annotation pipeline MAKER-P using expressed sequence tag (EST) and protein sequences from species relatives (Table S3) and the assembled J. regia transcriptome, which yielded 32 498 gene models. A total of 5172 scaffolds ≥5000 nt long were annotated with at least one gene model (range 1–325 annotations, average 6.28). Several gene features were explored to validate the MAKER models: completeness (canonical start and stop codons), transcriptome support, protein evidence (similarity to Ricinus communis, Vitis vinifera or Cucumis sativus proteins) and recognizable protein domains (PfamA and/or the Panther database) (Table 4). We discarded two mono-exonic genes with domains associated with retroelements. Based on these features, we classified the 32 496 gene models into three categories: (i) high quality, full-length genes (16 852 sequences) that are multiexonic or monoexonic, complete, supported by expression data or protein evidence and annotated with at least a protein domain (ii) high-quality partial genes (8782 sequences) that are multiexonic, partial, supported by expression data or protein evidence and annotated with at least a protein domain and (iii) low-quality genes (6862 sequences) which are multiexonic or monoexonic, complete or partial, not supported by expression data or protein evidence or not annotated with a protein domain.

Multi-exonic Mono-exonic Total genes
Number of genes 26 249 6247 32 496
Complete (contain canonical start/stop codons) 15 555 5964 21 519
Supported by walnut RNA sequencing data 19 884 3982 23 866
Contain Pfam and/or Panther domain/s 21 516 4332 25 848
Supported by protein evidence 25 772 6240 32 012

The genomic features were similar to those of other sequenced plant genomes (Table 5). Annotated protein-coding regions were only about 5% of the genome and about 28% of the total genic space, due to the longer length of introns (Table S4). Most intron/exon junctions were GT-AG canonical splice junctions (Table S5). Interestingly, the next most frequent junctions were GC, AT, AC, resembling proportions of alternative non-canonical splice junctions GC-AG and AT-AC described in other plant species (note that %AG-3′ – %GT-5′ = 0.0573% is similar to %GC-5′ = 0.0516% and %AT-5′ = 0.0072% is similar to %AC-3′ = 0.0079%) (Sparks and Brendel, 2005 ).

Juglans regia Pyrus communis Pyrus bretschneideri Fragaria vesca Malus × domestica Vitis vinifera Citrus sinensis
Predicted genes 32 496 43 419 42 812 34 809 54 921 30 434 29 445
Average gene length (bp) (including introns) 4358 3320 2776 2792 2802 3399 3041.21
Average CDS length (bp) 1222 1209 1172 1160 1155 1255
Number of exons 172 273 221 804 202 169 174 375 273 226 149 351 258 142
Average exon length (bp) 229.5 237 248 232 273 130 312
Number of single-exon genes 6249 10 909 12 310 5915 10 378
Number of introns 139 775 178 385 159 357 139 567 218 353 118 917 213 755
Introns per gene (multi-exon) 5.3 5.49 5.22 4.83 4.9 5.8
Average intron length (bp) 730 398 386 407 491 213 359

The completeness of the MAKER gene models was evaluated with CEGMA (Table S6). Coverage of the 248 ultraconserved set of core eukaryotic genes (CEG) was 82% (complete coverage) and 99% (partial coverage). Similar analysis of the transcriptome used for annotation (28 932) and all identified ORFs (35 836) showed a slightly lower coverage, suggesting that MAKER marginally increased the coverage provided by the transcriptome assembly. We analyzed the contribution of the different datasets (plant proteins and de novo transcriptome) to the final MAKER models (Figure 2). Among high-quality MAKER models (16 852), all were supported by protein evidence from other plant species and about 80% by the transcriptome assembly (Figure 2). Among fully covered MAKER models, the support from protein evidence is reduced but not the transcriptome support, for which more models are completely covered. When 100% coverage was required for both transcript and MAKER models, 35% of models are covered by a full-length transcript.

Analysis of evidence form MAKER models.

Percentage of full-length high-quality MAKER models (16 852) covered by protein (external evidence, green) or RNA sequencing (red) at any coverage, 100% (cv100) and 100% coverage of both the model and the transcript (cv100 both).

Of the MAKER models, 6959 (41.29% Figure 2) were fully covered by assembled transcripts and 12 209 transcripts were fully covered by MAKER models, indicating that the gene models are longer than the transcripts and that MAKER can ‘extend’ the information provided by the expression data. This is consistent with the observed average coding sequence length (MAKER models) of 1222 nt, and average transcript length (RNA-seq) of 965 nt. Of the 32 496 gene models, we annotated 30 843 with a plant protein using databases based on enTAP results. Vitis vinifera, R. communis and Arabidopsis thaliana contributed most to the functional annotation (Figure S2). There were 4716 models with uninformative hits (annotated as hypothetical, predicted or otherwise non-characterized proteins) and 1650 had no hit, potentially representing J. regia-specific proteins. Out of 346 genes identified as J. regia-specific using Markov Cluster Algorithm (MCL) analysis (described below), 230 were represented by these genes with no hit, further supporting their contribution in J. regia. In the complete set of genes, 24 234 were annotated with at least one of 3542 different Interpro accessions and 3558 with one of 713 different enzyme codes from the Enzyme Nomenclature Database (Bairoch, 2000 ).

Functional classification of annotated genes yielded 11 421 models annotated with 3771 different Gene Ontology (GO) terms. The most abundant molecular functions were binding (19.4%) and nucleotide binding (17.7%). Metabolism (25.4%), cellular (22.1%) and biosynthesis (16%) were the largest biological processes and membrane components (18.6%) were most abundant among cellular components (Figure S3). The most abundant InterProScan domains were IPR000719 (protein kinase domain), IPR002885 (pentatricopeptide repeat) and IPR001128 (cytochrome P450), which were in 708, 589 and 369 J. regia genes, respectively. Models with a leucine-rich repeat (LRR) were also abundant 527 genes were annotated with IPR001611 or IPR013210 accessions. The phosphorylation pathways were most represented based on enzyme code assignment this is not unusual for plants. Of 4716 gene models annotated as uninformative and 1650 with no hit, 2587 and 167, respectively, had a recognizable Pfam or Panther protein domain, contributing to their characterization despite lacking global similarity to characterized proteins.

Orthologous gene family analysis, performed via TRIBE-MCL, included the J. regia MAKER models and a collection of proteins from nine sequenced land plants (Figure 3). The nine species comprised a mix of monocots and dicots and one bryophyte. There were 10 046 families of size five and greater, yielding 9910 families after filtering for families containing retroelements. Of these, 9153 comprised proteins from more than two species. Only 20 families (165 proteins) were unique to J. regia (Tables 6 and 7). Of these, 12 were annotated with at least one descriptive protein domain. No protein domain was identified in eight J. regia-specific families (66 proteins). One abundant J. regia-specific group included characteristic resistance genes containing LRR and/or NB-ARC domains (PF12799, PF13855, PF08263 and PF00931 four families, 24 proteins). Further examination of BLAST annotation results (with at least 70% coverage at the protein level) for these proteins showed that 12 of the 24 were disease resistance genes.

Venn diagram of orthologous protein groups for each species identified in the Markov cluster analysis (MCL) with the Juglans regia MAKER models and the protein collection of nine sequenced plant species.

The nine sequenced plant species were as follows: Arabidopsis thaliana (AT), J. regia (JR), Oryza sativa (OS), Physcomitrella patens (PP), Populus trichocarpa (PT), Ricinus communis (RC), Theobroma cacao (TC), Vitis vinifera (VV), Zea mays (ZM) and Glycine max (GM). The center represents the number of families shared by all species simultaneously. Only families with five or more protein members were considered. Analysis of families with two or more members showed 5205 families shared by all species simultaneously.

Protein family Walnut models Pfam ID [Pfam count] Pfam descripton
2906 19 PF12796 [8] Ank_2
7392 10 PF03140 [10] DUF247
7940 10 PF12776 [1] Myb_DNA-bind_3
7946 10 No Pfam
7948 10 PF07777 [1] MFMR
7966 10 No Pfam
8408 9 No Pfam
8504 8 PF00171 [1] Aldedh
8677 8 No Pfam
8735 8 No Pfam
8737 8 No Pfam
8754 7 PF00646 [6] F-box
8861 7 PF12799 [4] LRR_4
8975 7 PF00931 [2] NB-ARC
PF13855 [1] LRR_8
8980 7 No Pfam
9091 6 PF14577 [1] SEO_C
9318 6 No Pfam
9633 5 PF13855 [1] LRR_8
9785 5 PF02892 [1] zf-BED
9863 5 PF08263 [3] LRRNT_2
PF12799nn [1] LRR_4
Element name Family name Frequency Total length (bp) Percentage of genome
Reju_rnd-2_family-57 LINE/L1 14 754 13 626 865 1.98
Reju_rnd-2_family-119 LTR/Gypsy 15 566 10 789 195 1.57
Reju_rnd-2_family-14 LTR 7952 8 167 833 1.19
Reju_rnd-3_family-29 LINE/L1 7037 7 495 511 1.09
Reju_rnd-2_family-118 LTR/Gypsy 13 456 7 324 814 1.06
Reju_rnd-3_family-5 LTR/Gypsy 4371 7 247 731 1.05
Reju_rnd-3_family-49 LTR 11 849 7 007 548 1.02
Reju_rnd-3_family-80 LINE/L1 4799 6 822 652 0.99
Reju_rnd-2_family-58 LTR 8977 5 941 064 0.86
Reju_rnd-3_family-89 LTR/Gypsy 4203 5 250 213 0.76
Reju_rnd-4_family-64 LTR/Copia 2912 5 086 047 0.74
Reju_rnd-6_family-1 DNA 15 957 4 219 485 0.61
Reju_rnd-4_family-7 LINE/L1 15 016 4 169 235 0.61
Reju_rnd-5_family-10 DNA 14 524 3 857 295 0.56
Reju_rnd-3_family-301 LINE/L1 3953 3 637 049 0.53
Reju_rnd-3_family-7 DNA 15 243 3 303 322 0.48
Reju_rnd-3_family-126 DNA/CMC-EnSpm 5152 3 256 473 0.47
Reju_rnd-3_family-210 DNA 6166 2 720 074 0.40
Reju_rnd-3_family-271 LTR/Gypsy 2209 2 685 568 0.39
Reju_rnd-3_family-231 DNA 16 304 2 644 048 0.38

Annotation of genes in the J. regia genome encoding enzymes involved in the biosynthesis of non-structural phenols

Consistent with the rich array of phenolic compounds in J. regia, eight families with 133 proteins contained the UDP-glucuronosyltransferase (UDPGT PF00201) domain associated with UDP glycosyltransferase activity. Other domains commonly associated with this set of families were the N-terminal domain of the glycosyltransferase family (Glyco_transf_28 PF03033) and the structurally conserved catalytic domain of protein kinases (Pkin PF00069). Less frequent were a conserved ferrodoxin domain (Fer2 PF00111), a helix–loop–helix domain from a large family of calcium-binding proteins (EF-hand_5 PF13202) and the catalytic domain of the Ulp1 protease family (Peptidase_C48 PF02902). We annotated one member of this cohort with the UDPGT (PF0021) as WALNUT_00000341-RA on scaffold jcf7180001222233. This protein was associated with a metabolic process (GO:0008152) and predicted by InterProScan as a UDP-glucuronosyl/UDP-glucosyltransferase (IPR002213). Intriguingly, mammalian proteins that exhibit this domain are classified as UDP-glucuronosyl transferases (EC: they transfer a glucuronic acid to liphophilic substrates and detoxify drugs and carcinogens (Burchell et al., 1991 ). In plants, however, members of this family are classified as flavonol-3-O-gluosyltransferases (EC: and transfer a glucose moiety from UDP-glucose to a flavanol, a reaction observed in anthocyanin biosynthesis (Hu et al., 2011 ). Two additional proteins were assigned to one family and clustered with 63 proteins from other genomes by MCL. These proteins were annotated as WALNUT_00030240-RA and WALNUT_00029754-RA by MAKER. Three Pfam domains associated with PPO enzymes were identified in each protein: tyrosinase (PF00264), PPO1_KFDV (PF12143) and PPO1_DWL (PF12142) domains. These domains are associated with oxidation/reduction and catechol oxidase activity in PPO genes and are found in canonical PPO enzymes. This result contrasts with a report identifying only a single polyphenol oxidase enzyme gene (JrPPO1) using Southern analysis (Escobar et al., 2008 ). A thorough characterization of JrPPO1, confirming the presence of one or more PPO genes in J. regia, is provided below. While we are confident that our results support the presence of an additional PPO gene, we recognize that our discovery was enabled solely by sequencing the J. regia genome and identifying conserved domains using in silico methods.

Annotation of miRNA sequences in the J. regia genome

We identified 5229 potential miRNA precursors with MIRENA by comparison with known plant miRNA sequences in MirBase. None were an exact match to known miRNAs (all contained one or two mismatches). The plant species with the greatest contribution to the identifications are listed in Table S7. Figure 4(a) shows the size distribution of all identified mature miRNAs predicted from precursors. Our results are biased toward selection of short sequences (compared with the typical 21-nt length of miRNAs in other species) and the absence of expression data hampers validation. Therefore, we applied stringent criteria for filtering. First, we selected only precursors containing mature sequences with previously described functional sizes (20, 21, 22 or 24 nt, 2881 precursors). Second, we selected precursors that contained a mature miRNA with sequence similarity to the most conserved miRNAs among plants (330 precursors Table S8). Third, RNA secondary structures were manually reviewed to select precursors meeting microRNA structural requirements (Figure 4b), yielding 205 and 66 high- and low-quality loci, respectively (Table S8) 20 were discarded. Seventy-two precursors were also collapsed to 37 different loci due to identification of both strands of the miRNA duplex on the same precursor (Figure 4c). Five long precursors with highly negative minimal folding free energy (MFEI) indices were flagged as low quality because they resembled the structure of fold-back retrotransposons (Figure 4d), which can be confused with true miRNA precursors. Small RNA expression data in future analysis may distinguish them from small interfering RNA precursors and validate the complete set of miRNA loci identified in this study. MiR156 is present in nearly every plant species in high copy number, but was particularly abundant in J. regia. Twenty-one MAKER models were annotated as squamosa promoter-binding proteins (SPL), which are typically targeted by this miRNA (Kozomara and Griffiths-Jones, 2014 ). If these data are experimentally validated, this could highlight the importance of SPL genes and miR156 in developmental processes in J. regia. Other consistent results include an abundance of miR168 (two miRNA loci identified), a key component of the silencing machinery that targets Argonaute (Vaucheret, 2008 ) proteins (seven models identified). MiR395 is highly diverse among species, ranging from only six copies in Arabidopsis to 17 in Prunus persica. This miRNA helps regulate sulfate assimilation in Arabidopsis (Matthewman et al., 2012 ) and the 17 high-quality precursors identified in J. regia suggest an important role of sulfur metabolism regulation through gene silencing. Four gene models annotated as ATP-sulfurylases (targeted by miR395 in other species) were identified, the same number of genes as annotated in the Arabidopsis genome (Lamesch et al., 2012 ).

Identification of miRNA in the walnut genome.

(a) Length distribution of mature miRNA sequences derived from computationally predicted precursors, for the complete set of miRNAs identified (red) and manually curated conserved plant miRNAs in other species (blue).

(b), (c) The RNA secondary structure of two precursors that passed the quality filter. Both miRNA and the complementary strand were computationally identified for precursor (c).

Tandem repeats

We identified 6.77 Mbp of tandem repeats, comprising 0.95% of the genome. The average tandem repeat was 25 bp. Minisatellites (9–100 bp) (Vergnaud and Denoeud, 2000 ) represented the largest physical portion of the genome (0.51%), followed by satellites (>100 bp, 0.28%) (Csink and Henikoff, 1998 ) and microsatellites (1–8 bp, 0.15%) (Morgante and Olivieri, 1993 ) (Table S9). Among microsatellites, dinucleotide repeats were most prevalent at 25% of all tandem repeats and 0.8 Mbp of the genome. The six most frequent consensus sequences had period size two and the AT dinucleotide comprised 0.23 Mbp (0.03%) of the genome (Table S10 Figure S4). Juglans regia had a significantly higher microsatellite density than A. thaliana, C. sativus, V. vinifera or P. triocharpa (Wegrzyn et al., 2014 ) (Figure 5). Dinucleotide and pentanucleotide repeats were notably denser in J. regia than in other plant genomes, while other microsatellites were present at comparable or lower densities. The average microsatellite had a copy number of 14.75, while minisatellites and satellites had average copy numbers of 2.84 and 2.30, respectively.

A comparison of microsatellite density across species.

Overlapping repeats were resolved by selecting the element with a higher Tandem Repeat Finder score. Tandem repeats with interspersed repeats that overlapped on less than 15% of their length were selected. These data were compared with a select set of sequenced angiosperms.

Interspersed repeats

Interspersed repeats are typically mobile elements that propagate using either DNA or an RNA intermediate. Our analysis revealed that 51.19% (50.38% interspersed plus 0.95% tandem) of the J. regia genome is repetitive. This estimate is higher than a previous estimate of 28.9% using BAC sequences (Wu et al. 2012 ) it is also greater than that of P. trichocarpa (42%) (Tuskan et al., 2006 ) but similar to Eucalyptus grandis (50.1% Myburg et al., 2014 ) and V. vinifera (55.02%). There were 930 517 interspersed repeats identified in the genome (754 453 full-length and 176 064 partial repeats), with a combined length of 346 567 115 bp. The percentages of de novo and similarity annotations were 92.49% and 7.51%, respectively. The number of unique de novo repeats obtained from RepeatModeler was 811, while 2009 unique repeat elements were identified using a Repbase similarity search (Figure S5). Among interspersed repeats, full-length repeats accounted for 53 151 460 bp (7.7%) and partial repeats for 293 415 655 bp (42.6%) (Figure 6). There were 27 473 interspersed repeats (40 unique elements) that were unclassified (627 733 full length and 3 548 903 partial), representing 0.6% of the genome. The interspersed repeats were 50.38% of the genome, of which 35.38% were retrotransposons and 14.35%, DNA transposons and uncharacterized repeats. Among long terminal repeat (LTR) retrotransposons, Gypsies and Copias were most abundant at 8.40% (721 unique elements) and 6.57% (701 unique elements) of the genome. The ratio of genome coverage of Gypsy to Copia elements was 1.27 (Table S11). However, the number of uncharacterized LTRs was 7.09% of the total after a similarity search and manual curation. Dating of the LTRs revealed that Copia elements had a median LTR divergence of 4.6% compared with 5.5% for gypsy elements, suggesting that Copias may be younger (File S2). However, the number of elements and the percentage of the genome covered by uncharacterized LTRs may influence the dating results.

Interspersed repeat content in Juglans regia.

Breakdown of full (using a cutoff of 80–80–80 (Wicker et al., 2007 ), labelled as full) and all (full plus partial length elements, labelled as all) interspersed repeats in the genome. Repeat elements have been further classified into class I (retrotransposons) and class II (DNA transposons). The figure also shows the rRNA content and unclassified repeat elements in J. regia.

Non-LTR elements have no LTR regions and terminate with a simple sequence repeat such as poly(A). Instead of RNA-mediated, element-encoded reverse transcriptase and integrase, target-primed RNA reverse transcription takes place at the site of insertion (Kejnovsky et al., 2012 ). The percentage of non-LTR retrotransposons in the genome was 10.47%. L1/LINE is the largest non-LTR retrotransposon subfamily, at 7.6% of the genome. Among DNA transposons (14.35% of the genome, 263 unique elements), En/Spm (Enhancer Suppressor-mutator) and hAT (Rubin et al., 2001 ) are the major DNA elements at 2.23% and 2.08% of the genome, respectively.

SNP discovery

We found 965 861 high-quality SNPs distributed in 7123 scaffolds. To estimate SNP heterozygosity we conditioned all sites on a required minimum depth of 80 and minimum mapping quality of 20, yielding 398 173 095 sites over 16 077 scaffolds on which high-quality SNPs could potentially be called. This resulted in an overall SNP heterozygosity estimate of 0.0024. Including lower-quality SNPs over the same region increased the estimate slightly to 0.0025. As expected, there was a strong correlation between the length of the scaffold and the number of SNPs. Scaffolds without SNPs tended to be small, representing 41% by count but only 9.9% by length of scaffolds longer than 5000 bp (see Figure S6 for genome heterozygosity, Appendix S1). Only 38 longer scaffolds between 100 and 673 kbp had no SNPs. These results confirmed that most scaffolds without SNPs are short and suggest that shared ancestry is mostly in the distant past. Most genes were on a scaffold with a SNP: 14 469 genes contained a combined 177 702 high-quality SNPs within their genic region. A subset of 10 130 genes had SNPs within their annotated coding sequence.

Whole-genome duplication event

There is evidence, based on the fossil record, for whole-genome duplication on the Juglans lineage that occurred some 60 million years ago (Luo et al., 2015 ). To comprehensively examine the evidence for a whole-genome duplication within the J. regia genome, self-comparisons were performed to identify conserved regions of synteny between homeologous regions of the genome. We identified 8459 pairs of paralogous genes (syntelogs, 12 084 genes), excluding within-scaffold tandem duplication. Of these, 3688 (30.5%) genes occurred more than twice. The distribution of estimated synonymous changes per synonymous site (Ks) was plotted for each of the 4111 pairs (8093 genes) of syntelogs with Ks < 1 (Figure 7). Within this group, only 423 (5.2%) genes occurred more than twice. Two obvious peaks are present. The minor peak, near the origin, probably represents unmerged, allelic haplotypes in the assembly of a single outbred genome. The major peak, representing the majority of syntelogs, has a mode at Ks = 0.33. This is surprisingly consistent with an estimate of the average divergence of 14 pairs of paralogous genes (Ks = 0.274 ± 0.09) (Luo et al., 2015 ). Gene Ontology terms which were overrepresented in the 8093 genes with Ks < 1 included proteins with functions in transcription regulation and signal transduction components (Figure S7), similar to those identified in other plant genomes (Blanc and Wolfe, 2004 Seoighe and Gehring, 2004 Bekaert et al., 2011 ). Our results provide strong support for whole-genome duplication and the degree of divergence between syntelogs. Genomic data from other members of the Juglandaceae are needed to determine a more accurate date and phylogenetic placement for the whole genome duplication.

Identification of GGT, an enzyme that catalyzes the committed step in the biosynthesis of HTs

Gallic acid, the precursor for the synthesis of gallo- and elagitannins, is synthesized from 3-dehydro shikimic acid and then converted to β-glucogallin (1-O-galloyl-β- d -glucose), the key intermediate for the synthesis of HTs (Niemetz and Gross, 2005 Muir et al., 2011 ). Recently a glucosyl transferase was shown to catalyze the formation of β-glucogallin (Mittasch et al., 2014 ). The glycosyltransferase (GT) family of enzymes transfers saccharide moieties from activated donor molecules to acceptor molecules (Lairson et al., 2008 ). Glycosyltransferases are classified into 97 families by the Carbohydrate-Active enZYmes Database (CAZy) (Lombard et al., 2014 ), of which 1-glycosyltransferases, or UDP-dependent glycosyltransferases (UGTs), are the largest in the plant kingdom (Ross et al., 2001 Yonekura-Sakakibara and Hanada, 2011 ). These UGTs are subdivided into 16 groups (A–P) (Ross et al., 2001 Caputi et al., 2012 ). Dicots have a wide range of putative UGT genes in their genomes (C. sativus has 85 Malus domestica, 241, etc. Caputi et al., 2012 ). A recent genome-wide analysis of UGTs in maize identified 147 genes (Li et al., 2014 ). We used UGT84A13 (QrGGT), a group L UGT from Quercus robur (English oak) (Mittasch et al., 2014 ), to query the J. regia transcriptome and identify UGT genes in the genome using an iterative algorithm (see Experimental Procedures). The UGTs contain a conserved segment of 42 amino acids, represented by the PSPG (putative secondary plant glycosyltransferase) motif (Figure 8a). Using the PSPG motif, we identified about 130 different UGT genes expressed in the J. regia genome (Figure 8b). Transcripts of UGTs are differentially expressed in most tissues, with a high number expressed in roots (Table S12). The phylogenetic tree (Figure 8c) shows approximately the same number of groups as previously reported (Caputi et al., 2012 ). We were especially interested in UGTs that exhibit GGT activity, catalyzing the formation of 1-O-galloyl-β- d -glucose. Two genes in J. regia, c48329_g2_i1 (JrGGT1) and c48329_g1_i1 (JrGGT2), were the closest matches to QrGGT and belonged to the L group of UGTs (Caputi et al., 2012 Mittasch et al., 2014 ). Group L UGTs have enzymatic activities on a wide variety of metabolites including phenylpropanoids (Lim et al., 2003 ), benzoates (Lim et al., 2002 ) and indole-3-acetic acid (Jackson et al., 2001 ). However, phylogenetic grouping could not predict the wide biochemical activities for a bi-functional resveratrol/hydroxycinnamic acid glucosyl-transferase from Concord grape (Hall and De Luca, 2007 ). A similar caveat is highlighted by the CAZy database, which advises ‘polyspecificity (enzymes with different donor and/or acceptor found in the same family) is common among glycosyltransferase families, making precise functional predictions often unreliable or inaccurate’ ( Thus, the promiscuity and broad substrate specificity (Chakraborty and Rao, 2012 ) of these enzymes necessitate proper biochemical characterization when classifying their catalytic properties.

Sequence analysis of glucosyltransferases (GT) in Juglans regia.

(a) Prosite profile of the 42 conserved residues in GT sequences.

(b) Partial multiple sequence alignment of the conserved region of GT protein sequences.

(c) Molecular phylogenetic analysis by the maximum likelihood method. The analysis involved 130 amino acid sequences. There were 757 positions in the final dataset. Evolutionary analyses were conducted in MEGA6. Group L UDP-glycosyltransferases are highlighted in red.

To probe the molecular basis of variation in substrate specificity of JrGGT1 and JrGGT2, we modeled the genes using SWISS-MODEL (Arnold et al., 2006 ) with PDBid:2PQ6 from Medicago truncatula (which has the closest identity to UGT84A13) as the template. Previously, we proposed a method to suggest mutations in a protein to endow it with a specific catalytic function based on structural and electrostatic homology of residues in the active site to those of a known protein with the desired function (Chakraborty, 2012 ). We chose the Cα atoms of the residues as the representative atom for each residue for a uniform comparison. The active sites of these proteins have distinct donor- and acceptor-binding sites (Li et al., 2007 ). We applied transformations (Chakraborty, 2012 ) to align three residues (H378, F400 and D402) involved in the donor-binding site (Li et al., 2007 ). These residues are exactly conserved in both JrGGT proteins. H378 and W381 correspond to the active site His and Trp residues described by Ghose et al. ( 2015 ), which are critical for glycosylation of the phenylpropanoid secoisolariciresinol.

This multiple superimposition of the proteins provided a single frame of reference to compare the GTs, clearly demonstrating their overall structural homology (Figure 9). Therefore, it is logical to conclude that the differences between these proteins, the M. truncatula isoflavonoid GT versus the Q. rubar and J. regia GGT proteins, lie in their active sites. After superimposition, we took residues within a radius of 5 Å from the triad (H378, F400 and D402). For each residue in PDBid:2PQ6, we found the closest residue in JrGGT1 and JrGGT2, superimposing the proteins, then compared the residues in the active sites of the three proteins. Some positions are occupied by stereochemically different residues, providing a plausible reason for the difference in substrate specificity when comparing the GT and the GGTs (Gouran et al., 2014 ). For example, the large, negatively charged D201 in PDBid:2PQ6 is replaced by a small, polar serine residue in the J. regia and Q. rubar GGTs. We designed four pairs of PCR primers based upon the genomic sequence to confirm the presence of these genes: for each gene, one set of primers was designed to amplify only the coding region and the other to amplify the gene and 2 kb upstream and 1 kb downstream of the coding region (see Appendix S2, Figures S8–S15, Tables S15–S19). The primers (JrGGT1-F-cd and JrGGT1-R-cd JrGGT2-F-cd and JrGGT2-R-cd) amplified single bands representing the GGT coding region from genomic DNA isolated from different J. regia cultivars (see Appendix S2). JrGGT1 was located on a 7209-bp scaffold designated jcf7180001222249 and JrGGT2 was present on a 7350-bp scaffold designated jcf7180001222233 (see Appendix S2).

Multiple sequence alignment of the gallate 1-β-glucosyltransferase (GGT) sequences.

(a) Alignment of JrGGT1 and JrGGT2 amino acid sequences from Juglans regia with QrGGT from Quercus robur (Mittasch et al., 2014 ) and a Medicago flavonoid glucosyltransferase (GT) with a known three-dimensional (3D) structure. Residues highlighted in blue correspond to residues in (b). Darker blue indicates the three active site residues that were chosen to anchor the 3D alignments together in the DECAAF modeling. The conserved region used to build the phylogenetic tree (Figure 10) lies within the yellow boxed area.

(b) DECAAF-generated superposition, obtained by superposing Cα atoms (F302, H378 and D402) from the donor-binding site of a GT (PDBid:2PQ6). [PDBid:2PQ6 in red, QrGGT (oak) in green, JrGGT1 in blue and JrGGT2 in orange]. *This residue is stereochemically different in one Protein Data Bank (PDB).

Identification and characterization of the JrPPO1 gene in J. regia

To identify PPO genes in the J. regia genome, we used the nucleotide sequence of JrPPO1 (GenBank ACN86310.1), previously found using Southern analysis (Escobar et al., 2008 ). This sequence served as a BLASTN query of the J. regia genome assemblies. A single 18 263-bp scaffold (jcf7180001214880) that contained JrPPO1 was found. The Trinity de novo assembly contig c55545_g1_i1, which matched JrPPO1, and is called WALNUT_00030240-RA here, aligned perfectly to the genomic scaffold. This gene contains the three characteristic domains for JrPPO1, PF00264:tyrosinase, PF12143:PPO1_KFDV and PF12142:PPO1_DWL, according to a NCBI CDD search (Marchler-Bauer et al., 2015 ), validating the genome annotation obtained using MAKER-P.

BLASTN of the JrPPO1 coding sequence to the Trinity de novo assembly of the J. regia transcriptome identified three additional predicted transcripts: c44803_g1_i1, c44803_g2_i1 and c44803_g3_i1. These matched the query with about 80% sequence identity and all aligned to a single genomic ORF that we designated JrPPO2. YeATS (Chakraborty et al., 2015 ) determined a segment of overlap between c44803_g2_i1 and c44803_g3_i1, indicating that they are part of the same ORF as independently shown by visualizing the reads constituting JrPPO1 and JrPPO2 on independent scaffolds using Integrated Genome Viewer (IGV see Appendix S2). JrPPO2 is present on a single 22 390-bp scaffold (jcf7180001216434). It encodes a PPO gene distinct from JrPPO1 but sharing about 71% amino acid sequence identity (see Appendix S2). MAKER-P annotated the entire 1833-bp coding region of JrPPO2 on this scaffold as a single-exon predicted gene, WALNUT_00029754-RA, containing three InterProScan domains (IPR022739, IPR022740 and IPR002227).

To confirm the presence of genes encoding JrPPO1 and JrPPO2, we developed gene-specific PCR primers as described for GGT (see Appendix S2). These primers (JrPPO1-F-cd and JrPPO1-R-cd JrPPO2-F-cd and JrPPO2-R-cd) amplified single bands representing the JrPPO coding regions from genomic DNA isolated from different J. regia cultivars (see Appendix S2). Polymerase chain reaction targeting of the regulatory sequences of JrPPO1 yielded amplicons for all J. regia cultivars however, PCR targeting of the JrPPO2 regulatory regions failed to amplify two of five cultivars tested, indicating inter-cultivar variation in the regulatory sequence of JrPPO2. Sanger sequencing of the JrPPO amplicons supported the presence of both JrPPO1 and JrPPO2 with highly conserved but not identical sequences as predicted by the WGS. No repeats were found in the genomic scaffold containing JrPPO1 and 43 high-quality SNPs none occur in the coding region of JrPPO1.

JrPPO1 and JrPPO2 are phylogenetically distinct (Figure 10) but both are closely related to ancestral forms of plant PPO genes. A diversity of PPO genes have been reported in Physcomitrella patens, Glycine max, Populus trichocarpa and other genomes, with variation in sequence lengths and a broad distribution of introns in both monocot and eudicot lineages, for example in P. trichocarpa (PtPPO13) and cherimoya (Annona cherimola AcPPO) (Tran et al., 2012 ). This observed diversity may result from independent bursts of gene duplication in response to environmental adaptation and a diversity of enzymatic functions (Tran et al., 2012 ).

Molecular phylogenetic analysis by the maximum likelihood method.

The evolutionary history was inferred by using the maximum likelihood method based on the Poisson correction model (Zuckerkandl and Pauling, 1965 ). The bootstrap consensus tree inferred from 1000 replicates is taken to represent the evolutionary history of the taxa analyzed (Felsenstein, 1985 ). Branches corresponding to partitions reproduced in fewer than 50% of bootstrap replicates are collapsed. Initial tree(s) for the heuristic search were obtained automatically by applying the maximum parsimony method. The analysis involved 118 amino acid sequences. There were 749 positions in the final dataset. Evolutionary analyses were conducted in MEGA6 (Tamura et al., 2013 ).

From evaluation of gene expression in different tissues, the two corresponding PPO transcripts are differentially expressed, with JrPPO1 being expressed in most green tissues and JrPPO2 expressed more specifically in callus tissue, where expression of JrPPO1 is lower (Table 8). Multiple sequence alignment (Figure 11) suggests that both proteins are localized to the thylakoid they contain twin transit peptides typical of such proteins. Scaffold jcf7180001216434 contained 123 high-quality SNPs, with the majority occurring in the non-coding region of the gene. One SNP within the JrPPO2 coding region resulted in a non-synonymous change: a glutamine changed to a lysine (Q-33K), located in the thylakoid signal peptide, which is cleaved. To evaluate the structure–function relationship of active site residues, we modeled the JrPPO1 and JrPPO2 protein sequences using the well-characterized PPOs from V. vinifera and Ipomeoea batatas (PDB IDs 2P3X and 1BT2, respectively) in SWISS-MODEL (Biasini et al., 2014 ). The active site for these enzymes consists of twin Cu-binding subdomains (Figures 11 and 12). The alignment of amino acid residues in the CuA- and CuB-binding region between the four proteins is of particular interest. The histidine residues at positions 87, 108 and 117 that bind CuA are highly conserved, as are histidine residues at positions 239, 243 and 272 of the CuB subdomain. Substrate specificity may be determined by variations in residues at positions 109 and 244. We examined the active site residues by docking the structure of phenylthiourea, a known PPO inhibitor, to JrPPO1 using DOCLASP (Chakraborty, 2014 ). To mimic changes in the active site due to binding of phenylthiourea, we used the structure of the PPO from I. batatas with bound phenylthiourea (PDB ID: 1BUG Klabunde et al., 1998 ), to model JrPPO1 with SWISS-MODEL. A sulfur atom in phenylthiourea is liganded by the copper ions (Figure 12b, Table S13), resulting in conformational changes (Table S14). Klabunde et al. ( 1998 ) proposed that the aromatic ring of F259 controls access to the catalytic metal center the phenyl ring of F259 and the imidazole ring of H243 form hydrophobic interactions with the aromatic ring of the inhibitor after undergoing conformational changes and finally the sulfur of phenylthiourea replaces the hydroxo-bridge present in the apoenzyme. Thus, we hypothesize that PPOs from different species share a similar inhibitory profile, despite obvious differences in sequence and known enzymatic activity

c55545_g1_i1 63 44 1844 0 1657 36 530 140 44 287 1748 52 1519 2 0 19 19 1 1355
c44803_g1_i1 839 873 80 0 343 46 14 2 19 8 12 2 24 64 0 0 112 96 77
c44803_g2_i1 1719 647 131 0 312 57 19 1 9 7 15 1 23 64 3 0 538 124 93
c44803_g3_i1 804 242 56 0 154 22 7 0 6 2 6 1 10 25 0 0 223 53 37
  • CE, callus exterior CI, callus interior CK, catkins EM, embryo FL, pistillate flower HC, hull cortex HL, hull immature HP, hull peel HU, hull dehiscing IF, fruit immature LE, leaves LM, leaf mature LY, leaf young PK, packing tissue mature PL, pellicle PT, packing tissue immature RT, root SE, somatic embryo VB, vegetative bud.

Full-length multiple sequence alignment of the polyphenol oxidase (PPO) sequences from Juglans regia, Vitis vinifera and Ipomeoea batatas.

The full-length, immature PPO amino acid sequence is shown prior to any post-translational modification to the signaling peptides. The chloroplast signaling peptide (CSP) is shown by a dark green bar and the thylakoid signaling peptide (TSP) by a light green bar. Alpha helices are indicated by the thick blocks, beta sheets are indicated by the thick arrows and unstructured turns are indicated by the thin arrows. The amino acid count commences at the first amino acid (Ala) of the 2P3X crystal structure the numbers corresponding to disulfide bonds and the thioester bridge are derived from the 2P3X structure. The alpha helices and beta sheets correspond to the features described for VvPPO (2P3X) by Virador et al. ( 2010 ). The known cleavage sites for the Vv2P3X (Virador et al., 2010 ) and JrPPO1 (Zekiri et al., 2014b ) CSP, TSP and C-terminus are indicated by a black ‘|’ symbol predicted cleavage sites in JrPPO2 are indicated by a red ‘|’ symbol.

Modeling of the JrPPO1, JrPPO2 and 2P3X protein sequences.

(a) Mapping of active site residues in the vicinity of the CuA- and CuB-binding regions of JrPPO1 and JrPPO2 compared with grapevine polyphenol oxidase (PPO) (2P3X). Active site regions are also compared with sweet potato (1BT1).

(b) Docking phenylthiourea to JrPPO1 using DOCLASP: JrPPO1 was modeled using the PPO from sweet potato (PDB ID:1BUG) by SWISS-MODEL. *This residue is stereochemically different in one protein database.

Biochemical and molecular mechanisms of stress tolerance in Antarctic arthropods

In comparison with temperate insects, the molecular mechanisms of stress tolerance in Antarctic arthropods have received little attention. Nonetheless, early studies in the 1980s characterized biochemical markers of environmental stress, and recent studies have capitalized on advances in molecular biology. Here, we will review the physiological mechanisms of stress tolerance in Antarctic terrestrial arthropods and highlight some avenues for future research.

Like their temperate counterparts, seasonal and stress-induced accumulation of low-molecular-weight osmoprotectants is a hallmark of Antarctic arthropods. Every species profiled thus far accumulates some sort of osmoprotective compound, although the type and amount vary from species to species. For example, the mite A. antarcticus primarily uses glycerol as an osmoprotectant, accumulating levels upwards of 0.5 mol l −1 (Block and Convey, 1995). Likewise, the collembolan C. antarcticus uses glycerol as its chief cryoprotectant and also accumulates glucose and trehalose in response to desiccation (Elnitsky et al., 2008a). In contrast, the midge B. antarctica accumulates very low levels of glycerol and instead relies primarily on glucose, erythritol and trehalose as osmoprotectants (Lee and Baust, 1981 Baust and Lee, 1983). One theme that is emerging from studies of both polar and tropical desiccation-tolerant organisms is the importance of trehalose as an osmoprotectant during dehydration. A range of organisms, including certain bacteria, fungi, plants and metazoans, have been demonstrated to accumulate trehalose in response to desiccation (Elbein et al., 2003). In arthropods, trehalose typically functions as the blood sugar and is the chief osmoprotectant in arthropods capable of anhydrobiosis (Clegg, 2001). Recent studies in C. antarcticus (Elnitsky et al., 2008a) and B. antarctica (Benoit et al., 2007b Elnitsky et al., 2008b) have implicated its importance during extreme dehydration in Antarctic arthropods as well.

In recent years, physiological studies of Antarctic arthropods have benefited from advances in molecular biology and ‘omics’ technology, although molecular experiments have only been conducted in two species, the collembolan C. antarcticus and the midge B. antarctica. Regardless, these studies are beginning to provide clues about the mechanisms of stress tolerance in some of the world's most extreme arthropods. A summary of the stress-responsive genes identified in Antarctic arthropods is provided in Table 1.

Molecular experiments in C. antarcticus are restricted to two microarray studies of cold tolerance. In the first, animals with ‘low’ supercooling points (<−15°C) were compared with those with ‘high’ supercooling points (>−15°C), to determine which genes are responsible for lowering supercooling points (Purać et al., 2008). This microarray contained a subset of 672 expressed sequence tags (ESTs), and thus was not comprehensive. Nonetheless, expression patterns indicate upregulation of a number of cuticular proteins and other structural constituents in the ‘low’ group, confirming the importance of the cuticle and molt cycle in regulating supercooling capacity in Antarctic collembolans (Worland and Convey, 2008). Other genes upregulated in the ‘low’ group relative to the ‘high’ group include several mitochondrial genes involved in ATP synthesis, suggesting that boosting energy production may be a component of low temperature survival. Specific upregulated genes confirmed by qPCR include a cuticular protein and CHK1, a checkpoint homolog involved in cell cycle regulation. A similar experiment was conducted later with a larger microarray (containing 5400 ESTs), and once again, a number of cuticular genes and genes involved in the molt cycle were upregulated in cold-acclimated collembolans (Burns et al., 2010). In this study, three genes (endocuticle structural glycoprotein SgAbd-4, cuticular protein 49Ah and chitin-binding peritrophin A) were confirmed to be upregulated by qPCR, while an mRNA encoding the extracellular matrix protein tenebrin was verified to be downregulated.

Relative to other Antarctic arthropods, the Antarctic midge B. antarctica has been subjected to the largest number of molecular studies. As in temperate insects, the heat shock proteins are important mediators of stress tolerance in B. antarctica. However, whereas most insects express genes encoding heat shock proteins at very low levels until the proteins are needed, larvae of B. antarctica constitutively express genes encoding heat shock proteins at high levels all the time (Rinehart et al., 2006). While adults of B. antarctica have a typical heat shock protein response, neither heat nor cold increased expression of three different heat shock proteins (small hsp, hsp70 and hsp90) in larvae. This constant presence of heat shock proteins likely provides year-round protection against environmental stress, which can be frequent and unpredictable in maritime Antarctica. Whereas high expression of heat shock proteins typically hinders growth and development (Krebs and Feder, 1997), larvae of B. antarctica are able to circumvent this and produce heat shock proteins at high levels even while they are feeding and growing.

Constitutive defenses in B. antarctica are not restricted to heat shock proteins. Likewise, larvae express genes encoding the antioxidant enzyme superoxide dismutase at high levels even in the absence of overt oxidative stress (Lopez-Martinez et al., 2008). Superoxide dismutase mRNA levels, as well as the mRNAs encoding catalase and three heat shock proteins, modestly increase after exposure to sunlight. Indeed, expression of these genes confers extremely high resistance to oxidative damage in B. antarctica, as the antioxidant capacity of B. antarctica larvae is five times greater than that of a temperate freeze-tolerant insect, E. solidaginis (Lopez-Martinez et al., 2008). Adults of B. antarctica have even higher levels of antioxidant capacity, probably because of their near-constant exposure to sunlight as they walk on the surface in search of mates. Resistance to oxidative damage is crucial for Antarctic arthropods, as Antarctic sunlight contains very high levels of UV radiation (Liao and Frederick, 2005), which is intensifying as a result of ozone damage (Weatherhead and Andersen, 2006). Furthermore, repeated bouts of freeze–thaw exposure, which are common in Antarctica, are known to cause oxidative damage in insects (Lalouette et al., 2011).

Recently, aquaporins have been implicated as key regulators of water movement during stressful conditions such as dehydration (Liu et al., 2011) and freezing (Philip et al., 2008 Philip and Lee, 2010). Aquaporins are pore-forming proteins that carry water, and sometimes other solutes, across the cell membrane (Borgnia et al., 1999). Because Antarctic arthropods are challenged by numerous forms of osmotic stress, including freezing, dehydration and immersion in seawater, aquaporins likely play an important role in mediating stress tolerance. Goto et al. (Goto et al., 2011) cloned and characterized the first aquaporin from an Antarctic arthropod, an aquaporin-1 like gene from B. antarctica. When expressed in Xenopus oocytes, this protein is capable of transmitting water, but not urea or glycerol, across the cell membrane. This specific aquaporin gene is expressed in several different tissues, indicating that it may play a general role in water movement across cells. However, mRNA expression did not change in response to dehydration, so it is unclear what, if any, role this gene plays in mediating stress tolerance. A second study of B. antarctica aquaporins found immunoreactivity to four different aquaporin antibodies from different species, and some of these were stress inducible (Yi et al., 2011). However, the sequence identity of these aquaporin genes has not been established. In the same study, blocking aquaporins pharmacologically with mercuric chloride reduced the ex vivo freezing tolerance of fat body, midgut and Malpighian tubule tissue, indicating that aquaporins are crucial for water redistribution during freezing. Additionally, mercuric chloride reduced the water loss of midgut tissue, suggesting that aquaporins also play a crucial role in mediating dehydration stress.

List of stress-upregulated genes identified in Antarctic arthropods

Genes involved in mobilization of energy reserves and synthesis of osmoprotectants also appear to be essential components of the stress response in B. antarctica. Using qPCR, Teets et al. (Teets et al., 2013) profiled the expression of 11 genes involved in glycogen breakdown, gluconeogenesis, polyol and trehalose metabolism, and proline synthesis. High and low temperature induces rapid upregulation of genes involved in glucose mobilization, including transcripts encoding glycogen phosphorylase and phosphoenolpyruvate carboxykinase (PEPCK), the rate-limiting enzyme of gluconeogenesis. These results are consistent with previous observations of cold-induced glucose mobilization in larvae of B. antarctica (Teets et al., 2011). In contrast, acute high and low temperature result in a general downregulation of genes involved in trehalose and proline synthesis. In response to dehydration, gene expression patterns are highly dependent on the type of dehydration experienced. Rapid dehydration at 75% RH has a similar transcriptional signature as that observed in response to heat and cold, namely upregulation of genes encoding glucose mobilizing enzymes (i.e. PEPCK and glycogen phosphorylase) with concurrent downregulation of genes involved in trehalose and proline synthesis. In contrast, while slow dehydration at 98% RH and cryoprotective dehydration also induce expression of pepck, these treatments upregulate genes involved in trehalose and proline synthesis, consistent with accumulation of trehalose (Benoit et al., 2007b Elnitsky et al., 2008b) and proline (Teets et al., 2012b) during prolonged dehydration.

Non-targeted, ‘omics’ approaches have also benefited our understanding of stress tolerance in B. antarctica. Using suppressive subtractive hybridization, Lopez-Martinez et al. (Lopez-Martinez et al., 2009) obtained a number of dehydration-responsive clones, and northern blots confirmed that 23 of these were indeed differentially expressed either during dehydration or rehydration. Upregulated genes include those encoding three heat shock proteins (hsp26, hsp70 and hsp90) and genes encoding two antioxidant enzymes (superoxide dismutase and catalase), indicating that protein denaturation and oxidative damage are symptoms of dehydration stress. Other genes upregulated during dehydration include genes coding for cytoskeletal proteins and membrane restructuring, consistent with previous observations that dehydration causes cytoskeletal reorganization (Chen et al., 2005) and membrane lipid remodeling (Bayley et al., 2001). In addition, several genes are downregulated in response to dehydration, including two electron transport chain genes, suggesting a shutdown of metabolism during dehydration.

To date, a single genome-wide expression study has been conducted in Antarctic arthropods. Using Illumina-based RNA-seq, Teets et al. (Teets et al., 2012b) profiled the expression of

13,500 transcripts in response to both desiccation at a constant temperature and cryoprotective dehydration. Both treatments result in sweeping changes in gene expression, as desiccation and cryoprotective dehydration resulted in

18%, respectively, of all genes being differentially expressed. These results confirmed the crucial role of heat shock proteins during environmental stress in B. antarctica, as 15 different heat shock protein transcripts were upregulated by one or both dehydration treatments. Concurrently, desiccation caused upregulation of genes involved in the recycling/degradation of proteins and cellular macromolecules, including significant enrichment of proteasomal and autophagy genes. Taken together, these results suggest that coordinated upregulation of heat shock proteins, ubiquitin-mediated proteasome and autophagy function to recycle and remove damaged cellular components during dehydration, thereby conserving energy and promoting cell survival (Fig. 3). We hypothesize that autophagy is particularly important for surviving the Antarctic winter, as this pathway inhibits apoptosis and other forms of cell death during prolonged periods of cellular stress (Teets and Denlinger, 2013). Desiccation and cryoprotective dehydration also cause a general downregulation of central metabolic genes, including genes involved in glycolysis, the tricarboxylic acid cycle and lipid metabolism, suggesting that dehydration causes a molecular shift to hypometabolism that conserves energy and prevents the toxic build-up of metabolic end products (Teets et al., 2012b). Indeed, changes in the metabolome correlate well with gene expression changes, indicating that coordinated changes in gene expression and metabolism govern responses to extreme environmental conditions.

Watch the video: Enzyme Assay, Enzyme Activity and Specific Activity (August 2022).