What is the secondary structure distribution per AA in the Human proteome?

What is the secondary structure distribution per AA in the Human proteome?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

If one would classify each amino acid to either be in a Coil, Beta strand or Alpha helix what would be the distribution of these classes in the human proteome?

Is it 33%-33%-33% or is it biased? If it is biased why?

I thought it should be around equal, but I run the whole proteome through PSIPRED, and I found the distribution to be 60% Coil, 30% Helix, 10% Sheet. Why is it so?

The question asks why the distribution of coil, alpha-helix and beta-strand is 60:30:10 rather than 33:33:33. The answer is:

“Why not”

This is because there is no reason whatever to expect that three types of structure (or in the case of coils, lack of structure*) should be present in equal amounts in proteins. It is like expecting that the percentage intron, exon and 'junk' DNA should be equal, or that the percentage of fuel reserves stored as glycogen and fat should be the same. Yes, they belong to the same category, but they are sufficiently different in each case for one not to be surprised if they are not required in equal amounts.

To understand this one needs to look a little more carefully at the occurrence of alpha-helix or beta-strand conformation in the three-dimensional structure of proteins. Three points may be emphasized:

  • Amino acids have particular conformations because they are part of an extended helix or sheet of strands the entirety of which leads to its structural stability - you don't have a random mixture.
  • In many cases the helices or sheets are occur in particular combinations to give a family of proteins of similar overall structure. Again, the idea of random mixtures doesn't enter into the equation.
  • These overall structures are suited to particular functions, so the abundance of proteins from a particular structural family will be determined by the requirement for antibodies or ion transporters or signal transducing proteins etc., not by some shake of the dice.

Images illustrating two such families are shown below:

(a) shows an ion transport proteins, predominantly alpha-helices, whereas (b) is an immunoglobulin domain with a distinctive pattern of beta strands (as well as some helix). For further examples and information I suggest the EMBL on-line course on protein classification and Berg et al. online, for example Section 7.3.

*Footnote 1: Situations where equal occurences might be expected It is worth while contrasting the situation with protein secondary structure to some where the prior expectation might well be for equal usage, and the deviation from this could be regarded as bias and an explanation worth asking for:

  • The different proportion of the 20 amino acids in proteins (although a chemist would not expect equal proportions)
  • The different usage of synonymous codons of the genetic code in various species and mRNAs
  • The different usage of the similar molecules with a high phosphoryl group transfer potential: ATP, GTP, UTP and CTP (often loosely termed 'high-energy')

*Footnote 2: A coil is not a secondary structure

As is stated in the Wikepedia entry for Protein Secondary Structure:

The random coil is not a true secondary structure, but is the class of conformations that indicate an absence of regular secondary structure.

It might be mentioned in regard to protein structure that there are smaller three-dimensional motifs that the analysis quoted does not consider. These do not occur in equal proportions either, to nobody's great surprise.

Protein structure prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology and it is important in medicine (for example, in drug design) and biotechnology (for example, in the design of novel enzymes).

Starting in 1994, the performance of current methods is assessed biannually in the CASP experiment (Critical Assessment of Techniques for Protein Structure Prediction). A continuous evaluation of protein structure prediction web servers is performed by the community project CAMEO3D.


The natural history of malaria involves cyclical infection of humans and female Anopheles mosquitoes. In humans, the parasites grow and multiply first in the liver cells and then in the red cells of the blood. In the blood, successive broods of parasites grow inside the red cells and destroy them, releasing daughter parasites (&ldquomerozoites&rdquo) that continue the cycle by invading other red cells.

The blood stage parasites are those that cause the symptoms of malaria. When certain forms of blood stage parasites (gametocytes, which occur in male and female forms) are ingested during blood feeding by a female Anopheles mosquito, they mate in the gut of the mosquito and begin a cycle of growth and multiplication in the mosquito. After 10-18 days, a form of the parasite called a sporozoite migrates to the mosquito&rsquos salivary glands. When the Anopheles mosquito takes a blood meal on another human, anticoagulant saliva is injected together with the sporozoites, which migrate to the liver, thereby beginning a new cycle.

Thus the infected mosquito carries the disease from one human to another (acting as a &ldquovector&rdquo), while infected humans transmit the parasite to the mosquito, In contrast to the human host, the mosquito vector does not suffer from the presence of the parasites.

The malaria parasite life cycle involves two hosts. During a blood meal, a malaria-infected female Anopheles mosquito inoculates sporozoites into the human host . Sporozoites infect liver cells and mature into schizonts , which rupture and release merozoites . (Of note, in P. vivax and P. ovale a dormant stage [hypnozoites] can persist in the liver (if untreated) and cause relapses by invading the bloodstream weeks, or even years later.) After this initial replication in the liver (exo-erythrocytic schizogony ), the parasites undergo asexual multiplication in the erythrocytes (erythrocytic schizogony ). Merozoites infect red blood cells . The ring stage trophozoites mature into schizonts, which rupture releasing merozoites . Some parasites differentiate into sexual erythrocytic stages (gametocytes) . Blood stage parasites are responsible for the clinical manifestations of the disease. The gametocytes, male (microgametocytes) and female (macrogametocytes), are ingested by an Anopheles mosquito during a blood meal . The parasites&rsquo multiplication in the mosquito is known as the sporogonic cycle . While in the mosquito&rsquos stomach, the microgametes penetrate the macrogametes generating zygotes . The zygotes in turn become motile and elongated (ookinetes) which invade the midgut wall of the mosquito where they develop into oocysts . The oocysts grow, rupture, and release sporozoites, which make their way to the mosquito&rsquos salivary glands. Inoculation of the sporozoites into a new human host perpetuates the malaria life cycle.

Human Factors And Malaria

Biologic characteristics and behavioral traits can influence an individual&rsquos risk of developing malaria and, on a larger scale, the intensity of transmission in a population.

Where does malaria transmission occur?

For malaria transmission to occur, conditions must be such so that all three components of the malaria life cycle are present:

  • Anopheles mosquitoes, which able to feed on humans humans, and in which the parasites can complete the &ldquoinvertebrate host&rdquo half of their life cycle
  • Humans. who can be bitten by Anopheles mosquitoes, and in whom the parasites can complete the &ldquovertebrate host&rdquo half of their life cycle
  • Malaria parasites.

In rare cases malaria parasites can be transmitted from one person to another without requiring passage through a mosquito (from mother to child in "congenital malaria" or through transfusion, organ transplantation, or shared needles.)


Climate is a key determinant of both the geographic distribution and the seasonality of malaria. Without sufficient rainfall, mosquitoes cannot survive, and if not sufficiently warm, parasites cannot survive in the mosquito.

Anopheles lay their eggs in a variety of fresh or brackish bodies of water, with different species having different preferences. Eggs hatch within a few days, with resulting larvae spending 9-12 days to develop into adults in tropical areas. If larval habitats dry up before the process is completed, the larvae die if rains are excessive, they may be flushed and destroyed. Life is precarious for mosquito larvae, with most perishing before becoming adults.

Life is usually short for adult mosquitoes as well, with temperature and humidity affecting longevity. Only older females can transmit malaria, as they must live long enough for sporozoites to develop and move to the salivary glands. This process takes a minimum of nine days when temperatures are warm (30°C or 86°F) and will take much longer at cooler temperatures. If temperatures are too cool (15°C or 59°F for Plasmodium vivax, 20°C or 68°F for P. falciparum), development cannot be completed and malaria cannot be transmitted. Thus, malaria transmission is much more intense in warm and humid areas, with transmission possible in temperate areas only during summer months.

In warm climates people are more likely to sleep unprotected outdoors, thereby increasing exposure to night-biting Anopheles mosquitoes. During harvest seasons, agricultural workers might sleep in the fields or nearby locales, without protection against mosquito bites.

Anopheles Mosquitoes

The types (species) of Anopheles present in an area at a given time will influence the intensity of malaria transmission. Not all Anopheles are equally efficient vectors for transmitting malaria from one person to another. Those species that are most prone to bite humans are the most dangerous, as bites inflicted on animals that cannot be infected with human malaria break the chain of transmission. If the mosquito regularly bites humans, the chain of transmission is unbroken and more people will become infected. Some species are biologically unable to sustain development of human malaria parasites, while others are readily infected and produce large numbers of sporozoites (the parasite stage that is infective to humans).

Many of the most dangerous species bite human indoors. For these species insecticide treated mosquito nets and indoor residual spray (whereby the inner walls of dwellings are coated with a long-lasting insecticide) are effective interventions. Both of these interventions require attention to insecticide resistance, which will evolve if the same insecticide is used continuously in the same area.


Biologic characteristics (inborn and acquired) and behavioral traits can influence an individual&rsquos malaria risk and, on a larger scale, the overall malaria ecology.


Characteristics of the malaria parasite can influence the occurrence of malaria and its impact on human populations, for example:

  • Areas where P. falciparum predominates (such as Africa south of the Sahara) will suffer more disease and death than areas where other species, which tend to cause less severe manifestations, predominate
  • P. vivax and P. ovale have stages (&ldquohypnozoites&rdquo) that can remain dormant in the liver cells for extended periods of time (months to years) before reactivating and invading the blood. Such relapses can result in resumption of transmission after apparently successful control efforts, or can introduce malaria in an area that was malaria-free
  • P. falciparum (and to a lesser extent P. vivax) have developed strains that are resistant to antimalarial drugs. Such strains are not uniformly distributed. Constant monitoring of the susceptibility of these two parasite species to drugs used locally is critical to ensure effective treatment and successful control efforts. Travelers to malaria-risk areas should use for prevention only those drugs that will be protective in the areas to be visited.

Plasmodium falciparum predominates in Africa south of the Sahara, one reason why malaria is so severe in that area.

Animal Reservoirs

A certain species of malaria called P. knowlesi has recently been recognized to be a cause of significant numbers of human infections. P. knowlesi is a species that naturally infects macaques living in Southeast Asia. Humans living in close proximity to populations of these macaques may be at risk of infection with this zoonotic parasite.

Areas Where Malaria Is No Longer Endemic

Malaria transmission has been eliminated in many countries of the world, including the United States. However, in many of these countries (including the United States) Anopheles mosquitoes are still present. Also, cases of malaria still occur in non-endemic countries, mostly in returning travelers or immigrants (&ldquoimported malaria&rdquo). Thus the potential for reintroduction of active transmission of malaria exists in many non-endemic parts of the world. All patients must be diagnosed and treated promptly for their own benefit but also to prevent the reintroduction of malaria.

Genetic Factors

Biologic characteristics present from birth can protect against certain types of malaria. Two genetic factors, both associated with human red blood cells, have been shown to be epidemiologically important. Persons who have the sickle cell trait (heterozygotes for the abnormal hemoglobin gene HbS) are relatively protected against P. falciparum malaria and thus enjoy a biologic advantage. Because P. falciparum malaria has been a leading cause of death in Africa since remote times, the sickle cell trait is now more frequently found in Africa and in persons of African ancestry than in other population groups. In general, the prevalence of hemoglobin-related disorders and other blood cell dyscrasias, such as Hemoglobin C, the thalassemias and G6PD deficiency, are more prevalent in malaria endemic areas and are thought to provide protection from malarial disease.

Persons who are negative for the Duffy blood group have red blood cells that are resistant to infection by P. vivax. Since the majority of Africans are Duffy negative, P. vivax is rare in Africa south of the Sahara, especially West Africa. In that area, the niche of P. vivax has been taken over by P. ovale, a very similar parasite that does infect Duffy-negative persons.

Other genetic factors related to red blood cells also influence malaria, but to a lesser extent. Various genetic determinants (such as the &ldquoHLA complex,&rdquo which plays a role in control of immune responses) may equally influence an individual&rsquos risk of developing severe malaria.

Acquired Immunity

Acquired immunity greatly influences how malaria affects an individual and a community. After repeated attacks of malaria a person may develop a partially protective immunity. Such &ldquosemi-immune&rdquo persons often can still be infected by malaria parasites but may not develop severe disease, and, in fact, frequently lack any typical malaria symptoms.

In areas with high P. falciparum transmission (most of Africa south of the Sahara), newborns will be protected during the first few months of life presumably by maternal antibodies transferred to them through the placenta. As these antibodies decrease with time, these young children become vulnerable to disease and death by malaria. If they survive repeated infections to an older age (2-5 years) they will have reached a protective semi-immune status. Thus in high transmission areas, young children are a major risk group and are targeted preferentially by malaria control interventions.

In areas with lower transmission (such as Asia and Latin America), infections are less frequent and a larger proportion of the older children and adults have no protective immunity. In such areas, malaria disease can be found in all age groups, and epidemics can occur.

Anemia in young children in Asembo Bay, a highly endemic area in western Kenya. Anemia occurs most between the ages of 6 and 24 months. After 24 months, it decreases because the children have built up their acquired immunity against malaria (and its consequence, anemia).

Mother and her newborn in Jabalpur Hospital, State of Madhya Pradesh, India. The mother had malaria, with infection of the placenta.

Pregnancy and Malaria

Pregnancy decreases immunity against many infectious diseases. Women who have developed protective immunity against P. falciparum tend to lose this protection when they become pregnant (especially during the first and second pregnancies). Malaria during pregnancy is harmful not only to the mothers but also to the unborn children. The latter are at greater risk of being delivered prematurely or with low birth weight, with consequently decreased chances of survival during the early months of life. For this reason pregnant women are also targeted (in addition to young children) for protection by malaria control programs in endemic countries.

Behavioral Factors

Human behavior, often dictated by social and economic reasons, can influence the risk of malaria for individuals and communities. For example:

  • Poor rural populations in malaria-endemic areas often cannot afford the housing and bed nets that would protect them from exposure to mosquitoes. These persons often lack the knowledge to recognize malaria and to treat it promptly and correctly. Often, cultural beliefs result in use of traditional, ineffective methods of treatment.
  • Travelers from non-endemic areas may choose not to use insect repellent or medicines to prevent malaria. Reasons may include cost, inconvenience, or a lack of knowledge.
  • Human activities can create breeding sites for larvae (standing water in irrigation ditches, burrow pits)
  • Agricultural work such as harvesting (also influenced by climate) may force increased nighttime exposure to mosquito bites
  • Raising domestic animals near the household may provide alternate sources of blood meals for Anopheles mosquitoes and thus decrease human exposure
  • War, migrations (voluntary or forced) and tourism may expose non-immune individuals to an environment with high malaria transmission.

Human behavior in endemic countries also determines in part how successful malaria control activities will be in their efforts to decrease transmission. The governments of malaria-endemic countries often lack financial resources. As a consequence, health workers in the public sector are often underpaid and overworked. They lack equipment, drugs, training, and supervision. The local populations are aware of such situations when they occur, and cease relying on the public sector health facilities. Conversely, the private sector suffers from its own problems. Regulatory measures often do not exist or are not enforced. This encourages private consultations by unlicensed, costly health providers, and the anarchic prescription and sale of drugs (some of which are counterfeit products). Correcting this situation is a tremendous challenge that must be addressed if malaria control and ultimately elimination is to be successful.

Protective Effect of Sickle Cell Trait Against Malaria

The sickle cell gene is caused by a single amino acid mutation (valine instead of glutamate at the 6th position) in the beta chain of the hemoglobin gene. Inheritance of this mutated gene from both parents leads to sickle cell disease and people with this disease have shorter life expectancy. On the contrary, individuals who are carriers for the sickle cell disease (with one sickle gene and one normal hemoglobin gene, also known as sickle cell trait) have some protective advantage against malaria. As a result, the frequencies of sickle cell carriers are high in malaria-endemic areas.

CDC&rsquos birth cohort studies (Asembo Bay Cohort Project in western Kenya) conducted in collaboration with the Kenya Medical Research Institute allowed an investigation into this issue. It was found that that the sickle cell trait provides 60% protection against overall mortality. Most of this protection occurs between 2-16 months of life, before the onset of clinical immunity in areas with intense transmission of malaria.

Graph of survival curves (&ldquosurvival function estimates&rdquo) of children without any sickle cell genes (HbAA), children with sickle cell trait (HbAS), and children with sickle cell disease (HbSS). Those who had the sickle cell trait (HbAS) had a slight survival advantage over those without any sickle cell genes (HbAA), with children with sickle cell disease (HbSS) faring the worst.

Reference: Protective Effects of the Sickle Cell Gene Against Malaria Morbidity and Mortality. Aidoo M, Terlouw DJ, Kolczak MS, McElroy PD, ter Kuile FO, Kariuki S, Nahlen BL, Lal AA, Udhayakumar V. Lancet 2002 359:1311-1312.

Anopheles Mosquitoes

Malaria is transmitted to humans by female mosquitoes of the genus Anopheles. Female mosquitoes take blood meals for egg production, and these blood meals are the link between the human and the mosquito hosts in the parasite life cycle. The successful development of the malaria parasite in the mosquito (from the &ldquogametocyte&rdquo stage to the &ldquosporozoite&rdquo stage) depends on several factors. The most important is ambient temperature and humidity (higher temperatures accelerate the parasite growth in the mosquito) and whether the Anopheles survives long enough to allow the parasite to complete its cycle in the mosquito host (&ldquosporogonic&rdquo or &ldquoextrinsic&rdquo cycle, duration 9 to 18 days). In contrast to the human host, the mosquito host does not suffer noticeably from the presence of the parasites.

Diagram of Adult Female Mosquito

Map of the world showing the distribution of predominant malaria vectors

Anopheles freeborni mosquito pumping blood
Larger Picture

General Information

There are approximately 3,500 species of mosquitoes grouped into 41 genera. Human malaria is transmitted only by females of the genus Anopheles. Of the approximately 430 Anopheles species, only 30-40 transmit malaria (i.e., are &ldquovectors&rdquo) in nature. The rest either bite humans infrequently or cannot sustain development of malaria parasites.

Geographic Distribution

Anophelines are found worldwide except Antarctica. Malaria is transmitted by different Anopheles species in different geographic regions. Within geographic regions, different environments support a different species.

Anophelines that can transmit malaria are found not only in malaria-endemic areas, but also in areas where malaria has been eliminated. These areas are thus at risk of re-introduction of the disease.

Life Stages

Like all mosquitoes, anopheles mosquitoes go through four stages in their life cycle: egg, larva, pupa, and adult. The first three stages are aquatic and last 7-14 days, depending on the species and the ambient temperature. The biting female Anopheles mosquito may carry malaria. Male mosquitoes do not bite so cannot transmit malaria or other diseases. The adult females are generally short-lived, with only a small proportion living long enough (more than 10 days in tropical regions) to transmit malaria.

Adult females lay 50-200 eggs per oviposition. Eggs are laid singly directly on water and are unique in having floats on either side. Eggs are not resistant to drying and hatch within 2-3 days, although hatching may take up to 2-3 weeks in colder climates.


Mosquito larvae have a well-developed head with mouth brushes used for feeding, a large thorax, and a segmented abdomen. They have no legs. In contrast to other mosquitoes, Anopheles larvae lack a respiratory siphon and for this reason position themselves so that their body is parallel to the surface of the water.

Top: Anopheles Egg note the lateral floats.
Bottom: Anopheles eggs are laid singly.

Larvae breathe through spiracles located on the 8th abdominal segment and therefore must come to the surface frequently.

The larvae spend most of their time feeding on algae, bacteria, and other microorganisms in the surface microlayer. They do so by rotating their head 180 degrees and feeding from below the microlayer. Larvae dive below the surface only when disturbed. Larvae swim either by jerky movements of the entire body or through propulsion with the mouth brushes.

Larvae develop through 4 stages, or instars, after which they metamorphose into pupae. At the end of each instar, the larvae molt, shedding their exoskeleton, or skin, to allow for further growth.

Anopheles Larva. Note the position, parallel to the water surface.

The larvae occur in a wide range of habitats but most species prefer clean, unpolluted water. Larvae of Anopheles mosquitoes have been found in fresh- or salt-water marshes, mangrove swamps, rice fields, grassy ditches, the edges of streams and rivers, and small, temporary rain pools. Many species prefer habitats with vegetation. Others prefer habitats that have none. Some breed in open, sun-lit pools while others are found only in shaded breeding sites in forests. A few species breed in tree holes or the leaf axils of some plants.


The pupa is comma-shaped when viewed from the side. This is a transitional stage between larva and adult. The pupae does not feed, but undergoes radical metamorphosis. The head and thorax are merged into a cephalothorax with the abdomen curving around underneath. As with the larvae, pupae must come to the surface frequently to breathe, which they do through a pair of respiratory trumpets on the cephalothorax. After a few days as a pupa, the dorsal surface of the cephalothorax splits and the adult mosquito emerges onto the surface of the water.

The duration from egg to adult varies considerably among species and is strongly influenced by ambient temperature. Mosquitoes can develop from egg to adult in as little as 7 days but usually take 10-14 days in tropical conditions.

Anopheles Adults. Note (bottom row) the typical resting position.


Like all mosquitoes, adult anopheles have slender bodies with 3 sections: head, thorax and abdomen.

The head is specialized for acquiring sensory information and for feeding. The head contains the eyes and a pair of long, many-segmented antennae. The antennae are important for detecting host odors as well as odors of aquatic larval habitats where females lay eggs. The head also has an elongate, forward-projecting proboscis used for feeding, and two sensory palps.

The thorax is specialized for locomotion. Three pairs of legs and a single pair of wings are attached to the thorax.

The abdomen is specialized for food digestion and egg development. This segmented body part expands considerably when a female takes a blood meal. The blood is digested over time serving as a source of protein for the production of eggs, which gradually fill the abdomen.

Anopheles mosquitoes can be distinguished from other mosquitoes by the palps, which are as long as the proboscis, and by the presence of discrete blocks of black and white scales on the wings. Adult Anopheles can also be identified by their typical resting position: males and females rest with their abdomens sticking up in the air rather than parallel to the surface on which they are resting.

Adult mosquitoes usually mate within a few days after emerging from the pupal stage. In some species, the males form large swarms, usually around dusk, and the females fly into the swarms to mate. The mating habitats of many species remain unknown.

Males live for about a week, feeding on nectar and other sources of sugar. Females will also feed on sugar sources for energy but usually require a blood meal for the development of eggs. After obtaining a full blood meal, the female will rest for a few days while the blood is digested and eggs are developed. This process depends on the temperature but usually takes 2-3 days in tropical conditions. Once the eggs are fully developed, the female lays them then seeks blood to sustain another batch of eggs.

The cycle repeats itself until the female dies. Females can survive up to a month (or longer in captivity) but most do not live longer than 1-2 weeks in nature. Their chances of survival depend on temperature and humidity, but also upon their ability to successfully obtain a blood meal while avoiding host defenses.

Female Anopheles dirus feeding

Factors Involved in Malaria Transmission and Malaria Control

Understanding the biology and behavior of Anopheles mosquitoes can aid in designing appropriate control strategies. Factors that affect a mosquito&rsquos ability to transmit malaria include its innate susceptibility to Plasmodium, its host choice, and its longevity. Long-lived species that prefer human blood and support parasite development are the most dangerous. Factors that should be taken into consideration when designing a control program include the susceptibility of malaria mosquitoes to insecticides and the preferred feeding and resting location of adult mosquitoes.

Preferred Sources for Blood Meals

One important behavioral factor is the degree to which an Anopheles species prefers to feed on humans (anthropophily) or animals such as cattle (zoophily). Anthrophilic Anopheles are more likely to transmit the malaria parasites from one person to another. Most Anopheles mosquitoes are not exclusively anthropophilic or zoophilic many are opportunistic and feed upon whatever host is available. However, the primary malaria vectors in Africa, An. gambiae and An. funestus, are strongly anthropophilic and, consequently, are two of the most efficient malaria vectors in the world.

Life Span

Once ingested by a mosquito, malaria parasites must undergo development within the mosquito before they are infectious to humans. The time required for development in the mosquito (the extrinsic incubation period) takes 9 days or longer, depending on the parasite species and the temperature. If a mosquito does not survive longer than the extrinsic incubation period, then she will not be able to transmit any malaria parasites.

It is not possible to measure directly the life span of mosquitoes in nature, but many studies have indirectly measured longevity by examination of their reproductive status or via marking, releasing, and recapturing adult mosquitoes. The majority of mosquitoes do not live long enough to transmit malaria, but some may live as long as three weeks in nature. Though evidence suggests that mortality rate increases with age, most workers estimate longevity in terms of the probability that a mosquito will live one day. Usually these estimates range from a low of 0.7 to a high of 0.9. If survivorship is 90% daily, then a substantial proportion of the population would live longer than 2 weeks and would be capable of transmitting malaria. Any control measure that reduces the average lifespan of the mosquito population will reduce transmission potential. Insecticides thus need not kill the mosquitoes outright, but may be effective by limiting their lifespan.

Patterns of Feeding and Resting

Most Anopheles mosquitoes are crepuscular (active at dusk or dawn) or nocturnal (active at night). Some Anopheles mosquitoes feed indoors (endophagic) while others feed outdoors (exophagic). After blood feeding, some Anopheles mosquitoes prefer to rest indoors (endophilic) while others prefer to rest outdoors (exophilic). Biting by nocturnal, endophagic Anopheles mosquitoes can be markedly reduced through the use of insecticide-treated bed nets (ITNs) or through improved housing construction to prevent mosquito entry (e.g., window screens). Endophilic mosquitoes are readily controlled by indoor spraying of residual insecticides. In contrast, exophagic/exophilic vectors are best controlled through source reduction (destruction of larval habitats).

Insecticide Resistance

Insecticide-based control measures (e.g., indoor spraying with insecticides, ITNs) are the principal way to kill mosquitoes that bite indoors. However, after prolonged exposure to an insecticide over several generations, mosquitoes, like other insects, may develop resistance, a capacity to survive contact with an insecticide. Since mosquitoes can have many generations per year, high levels of resistance can arise very quickly. Resistance of mosquitoes to some insecticides has been documented within a few years after the insecticides were introduced. There are over 125 mosquito species with documented resistance to one or more insecticides. The development of resistance to insecticides used for indoor residual spraying was a major impediment during the Global Malaria Eradication Campaign. Judicious use of insecticides for mosquito control can limit the development and spread of resistance, particularly via rotation of different classes of insecticides used for control. Monitoring of resistance is essential to alert control programs to switch to more effective insecticides.


Some Anopheles species are poor vectors of malaria, as the parasites do not develop well (or at all) within them. There is also variation within species. In the laboratory, it has been possible to select for strains of An. gambiae that are refractory to infection by malaria parasites. These refractory strains have an immune response that encapsulates and kills the parasites after they have invaded the mosquito&rsquos stomach wall. Scientists are studying the genetic mechanism for this response. It is hoped that some day, genetically modified mosquitoes that are refractory to malaria can replace wild mosquitoes, thereby limiting or eliminating malaria transmission.

Malaria Parasites

Malaria parasites are micro-organisms that belong to the genus Plasmodium. There are more than 100 species of Plasmodium, which can infect many animal species such as reptiles, birds, and various mammals. Four species of Plasmodium have long been recognized to infect humans in nature. In addition there is one species that naturally infects macaques which has recently been recognized to be a cause of zoonotic malaria in humans. (There are some additional species which can, exceptionally or under experimental conditions, infect humans.)

Ring-form trophozoites of P. falciparum in a thin blood smear.

Ring-form trophozoites of P. vivax in a thin blood smear.

Trophozoites of P. ovale in a thin blood smear.

Band-form trophozoites of P. malariae in a thin blood smear.

Schizont and ring-form trophozoite of P. knowlesi in a thin blood smear.


Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, SCoV2) is the cause of the early 2020 pandemic coronavirus lung disease 2019 (COVID-19) and belongs to Betacoronaviruses, a genus of the Coronaviridae family covering the α−δ genera (Leao et al., 2020). The large RNA genome of SCoV2 has an intricate, highly condensed arrangement of coding sequences (Wu et al., 2020). Sequences starting with the main start codon contain an open reading frame 1 (ORF1), which codes for two distinct, large polypeptides (pp), whose relative abundance is governed by the action of an RNA pseudoknot structure element. Upon RNA folding, this element causes a 𢄡 frameshift to allow the continuation of translation, resulting in the generation of a 7,096-amino acid 794 kDa polypeptide. If the pseudoknot is not formed, expression of the first ORF generates a 4,405-amino acid 490 kDa polypeptide. Both the short and long polypeptides translated from this ORF (pp1a and pp1ab, respectively) are posttranslationally cleaved by virus-encoded proteases into functional, nonstructural proteins (nsps). ORF1a encodes eleven nsps, and ORF1ab additionally encodes the nsps 12�. The downstream ORFs encode structural proteins (S, E, M, and N) that are essential components for the synthesis of new virus particles. In between those, additional proteins (accessory/auxiliary factors) are encoded, for which sequences partially overlap (Finkel et al., 2020) and whose identification and classification are a matter of ongoing research (Nelson et al., 2020 Pavesi, 2020). In total, the number of identified peptides or proteins generated from the viral genome is at least 28 on the evidence level, with an additional set of smaller proteins or peptides being predicted with high likelihood.

High-resolution studies of SCoV and SCoV2 proteins have been conducted using all canonical structural biology approaches, such as X-ray crystallography on proteases (Zhang et al., 2020) and methyltransferases (MTase) (Krafcikova et al., 2020), cryo-EM of the RNA polymerase (Gao et al., 2020 Yin et al., 2020), and liquid-state (Almeida et al., 2007 Serrano et al., 2009 Cantini et al., 2020 Gallo et al., 2020 Korn et al., 2020a Korn et al., 2020b Kubatova et al., 2020 Tonelli et al., 2020) and solid-state NMR spectroscopy of transmembrane (TM) proteins (Mandala et al., 2020). These studies have significantly improved our understanding on the functions of molecular components, and they all rely on the recombinant production of viral proteins in high amount and purity.

Apart from structures, purified SCoV2 proteins are required for experimental and preclinical approaches designed to understand the basic principles of the viral life cycle and processes underlying viral infection and transmission. Approaches range from studies on immune responses (Esposito et al., 2020), antibody identification (Jiang et al., 2020), and interactions with other proteins or components of the host cell (Bojkova et al., 2020 Gordon et al., 2020). These examples highlight the importance of broad approaches for the recombinant production of viral proteins.

The research consortium COVID19-NMR founded in 2020 seeks to support the search for antiviral drugs using an NMR-based screening approach. This requires the large-scale production of all druggable proteins and RNAs and their NMR resonance assignments. The latter will enable solution structure determination of viral proteins and RNAs for rational drug design and the fast mapping of compound binding sites. We have recently produced and determined secondary structures of SCoV2 RNA cis-regulatory elements in near completeness by NMR spectroscopy, validated by DMS-MaPseq (Wacker et al., 2020), to provide a basis for RNA-oriented fragment screens with NMR.

We here compile a compendium of more than 50 protocols (see Supplementary Tables SI1–SI23) for the production and purification of 23 of the 30 SCoV2 proteins or fragments thereof (summarized in Tables 1, 2). We defined those 30 proteins as existing or putative ones to our current knowledge (see later discussion). This compendium has been generated in a coordinated and concerted effort between 㸰 labs worldwide (Supplementary Table S1), with the aim of providing pure mg amounts of SCoV2 proteins. Our protocols include the rational strategy for construct design (if applicable, guided by available homolog structures), optimization of expression, solubility, yield, purity, and suitability for follow-up work, with a focus on uniform stable isotope-labeling.

TABLE 1. SCoV2 protein constructs expressed and purified, given with the genomic position and corresponding PDBs for construct design.


Gamma-irradiation causes more targeted protein damage in D. radiodurans than E. coli

To investigate oxidative damage to bacterial proteins, cultures were exposed to an acute dose of γ-radiation (6.7 kGy) lethal to E. coli but yielding 55–70% survival of D. radiodurans, and protein carbonyls and relative abundance changes were measured by mass spectrometry (Figs 1B, 2, and EV1). Based on previous work (Krisko & Radman, 2010 ), a dosage of radiation lethal to E. coli is required in order to observe any deleterious impact on D. radiodurans survival. Furthermore, our selected dosage approximates the highest reported dosage (7 kGy) used in bulk protein carbonylation measurements from whole cell lysate and dialyzed samples from both species (Krisko & Radman, 2010 ), providing a basis to model the impact of extrinsic protection of proteins by small molecule antioxidants. In order to limit de novo protein synthesis throughout and following irradiation, bacterial cultures were maintained near 0°C using a custom rack design (Dataset EV1 and EV2). Importantly, this resulted in differential relative protein abundances due specifically to oxidative damage (Materials and Methods), distinguishing our results from previous proteomic studies. Protein concentrations upon extraction were similar regardless of irradiation for each species (Appendix Table S1), and SDS–PAGE banding patterns were also qualitatively similar across protein samples extracted from the same species (Appendix Fig S2). Altogether, these results suggest that cell membrane integrity was preserved upon radiation.

Figure 2. Summary of shotgun redox proteomic data

  1. Total carbonyl-bearing proteins detected by shotgun redox proteomic measurement in three biological replicates each of E. coli and D. radiodurans with and without irradiation. The left axis is the number of sequence-unique proteins detected as carbonylated. The right axis is the number of sites in total detected as carbonylated (red) or not oxidized (black) in peptides bearing at least one carbonyl. Stripes indicate carbonylated proteins and carbonylatable sites detected only in irradiated samples. See also Appendix Fig S1.
  2. Volcano plots for relative protein abundance changes measured by mass spectrometry in E. coli (left) and D. radiodurans (right) after irradiation using the same biological replicates as in Fig 2A. Black-circled points are those proteins with significant changes (paired, 2-sided t-test P-value < 0.05) of > 2-fold or < 0.5-fold. Red points are proteins with at least one carbonylated peptide detected. Fold change and P-value cutoffs considered for significance are indicated by dashed lines. See also Fig EV1.

Figure EV1. Survival and carbonyl site sampling limits for proteomic experiments, related to Figs 2 and 3

  1. Survival rates (based on CFU counts) of irradiated E. coli and D. radiodurans corresponding to biological triplicate samples from which proteomic data were acquired. Absolutely no colonies were recovered from E. coli cultures that had been irradiated, even without diluting the samples before plating.
  2. Carbonyl site measurement saturation curves for biological triplicate shotgun redox proteomic measurements in E. coli and D. radiodurans. Exponential saturation functions were fit by minimizing the sum of squared errors with the triplicate data points the bolded term in each function is the estimated number of total non-redundant carbonyl sites in our samples.

As expected (Krisko & Radman, 2010 ), we observed carbonylation of more proteins in E. coli (

700 CS in 102 of 1,373 identified proteins) than in D. radiodurans (

400 CS in 70 of 1,264 identified proteins) under either unirradiated or irradiated conditions (Fig 2A and Table EV1). D. radiodurans showed similar detection rates to that in Photobacterium angustum exposed to UVB (62 carbonylated proteins of 1,221 identified) using the same redox proteomic technique (Matallana-Surget et al, 2013 ). The lesser total protein carbonylation in D. radiodurans was likely due to its effective ROS detoxification mechanisms (Slade & Radman, 2011 ). CS saturation curves suggest the fewer detected carbonylation events in D. radiodurans account for a greater percent coverage of all in vivo events than is the case for E. coli (85 and 27%, respectively Fig EV1B), in agreement with the difference in oxidative stress sensitivity between these species. Slightly more unique proteins were detected as carbonylated in a radiation-dependent manner in D. radiodurans (25) than in E. coli (20 Fig 2A). Based on the much lower estimated coverage of all in vivo carbonylation in E. coli, we suggest that extensive damage to the E. coli proteome—leading to more degraded and aggregated proteins—hindered identification of some carbonylated peptides by mass spectrometry.

Relative protein quantification provided clear evidence of contrasting differential protein damage distinguishing these organisms (Fig 2B and Table EV2). Although in E. coli only six proteins showed significant > 2-fold differential relative abundance (paired t-test P-value < 0.05), 163 proteins overall showed > 2-fold changes albeit with higher variability across replicates. In D. radiodurans, 81 proteins significantly changed in relative abundance by > 2-fold the magnitude of change was greater on average with lower variability than in E. coli. Proteins for which we detected at least one CS decreased in relative abundance more than other proteins in D. radiodurans (unpaired t-test P-value = 0.031), illustrating the expected relationship between carbonylation and degree of protein degradation. However, this relationship was less prominent in our E. coli data (unpaired t-test P-value = 0.104). Hence, although E. coli accumulated more protein carbonyls overall, their distribution is broader across distinct protein species, providing evidence of more protein-specific mechanisms for protection against ROS in D. radiodurans that are absent in E. coli.

Analogous relative peptide quantification was also performed. For D. radiodurans, 148 peptides representing 134 unique proteins significantly increased in relative abundance (fold change > 2, satisfying Benjamini–Hochberg criteria with false discovery rate of 0.05) after irradiation, and one peptide significantly decreased (fold change < 0.5, satisfying Benjamini-Hochberg criteria). For E. coli, 26 peptides representing 25 unique proteins significantly decreased in relative abundance after irradiation, and no peptides significantly increased. No individual carbonylated peptides significantly changed in relative abundance in either species. These observations generally parallel the anticipated contrasting response upon irradiation of these species. However, greater statistical power is achieved when pooling peptides to evaluate abundance changes at the whole-protein level. This is partly because stochastically missed tryptic sites and post-translational modifications lead to imperfect peptide identity when quantifying at the peptide level.

Broad functional characterization of proteins with substantial relative abundance change (<0.5-fold or > 2-fold) was carried out by Gene Ontology (GO) biological process term enrichment analysis with protein abundance correction (Scholz et al, 2015 ). These proteins in E. coli exhibited no significantly over- or underrepresented GO annotations. In contrast, D. radiodurans proteins with > 2-fold relative increase were overrepresented by proteins involved in translation and broader protein metabolism (Table 1), including many ribosomal subunits. Additionally, D. radiodurans proteins with < 0.5-fold change underrepresented proteins involved in nitrogen compound biosynthesis, indirectly implicating the importance of amino acid and nucleotide synthesis. Therefore, resistance to protein oxidation in D. radiodurans preferentially protects the critical process of proteome regeneration under oxidative stress.

Retained or lost GO ID Over-/underrepresented % foreground % background Fold enrichment Foreground count Background count P-value GO biological process
Retained > 2-fold GO:0006412 O 16.67 7.58 2.20 22 10 0.037 Translation
GO:0006518 O 19.70 9.09 2.17 26 12 0.022 Peptide metabolic process
GO:0044267 O 21.97 11.36 1.93 29 15 0.031 Cellular protein metabolic process
GO:0009059 O 24.24 12.88 1.88 32 17 0.026 Macromolecule biosynthetic process
GO:0019538 O 28.03 15.91 1.76 37 21 0.025 Protein metabolic process
GO:0009987 O 73.49 61.36 1.20 97 81 0.049 Cellular process
Lost < 0.5-fold GO:0044271 U 12.12 27.27 0.44 8 18 0.048 Cellular nitrogen compound biosynthetic process

Amino acid composition protects against oxidative damage

Although the relative frequency of carbonylated RKTP residues generally confirmed previous studies (Rao & Moller, 2011 Matallana-Surget et al, 2013 ), we found lysine to be as susceptible as proline to carbonylation under γ-irradiation (Fig 3A) in D. radiodurans (ratio 1.77 versus 1.66) and to a lesser extent in E. coli (ratio 1.17 versus 1.43). Protein carbonylation by natively generated ROS in eukaryotes (Rao & Moller, 2011 ) and UV irradiation in P. angustum (Matallana-Surget et al, 2013 ) both indicated proline as the most ROS susceptible of RKPT and lysine as not especially or least susceptible, respectively. Proline carbonylation often leads to polypeptide self-cleavage, which may explain the relatively low proline content of bacterial ribosomal versus non-ribosomal proteins (Lott et al, 2013 ), an evolutionary adaptation contributing to protection of translation against oxidative stress. In contrast, lysine, found incorporated into proteins much more frequently, lacks a similar mechanism for self-cleavage upon carbonylation. The more complex role of lysine in oxidative stress is discussed below.

Figure 3. Amino acid prevalence in proteomic data before and after irradiation

  1. Prevalence of individual RKPT residues and prevalence of carbonylated form in experimentally measured peptides combining all three biological replicates of both conditions for each organism. Ratios are given above each pair of bars. All proportions are significantly different between each RKPT and their respective carbonylation state by two-tailed z-test of two proportions (P-values < 0.01 see Materials and Methods), and meaning carbonylated proportions are not determined simply by relative prevalence of RKPT. See also Appendix Fig S1.
  2. Prevalence of all canonical amino acids before irradiation of E. coli and D. radiodurans, combining all three biological replicates for each condition. Ratios are given above each pair of bars. All proportions are significantly different between species by two-tailed z-test of two proportions (P-values < 0.01). See also Figs EV1 and EV2.

Selective amino acid composition is a major adaptation organisms have evolved to thrive in diverse environmental niches (Brbic et al, 2015 ). Comparing compositions between expressed proteomes of E. coli and D. radiodurans under permissive conditions (Fig 3B) revealed significant differences among oxidizable amino acids. Lysine and arginine, both positively charged at physiological pH, differ in ROS susceptibility and exhibited significant usage differences. While highly susceptible lysine was found to be less frequently used in D. radiodurans, less susceptible arginine was overrepresented instead (0.71-fold and 1.57-fold, respectively). Reversibly oxidizable sulfur-containing amino acids, cysteine and methionine, were rare in both species, but significantly less prevalent in D. radiodurans under permissive conditions (0.53-fold and 0.17-fold, respectively). Surface methionines and cysteines help protect proteins from oxidative damage in many organisms due to their own reversible oxidation (Stadtman & Levine, 2003 ). However, cysteine and methionine are metabolically expensive (i.e., stoichiometrically consume the most ATP) for bacterial synthesis (Kaleta et al, 2013 ), and D. radiodurans is auxotrophic for methionine (Zhou et al, 2017 ), which may explain their significantly lower prevalence in slower-growing D. radiodurans despite expected benefits for resistance. Tryptophan and tyrosine, two metabolically inexpensive amino acids that function as integrated antioxidants in some proteins (Moosmann & Behl, 2000 ), were significantly more abundant in D. radiodurans than in E. coli (both

To evaluate the impact of oxidative stress on amino acid prevalence in identified proteins, we compared changes in amino acid composition after γ-irradiation of E. coli and D. radiodurans (Fig EV2). While only seven amino acids significantly changed in E. coli, 16 significantly changed in D. radiodurans and to a greater magnitude. The greatest decrease among RKPT was lysine in both species, further supporting that incorporated lysine is an important mediator of protein oxidative damage under γ-irradiation. Lysine can sometimes be exchanged for histidine in proteins and still preserve protein function as shown in synthetic mutational studies (Yampolsky & Stoltzfus, 2005 ). Notably, relative histidine prevalence increased modestly (+2%) in E. coli and significantly (+11%) in D. radiodurans after irradiation, suggesting that D. radiodurans has evolved proteins that are more composed of non-carbonylatable histidine rather than lysine as another protein-intrinsic protection mechanism. Indeed, across sequences of functional orthologs and isozymes in these species (Appendix Fig S3) we found 10% greater histidine composition in D. radiodurans than in E. coli as a fraction of total histidine and lysine (paired t-test P-value < 6 × 10 −60 ). Following irradiation, tyrosine prevalence significantly increased in E. coli (+4%) and in D. radiodurans (+8%), and cysteine increased significantly (+18%) only in D. radiodurans. The most significant decrease in E. coli (−13%) and increase in D. radiodurans (+45%) was for methionine. This contrast suggests a more efficient methionine sulfoxide reductase system under oxidative stress in D. radiodurans. All together, these results establish that protein-intrinsic properties, even in primary structure, differ between E. coli and D. radiodurans and affect which proteins withstand the onslaught of ROS-induced oxidative damage.

Figure EV2. Canonical amino acid prevalence change following irradiation of E. coli (left) and D. radiodurans (right), related to Fig 3

Structure- and sequence-based model predict protein vulnerability to carbonylation

Structure-based molecular feature engineering

The computational phase of this study (Fig 1B) involved proteome-wide derivation of 3D structures to investigate molecular properties contributing to ROS susceptibility (Fig 4A, Table EV3, and Materials and Methods). Due to incomplete proteome coverage by crystal structures (<3% for D. radiodurans proteins), computation of molecular features required high-throughput modeling of single-chain proteins, which we performed de novo for D. radiodurans and used published models for E. coli (Xu & Zhang, 2013b Yang et al, 2015 ). The challenge of deriving D. radiodurans proteins by available modeling strategies is summarized in Fig EV3A. The best representative model from alternative methods (Appendix Table S2) for each protein was selected using multiple structure quality metrics (Appendix Table S3). Models generally evaluated comparably to crystal structures for D. radiodurans proteins by these metrics (Fig EV3B and Table EV4). Best representative models were obtained for >95% of D. radiodurans proteins (Fig EV3C), most commonly resulting from I-TASSER (Yang et al, 2015 ) or ProtMod ( Future replacement with higher quality models or experimentally determined structures could improve the performance of our algorithm.

Figure 4. Feature engineering

  1. Three-dimensional feature engineering from molecular properties. Initial properties that can be determined only with an atomic resolution structure, in the context of an amino acid sequence, or that depend only on amino acid identity are denoted at left. This property list is a non-redundant abbreviated set of all properties considered (see Appendix Table S4 and Materials and Methods for full detail). Columns of the feature matrix at right are alternating property sums and means at spatial scales denoted below matrix. p = a molecular property i = RKPT residue k = neighbor residues of i r = radius length. See also Fig EV3.
  2. Sequence homology-based features for machine learning were derived by performing sequence alignments of all RKPT sites (± 10 residues) anchored at the central residue to compute alignment scores that were then reduced to a computationally manageable number of features by principal component analysis (PCA).

Figure EV3. D. radiodurans proteome structure modeling, related to Figs 4–6, 4–6, 4–6

  1. Distribution of D. radiodurans proteins by difficulty of template-based homology modeling and size regimes relevant for determining structure modeling algorithm applicability. Easy signifies ≥ 10 high-confidence homologous templates available. Medium signifies ≥ 1 high-confidence homologous template available. Hard signifies no high-confidence homologous templates available. Proteins ≤ 200 residues long are amenable to ab initio folding. Proteins ≤ 800 residues long are amenable to homology modeling.
  2. Structure quality evaluation criteria and percentage of D. radiodurans protein structures that satisfy published criteria thresholds. Blue plot represents best representative models for D. radiodurans proteins. Gray plot represents best available crystal structures from the PDB for D. radiodurans proteins.
  3. Distribution of methods used to derive best representative protein structures for D. radiodurans. “None” indicates the proteins for which no PDB structure exists, and no modeling method is applicable.

We engineered for the first time molecular features at multiple spatial scales using 3D structures (Fig 4A, Table EV3, Appendix Table S4, and Materials and Methods) to predict carbonylation. Features were computed with respect to all RKPT across D. radiodurans and E. coli proteomes. These features quantitatively summarize the molecular environment of carbonylatable sites. Statistical summaries of local structural properties were computed as the sums and means of canonical property values for neighboring residues within multiple radii to account for a gradient of scales. This feature engineering strategy enabled incorporation of more molecular properties and with spatial dimensionality than possible using sequences alone to represent proteins.

Combining structure- and sequence-based approaches for machine learning

In addition to structure-derived features, we implemented simple sequence alignment-based feature engineering to predict CS (Fig 4B). We defined a local neighborhood centered on each RKPT covered by carbonylated peptides in our proteomic data and performed all-by-all pairwise sequence alignments of these regions, using the alignment score matrix as potential predictive features. This alignment-based approach is agnostic to specific sequence motifs while still leveraging any useful local sequence homology across CS.

All RKPT from carbonylated peptides were mapped to respective protein structure and sequence to assign carbonylated and non-carbonylated residues. Unlike previous CS prediction efforts (Maisonneuve et al, 2009 Lv et al, 2014 Weng et al, 2017 ), we did not assume that any given RKPT is deterministically carbonylated or not. Protein carbonylation is an inherently stochastic process. Therefore, we took a probabilistic approach and used all of the carbonylated peptide data regardless of site redundancy or occurrence as carbonylated in one peptide but non-carbonylated in another. Previous approaches also often sampled unmodified RKPT across all detected peptides, carbonylated or not, to define negatives for training. Compared to non-carbonylated peptides, unmodified RKPT on peptides bearing a carbonyl on another residue better-represent negative data because it is certain that those molecules were directly exposed to ROS yet did not react with ROS.

Independent probability estimators for CS were trained by logistic regression using structure-based features and sequence-based features and then combined into a stacked model. Each independent model and the stacked model were evaluated by leave-1-out validation and their performance quantified by receiver operating characteristic (ROC) analysis (Fig 5A from data in Tables EV5 and EV6). At the residue scale, our stacked model outperformed (AUCnorm = 0.73) each of its structure- and sequence-based components. Shuffling each feature before training yielded random performance (AUCnorm = 0.54), strongly supporting the predictive power of our engineered features. We also evaluated performance of our model for predicting protein-scale vulnerability to oxidation (Fig 5B) by calculating a CS enrichment metric. Predicted carbonylation enrichments for training set proteins strongly rank correlate with enrichments derived from measured carbonylated peptides (Spearman ρ = 0.82, permutation test P-value = 1.3 × 10 −22 for E. coli and Spearman ρ = 0.87, permutation test P-value = 7.2 × 10 −21 for D. radiodurans), signifying that our model can predict relative propensity to carbonylation of different protein species. Due to prioritized sensitivity, our model tends to predict higher enrichment values than derived experimentally (1.9-fold on average for E. coli and 1.7-fold for D. radiodurans), but these predicted enrichment values are plausible given the fact that in vivo carbonylation events are undersampled experimentally (Fig EV1B).

Figure 5. Multi-scale validation of protein carbonylation predictor

  1. Residue-scale validation: Receiver operating characteristic (ROC) curves for CS predictors derived by leave-1-out validation. The dashed black line at y=x corresponds to performance expected by chance. Top left = final predictor trained by stacking structure- and sequence-based models. Top middle = predictor trained only on structure-based features. Top right = predictor trained only on sequence-based features. Bottom left = theoretical maximum predictive power for a probability estimator (AUC = 0.98). Bottom middle = same algorithm as used for final predictor but with all features shuffled beforehand. Bottom right = CSPD model developed using metal-catalyzed oxidation (MCO) site data from E. coli. See also Figs EV3 and EV4.
  2. Protein-scale validation: Comparison between predicted CS enrichment from leave-1-out validation to CS enrichment computed from all carbonylated peptides measured for E. coli (left) and D. radiodurans (right). Each point represents a different protein species. Predicted probability-weighted CS enrichment = (sum of carbonylation probabilities across training set sites)/(number of residues in corresponding peptides from experiments). Experimentally measured probability-weighted CS enrichment = (sum of empirical oxidation probabilities across training set sites)/(number of residues in corresponding peptides from experiments). The solid line is the fitted regression line, and dashed lines indicate the boundaries of the 95% confidence interval.

Molecular properties explain vulnerability to carbonylation

400 structure-based features in the modeling, only seven of the logistic regression coefficients were non-zero: relative reactivity with ROS (reactivity_res), codon diversity, whether the RKPT site was a threonine residue, molecular volume, local solvent accessible surface area, local positive charge, and local lysine residues. Codon diversity (AAindCodonDiv_res) itself is unlikely to be causal. Instead, this feature has the same rank order as carbonylation prevalence in D. radiodurans from our experiments (Fig 3A) and is therefore a fortuitous proxy for γ-specific reactivity. Threonine is by far the least frequently carbonylated of RKPT in both species (Fig 3A), and inclusion of this feature (Thr_res) in our model reflects this lower propensity to reaction with ROS.

Aside from the reactivity features differentiating RKPT, all other explanatory properties for ROS susceptibility derived from 3D structures (Fig 6). Accessibility to ROS promotes carbonylation (Fig 6A). The lower the molecular volume of a residue (AAindMolVol_res), the more likely it will react with ROS due to lower steric effects. Similarly, lower local surface area (areaSAS_5A_sum) surrounding a near-surface site indicates less likelihood of shielding by surrounding structure, such as the protrusion in Fig 6D. Local positive charges (posCharge_8A_sum) promote carbonylation by attracting negatively charged superoxide radicals (Fig 6B). Colocalization of highly reactive sites may cause progressive protein misfolding, exposing neighboring residues to ROS (Maisonneuve et al, 2009 Fig 6C). In our model, neighboring lysine residues (Lys_8A_sum) contribute to the probability of carbonylation, lysine being the most prevalently carbonylated RKPT under γ-irradiation in our data (Fig 3A). Polarity leading to solubility of lysine-rich regions could also contribute to this effect. Sites without neighboring lysines are less likely to be carbonylated (Fig 6D).

Figure 6. Molecular properties predicting protein vulnerability to carbonylation

  • A–D. Example sites prone to carbonylation. (A) DRA0302_P252, (B) DR0099_P51, and (C) b0911_K411 and example robust site (D) b3313_P69.

Our algorithm also extends to prediction of metal-catalyzed oxidation

We applied Carbonylated Site and Protein Detection (CSPD) developed by Maisonneuve et al ( 2009 ) to predict CS across our training set (Fig 5A). CSPD performance on our data was essentially random (AUCnorm = 0.53). It is important to note that CSPD was developed using metal-catalyzed oxidation (MCO) data from a set of only 23 carbonylated E. coli proteins derived from samples prepared under similar conditions to our negative controls. However, while we kept our samples on ice after harvesting the exponential phase cells, Maisonneuve et al did not report any similar temperature treatment for their samples. In this way, the samples of Maisonneuve et al being prepared at higher temperature allowed protein synthesis and turnover that would have led to fewer detectable carbonylated proteins than we measured. Furthermore, Maisonneuve et al performed 2D SDS–PAGE and excised only visible spots labeled for carbonylation, which could have further limited the number of distinct carbonylated proteins identified from their samples. In all, we identified 82 carbonylated proteins in our E. coli-negative controls, including 10 in common with the Maisonneuve et al data. The inability of CSPD to generalize to carbonylation from γ-irradiation may be due in part to the experimental differences noted above in addition to a difference in effects of each specific source of ROS. Therefore, to more directly compare algorithmic performance we also used our algorithm to train a model predicting MCO using the same redox proteomic data used to develop CSPD (Fig EV4). CSPD showed modest positive performance on this dataset (AUCnorm = 0.58), the discrepancy in previously reported performance owing to our inclusion of all carbonylated peptides with carbonylated and non-carbonylated residues defined as described above. We conclude that CSPD was overfitted to the MCO data and depends on the assumption of deterministic protein carbonylation and on less-strict standards for defining non-carbonylated residues in proteomic data.

Figure EV4. Residue-scale validation of metal-catalyzed oxidation (MCO) predictor, related to Fig 5 A

Furthermore, our stacked model for MCO prediction performed better (AUCnorm = 0.75) than our γ-induced oxidation model with better synergy in stacking the structure- (AUCnorm = 0.72) and sequence-based (AUCnorm = 0.67) models. This performance difference was likely due to the relatively less diverse products of MCO than γ-induced oxidation. ROS production in MCO is more localized because it depends on the presence of Fe or Cu cations to drive the Fenton reaction and therefore affects a smaller number of proteins than γ-induced oxidation. Indeed, data from γ-irradiation experiments include not only CS caused by ROS from water radiolysis but also basal cellular oxidation due to native ROS sources, including MCO and cellular respiration. Thus, oxidation from γ-irradiation is more diverse and complex than MCO products and more challenging for learning structure and sequence signatures.

Intra- and interspecies differences in protein vulnerability to carbonylation

D. radiodurans proteome maintenance is protected from carbonylation

Orthologs and isozymes mapped between E. coli and D. radiodurans (Appendix Fig S3) were compared by their unweighted carbonylation enrichment (Fig 7 and Table EV7) as computed from proteome-wide CS prediction in E. coli and D. radiodurans to reveal functional classes and individual proteins differing in susceptibility between and within these proteomes. Functional classes known to be involved in resistance and recovery from oxidative stress include the following: ribosomal, ribosomal assembly, translation, protein chaperone, protease and peptidase, amino acid and peptide transport, DNA repair, DNA damage response and regulation of repair, native ROS production, ROS detoxification, ROS response, metal transport, terpenoid synthesis, and polyamine accumulation.

Figure 7. Interspecies comparison of predicted protein vulnerability to carbonylation

91% of all data points. This reference region distinguishes outlier points that are distant from the main population. Outliers with associated experimental evidence related to hypersensitivity to oxidative stress are labeled with their protein names. See also Appendix Fig S3 and Fig EV5.

Pairwise orthologs were compared based on protein-intrinsic and extrinsic factors contributing to their propensity to carbonylation (Fig 7). Perpendicular distance to the y = x diagonal represents the relative degree to which one ortholog is intrinsically more or less sensitive given the same ROS dosage on the basis of carbonylation enrichment alone. Protein-extrinsic factors, such as the Mn-dependent scavenging system in D. radiodurans (Daly, 2012 ) and the antioxidant carotenoid deinoxanthin (Tian et al, 2009 ), also contribute to interspecies differences in protein oxidation. Such protein-extrinsic factors act broadly by reducing the effective cellular dosage of ROS. An acute gamma dosage of 7 kGy, approximately the same as in this study, yielded about 3.78-fold more protein carbonyls in E. coli lysate than in D. radiodurans (Materials and Methods) due to small molecules removable by dialysis (Krisko & Radman, 2010 ). Assuming such factors act globally without favoring protection of specific proteins, the degree to which these extrinsic factors differentiate vulnerability to carbonylation between orthologs can be modeled in combination with protein-intrinsic factors simply by computing perpendicular distance to the y = x/3.78 diagonal (Fig 7). By this model, especially susceptible proteins benefit more from an effectively lower dosage of ROS in D. radiodurans.

Relative vulnerability to ROS differed between E. coli and D. radiodurans within particular functional classes (Fig 7). We predicted the intrinsic susceptibility of E. coli ribosomal proteins to be more than 2.4-fold greater than across all-orthologs (unpaired t-test P-value = 0.01). Accounting for extrinsic ROS protection predicted ribosomal proteins to be the most favored functional class in D. radiodurans over E. coli (1.5-fold, unpaired t-test P-value = 1.2 × 10 −26 ), in agreement with D. radiodurans ribosomal proteins being enriched among those with relative abundance increases after irradiation. Protein chaperones in E. coli were predicted on average 1.13-fold more intrinsically vulnerable than in D. radiodurans (unpaired t-test P-value = 0.02), a difference further distinguished due to being more than 4.5-fold greater than the difference across all-orthologs (unpaired t-test P-value = 0.003) and 1.14-fold greater when accounting for extrinsic protection as well (unpaired t-test P-value = 0.02). E. coli proteins involved in polyamine synthesis and uptake are predicted to be more than 3.7-fold intrinsically vulnerable than across all-orthologs (unpaired t-test P-value = 0.04). Revisiting the observation that methionine usage featured prominently in D. radiodurans proteins retained after irradiation, we predicted that methionine sulfoxide reductases acting on protein-incorporated methionine MsrB and MsrP are both 1.4-fold more intrinsically sensitive to carbonylation in E. coli. MsrP was also in the 94 th percentile of proteins benefiting from extrinsic protection in D. radiodurans.

Comparison of interspecies outliers reveals proteins involved in oxidative stress resistance

Many proteins involved in coping with oxidative stress were significant outliers in predicted intrinsic vulnerability to carbonylation (Fig 7). There were 111 orthologous pairs greater than 3 standard deviations of distance from the mean of the distribution or greater than 3 standard deviations away from the mean perpendicular distance from the y = x diagonal. We grouped these outliers according to three properties: (i) intrinsic sensitivity or robustness compared to the rest of the proteome, (ii) comparative intrinsic vulnerability between D. radiodurans and E. coli, and (iii) relative effect of ROS detoxification in D. radiodurans over E. coli (Fig EV5).

Figure EV5. Predicted outliers grouped by comparative intrinsic and extrinsic vulnerability to carbonylation in D. radiodurans and E. coli, related to Fig 7

Proteins predicted as significantly more intrinsically or extrinsically protected from ROS in D. radiodurans relative to E. coli fall into three groups based on the three properties described above. Group 1 proteins were predicted highly carbonylation-prone but more protected intrinsically and extrinsically in D. radiodurans than in E. coli. On average, these 12 proteins were 1.4-fold more CS-enriched in E. coli and above the 99 th percentile of extrinsic protection in D. radiodurans. Of 10 proteins detected in both organisms by proteomics, eight had more negative γ-induced relative abundance changes in E. coli than D. radiodurans, with a median E. coli-to-D. radiodurans ratio of 0.47. Ribosomal subunits comprised 11 of these proteins, eight of which are essential in E. coli. E. coli knockouts of rpmI (Nakayashiki & Mori, 2013 ) are hypersensitive to oxidative stress. Overexpression of rpmG increases resistance to oxidative stress from mitomycin C (Bolt et al, 2015 ), and GroS overexpression decreases protein carbonyl accumulation (Fredriksson et al, 2005 ). Seven of these proteins exhibit oxidative stress-induced expression in D. radiodurans (Liu et al, 2003 Slade & Radman, 2011 ). Group 2 proteins were predicted as similarly intrinsically carbonylation-prone in both species but significantly extrinsically protected in D. radiodurans. On average, these 22 proteins are above the 86 th percentile of extrinsic protection in D. radiodurans. Of 13 proteins detected in both organisms by proteomics, 11 showed substantially more positive γ-induced relative abundance changes in D. radiodurans. In this group, 13 proteins are ribosomal subunits. In E. coli, pstS knockouts are hypersensitive to oxidative stress (Sargentini et al, 2016 ), rpsL mutants have been shown to affect oxidative stress tolerance (Ballesteros et al, 2001 Miskinyte & Gordo, 2013 ), and 13 others are essential genes (Baba et al, 2006 Bubunenko et al, 2007 ). In D. radiodurans, rpsS and hupA knockouts are hypersensitive to oxidative stress (Dulermo et al, 2015 ), and overexpression of rpsS, rpsT, rplQ, rpsM, rpmB, rplK, rpsL, thpR, rpmE, nrdH, rplR, rplV, and rpsR occurs during oxidative stress (Liu et al, 2003 Slade & Radman, 2011 ). Group 3 proteins were predicted significantly more susceptible to carbonylation in E. coli than in D. radiodurans. On average, these 27 proteins were 1.9-fold more CS-enriched in E. coli and above the 95 th percentile of extrinsic protection in D. radiodurans. In E. coli, rpmF (Nakayashiki & Mori, 2013 Sargentini et al, 2016 ) and icd (Krisko et al, 2014 ) knockouts are hypersensitive to oxidative stress, and osmY (Basak & Jiang, 2012 ) is also involved in oxidative stress resistance. In D. radiodurans, xseB knockouts are hypersensitive to oxidative stress (Dulermo et al, 2015 ), and adk, icd, malE, osmC, ppiA, rplB, rpmC, rpsC, and yceI are highly expressed under oxidative stress (Liu et al, 2003 Slade & Radman, 2011 Basu & Apte, 2012 ). Higher resistance to carbonylation of proteins from these groups sets D. radiodurans apart from E. coli and delineates transgenes that could serve to increase stress tolerance in E. coli.

Interspecies outliers not predicted as significantly more protected from ROS in D. radiodurans fall into two groups. Group 4 proteins were predicted as highly intrinsically robust to carbonylation in both species and therefore not to benefit substantially from extrinsic protection in D. radiodurans. Of these five proteins, three were more intrinsically vulnerable in E. coli, including secE, which is essential in E. coli (Baba et al, 2006 ), and fdx, which is highly expressed under oxidative stress in D. radiodurans (Liu et al, 2003 ). Group 5 proteins were predicted as significantly more intrinsically vulnerable to carbonylation in D. radiodurans than in E. coli. These 14 functionally diverse proteins include three known oxidative stress-hypersensitive knockout mutants in D. radiodurans (Dulermo et al, 2015 ) however, all but 2 still lie above y = x/3.78 in Fig 7, suggesting that extrinsic protection could still compensate for intrinsic vulnerability differences between these species.


Collection and classification of spliceosome proteins

A total of 244 proteins found in the proteomics analyses of the major human spliceosome [sourced from one or more of the following references ( 2, 4, 8, 37–41)], and 8 proteins specific to the U11/U12 di-snRNP subunit of the minor spliceosome ( Supplementary Table S1 ) ( 42), were downloaded from the NCBI Protein (nr) database. Proteins were classified as ‘abundant’ and ‘non-abundant’ according to ( 2), and they were assigned into groups based mainly on ( 2), followed by references ( 4, 38–40). Proteins classified here as ‘miscellaneous’ were classified in primary sources, variably, as ‘miscellaneous proteins’, ‘miscellaneous splicing factors’, ‘additional proteins’, ‘proteins not reproducibly detected’ and ‘proteins not previously detected’. We disclaim any responsibility for the factual accuracy of the association of proteins with the relevant groups beyond the point of following the primary sources.

Sequence searches, alignments and clustering

Searches of protein homologs in the NCBI Protein (nr) database were carried out at the NCBI using BLASTP/PSI-BLAST ( 43) with default parameter settings. Putative homology was validated by reciprocal BLASTP searches against the Protein database with ‘human’ (NCBI taxon id: 9606) as a taxon search delimiter. Sequence alignments were calculated using the MAFFT server using the Auto strategy ( ( 44). Clustering analysis of helicase sequences was performed with CLANS ( 45).

Identification and description of structural regions of proteins

Identification of intrinsically ordered and disordered regions of proteins, prediction of protein secondary structure and domain boundaries, as well as fold-recognition (FR) analyses, were carried out via the GeneSilico MetaServer gateway (for references to the original methods, see ( 46). In non-trivial cases (usually when putative modeling templates returned by FR scored low and/or various methods disagreed on the best template), FR alignments to the top-scoring templates from the PDB were compared, evaluated and ranked by the PCONS server ( 47), and the PCONS result was used to identify region boundaries. Additional searches were performed on the HHPRED server ( 48).

SCOP database ( 49) IDs used for the purposed of structural domain identification were either extracted from the Protein Data Bank or from the SCOP parseable files on the SCOP website ( or assigned using the fastSCOP server ( ( 50). PFAM domain names were assigned on the PFAM website ( SCOP v. 1.75 and PFAM v. 25.0 were used. Structural similarity was compared using the DALI server ( 51).

Assignment of models to structural regions of proteins

In assigning structural models to regions, we followed a four-step procedure ( Figure 1). Whenever a high-resolution experimental structural model (either X-ray or NMR structure) was available, we assigned it to the corresponding sequence region. If a structural similarity to a protein of known structure was predicted for a given region by fold-recognition algorithms (see below for details), we constructed a model for this region by a comparative (template-based) modeling technique, using the detected experimental structures as templates. In the absence of confidently predicted templates, we used de novo folding methods for relatively small fragments likely to form globular domains. For the remaining regions (those without experimentally solved structures and for which the current modeling methodology cannot provide confident predictions of the 3D structure), we generated pro forma models, in which only the primary and (predicted) secondary structure was represented explicitly, while the tertiary arrangement was arbitrary. Pro forma models are not supposed to be reliable at the tertiary level and were constructed for the sake of further analyses (e.g. to initialize protein folding analyses that require some kind of a structural representation as an input).

Rules for selecting and producing structural representations of protein regions. From left to right, structural representations decrease in the average confidence.

Rules for selecting and producing structural representations of protein regions. From left to right, structural representations decrease in the average confidence.

For regions with multiple solved structures in the Protein Data Bank, the following criteria of preference were used: (i) structures of the region in complex with other proteins and/or nucleic acids (i.e. in a potentially ‘active’ or ‘functionally relevant’ state) were given priority over structures of the region in isolation, (ii) crystallographic structures were given priority over NMR structures, (iii) higher-resolution crystallographic structures were given priority over lower-resolution structures and (iv) more complete structures were given priority over less complete structures. The following experimental artifacts were removed from experimental structure files or corrected by standard modeling procedures: non-native sequences added to aid in the protein expression and structure determination process (e.g. affinity tags), non-standard amino acids (e.g. selenomethionine was replaced by methionine), and gaps in sequences (e.g. short disordered loop fragments were added). Single chains only were retained if the original PDB file contained multiple chains of the same protein.

Comparative models were constructed by default with MODELLER ( 52) based on templates identified in the fold-recognition process. Selected challenging models were constructed using the I-TASSER server ( 53). Selected models were also adjusted with ROSETTA 3.0/3.1 using the loop modeling mode ( 54). De novo models were produced with the ROSETTA 3.0/3.1 AbInitioRelax application and clustered with the Rosetta 3.0/3.1 Cluster Application, following the protocols set out in the ROSETTA User Guide for version 3.1. ( ( 54). De novo folding was attempted if the following conditions were fulfilled: the region was ≤125 residues in length, predicted to be completely ordered and predicted to contain secondary structure elements. These conditions correspond to the current practical limit of utility of this type of methods ( 55). Artificial pro forma spatial representations of protein chains of unknown/uncertain structure or predicted to lack a stable structure were built with UCSF Chimera (v.1.4/1.5) using the Tools>Structure Editing>Build Structure command ( 56). Pro forma constructs reflect only the known primary and predicted secondary structure of the corresponding regions, while their tertiary structure should be regarded as unassigned (and remains to be modeled in the future). Miscellaneous manipulations of structures and models of molecules during this stage were performed in UCSF Chimera ( 56) and Swiss-PdbViewer v. 4.0.1 ( 57).

Protein model quality assessment

Assessment of model quality was performed with MetaMQAPII [, an updated version of a method described in ( 58)] and QMEAN [ ( 59)].

MetaMQAP predicts the deviation of the query model from the (unknown) native structure and expresses it as the predicted global root mean square deviation (RMSD) and the predicted global distance test total score (GDT_TS) ( 60). The lower the predicted RMSD and the higher the predicted GDT_TS score, the better the model.

QMEAN first calculates an internal score, and then the QMEAN Z-score indicates by how many standard deviations the QMEAN score of the model differs from expected values for experimental structures that have a similar length to the model. High quality models are expected to have positive QMEAN Z-scores, and good models are expected to have a QMEAN Z-score above −2.0. Indicators of accuracy of individual residues were generated by MetaMQAPII and are supplied as B-factor values inside the model files available from the SpliProt3D database website (see below). They can be visualized with the UCSF Chimera command Render By Attribute > (attributes of residues: average B-factor) or with equivalent commands in other molecular visualization programs. Mean values and standard deviations of the QMEAN Z-scores for the six QMEAN contributing factors are provided with this publication ( Supplementary Table S4 ) and the values for all models are provided with the model files. Models of low quality are expected to have a strongly negative QMEAN Z-score, but also strongly negative Z-scores for most of the contributing terms.

As MetaMQAPII is not capable of evaluating multimeric models, for models of protein complexes (11 X-ray models and 2 NMR models) only the quality of the longest chain was evaluated by MetaMQAPII.

Website/database of models

Models and additional data, including alignments of representative sequences annotated with predictions of order/disorder, secondary structure, binding disorder, solvent accessibility and coiled coils, as well as and annotations of sites of post-translational modification from UniProt ( 29), are available via the SpliProt3D web server at The entire archive of files available for download has approximately 250 MB.

Visualization of sequence alignments and molecular structures

Sequence alignments were visualized with Jalview v. 2.6.1 ( 61), while molecular structure graphics were produced with UCSF Chimera ( 56).

Protein Structure

Each successive level of protein folding ultimately contributes to its shape and therefore its function.

Learning Objectives

Summarize the four levels of protein structure

Key Takeaways

Key Points

  • Protein structure depends on its amino acid sequence and local, low-energy chemical bonds between atoms in both the polypeptide backbone and in amino acid side chains.
  • Protein structure plays a key role in its function if a protein loses its shape at any structural level, it may no longer be functional.
  • Primary structure is the amino acid sequence.
  • Secondary structure is local interactions between stretches of a polypeptide chain and includes α-helix and β-pleated sheet structures.
  • Tertiary structure is the overall the three-dimension folding driven largely by interactions between R groups.
  • Quarternary structures is the orientation and arrangement of subunits in a multi-subunit protein.

Key Terms

  • antiparallel: The nature of the opposite orientations of the two strands of DNA or two beta strands that comprise a protein’s secondary structure
  • disulfide bond: A bond, consisting of a covalent bond between two sulfur atoms, formed by the reaction of two thiol groups, especially between the thiol groups of two proteins
  • β-pleated sheet: secondary structure of proteins where N-H groups in the backbone of one fully-extended strand establish hydrogen bonds with C=O groups in the backbone of an adjacent fully-extended strand
  • α-helix: secondary structure of proteins where every backbone N-H creates a hydrogen bond with the C=O group of the amino acid four residues earlier in the same helix.

The shape of a protein is critical to its function because it determines whether the protein can interact with other molecules. Protein structures are very complex, and researchers have only very recently been able to easily and quickly determine the structure of complete proteins down to the atomic level. (The techniques used date back to the 1950s, but until recently they were very slow and laborious to use, so complete protein structures were very slow to be solved.) Early structural biochemists conceptually divided protein structures into four “levels” to make it easier to get a handle on the complexity of the overall structures. To determine how the protein gets its final shape or conformation, we need to understand these four levels of protein structure: primary, secondary, tertiary, and quaternary.

Primary Structure

A protein’s primary structure is the unique sequence of amino acids in each polypeptide chain that makes up the protein. Really, this is just a list of which amino acids appear in which order in a polypeptide chain, not really a structure. But, because the final protein structure ultimately depends on this sequence, this was called the primary structure of the polypeptide chain. For example, the pancreatic hormone insulin has two polypeptide chains, A and B.

Primary structure: The A chain of insulin is 21 amino acids long and the B chain is 30 amino acids long, and each sequence is unique to the insulin protein.

The gene, or sequence of DNA, ultimately determines the unique sequence of amino acids in each peptide chain. A change in nucleotide sequence of the gene’s coding region may lead to a different amino acid being added to the growing polypeptide chain, causing a change in protein structure and therefore function.

The oxygen-transport protein hemoglobin consists of four polypeptide chains, two identical α chains and two identical β chains. In sickle cell anemia, a single amino substitution in the hemoglobin β chain causes a change the structure of the entire protein. When the amino acid glutamic acid is replaced by valine in the β chain, the polypeptide folds into an slightly-different shape that creates a dysfunctional hemoglobin protein. So, just one amino acid substitution can cause dramatic changes. These dysfunctional hemoglobin proteins, under low-oxygen conditions, start associating with one another, forming long fibers made from millions of aggregated hemoglobins that distort the red blood cells into crescent or “sickle” shapes, which clog arteries. People affected by the disease often experience breathlessness, dizziness, headaches, and abdominal pain.

Sickle cell disease: Sickle cells are crescent shaped, while normal cells are disc-shaped.

Secondary Structure

A protein’s secondary structure is whatever regular structures arise from interactions between neighboring or near-by amino acids as the polypeptide starts to fold into its functional three-dimensional form. Secondary structures arise as H bonds form between local groups of amino acids in a region of the polypeptide chain. Rarely does a single secondary structure extend throughout the polypeptide chain. It is usually just in a section of the chain. The most common forms of secondary structure are the α-helix and β-pleated sheet structures and they play an important structural role in most globular and fibrous proteins.

Secondary structure: The α-helix and β-pleated sheet form because of hydrogen bonding between carbonyl and amino groups in the peptide backbone. Certain amino acids have a propensity to form an α-helix, while others have a propensity to form a β-pleated sheet.

In the α-helix chain, the hydrogen bond forms between the oxygen atom in the polypeptide backbone carbonyl group in one amino acid and the hydrogen atom in the polypeptide backbone amino group of another amino acid that is four amino acids farther along the chain. This holds the stretch of amino acids in a right-handed coil. Every helical turn in an alpha helix has 3.6 amino acid residues. The R groups (the side chains) of the polypeptide protrude out from the α-helix chain and are not involved in the H bonds that maintain the α-helix structure.

In β-pleated sheets, stretches of amino acids are held in an almost fully-extended conformation that “pleats” or zig-zags due to the non-linear nature of single C-C and C-N covalent bonds. β-pleated sheets never occur alone. They have to held in place by other β-pleated sheets. The stretches of amino acids in β-pleated sheets are held in their pleated sheet structure because hydrogen bonds form between the oxygen atom in a polypeptide backbone carbonyl group of one β-pleated sheet and the hydrogen atom in a polypeptide backbone amino group of another β-pleated sheet. The β-pleated sheets which hold each other together align parallel or antiparallel to each other. The R groups of the amino acids in a β-pleated sheet point out perpendicular to the hydrogen bonds holding the β-pleated sheets together, and are not involved in maintaining the β-pleated sheet structure.

Tertiary Structure

The tertiary structure of a polypeptide chain is its overall three-dimensional shape, once all the secondary structure elements have folded together among each other. Interactions between polar, nonpolar, acidic, and basic R group within the polypeptide chain create the complex three-dimensional tertiary structure of a protein. When protein folding takes place in the aqueous environment of the body, the hydrophobic R groups of nonpolar amino acids mostly lie in the interior of the protein, while the hydrophilic R groups lie mostly on the outside. Cysteine side chains form disulfide linkages in the presence of oxygen, the only covalent bond forming during protein folding. All of these interactions, weak and strong, determine the final three-dimensional shape of the protein. When a protein loses its three-dimensional shape, it will no longer be functional.

Tertiary structure: The tertiary structure of proteins is determined by hydrophobic interactions, ionic bonding, hydrogen bonding, and disulfide linkages.

Quaternary Structure

The quaternary structure of a protein is how its subunits are oriented and arranged with respect to one another. As a result, quaternary structure only applies to multi-subunit proteins that is, proteins made from more than one polypeptide chain. Proteins made from a single polypeptide will not have a quaternary structure.

In proteins with more than one subunit, weak interactions between the subunits help to stabilize the overall structure. Enzymes often play key roles in bonding subunits to form the final, functioning protein.

For example, insulin is a ball-shaped, globular protein that contains both hydrogen bonds and disulfide bonds that hold its two polypeptide chains together. Silk is a fibrous protein that results from hydrogen bonding between different β-pleated chains.

Four levels of protein structure: The four levels of protein structure can be observed in these illustrations.

What is the secondary structure distribution per AA in the Human proteome? - Biology

Figure 1: Gallery of proteins. Representative examples of protein size are shown with examples drawn to illustrate some of the key functional roles they take on. All the proteins in the figure are shown on the same scale to give an impression of their relative sizes. The small red objects shown on some of the molecules are the substrates for the protein of interest. For example, in hexokinase, the substrate is glucose. The handle in ATP synthase is known to exist but the exact structure was not available and thus only schematically drawn. Names in parenthesis are the PDB database structures entries IDs. (Figure courtesy of David Goodsell).

Proteins are often referred to as the workhorses of the cell. An impression of the relative sizes of these different molecular machines can be garnered from the gallery shown in Figure 1. One favorite example is provided by the Rubisco protein shown in the figure that is responsible for atmospheric carbon fixation, literally building the biosphere out of thin air. This molecule, one of the most abundant proteins on Earth, is responsible for extracting about a hundred Gigatons of carbon from the atmosphere each year. This is ≈10 times more than all the carbon dioxide emissions made by humanity from car tailpipes, jet engines, power plants and all of our other fossil-fuel-driven technologies. Yet carbon levels keep on rising globally at alarming rates because this fixed carbon is subsequently reemitted in processes such as respiration, etc. This chemical fixation is carried out by these Rubisco molecules with a monomeric mass of 55 kDa fixating CO2 one at a time, with each CO2 with a mass of 0.044 kDa (just another way of writing 44 Da that clarifies the 1000:1 ratio in mass). For another dominant player in our biosphere consider the ATP synthase (MW≈500-600 kDa, BNID 106276), also shown in Figure 1, that decorates our mitochondrial membranes and is responsible for synthesizing the ATP molecules (MW=507 Da) that power much of the chemistry of the cell. These molecular factories churn out so many ATP molecules that all the ATPs produced by the mitochondria in a human body in one day would have nearly as much mass as the body itself. As we discuss in the vignette on “What is the turnover time of metabolites?” the rapid turnover makes this less improbable than it may sound.

Figure 2: A Gallery of homooligomers showing the beautiful symmetry of these common protein complexes. Highlighted in pink are the monomeric subunits making up each oligomer. Figure by David Goodsell.

The size of proteins such as Rubisco and ATP synthase and many others can be measured both geometrically in terms of how much space they take up and in terms of their sequence size as determined by the number of amino acids that are strung together to make the protein. Given that the average amino acid has a molecular mass of 100 Da, we can easily interconvert between mass and sequence length. For example the 55 kDa Rubisco monomer, has roughly 500 amino acids making up its polypeptide chain. The spatial extent of soluble proteins and their sequence size often exhibit an approximate scaling property where the volume scales linearly with sequence size and thus the radii or diameters tend to scale as the sequence size to the 1/3 power. A simple rule of thumb for thinking about typical soluble proteins like the Rubisco monomer is that they are 3-6 nm in diameter as illustrated in Figure 1 which shows not only Rubisco, but many other important proteins that make cells work. In roughly half the cases it turns out that proteins function when several identical copies are symmetrically bound to each other as shown in Figure 2. These are called homo-oligomers to differentiate them from the cases where different protein subunits are bound together forming the so-called hetero-oligomers. The most common states are the dimer and tetramer (and the non oligomeric monomers). Homo-oligomers are about twice as common as hetero-oligomers (BNID 109185).

There is an often-surprising size difference between an enzyme and the substrates it works on. For example, in metabolic pathways, the substrates are metabolites which usually have a mass of less than 500 Da while the corresponding enzymes are usually about 100 times heavier. In the glycolysis pathway, small sugar molecules are processed to extract both energy and building blocks for further biosynthesis. This pathway is characterized by a host of protein machines, all of which are much larger than their sugar substrates, with examples shown in the bottom right corner of Figure 1 where we see the relative size of the substrates denoted in red when interacting with their enzymes.

Figure 3: Distribution of protein lengths in E. coli, budding yeast and human HeLa cells. (A) Protein length is calculated in amino acids (AA), based on the coding sequences in the genome. (B) Distributions are drawn after weighting each gene with the protein copy number inferred from mass spectrometry proteomic studies (M. Heinemann in press, M9+glucose LMF de Godoy et al. Nature 455:1251, 2008, defined media T. Geiger et al., Mol. Cell Proteomics 11:M111.014050, 2012). Continuous lines are Gaussian kernel-density estimates for the distributions serving as a guide to the eye.

Table 1: Median length of coding sequences of proteins based on genomes of different species. The entries in this table are based upon a bioinformatic analysis by L. Brocchieri and S. Karlin, Nuc. Acids. Res., 33:3390, 2005, BNID 106444. As discussed in the text, we propose an alternative metric that weights proteins by their abundance as revealed in recent mass spec proteome-wide censuses. The results are not very different from the entries in this table, with eukaryotes being around 400 aa long on average and bacteria about 300 aa long.

Concrete values for the median gene length can be calculated from genome sequences as a bioinformatic exercise. Table 1 reports these values for various organisms showing a trend towards longer protein coding sequences when moving from unicellular to multicellular organisms. In Figure 3 we go beyond mean protein sizes to characterize the full distribution of coding sequence lengths on the genome, reporting values for three model organisms. If our goal was to learn about the spectrum of protein sizes, this definition based on the genomic length might be enough. But when we want to understand the investment in cellular resources that goes into protein synthesis, or to predict the average length of a protein randomly chosen from the cell, we advocate an alternative definition, which has become possible thanks to recent proteome-wide censuses. For these kinds of questions the most abundant proteins should be given a higher statistical weight in calculating the expected protein length. We thus calculate the weighted distribution of protein lengths shown in Figure 3, giving each protein a weight proportional to its copy number. This distribution represents the expected length of a protein randomly fished out of the cell rather than randomly fished out of the genome. The distributions that emerge from this proteome-centered approach depend on the specific growth conditions of the cell. In this book, we chose to use as a simple rule of thumb for the length of the “typical” protein in prokaryotes ≈300 aa and in eukaryotes ≈400 aa. The distributions in Figure 3 show this is a reasonable estimate though it might be an overestimate in some cases.

One of the charms of biology is that evolution necessitates very diverse functional elements creating outliers in almost any property (which is also the reason we discussed medians and not averages above). When it comes to protein size, titin is a whopper of an exception. Titin is a multi-functional protein that behaves as a nonlinear spring in human muscles with its many domains unfolding and refolding in the presence of forces and giving muscles their elasticity. Titin is about 100 times longer than the average protein with its 33,423 aa polypeptide chain (BNID 101653). Identifying the smallest proteins in the genome is still controversial, but short ribosomal proteins of about 100 aa are common.

It is very common to use GFP tagging of proteins in order to study everything from their localization to their interactions. Armed with the knowledge of the characteristic size of a protein, we are now prepared to revisit the seemingly innocuous act of labeling a protein. GFP is 238 aa long, composed of a beta barrel within which key amino acids form the fluorescent chromophore as discussed in the vignette on “ What is the maturation time for fluorescent proteins?”. As a result, for many proteins the act of labeling should really be thought of as the creation of a protein complex that is now twice as large as the original unperturbed protein.

Discussion and Conclusions

Increasing evidence indicates that a wide range of proteins unrelated in sequence, native structure, and function can form biomolecular condensates (1, 2, 4, 53). These observations suggest that the droplet state may have a generic nature and be accessible to most proteins. This possibility may not be immediately evident from the data currently available because the condensation of different proteins has been reported for experimental conditions often far from physiological ones. Moreover, a full understanding of the interactions driving droplet formation has not been achieved yet, owing to a wide variety of sequence motifs associated with the droplet state.

In this work, we have exploited that a large fraction of the proteins in the human proteome have favorable binding entropies by visiting an ensemble of bound states (54, 55), which is realized via disordered binding modes. We thus hypothesized that the high conformational entropy associated with nonspecific side-chain interactions contributes to the stabilization of the droplet state, and proposed a model to quantify it from its sequence. We have shown that droplet-promoting propensities can be predicted using such a generic model, even without the explicit incorporation of specific types of interactions. The specificity of our model originates from local compositional sequence biases, which are used to estimate the entropy in the bound state (23). That is, both hydrophobic and hydrophilic motifs can selectively mediate interactions if they are embedded in an environment of opposite character, explaining how selectivity can be achieved via a wide variety of interactions and contact types. We have shown earlier that this approach is capable of describing ordered and disordered binding under cellular conditions (27).

Using these general principles, we developed the FuzDrop method to predict droplet-promoting profiles and propensity of proteins to drive droplet formation. Applying this prediction method to different datasets of phase-separating proteins, we described two mechanisms of droplet formation: 1) the driver mechanism, which does not require additional components for phase separation, and depends on the overall conformational entropy of the protein, and 2) the client mechanism, which is induced by protein interactions, and is dependent on the presence of specific droplet-promoting regions in the sequence of the protein. Our results indicate that proteins may use the driver or the client mechanisms, or a combination of them, to form droplets.

Our proteome-wide analysis indicates that the presence of droplet-promoting regions is widespread in the sequences in the human proteome. Based on this analysis, we conclude that the droplet state is accessible, even if only transiently, for most proteins. In ∼40% of the human proteome it is predicted to occur spontaneously, whereas an approximately equal fraction may require a variety of cellular components or nonphysiological conditions. Proteins in known membraneless organelles represent a combination of these mechanisms, whereas those identified by high-throughput studies mostly represent droplet clients.

Taken together, these results indicate that the droplet state is likely to be a fundamental state of proteins, alongside the native and amyloid states.


The Viral Protein 35 (VP35), a crucial protein of the Zaire Ebolavirus (EBOV), interacts with a plethora of human proteins to cripple the human immune system. Despite its importance, the entire structure of the tetrameric assembly of EBOV VP35 and the means by which it antagonizes the autophosphorylation of the kinase domain of human protein kinase R (PKR K ) is still elusive. We consult existing structural information to model a tetrameric assembly of the VP35 protein where 93% of the protein is modeled using crystal structure templates. We analyze our modeled tetrameric structure to identify interchain bonding networks and use molecular dynamics simulations and normal-mode analysis to unravel the flexibility and deformability of the different regions of the VP35 protein. We establish that the C-terminal of VP35 (VP35 C ) directly interacts with PKR K to prevent it from autophosphorylation. Further, we identify three plausible VP35 C –PKR K complexes with better affinity than the PKR K dimer formed during autophosphorylation and use protein design to establish a new stretch in VP35 C that interacts with PKR K . The proposed tetrameric assembly will aid in better understanding of the VP35 protein, and the reported VP35 C –PKR K complexes along with their interacting sites will help in the shortlisting of small molecule inhibitors.


This article is part of the Proteomics in Pandemic Disease special issue.

This article is made available via the ACS COVID-19 subset for unrestricted RESEARCH re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.

Watch the video: Κοινόχρηστα πολυκατοικιών Πρέπει να γνωρίζετε ή όχι ; (August 2022).