Propensity Score for amino acid

Propensity Score for amino acid

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

What is the meaning of propensity score of amino acid? How is it calculated?
(I have not studied biology since last 8 years and now I am going through it because I need it for my research. So if someone can describe it in simple language it would be very helpful)

As a simplified response: Propensity score is used to predict protein secondary structure. It is derived from looking at the aa residue of the accessible surface of the protein and also the interface which enables interactions between other proteins.

The equation is as follows:

Propensity= [probability of the residue in the interface / probability of the residue on the surface]


prob of the residue in the interface = [number of amino acids in the interface / total number of amino acids of any type of interface]


prob of the residue on the surface = [number of surface amino acids / total number of surface amino acids]

Check out this PLOS paper for an example. PLoS One. 2014; 9(5): e97158.

DiscoTope 2.0 Server

DiscoTope server predicts discontinuous B cell epitopes from protein three dimensional structures. The method utilizes calculation of surface accessibility (estimated in terms of contact numbers) and a novel epitope propensity amino acid score. The final scores are calculated by combining the propensity scores of residues in spatial proximity and the contact numbers.

New in the DiscoTope version 2.0: Novel definition of the spatial neighborhood used to sum propensity scores and half-sphere exposure as a surface measure.

Note: The DiscoTope server has been up-dated to improve the user-friendliness. The server now predicts epitopes in complexes of multiple chains. Also, DiscoTope output files are now easily downloaded and imported in spreadsheets. Futhermore, we have facilitated the visualization of prediction results.


For publication of results, please cite:

Reliable B Cell Epitope Predictions: Impacts of Method Development and Improved Benchmarking
Jens Vindahl Kringelum, Claus Lundegaard, Ole Lund, and Morten Nielsen
Plos Computational Biology, 2012
Link to Paper


About 20-30% of all proteins encoded in a typical genome are estimated to be localized in membranes [1, 2], where protein-lipid interactions play crucial roles in the conformational stability and biological functions of membrane proteins. Many experimental studies have suggested that physico-chemical properties of the membrane lipid bilayer influence the stability and function of membrane proteins. The thermal [3, 4] and chemical [5] stability of the potassium channel KcsA has been shown to vary according to the lipid composition of the membrane bilayer. It has also been shown that the lipid composition affects protein functions including: ion transport in KcsA [6, 7] and the Ca 2+ -ATPase of sarcoplasmic reticulum [8, 9], phosphorylation by the diacylglycerol kinase [10] and chemical compound transport by the mechanosensitive channel of large conductance MscL [11]. To complement these experimental studies, statistical analyses have been carried out to reveal amino acid preferences and conservation patterns within the lipid bilayer environment [12–16] using available sequence and structural data. The patterns emerging from these statistical analyses should reflect implicitly the effects of lipid molecules on the structural formation and stability of membrane proteins. However, few of the previous computational studies have taken into account the atomic details of protein-lipid interactions explicitly. A notable exception is all-atom molecular dynamics (MD) simulations it has become possible to apply the technique to membrane proteins in conditions mimicking biological membranes (reviewed recently by Khalili-Araghi and co-authors [17]). All-atom MD simulations enable us to inspect protein-lipid interactions in atomic details [18, 19] and can reveal the role of lipids in protein function [20], albeit for a small selection of specific lipid and protein molecules.

In this paper, we attempt to understand the nature of protein-lipid interactions using a computational approach. Given the limited number of crystal structures containing lipid molecules, we decided to combine all known biological phospholipids together and classify the atomic interactions into those involving the "head" and "tail" parts of the lipids. The head and tail groups can be found in most phospholipids constituting a biological membrane and define one of the most essential chemical features of these molecules. Thus, we ask more specifically: "How are the head and tail portions of lipid molecules recognized by amino acid residues in membrane proteins?"

To answer this question, we utilized two available data sources, crystal structures and MD trajectories. Using the crystal structure data, we can include and examine various kinds of proteins and lipids, although the number of lipid molecules observed in each solved structure is limited. Using the MD data, we can obtain detailed information about all the lipid molecules surrounding a protein, although such an analysis is possible only for a small set of protein and lipid types. The combination of these two data sources allows us to assess the biases resulting from a limited variety of data in each data source. The results revealed a common pattern of lipid tail-amino acid interactions observed in both the crystal structures and MD trajectories. We show that the recognition of lipid tails can be explained largely by general lipophilicity and that this effect dominates in the two different situations represented by the crystal structure and MD datasets. In contrast, lipid head groups showed a more complicated and diverse pattern and we discuss how our observations can be related to known experimental data and previously proposed concepts concerning protein-lipid interactions.


Knowledge of three-dimensional protein structures is crucial when investigating protein functions. The structural knowledge is considered to be important when designing drugs involving the protein functions [1]. In general, X-ray crystallography and nuclear magnetic resonance spectroscopy are commonly used for determining the structures of proteins. Approximately 80% of the protein structures in Protein Data Bank (PDB) were obtained by using the X-ray crystallography method [2]. In fact, these two approaches involve very complex, time-consuming, laborious and expensive processes. Because of the difficulties in determining the crystal structures, the current protocol yields only a 30% success rate [3]. Thus, many researchers take advantage of computational approaches to directly predicting protein crystallization.

Canaves et al. [4] and Goh et al. [5] have proposed methods for extracting informative features to predict protein crystallization. Many sequence-based computational methods, including OB-Score [6], SECRET [7], CRYSTALP [8], XtalPred [9], ParCrys [10], CRYSTALP2 [11], SVMCRYS [12], PPCpred [13] and RFCRYS [14], predict protein crystallization, as shown in Table 1. Both support vector machine (SVM) [7], [12], [13] and the ensemble mechanism [13], [14] are well-known techniques to enhance prediction accuracy. Because of the different design aims and benchmarks used, it is not easy to assess which method and features are the most effective. From the study in [14] and Table 1, we can see that the SVM_POLY method (see the work [13]) using SVM has the highest accuracy among the non-ensemble methods. This method is one of the four SVM predictors that are integrated into PPCpred [13]. The state-of-the-art ensemble methods PPCpred and RFCRYS have high prediction accuracies using the SVM and Random Forest classifiers, respectively. PPCpred utilizes a comprehensive set of inputs that are based on energy and hydrophobicity indices, the composition of certain amino acid types, predicted disorder, secondary structure, solvent accessibility, and the content of certain buried and exposed residues [13]. RFCRYS predicts the protein crystallization by utilizing the mono-, di- and tri-peptide compositions the frequencies of amino acids in different physicochemical groups the isoelectric point the molecular weight and the length of the protein sequences [14]. However, the mechanism of these two ensemble classifiers suffers from low interpretability for biologists. It is not clear which sequence features provide the essential contribution to the high prediction accuracy.

Rather than increasing both the complexity of prediction methods and the number of feature types while pursuing high accuracy, the motivation of this study is to provide a simple and highly interpretable method with a comparable accuracy from the viewpoint of biologists. The p-collocated AA pairs (p = 0 for a dipeptide) are shown to be significant in influencing or enhancing protein crystallization because of the impact of folding corresponding to the interaction between local AA pairs [8], [11]. The p-collocated AA pairs provide the additional information on which the interaction between local AA pairs reflects besides the simple AA composition. This study proposes an ensemble method, SCMCRYS, to predict protein crystallization in which each classifier is built by using a scoring card method (SCM) [15] with estimating propensity scores of p-collocated AA pairs to be crystallizable. Compared to SCM using dipeptide composition in [15], the ensemble classifier of SCMCRYS makes the best use of p-collocated AA pairs. The rules for deciding whether a protein is crystallizable in the SCM classifier and SCMCRYS are very simple according to a weighted-sum score and a voting method from a number of SCM classifiers, respectively. However, the experimental results show that the SCM classifier is comparable to SVM_POLY and the SVM-based classifiers with p-collocated AA pairs. The SCMCRYS method is comparable to the state-of-the-art ensemble methods PPCpred and RFCRYS.

The propensity scores of dipeptides and amino acids to be crystallizable are highly correlated with the crystallization ability of sequences and can provide insights into protein crystallization. Furthermore, the propensity scores of amino acids can also reveal the relationship between crystallizability and physicochemical properties such as solubility, molecular weight, melting point and conformational entropy of amino acids. This study also proposes a mutagenesis analysis method for illustrating the additional advantage of SCM. We investigate the mutagenesis analysis for enhancing protein crystallizability based on the estimated crystallizability scores, solubility scores [15], and physicochemical properties of amino acids. The analysis result reveals the hypothesis that the mutagenesis of surface residues Ala and Cys has large and small probabilities of enhancing protein crystallizability in applying protein engineering approaches.

SCMCRYS: Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs

Existing methods for predicting protein crystallization obtain high accuracy using various types of complemented features and complex ensemble classifiers, such as support vector machine (SVM) and Random Forest classifiers. It is desirable to develop a simple and easily interpretable prediction method with informative sequence features to provide insights into protein crystallization. This study proposes an ensemble method, SCMCRYS, to predict protein crystallization, for which each classifier is built by using a scoring card method (SCM) with estimating propensity scores of p-collocated amino acid (AA) pairs (p = 0 for a dipeptide). The SCM classifier determines the crystallization of a sequence according to a weighted-sum score. The weights are the composition of the p-collocated AA pairs, and the propensity scores of these AA pairs are estimated using a statistic with optimization approach. SCMCRYS predicts the crystallization using a simple voting method from a number of SCM classifiers. The experimental results show that the single SCM classifier utilizing dipeptide composition with accuracy of 73.90% is comparable to the best previously-developed SVM-based classifier, SVM_POLY (74.6%), and our proposed SVM-based classifier utilizing the same dipeptide composition (77.55%). The SCMCRYS method with accuracy of 76.1% is comparable to the state-of-the-art ensemble methods PPCpred (76.8%) and RFCRYS (80.0%), which used the SVM and Random Forest classifiers, respectively. This study also investigates mutagenesis analysis based on SCM and the result reveals the hypothesis that the mutagenesis of surface residues Ala and Cys has large and small probabilities of enhancing protein crystallizability considering the estimated scores of crystallizability and solubility, melting point, molecular weight and conformational entropy of amino acids in a generalized condition. The propensity scores of amino acids and dipeptides for estimating the protein crystallizability can aid biologists in designing mutation of surface residues to enhance protein crystallizability. The source code of SCMCRYS is available at


Anticancer peptides (ACPs) are small peptides exerting selective and toxic properties toward cancer cells. Owing to its inherent high penetration, high selectivity and ease of modification, synthetic peptide-based drugs and vaccines 1 – 3 represents a promising class of therapeutic agents. Designed ACPs can improve affinity, selectivity and stability for enhancing cancer cell elimination. The influence of amino acid residues towards the anticancer activity of ACPs is dependent on cationic, hydrophobic and amphiphilic properties with helical structure to drive cell permeability. Particularly, cationic amino acid residues (i.e., lysine, arginine, and histidine) can disrupt and penetrate the cancer cell membrane to induce cytotoxicity whereas anionic amino acids (i.e., glutamic and aspartic acids) affords antiproliferative activity against cancer cells. Furthermore, hydrophobic amino acid residues (i.e., phenylalanine, tryptophan, and tyrosine) exerts their effect on the cancer cytotoxic activity 1 , 4 , 5 . Moreover, the secondary structure of ACPs that is formed by cationic and hydrophobic amino acids, plays a crucial role in peptide-cancer cell membrane interaction that inherently leads to cancer cell disruption and death 1 , 6 . Therefore, it is desirable to develop a simple, interpretable and efficient predictor for achieving accurate ACP identification as well as facilitating the rational design of new anticancer peptides with promising clinical applications.

In the past few years, most methods in existence were developed via the use of machine learning (ML) and statistical methods as applied on peptide sequence information for discriminating ACPs from non-ACPs 7 – 23 . More details of those existing methods are summarized in two comprehensive review papers 2 , 3 . Amongst the various types of ML approaches, both support vector machine (SVM) (i.e. AntiCP 8 , Hajisharifi et al.’s method 9 , ACPP 24 , iACP 10 , Li and Wang’s method 11 , iACP-GAEnsC 12 , TargetACP 14 and ACPred 19 ) and the ensemble approach (i.e. MLACP 13 , ACPred 19 , PTPD 21 , ACP-DL 22 , PEPred-Suite 20 , ACPred-FL 15 , ACPred-Fuse 18 , PPTPP 23 and AntiCP_2.0 25 ) were widely used to develop ACP predictors. As summarized in a recent review 2 , we could see that TargetACP has been developed by integrating the split amino acid composition and pseudo position-specific scoring matrix descriptors 14 , which was shown to outperform SVM-based predictors 8 – 12 , 19 , 24 . In the meanwhile, the state-of-the-art ensemble methods comprising PEPred-Suite 20 and ACPred-Fuse 18 provided the highest prediction accuracies as evaluated on the dataset collected by Rao et al. 18 . In ACPred-Fuse, it was developed using random forest (RF) model in conjunction with 114 feature descriptors. And then, a total of 114 RF models were trained to generate class information and probabilistic information used for developing a final model. Most recently, Agrawal et al. proposed an updated version of AntiCP called AntiCP2.0 and also provided two high-quality benchmark datasets (i.e. main and alternative datasets) having the largest number of peptides. AntiCP2.0 was developed by extremely randomized trees (ETree) algorithm with amino acid composition (AAC) and dipeptide composition (DPC). On the basis of independent test results reported by the prior work of AntiCP2.0, it can be noticed that AntiCP2.0 was superior to other existing ACP predictors (e.g. AntiCP 8 , iACP 10 , ACPred 19 , ACPred-FL 15 , ACPred-Fuse 18 , PEPred-Suite 20 ). All in all, much progress has been achieved in existing methods. Nevertheless, two potential drawbacks of existing ACP predictors motivated us to develop a new ACP predictor in this study. First, their interpretable mechanisms are not easily understood and implemented by the viewpoint of biologists and biochemists. Existing ACP models do not provide a straight-forward explanation on the underlying mechanism of the biological activity of what constitute ACPs. Meanwhile, a simple and easily interpretable models is more useful in a further analysis of characteristics of anticancer activities of peptides. Second, their accuracy and generalizability still require improvement.

In consideration of these problems, we propose herein the development of a novel ML-based predictor called the iACP-FSCM for further improving the prediction accuracy as well as shedding light on characteristics governing anticancer activities of peptides. The conceptual framework of the iACP-FSCM approach proposed herein for predicting and analyzing ACPs is summarized in Fig.  1 . The major contributions of iACP-FSCM for predicting and characterizing ACPs can be summarized as follows. Firstly, we proposed herein a novel, flexible scoring card method (FSCM) for effective and simple prediction and characterization of peptides affording anticancer activity using only sequence information. The FSCM method is an updated version of the SCM method developed by Huang et al. 26 and Charoenkwan et al. 27 by making use of propensity scores of both local and global sequential information. Secondly, unlike the rather complex classification mechanisms as afforded by state-of-the-art ensemble approaches 15 , 18 , 20 , the iACP-FSCM method proposed herein identifies ACPs using only weighted-sum scores between the composition and propensity scores, which is easily understood and implemented by biologists and biochemists. Thirdly, the FSCM-derived propensity scores can be adopted to identify informative physicochemical properties (PCPs) that may provide crucial information pertaining to local and global properties of ACPs. Finally, comparative results revealed that iACP-FSCM outperformed those of state-of-the-art ACP predictors for ACP identification and characterization. The iACP-FSCM webserver presented herein has been demonstrated to be robust as deduced from its superior prediction accuracy, interpretability and publicly availability, which is instrumental in helping biologists in their identification of ACPs with potential bioactivities. Furthermore, the proposed FSCM method has great potential for estimating the propensity scores of amino acids and dipeptides that can be used to predict and analyze various bioactivities of peptides such as haemolytic peptides 28 , antihypertensive peptides 29 and antiviral peptides 20 , 23 .


The scoring function for interface-residue identification

1. Side chain energy score

2. Residue conservation score

3. Residue interface propensity

PINUP algorithm for predicting the interface residues

The PINUP algorithm is as follows:

Identification of surface residues. As in a previous study ( 38), surface residues are defined as those side chains with a relative accessibility of >6% (probe radius = 1.2 Å).

Identification of candidate binding surface patches. A surface patch is defined as a central surface residue and 19 nearest neighbors as in a previous study ( 38). The score of a patch is given by the average value of the scores for all 20 residues by using the above-described scoring function. All of the surface residues are sampled. Solvent vector constraints ( 32) are applied in order to avoid patches sampling at different sides of a protein surface. Top 5% scored patches are selected. If the number of surface residues for a protein is less than 100, five top-scored patches are selected, instead.

Locating candidate interface residues. Typically, the above selected patches overlap with each other. That is, one residue can appear in multiple patches. We rank residues based on the number of top-scored patches to which they belong (the appearance rate in top-scored patches). Top-ranked 15 residues are designated as candidate interface residues. For large proteins with more than 150 surface residues, we retain up to 10% of total surface residues. If the last candidate residue (e.g. the 15th residue for proteins with less than 150 residues) has the same appearance rate in the top-scored patches as several other non-candidate residues, all of them are included in the candidate interface residues.

Prediction of a continuous binding interface. The final predicted interface is defined as the largest continuous patch made of the ‘interacting’ candidate interface residues. Two residues are considered interacting if the distance between any two respective side chain atoms is <1 Å plus the sum of the van der Waals radius of the two atoms. If a surface residue is surrounded by the predicted interface residues and it does not interact with other surface residues, the residue will be included as interface residues. The van der Waals radii for all atom types are from the CHARMM21 parameter set ( 42).

There are several parameters, such as the definition of surface residues [Step (1)] and the size of surface patches [Step (2)] in this PINUP algorithm. Effects from varying these parameters are discussed in the Results section.

Protein datasets

We use a set of 57 non-homologous proteins collected by Neuvirth et al. ( 10) for training and cross validation. In this set, antibodies and antigens are not included since their specific binding mode is optimized through rapid somatic cell mutations instead of evolution over many years. Our algorithm relies on conservation information and, thus, is not suitable for predicting antigen–antibody interfaces. The structures of the unbound monomers and complexes are obtained from PDB ( 43). The program REDUCE ( 44) is used to add hydrogen atoms to all proteins. Non-polar hydrogen atoms and all water molecules are deleted. The binding sites are predicted with unbound structures. The complex structures are used to define the experimental interface residues for the unbound monomers. A surface residue is considered as interface residues if its accessible surface area is decreased by more than 1 Å 2 upon complexation.

To further test PINUP, we use the protein–protein docking benchmark 2.0 established by Chen et al. ( 45). This benchmark contains 62 protein complexes (excluding antigen–antibody), in which 68 unbound proteins can be considered as an independent test set because they share <35% sequence identity with any protein in the 57-protein dataset described above. This 68-protein set contains 42, 18 and 8 proteins with minor, medium and large-scale conformational change upon complexation, respectively.

There is a significant homologous relation between the 75 proteins used for deriving interface propensity and the 57 proteins used for cross validation. We test the dependence of prediction accuracy on the dataset used for deriving interface propensity and find that the dependence is essentially negligible. Details can be found in the Results section.

Assessment of prediction accuracy

Prediction accuracy is assessed by the coverage of the actual interface by the predicted interface, which is the fraction of correctly predicted interface residues in the total number of observed interface residues, and the accuracy of the predicted interface, which is the fraction of correctly predicted interface residues in the total number of predicted interface residues. The expected accuracy from random prediction is the fraction of observed interface residues in the total number of surface residues.

Optimizing the weights

We use a simple grid method to optimize the weights of wc and wp. An initial scanning suggests the optimal values located at 0 < wc < 2 and 1 < wp < 10. The final weights are obtained by a simple grid search within 0 < wc < 2 with a step of 0.2 and 1 < wp < 10 with a step of 1. The parameters are optimized for the highest accuracy.


Whilst randomised controlled trials (RCTs) are the gold standard for evaluating treatment effects, they are often infeasible due to time, cost or ethical constraints. In such situations, observational data may provide valuable information. Unfortunately, observational data analyses are subject to confounding bias. This occurs when patient characteristics that influence the outcome have unbalanced distributions across treatment groups. Any differences observed in the outcome between treatment groups may be partly due to the differences in patient characteristics.

Traditionally, multivariable regression is used to account for the differences in patient characteristics between treatment groups. However, this approach is not always suitable. For example, when the study outcome is binary, a rule of thumb suggests that 10 events should be observed per covariate included in the regression model [1]. This could be infeasible if the outcome is rare and there are many covariates to adjust for. Propensity scores provide a potential solution to this problem. Rosenbaum and Rubin [2] first introduced the propensity score, defined as the probability of treatment assignment conditional on baseline characteristics. Additionally, they demonstrated that conditioning on the propensity score will balance the distribution of characteristics between treatment groups, reducing the chance of confounding bias. Propensity scores are useful for situations with rare binary outcomes because adjusting for the propensity score only is sufficient to improve balance on the measured covariates. They are also useful in situations where the relationship between covariates and treatment is better understood than the relationship between covariates and outcome, since treatment is modelled rather than outcome. Additionally, comparing propensity score distributions between treatment groups can help identify areas of non-overlap in covariate distributions, which are often overlooked when using traditional regression methods [3]. However, it is important to note that propensity scores cannot account for unmeasured confounding: balance will only be improved on covariates used to estimate the propensity score.

Most commonly, propensity scores are estimated using logistic regression. Treatment assignment is regressed on baseline characteristics and the predicted probabilities are the estimated propensity scores. Assuming no unmeasured confounding and no misspecification of the propensity score model, unbiased estimates of treatment effects can be obtained using one of four techniques: matching, stratification, weighting or covariate adjustment. We briefly describe these techniques here, but readers are referred elsewhere for more details [2, 4,5,6,7,8,9]. Matching involves forming matched sets of treated and control patients, on the basis of having similar propensity scores. Stratification involves dividing patients into equally sized strata based on their propensity score and weighting involves assigning propensity-based weights to each patient. Estimated treatment effects can then be obtained by comparing outcomes in the matched set, within strata (an overall estimate can be obtained by pooling the strata-specific estimates) or in the weighted sample. Finally, covariate adjustment is implemented by including the propensity score as a covariate when regressing outcome on treatment. Each of these techniques aim to balance patient characteristics between treatment groups, but misspecification of the propensity score model could prevent achieving adequate balance, thereby leading to residual confounding bias. Hence, an essential step of propensity score implementation is using appropriate diagnostics to assess the propensity score and ensure that it has adequately reduced confounding bias. Many authors [10,11,12,13,14,15,16,17] have made recommendations regarding appropriate use of diagnostics. More specifically, they recommended against the use of hypothesis tests comparing covariate means or proportions and advocated using standardised differences.

Despite their introduction in 1983, propensity scores were not commonly applied in the medical literature until around 20 years later. More recently, they have become increasingly popular [10]. In the last decade (2007–2017) the number of articles returned from searching ‘propensity scores’ in PubMed more than tripled over each 5 year period. Following the increase in use of propensity scores, a number of reviews [10, 11, 18,19,20,21,22,23,24,25] assessing their implementation were published. Regrettably, each review found that propensity score implementation was suboptimal, particularly regarding the use of diagnostics. Many authors were not reporting the use of any propensity score diagnostic, and those who did were often using hypothesis tests, which are widely discouraged. If appropriate diagnostics are not used to demonstrate the balance of potential confounders achieved by the propensity score, readers of the research have no basis for trusting the results. Of the existing reviews on the propensity score literature, only three [11, 19, 21] consider articles from all areas of medicine, and these collectively include articles published up to 2012. Since 2012, there has been numerous publications providing guidance on the use of propensity score diagnostics [10,11,12, 14,15,16,17], or proposing new propensity score diagnostics [26,27,28,29]. Considering these recent developments in methodology and guidance on practice, the use of propensity score diagnostics in recent medical studies may have improved. Therefore the aim of this review is to update the literature on diagnostic use, but with a focus on high-ranking journals. Such journals could be considered more influential as they are often looked towards as a beacon of best practice. Furthermore, it may beneficial to know which types of studies are more or less likely to report use of suboptimal diagnostics. This information could help us to identify pockets of good practice and areas where efforts to change practice should be focused. Bearing this in mind, the objectives of this review are to: (1) assess the use of propensity score diagnostics in medical studies published in high-ranking journals and (2) compare use of diagnostics between studies (a) in different research areas and (b) using different propensity score methods.

Materials and methods

Collection of annotations of crystallization trials

We only extracted X-ray crystallography-based experimental trials annotated with the most advanced experimental statuses. These statuses include ‘selected’, ‘cloned’, ‘expressed’, ‘soluble’, ‘purified’, ‘crystallized’, ‘diffraction’, ‘crystal structure’ or ‘in PDB’. We grouped the proteins with the status of ‘crystal structure’ or ‘in PDB’ as crystallizable proteins (defined as the ‘CRYS’ class), and grouped those with other statuses as non-crystallizable proteins (defined as the ‘NCRYS’ class).

We only selected the experimental trials annotated with two states: ‘work stopped’ ‘in PDB’ or ‘crystal structure’.

We did not extract the experimental trials both before 1 January 2009 and after 31 December 2014. This could ensure that we only extracted recent data and excluded trials that are potentially still ongoing at present.

We eliminated non-crystallizable proteins sharing >100% sequence identity with crystallizable proteins. The sequence identity was quantified by the CD-Hit program [ 49].

The constructed TTdata includes 81 279 non-crystallizable proteins and 103 247 crystallizable proteins.

Collection of functional annotations

We retrieved functional annotations of the proteins from UniProt (, which included 549 008 proteins from the Swiss-Prot database and 50 011 027 proteins from the TrEMBL database (on 14 July 2015). Swiss-Prot is a collection of entries that are reviewed and manually annotated using a literature search and curator-evaluated computational analysis. TrEMBL is not reviewed in which proteins are annotated computationally. We mapped the proteins in TTdata to both Swiss-Prot and TrEMBL via one-by-one matching of sequences sharing 100% sequence identity. Totally, 5849 crystallizable proteins (positive samples) and 4907 non-crystallizable (negative samples) proteins were mapped to the Swiss-Prot database, constituting the Swiss-Prot data set. Additionally, 8491 crystallizable (positive samples) and 21 426 non-crystallizable (negative samples) proteins were mapped to the TrEMBL database, comprising the TrEMBL data set.

Training and benchmark test data sets

We eliminated sequence redundancy (proteins with >25% sequence identity) within crystallizable proteins contained in either Swiss-Prot or TrEMBL, also eliminated that within non-crystallizable proteins contained in each data set. The sequence identity was qualified by using a combination of CD-Hit [ 49] and BLAST [ 44]. Eliminating sequence redundancy within each data set was based on the observation that the proteins with similar sequences could possess distinct CPs [ 2]. Totally, the Swiss-Prot data set contains 2798 crystallizable and 3096 non-crystallizable proteins (denoted as the ‘SP’ data set), while the TrEMBL data set contains 4994 crystallizable and 9794 non-crystallizable proteins (denoted as the ‘TR’ data set).

Either the SP data set or the TR data set was randomly divided into six equally sized subsets. The first five subsets were merged together to form the training data set (denoted as ‘SP_train’ or ‘TR_train’), while the remaining sixth subset worked as the independent test data set (denoted as ‘SP_test’ or ‘TR_test’).

We further eliminated the proteins sharing >25% sequence identity with those used in other predictors. The resulting four data sets were named as ‘SP_train_nr’, ‘SP_test_nr’, ‘TR_train_nr’ and ‘TR_test_nr’, respectively. These data sets can be downloaded from

To examine whether the functional features of similar proteins can be used to predict CP, we mapped TTdata-derived sequences to Swiss-Prot and TrEMBL data sets via one-by-one matching of sequences sharing >90% sequence identity. The resultant data sets were named ‘SP0.9’ and ‘TR0.9’, respectively. Hence, each protein in SP0.9 or TR0.9 is associated with one or more orthologous proteins in the Swiss-Prot data set or the TrEMBL data set.


Abnormal bitterness might be associated with dietary danger. In general, hydrolyzed proteins, plant-derived alkaloids and toxins exhibit unpleasant bitter taste. Thus, the bitter taste perception plays a crucial role in protecting animals from poisonous plants and environmental toxins [1]. The taste perception of humans can be categorized into four well-known groups: sweet, bitter, sour and salty, in addition to two controversial groups, i.e. fat taste and amino acid taste [2]. Although, abnormal or extreme bitterness tends to be associated with dietary danger, a number of diverse plant-derived food produce bitterness such as cucumber, pumpkin, zucchini, squash, lettuce, spinach and kale. In addition, many bitter compounds are important drugs or drug candidates encompassing ions, alkaloids, polyphenols, glucosinolates and peptides. Proteolytic hydrolysis of peptides and proteins have been known to make foods unfavorable [3,4]. In this process, caseins are digested into peptides containing bulky hydrophobic groups at their C-terminal region [3]. Hence, the hydrophobic property of the amino acid side chain at the C-terminus can be attributed to its bitterness. The successful identification and characterization of bitter peptides is essential for drug development and nutritional research.

High-throughput experimental approaches for identifying bitter peptides are time-consuming and costly, thus the development of accurate and fast computational methods is in great demand. Particularly, such computational approach is based on quantitative structure–activity relationship (QSAR) modeling. QSAR is a ligand-based approach that seeks to discern the mathematical relationship between various types of descriptors (X) and their investigated biological activity (Y) through the use of machine learning (ML) models [5]. As mentioned in the Organization for Economic Co-operation and Development (OECD) guideline [[6], [7], [8]], the development of robust QSAR models entails the following characteristics: (i) a defined endpoint (ii) an unambiguous algorithm (iii) a defined domain of applicability (iv) appropriate measures of goodness-of-fit, robustness, and predictive ability and (v) a mechanistic interpretation.