1471-2105-12-280, bioinformatyka, artykuly

Zgryźliwość kojarzy mi się z radością, która źle skończyła.

[ Pobierz całość w formacie PDF ]
//-->.pos {position:absolute; z-index: 0; left: 0px; top: 0px;}Liu and VakserBMC Bioinformatics2011,12:280http://www.biomedcentral.com/1471-2105/12/280RESEARCH ARTICLEOpen AccessDECK: Distance and environment-dependent,coarse-grained, knowledge-based potentials forprotein-protein dockingShiyong Liu1and Ilya A Vakser2*AbstractBackground:Computational approaches to protein-protein docking typically include scoring aimed at improvingthe rank of the near-native structure relative to the false-positive matches. Knowledge-based potentials improvemodeling of protein complexes by taking advantage of the rapidly increasing amount of experimentally derivedinformation on protein-protein association. An essential element of knowledge-based potentials is defining thereference state for an optimal description of the residue-residue (or atom-atom) pairs in the non-interaction state.Results:The study presents a new Distance- and Environment-dependent, Coarse-grained, Knowledge-based(DECK) potential for scoring of protein-protein docking predictions. Training sets of protein-protein matches weregenerated based on bound and unbound forms of proteins taken from the DOCKGROUND resource. Each residuewas represented by a pseudo-atom in the geometric center of the side chain. To capture the long-range and themulti-body interactions, residues in different secondary structure elements at protein-protein interfaces wereconsidered as different residue types. Five reference states for the potentials were defined and tested. The optimalreference state was selected and the cutoff effect on the distance-dependent potentials investigated. Thepotentials were validated on the docking decoys sets, showing better performance than the existing potentialsused in scoring of protein-protein docking results.Conclusions:A novel residue-based statistical potential for protein-protein docking was developed and validatedon docking decoy sets. The results show that the scoring function DECK can successfully identify near-nativeprotein-protein matches and thus is useful in protein docking. In addition to the practical application of thepotentials, the study provides insights into the relative utility of the reference states, the scope of the distancedependence, and the coarse-graining of the potentials.BackgroundProtein-protein interactions are a key element of life pro-cesses. Thus better understanding of these interactions,coupled with our ability to model them, is essential for thefundamental knowledge of their biology and the multitudeof biomedical applications.Computational approaches to structural determination ofprotein-protein complexes (protein-protein docking) typi-cally involve two steps: the global, often low-resolution,search within a computationally feasible timeframe todetect a set of matches that includes at least one near-* Correspondence: vakser@ku.edu2Center for Bioinformatics and Department of Molecular Biosciences, TheUniversity of Kansas, Lawrence, KS 66047, USAFull list of author information is available at the end of the articlenative structure (scan stage), and the local refinement ofthe matches from the scan stage that may involve morecomputationally expensive protocols. Such refinementoften includes scoring aimed at improving the rank of thenear-native structure relative to the false-positive matches.Knowledge-based potentials [1,2], physics-based poten-tials [3], and the hybrid potentials [4-6] have been shownto perform successfully in protein-protein dockingbenchmark tests. However, the limited ranking ability ofthe current scoring functions in CAPRI [7] suggests thatmuch work still has to be done.In structure prediction of individual proteins, theknowledge-based scoring functions gained significantpopularity [8-10]. It has been shown that knowledge-based pairwise atomic potentials perform better than the© 2011 Liu and Vakser; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.Liu and VakserBMC Bioinformatics2011,12:280http://www.biomedcentral.com/1471-2105/12/280Page 2 of 7physics-based potentials in the near-native structurerefinement [11].An essential element of knowledge-based potentials isdefining the reference state for the optimal description ofresidue-residue (or atom-atom) pairs in the non-interac-tion state. For protein-protein interactions, generally,there are three methods of defining the non-interactionstate. The first one is based on the large-distance cutoffs(e.g., DFIRE [12], DCOMPLEX with DFIRE-based poten-tial [13], DOPE [14], and volume correction [15,16]), thesecond one is based on random mixing of residue oratom types (e.g., KBP [17], and DBD-Hunter [18]), andthe third one is based on false-positive matches/decoys(e.g., RAPDF [19], PIPER [20], and DARS [2]).Our approach utilizes reference states based on protein-protein decoys. It was shown that the long-range coop-erative interactions [21] play an important role inprotein-protein association. However, they are difficult tomodel based on contact or physics-based potentials. Onthe other hand, the coarse-grained distance-dependentpotentials are a simple way to capture the long-rangeresidue-residue interaction. In this paper we present anew Distance- and Environment-dependent, Coarse-grained, Knowledge-based (DECK) potential for scoringof protein-protein docking predictions.100908070Success rate60504030201013579 11 13 15 17 19 21 23 25 27 29Ref state 1Ref state 2Ref state 3Ref state 4Ref state 5Number of predictionsFigure 1Comparison of scoring results based on five referencestates.The success rates were determined on GRAMM-X dockingdecoys, as the percentage of complexes with at least one hit rankedin top N matches. A hit is defined as a match with ligand RMSDfrom the native structure <5 Å.ResultsCoarse-grained statistical potentials were developed,based on pseudo-atoms at the geometric center of theside chains, with five different reference states. Thepotentials were trained on sets of unbound and boundprotein-protein complexes (see Methods). To select theoptimal reference state, the scoring functions were testedon GRAMM-X decoy set [22]. The success rate for eachscoring function for the 61 complexes in the set is shownin Figure 1. The success rate was calculated as the per-centage of complexes with at least one hit ranked in topN. A hit was defined as a match with ligand RMSD <5 Å.The success rates in Figure 1 provide a clear comparisonof the five reference states, with the reference state 5yielding the highest success rates overall, especially forthe smaller top N values. Thus, further results in thisstudy were obtained with the potentials based on thisreference state.Our potentials are distance-dependent by design. In thedevelopment of distance-dependent potentials, the choiceof the distance cutoff is an important consideration. Ear-lier studies investigated the cutoff effect in protein-pro-tein energy landscapes [23]. For a long-range potential,such as soft Lennard-Jones, 14 Å cutoff was suggested.This value is close to the cutoff 15.5 Å in DFIRE [12]. Inan iterative knowledge-based scoring function for pro-tein-protein recognition, cutoff distance was set to 10 Å[24]. In the current study, for the scoring function withthe reference state 5, cutoffs from 3.2 to 20.8 Å wereused to check the cutoff effect on the success rate for theGRAMM-X decoys. The success rates were calculated fora set of top N criteria (Figure 2). The results show adecrease of the success rate for cutoffs >10 Å. This valueis close to the cutoff values in ITScore [24]. The cutoffbetween 8 and 10 Å has little effect on the success rate.Thus, along with the distance-dependent potentials, wetested a contact potential, based on the reference state 5,which included a single 0 - 8Å bin.908070Top 1Top 2Top 3Top 4403020103579111315171921Top 5Top 6Top 7Top 8Top 9Top 10Success rate6050Cutoff, ÅFigure 2Cutoff effect on the DECK potential.The success ratesof scoring, based on reference state 5 with different cutoff values,were obtained on GRAMM-X docking decoys. The cutoffs weretested with 0.2 Å step. The success rates were calculated as thepercentage of complexes with hits (ligand RMSD <5 Å) in top Npredictions, for different N values.Liu and VakserBMC Bioinformatics2011,12:280http://www.biomedcentral.com/1471-2105/12/280Page 3 of 7The potentials were tested on the ZDOCK3.0+ZRANKDecoys developed in Weng’s lab [25]. ZDOCK3.0 [1]implements FFT docking based on shape complementa-rily, electrostatics, and pairwise contact potentials.ZRANK [5] is an optimized energy function, whichincludes van der Waals, electrostatics and pairwiseatomic contact energy. The dataset included 84 com-plexes with 54,000 decoys each. At least one near-nativehit (a match with the interface CaRMSD <2.5 Å) waspresent in 66 complexes. The tested potentials were:DECK 1 and DECK 2 (reference state 5, training sets 1and 2, correspondingly), Contact Potential (trained on set2), and DCOMPLEX. The results were compared withZRANK values from the score file in the decoys set. Thesuccess rates are shown in Figure 3A. Overall, ZRANKshowed the best results, except DECK 2 in the top 1 pre-dictions. DECK 2 was better than Contact Potential andDCOMPLEX for all top N predictions.A test was also performed on RosettaDock [4] unbounddocking decoy set from Gray lab. The set includes 54complexes. Each complex has top 200 structures from theglobal search based on unbound structures with rebuiltside chains. This decoy set represents another importantfacet of protein docking. The ZDOCK3.0+ZRANK set hasthe rigid body docking output, which typically contains alarge number of matches for further structural refinement.The RosettaDock set contains the structures with opti-mized side-chain conformations, representing an expectedoutput of a flexible structure refinement. Such a refine-ment is computationally expensive and thus has a signifi-cantly smaller number of matches, which are meant to bestructurally more accurate than the rigid-body dockingoutput.DECK 1 and 2, and Contact Potential were tested andcompared with RosettaDock, DCOMPLEX and ZRANKscore values. The RosettaDock score values were obtainedfrom the file in the decoy set. The scores of DCOMPLEXand ZRANK were computed locally. With a hit defined asa match with ligand RMSD <5 Å, 28 of 54 complexes hadat least one hit. The results are shown in Figure 3B. If thehit was redefined as a match with ligand RMSD <10 Å, 37of 54 complexes in the decoy set had at least one hit.Figure 3C shows the results according to this definition.As the results indicate, in both cases, DECK 2 outper-formed other potentials across all top N predictions.An important activity in the field of protein-proteindocking is a community-wide experiment on CriticalAssessment of Predicted Interactions (CAPRI; http://www.ebi.ac.uk/msd-srv/capri). This experiment allows acomparison of different computational methods on a setof prediction targets (co-crystallized protein complexeswith the structure of the complex unknown to the3025201510512345678910A908070Success rate60504030201012345678910DECK 1DECK 2ContactRosettaDockDCOMPLEXBZRank807060504030201012345678910CNumber of predictionsFigure 3Test on ZRANK and RosettaDock decoys.DECK versions1 and 2 are based on the reference state 5, and trained on set 1and 2, correspondingly. The success rate was calculated as thepercentage of complexes with at least one hit ranked in top Npredictions. The definition of the hit is according to the test. (A)Test on ZRANK docking decoys. A hit is defined as a match withinterface RMSD <2.5 Å. The ZRANK score and RMSD values weretaken from the score file included with the decoys. (B) Test onRosettaDock decoys, with a hit defined as a match with ligandRMSD <5 Å, and (C) with a hit defined as a match with ligandRMSD <10 Å. The RosettaDock scores and RMSD values were takenfrom the score file included with RosettaDock decoys. DCOMPLEXand ZRANK scores were calculated locally.Liu and VakserBMC Bioinformatics2011,12:280http://www.biomedcentral.com/1471-2105/12/280Page 4 of 7predictors). The community of predictors is providedwith the coordinates of the separate components of thecomplex, which they use for the docking and scoring.After the models are submitted by the docking predic-tors, they are made available to‘scorer’groups to re-rankthem and submit their own 10 best-ranking matches [7].The DECK potential was tested in the CAPRI scoringexperiment. According to the CAPRI assessment criteria,it identified two‘acceptable’models for target 32, four‘medium’models for target 40, four‘medium’and three‘acceptable’models for target 41, and one‘acceptable’model for target 46. Target 32 was a complex betweensubtilisin Savinase anda-amylasesubtilisin inhibitor.The distribution of the top 10 models for this target isshown in Figure 4 (the best results for the target amongtwenty scoring teams).The scoring procedure implementing DECK is availablefrom the authors upon request (liushiyong@gmail.com).DiscussionThe knowledge-based potentials improve modeling ofprotein complexes by taking advantage of the rapidlyincreasing amount of experimentally derived informa-tion on protein-protein association. The distance depen-dence of these potentials is supposed to provide a moreaccurate description of protein-protein interactions bytaking into account the structural and physicochemicalaspects of the interacting proteins within a broaderscope than the immediate contact across the interface.The coarse-graining of the potentials makes them lesssensitive to the structural inaccuracies of the proteins,which are unavoidable for unbound X-ray and poten-tially modeled proteins, especially in high-throughputapplications to large interaction networks.Five reference states for the coarse-grained, distance-dependent, knowledge-based potentials were used in thisstudy. Similar reference states in earlier studies focused onprotein structure prediction and protein folding [19,26,27].We applied a similar form of the potential to protein-pro-tein docking, redefining the reference states based on thenon-native matches (docking decoys). The larger numberof non-native matches models random protein-proteinbinding with reasonable accuracy. The long range interac-tions were accounted for by incorporating the structuralenvironment of the interacting residues. Docking decoyswere used as a reference state earlier in DARS potentials[2]. However, our method differs in three key points. Thefirst one is the detailed form of the potential. DARS isbased on the mole fraction potential, uniform referencestate, and atomic contact potentials [28] (the randomcrystal reference state: the atom pairs are randomlyexchanged). In our method, the reference states 1 and 2also include the mole fraction terms. However, they alsoincorporate the probability of finding residue types at acertain distance [19]. The second point is the way to calcu-late the observed and the expected probabilities of residuepairs. The observed probability of DARS is based on thenative structure. In our study, the observed probabilitybased on the native structure made the results worsewhen tested on GRAMM-X decoys (data not shown). Themain reason was the limited number of nonredundantprotein-protein interfaces. So, in our approach the near-native matches were used instead of the native complexes.The DARS approach used 20,000 best scoring matches(shape complementarily only) for calculating the referenceprobabilities. We used ~160,000 best scoring matcheswithout the near-native hits for calculating the expectedprobability in each case. The third point is the resolution.Our method is coarse-grained. Because in this work we donot integrate our potential in the FFT search, a directcomparison of the results is difficult. However, both stu-dies show that the reference states based on decoys per-form better than the ones based on mole fraction terms.Overall, the results show that the scoring function DECKcan successfully identify near-native protein-proteinmatches and thus is useful in protein docking.Figure 4Example of DECK scoring of protein-protein dockingmatches.Top 10 models according DECK scores are shown forCAPRI target 32. The structures are shown in the correct (co-crystallized) position. Binding site residues on the receptor are inred. Magenta spheres are the geometric centers of the ligand in thetop 10 predictions containing two acceptable models (see text fordetails).ConclusionsScoring of predicted protein-protein matches is importantfor identification of near-native structures in a pool ofmodels. Knowledge-based scoring schemes improve mod-eling of protein complexes by taking advantage of therapidly increasing amount of experimentally derived infor-mation on protein-protein association. A choice of thereference state for the description of non-interacting resi-due or atom pairs is an essential element of the knowl-edge-based potentials. The study presents a new potentialfor scoring of protein-protein docking predictions.Liu and VakserBMC Bioinformatics2011,12:280http://www.biomedcentral.com/1471-2105/12/280Page 5 of 7Training sets of protein-protein matches were generatedbased on the bound and unbound proteins from theDOCKGROUND resource. Each residue was representedby a pseudo-atom in the geometric center of the sidechain. To capture the long-range and the multi-bodyinteractions, residues in different secondary structure ele-ments at protein-protein interfaces were considered as dif-ferent residue types. Five reference states for the potentialswere defined and tested. The optimal reference state wasselected and the cutoff effect on the distance-dependentpotentials investigated. The potentials were validated onthe docking decoys sets, showing better performance thanthe existing potentials used in scoring of protein-proteindocking results. The study also provides insights into therelative utility of the reference states, the scope of the dis-tance dependence and the coarse-graining of thepotentials.e(i, j, d)=−RTlnπ(i,j, d)obsπ(i,j, d)exp(1)MethodsTraining setsThe bound and the unbound complexes for the trainingsets were taken from the DOCKGROUND resource[22,29,30] (http://dockground.bioinformatics.ku.edu).The bound complexes were from the representativebound set and the bound part of the docking benchmark.The unbound complexes were from the docking bench-mark. For all the complexes, the docking decoys weregenerated by GRAMM-X [31] scan (with no scoring andrefinement). A match with RMSD of the ligand backboneatoms <5 Å was defined as the near-native one, compar-able with CAPRI evaluation criteria [7]. With 160,000matches per complex, 358 bound complexes from therepresentative set, and 71 bound complexes and 50unbound complexes from the docking benchmark sethad at least one near-native prediction. Two training setswere compiled: Training Set 1 (408 complexes) including358 bound complexes from the representative set and 50unbound complexes from the docking benchmark, andTraining Set 2 (429 complexes) including 358 boundcomplexes from the representative set and 71 boundcomplexes from the docking benchmark. It is well knownthat existing protein-protein docking procedures performdifferently on bound and unbound structures. Thus, it isinteresting to see the difference between the knowledge-based potentials derived from the bound and from theunbound docking, especially with the potentials tested onthe unbound docking decoys.Knowledge-based energy functionswhereπ(i,j,d)obsandπ(i,j,d)expare the observed andthe expected probability of the residue pair (i,j)at dis-tancedrespectively, and RT is set to 1.The interaction distance was divided into 21 bins.Comparison with the contact potential (Figure 3)suggests that the larger number of bins enhances theperformance of the potential. At the same time,increasing the number of bins beyond 21 would con-tradict the coarse-grained, residue-based nature of thepotential.Five reference states from the existing methodologieswere defined. Each residue was represented by apseudo-atom in the geometric center of the side chain(for GLY, the geometric center of the main chain). Thedistance between residuesiandjwas defined as the dis-tance between their pseudo atoms. Atomic environmentpotential [38] was used to model multi-body interactionfrom pairwise contact potentials. To capture the long-range and the multi-body interactions, residues in differ-ent secondary structure environments [39,40] (helix,strand, and coil) at protein-protein interfaces was con-sidered as different residue types. The total number ofsuch types was 60 (20 amino acids in three secondarystructure states). The secondary structure state was cal-culated by DSSP [41]. The eight DSSP secondary struc-ture states are usually placed in three groups: helix (G,H and I), strand (E and B) and loop (all others). In ourstudy, besides H and E, other states were designated asO. So the three secondary structure states were: H, Eand O.All residue-residue pairs were from protein-proteininterfaces of the near-native matches or non-near nativedecoys. A residue was assigned to the interface if itscentroid was within 30 Å of any residue centroid of theother docking partner. Different methods of calculatingthe probabilities in observed and expected states lead todifferent potentials. In the following part, we will discussfive different methods used to define the reference state.Reference state 1The observed probability of residue pair (i,j)wasdefined asπ(i,j, d)obs=N(i, j, d)obsN(d)obsχiχj(2)It can be assumed that the probability of structural fea-tures at protein-protein interfaces follows the Boltzmanndistribution [12,17,19,26,27,32-37]. For a residue-residuepair (i,j)at distancedacross the interface, the contribu-tion of binding energye(i, j, d)can be estimated as:wheredis the distance between residuesiandj;npnmN(i, j, d)obs=p=1 m=1gp,m(i,j, d)(3)

zanotowane.pl

doc.pisz.pl

pdf.pisz.pl

hannaeva.xlx.pl

1471-2105-12-280, bioinformatyka, artykuly

Zgryźliwość kojarzy mi się z radością, która źle skończyła.

Wątki

Drogi uĹźytkowniku!