1471-2105-14-S2-S8, bioinformatyka, artykuly

Zgryźliwość kojarzy mi się z radością, która źle skończyła.

//-->.pos {position:absolute; z-index: 0; left: 0px; top: 0px;}Guet al. BMC Bioinformatics2013,14(Suppl2):S8http://www.biomedcentral.com/1471-2105/14/S2/S8PROCEEDINGSOpen AccessBuilding Markov state models with solventdynamicsChen Gu1, Huang-Wei Chang1, Lutz Maibaum2, Vijay S Pande3, Gunnar E Carlsson4, Leonidas J Guibas5*FromThe Eleventh Asia Pacific Bioinformatics Conference (APBC 2013)Vancouver, Canada. 21-24 January 2013AbstractBackground:Markov state models have been widely used to study conformational changes of biologicalmacromolecules. These models are built from short timescale simulations and then propagated to extract longtimescale dynamics. However, the solvent information in molecular simulations are often ignored in currentmethods, because of the large number of solvent molecules in a system and the indistinguishability of solventmolecules upon their exchange.Methods:We present a solvent signature that compactly summarizes the solvent distribution in the high-dimensional data, and then define a distance metric between different configurations using this signature. We nextincorporate the solvent information into the construction of Markov state models and present a fast geometricclustering algorithm which combines both the solute-based and solvent-based distances.Results:We have tested our method on several different molecular dynamical systems, including alaninedipeptide, carbon nanotube, and benzene rings. With the new solvent-based signatures, we are able to identifydifferent solvent distributions near the solute. Furthermore, when the solute has a concave shape, we can alsocapture the water number inside the solute structure. Finally we have compared the performances of differentMarkov state models. The experiment results show that our approach improves the existing methods both in thecomputational running time and the metastability.Conclusions:In this paper we have initiated an study to build Markov state models for molecular dynamicalsystems with solvent degrees of freedom. The methods we described should also be broadly applicable to a widerange of biomolecular simulation analyses.BackgroundThe simulation of biological processes at the molecularscale has the potential to give insight into a wide range ofproperties and phenomena that are important to science,engineering, and medicine–with protein folding, or mis-folding, being perhaps the most famous example [1,2].Indeed, simulations can give, in principle, atomic-leveldetail with great temporal precision over a wide range ofapplication areas, thus greatly complementing andexpanding on what one can currently do experimentally.Today, with powerful individual processors, large* Correspondence: guibas@cs.stanford.edu5Department of Computer Science, Stanford University, Stanford, CA 94305,USAFull list of author information is available at the end of the articlecomputer clusters, as well as with very large distributedclusters of processors, one can routinely generate massivequantities of simulation data for a given phenomenon ofinterest, often in full-atomic detail along many trajectories.There is an increasing need to mine such massive datasets in order to gain insight into the fundamental phenom-ena under study. From these data sets, the goal is tounderstand at some more macroscopic level the structureof the paths taken during the simulation. The key chal-lenge facing dynamical simulation on the molecular scaleis to overcome the gap between the timescales whereinteresting biologically relevant conformational changesoccur (typically microseconds or even longer) and thosewe can simulate at atomic resolution (typically nanose-conds). The length of atomic simulations is limited by the© 2013 Gu et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.Guet al. BMC Bioinformatics2013,14(Suppl2):S8http://www.biomedcentral.com/1471-2105/14/S2/S8Page 2 of 9need to take small time steps, which is determined by thehigh frequency motions.Markov state modelsvisualizing the model. There are also several recent worksdeveloped related to these methods [7-9].Solvent degrees of freedomTo meet such a challenge, a lot of recent effort has beendevoted to constructing stochastic kinetic models, often inthe form ofdiscrete-time Markov state models (MSMs),from relatively short molecular dynamics simulations[3-11]. These models are built from short timescale simu-lations and then propagated to extract long timescaledynamics. The MSMs partition configuration space into anumber of distinct states, calledmetastable states,suchthat the intra-state transitions are fast but the inter-statetransitions are slow. Such separation of timescales ensuresthat the model is Markovian, in that the probability ofbeing in a given state at timet+Δtdepends only on thestate at timet.In a MSM, the time evolution of a vector representingthe population of each state can be calculated asP(nτ) =[T(τ)]nP(0),whereP(nτ)is a vector of state populationsafterntime steps andT(τ)is the column-stochastic tran-sition probability matrix with lag timeτ(simulation timestep). Note that any model is Markovian for a sufficientlylong lag timeτ,because the system is able to converge toan equilibrium distribution from any arbitrary initial dis-tribution after one lag time. The key point is to build amodel with a lag time that is shorter than the timescaleof the process of interest with a reasonable number ofstates.To build such dynamical models, it is necessary to mapout the dominant long lived, kinetically metastable statesand then determine the rates for transitioning betweenthese states. A few different approaches have been devel-oped to generate good state decompositions. If the low-dimensional manifold containing all the slow degrees offreedom is known a priori, then the configuration spacecan be partitioned into free energy basins to define thesemetastable states, such as by examination of the potentialof mean force [10-14]. Without this prior knowledge,some attempts have turned to conformational clusteringtechniques which assume that geo-metrically distinctclusters may also be kinetically distinct [15-18].In [4], Chodera et al. proposed a first algorithm thatcan automatically discover kinetically metastable statesfor the construction of MSMs. They use a geometricclustering algorithm to split the configuration space intoa large number of small microstates, and then lump theminto kinetically distinct macrostates. Later, Bowman et al.developed an open source software package MSMBuilderbased on this framework [6]. The software provides toolsfor clustering data based on geometric relationships andfor constructing and manipulating MSMs based on thisinitial clustering. It also includes tools for verifying thatthe resulting model is Markovian as well as analyzing andSince the dynamics of biological macromolecules areusually coupled with the surrounding solvent, manymolecular simulations involve both a solute and a solvent(typically water). Some previous works have shown thenecessary of accounting for the solvent structure to accu-rately characterize the dynamics and free energy land-scape of the biological macromolecule systems, such asthe RNA hairpin-loop motif [19], alanine dipeptide [20]and the BphC enzyme [21]. In this setting, both soluteand solvent atoms are placed in a box and then move fol-lowing some predefined force field, yielding a sequenceof snapshots of the atom positions. The number of atomsis kept constant in this process.Although people have recognized that solvent coordi-nates may be critical in some phenomena [19-25], in thestep of data analysis people often assume configurationslie exclusively in the configuration space of the macro-molecule, and simply ignore the solvent information. Forexample, in [4], it presume that de-correlation ofmomenta and reorganization of the solvent is faster thanthe process of interest. One difficulty in dealing with sol-vent degrees of freedom is the large number of solventmolecules in a system (typically thousands). Besides, italso requires to account for the indistinguishability ofsolvent molecules upon their exchange. One impressivework in this direction is [22], which used a generic neuralnetwork model to identify reaction coordinates from adatabase of candidate variables including water relatedones. However, to use this approach researchers have todefine the candidate variables. Furthermore, the resultfrom the neural network model may not be easy to inter-pret, which is a drawback as a data exploration tool.In this paper, we propose to generalize the currentmethods to include the solvent degrees of freedom. Wefirst present a new distance metric which encodes thesolvent information in molecular configurations, andthen incorporate it into the construction of MSMs.Finally we apply our method to several biological modelsystems and assess its performance.MethodsMany of the dynamical systems which occur in biochem-istry take place in very high dimensional spaces. Ourmain goal is to develop techniques to obtain the simplestkind of qualitative information about high-dimensionalmolecular dynamical systems. Perhaps the most signifi-cant piece of information one has about the data set isthe distance metric which specifies the distances betweenpairs of points (molecular configurations). For macromo-lecules, a commonly used metric for estimating theGuet al. BMC Bioinformatics2013,14(Suppl2):S8http://www.biomedcentral.com/1471-2105/14/S2/S8Page 3 of 9distance between two molecules is theRMSD distance,defined as the root mean squared deviation of the Carte-sian coordinates of heavy atoms in the molecules after aminimizing rigid body translation and rotation alignment[26,27]. In this section, we design a new distance functionfor comparing the solvent profiles, and then use it toconstruct MSMs with solvent degrees of freedom.Distance functionsf(X, Y)to satisfy the following properties:1.f(X, Y)is continuous in the input spaceXandY,soa small perturbation of the system does not changethe signatures too much.2.f(X, Y)is symmetric inY= {y1,y2, ...,yn}, so the sol-vent molecules are indistinguishable upon theirexchange.3.yi’sfar fromXhave less weights inf(X, Y),becausethese solvent molecules have little impact to thesolute.To meet these properties, we define the signaturef(X,Y)as follows. Given a pointxÎX,we transform the||x−y||space using a Gaussian kernelK x, y= exp−2σ22In molecular simulations, a system consists of both asolute (macromolecule) and a solvent (water). Supposethe solute structure containsmatoms, and the solventinvolvesnwater molecules. We denoteX= {x1,x2, ...,xm}as the set of solute atoms, andY= {y1,y2, ...,yn} repre-senting the set of solvent atoms. (For water molecules,we only record the oxygen atom at the vertex and ignoretwo hydrogen atoms at the tips, so eachyicorresponds tothe oxygen atom of a water molecule). Then, the resultsof the simulations become sequences of point sets {X,Y},which are obtained by sampling at random from the con-figuration space and then following the trajectory for acertain time interval.We first point out two properties when comparing dif-ferent configurations {X,Y}:•m≪n–typically the number of solute atoms isless than 100, while there can be thousands of sol-vent molecules in a system.•{y1, y2, ...,yn} are indistinguishable upon theirexchange–when considering the interaction betweenthe solute and the solvent, we do not care about theidentities ofY.In other words, two configurations areconsidered as the same if they only differ by a permu-tation of solvent molecules.To address the indistinguishability of solvent moleculesupon their exchange, one may consider methods thatcompute the optimal matching between the solvent mole-cules, such as minimum cost flow [28], or the Hungarianalgorithm [29]. However, these matching based algorithmswould requireO(n3) time, which is slow for systems withthousands of solvent molecules. The computational costcan be reduced if we only focus on solvent moleculesaround the solute, such as itsk-nearestneighbors. How-ever, this solution is not stable because a small perturba-tion in the configuration may cause the set ofk-nearestneighbors to vary a lot.We present a new distance function that measures thegeometric similarity between different configuration. Theidea is we compute some signatures/descriptorsf(X, Y)that compactly summarize the high-dimensional data sets{X,Y},and then define the distances using these signa-tures. As mentioned above, we would like the signature,where ||x -y||is the Euclidean distance between pointsxandy,so thatyi’sfar fromXbecome less important.We then define the signature of a single pointxrelativeto spaceYasf(x,Y)=nK x, yi. By summing up alli=1kernelsK(x,yi), the result is invariant under permuta-tions of solvent molecules. Finally, we definef(X, Y)as asignature vector {f(x1,Y), f(x2,Y),...,f(xm, Y)},which takesO(mn)computation time.Intuitively, the signature vectorf(X, Y)summarizes thesolvent distribution around each solute atom. We thendefine the distance between two configurations simply asthe Euclidean distance between their signature vectors.In fact, there are various choices of functions that cansatisfy these properties (1-3), while the one we proposedhere is simple and fast to compute.Constructing Markov state modelsIn this section, we integrate the solvent information intothe construction of MSMs. We will follow and extendthe methods described in [4,6]. Basically, theseapproaches has two steps–asplitstep to reduce the sizeof the data set based on geometric shapes, and then alumpstep to incorporate kinetic information fromtrajectories.SplittingModern computer simulations can easily generate datasets with millions of configurations, making analysis ofthese massive data sets computationally challenging. Animportant method for shrinking the data sets is to apply aclustering algorithm to obtain a family of clusters (micro-states) of much smaller size than the original data set.Here each cluster should be small enough to ensure thatthe intra-state transitions between configurations in thesame cluster are fast.In the split step, allNconfigurations (104- 107) aregrouped intoKmicrostates (102- 104) based on theirGuet al. BMC Bioinformatics2013,14(Suppl2):S8http://www.biomedcentral.com/1471-2105/14/S2/S8Page 4 of 9structural similarity. Due to the large size of the data set,it is more practical to apply a fast geometric clusteringalgorithm, such as thek-centerork-medoidalgorithmwithO(KN)time complexity [30,31]. Another importantfactor is the choice of distance functions in these cluster-ing algorithms. In the traditional solute-based models,the RMSD distance is often used as a standard metric tomeasure the structural similarity. With the distance func-tion we defined between solvent configurations, we areable to identify solvent-based metastable clusters.Furthermore, we may combine these two distance func-tions together to build a model with both solute and sol-vent information.Suppose we want to build a model withKmicrostates,√we first group allNconfigurations intoKsoluteclusters using the RMSD distance, and then indepen-√dently group all configurations intoKsolvent clus-ters using the distance based on solvent signatures. Inthe next step, we consider two configurations to be inthe same microstate if and only if they are assigned toboth the same solute cluster and the same solvent clus-√2ter, and thus there are totallyKstates at the end.Note that some states might be empty if there is noconfiguration assigned to their corresponding {solute,solvent} cluster pairs. In this case, we may increase thenumber of solute/solvent clusters a little bit larger tomake sure that we have at leastKnon-empty states.(An alternate solution is to group all configurations into√√Ksolute clusters first, and then generateKsolvent clusters for configurations within each solute√cluster independently, instead of generatingKglo-bal solvent clusters.) Finally, we form theKmicrostatesby simply merging the smallest states (this step can beskipped if we do not need to form exactlyKmicrostates).More generally, we can generateK1solute clusters andK2solvent clusters (withK1K2≥K),and then combinethem intoKmicrostates. In fact, the traditional solute-based model can be seen as a special case whereK2= 1,and the solvent-based model is a special case whereK1= 1. Note that in this case, the running time for geo-metric clustering becomesO((K1+K2)N). By setting√K1=K2=K, we achieve the optimal running time√OKN–which is much faster thanO(KN)time forlargeK(because we are generating hundreds/thousandsof microstates).Lumpinglost the original metric information after the split step.What one retains, however, is the computation of prob-abilities for transitioning from one microstate toanother. This means that we retain a coarse version ofthe dynamics. In the next step, these microstates arelumped into macrostates based on their kinetic transi-tions in the trajectories. Since this step does not con-sider solute/solvent information about configurations,we simply follow the same approach described in [4].In the lump step, theKmicrostates are grouped intoLmacrostates (< 102) so as to maximize themetastability.The metastabilityQof a decomposition intoLmacro-states is defied as the trace of its transition probabilityLmatrixQ=i=1Tii(τ ). Intuitively, a poor decompositionwould result in a smallQ,as trajectories started in somestates exit rapidly; conversely, a good decomposition withstrongly metastable states would result in a largeQ,astrajectories remain in each state for long times.In the original approach, a simulated annealing algo-rithm [32] is used to optimize the metastability in lump-ing. The algorithm starts with an arbitrary initialsolution that assignsKmicrostates intoLmacrostates.In each step, a microstate is selected uniformly at ran-dom, and reassigned to a new random macrostate (thenew solution is rejected if a macrostate becomes emptyafter this change to ensure that there areLmacrostates).If the new solution has a larger metastabilityQ’than theold solutionQ,the new solution is accepted; otherwiseQ−Qit is accepted by a probability ofexp, whereTTis a temperature parameter which is set to be theinverse of the step number. The allowance for these“downhill”moves can potentially save the method frombecoming stuck at local optima.Results and discussionThe method we described here would be generallyapplicable to a wide range of biomolecular simulationanalyses. In this section, we pick several examples andtest the performance of our method in these differentmodels.Solvent-based clustersBecause the clustering algorithms do not produce clus-ters of any particular uniform shape or size, we haveWe first apply our method to a small alanine dipeptidesystem, which has been used as an example in theMSMBuilder [4,6]. We pick a 5 nanoseconds trajectory ofalanine dipeptide in explicit water, with a frame rate of 1picosecond.In this model, the solute structure Ace-Ala-Nme con-sists of 22 atoms and the solvent contains 885 H2O. Foreach configuration, we extract 10 solute atomsX= {x1,x2, ...,x10} consisting of all heavy atoms on the backboneGuet al. BMC Bioinformatics2013,14(Suppl2):S8http://www.biomedcentral.com/1471-2105/14/S2/S8Page 5 of 9chain (see Figure 1(a)), and alsoY= {y1,y2, ...,y885}representing the water molecules. We next reduce thedimensionality of this point set {X,Y}by computing itssignaturef(X, Y).Intuitively, the signature vectorf(X, Y)summarizes thesolvent distribution around the solute. To see this, wemap the signatures of all configurations onto a lowerdimensional space using the principle component analy-sis (PCA) [33]. Figure 1(b) shows the top three PCAdirections forf(X, Y),where the colors represent weightsfor each dimension. The first principle component isbasically the average off(xi, Y)at all solute atoms, whichrepresents the amount of water around the whole solutestructure. The second principle component distinguishesthe two ends of the backbone chain, which tells uswhether the water molecules are gathered on the leftside or the right side. Furthermore, the third principlecomponent distinguishes the two ends and the middlepart, for example in the case when the two ends arefolded close to each other. A six-states decompositionfor all solvent signatures using thek-centerclustering isshown in Figure 1(c), where the space is partitionedbased on these PCA directions.In protein backbone geometry, it is known that thetorsion anglesjandψare the primary degrees of free-dom of the solute structure. (The solvent coordinateshave been shown to be the next most important degreesof freedom in this dynamical system [20,22].) For exam-ple, Figure 1(d) shows a five-states decomposition usingthek-centerclustering with RMSD distances, projectedonto the (j,ψ)torsion angles map (similar to the man-ual state decomposition described in [14]). However,these solute-based clusters are very different from thosesolvent-based clusters–if we project the solvent clus-ters onto the torsion angles map, they no longer show aclustering behavior (see Figure 1(e)). This also motivatedus the construction of the combination model whichintegrates both solute and solvent information, asdescribed in the splitting section.In the above alanine dipeptide example, the solutestructure is small and may in some sense be consideredas a convex object, because the water molecules rarelyenter the region inside the solute structure. We nextturn to another example of carbon nanotube in water,whose solute atoms form a very concave structure.Because this model simulates water molecules going inand out of a carbon nanotube, it is a good test ofwhether the solvent distribution inside the solute struc-ture can be captured by our method.We have a 10 nanoseconds trajectory of carbon nano-tube in water, with a frame rate of 1 picosecond. ThesoluteXconsists of 144 fixed carbon atoms with a cylind-rical nanostructure, and the solventYcontains 951 H2O.In [23], it has been observed the spontaneous and continu-ous filling of a nonpolar carbon nanotube with a one-dimensionally ordered chain of water molecules, and aminute reduction in the attraction between the tube walland water can dramatically affect pore hydration, leadingto sharp transitions between empty and full states on ananosecond timescale (see Figure 2(a)). This can also beverified using our method by computing thewater numberinside the nanotube, which we define as the integral ofpoint signaturef(x, Y)over the cylindrical regionVinsidethe nanotube. Here we use a normalized Gaussian kernelK x, y=√12πσ3x−yexp−2σ2n2. Note that the waternnumberVf(x,Y)dx=V i=1K x, yidx=i=1VK x, yidx.Ass®0,K(x, yi) converges to the Dirac delta functioncentered atyi, and thus∫VK(x, yi)dx can be seen as an indi-cator functionI(yiÎV).So, the water number roughlycounts the number of water molecules inside the nano-tube, except that it is a continuous function. Figure 2(b)plots the water number inside the nanotube over a periodof 3000 frames. In this figure, we can clearly see that thesystem transits between empty and full states, with fastintra-state transitions and slow inter-state transitions. (In aFigure 1Alanine dipeptide.(a) Solute structure. (b) Top 3 PCA directions for signaturef(X, Y)(s = 1). (c) A solvent-based state decompositionmapped to the PCA space. (d) A solute-based state decomposition on the torsion angles map. (e) Projection of solvent-based clusters onto thetorsion angles map.

zanotowane.pl

doc.pisz.pl

pdf.pisz.pl

hannaeva.xlx.pl

1471-2105-14-S2-S8, bioinformatyka, artykuly

Zgryźliwość kojarzy mi się z radością, która źle skończyła.

Wątki