Methods for Macromolecular Modeling: Overview


Macromolecular simulations are increasingly demonstrating their value as a tool with which to study biomolecular structure and function. The simulation field has made giant leaps through advances in computational techniques, most notably for timestep integration for molecular dynamics, configurational sampling, fast evaluation of the long-range electrostatic interactions, and quantum-mechanical calculations. Yet many fundamental and practical limitations face biomolecular modelers. They include the approximate nature of the governing force fields as well as simulation protocols, the limited range of configurational sampling and relatively short trajectory times, the neglect of quantum effects in classical molecular dynamics, and the enormous computational requirements (e.g., a year of computational time on a lab workstation to simulate a small solvated protein with simplified electrostatics). New algorithmic approaches, hierarchical spatial representations, and improved computing platforms are continuously needed to enhance the reliability of macromolecular simulations and increase their appeal and scope to experimentalists [Schlick, 1999].







The popularity of molecular dynamics simulations

Though a variety of simulation techniques are available, the molecular dynamics (MD) approach tops the list in popularity because of its physical appeal and biological connection. Namely, the trajectories showing the evolution in time and space of molecular conformations follow classical physics; this allows kinetic processes to be followed in detail, linking and expanding upon experimental observations. MD's popularity would be overwhelming if the technique's computational demands, and hence biological scope, were not so limited for large systems. This limitation stems from the numerical stability requirement, which restricts the timestep size used for integrating the equations of motion to a relatively small value (e.g., ~1 fs). This timestep size implies one million or more steps for simulating a mere nanosecond in the life of a biomolecule; this number already translates to days or months of computing time, even on state-of-the-art platforms [Board et al., 1994; Schlick et al., 1999], since each step typically requires at least one expensive force evaluation.

Certainly, code parallelization on multiple-processor machines shaves off computing clock time, as demonstrated by the longest simulation to date -- 1 s, for a villin headpiece, achieved in four months of dedicated CPU time on 256 processors of a Cray T3D/E [Duan & Kollman, 1998], or by IBM's ambitious announcement to fold proteins by 2005 on a unique petaflop computer called Blue Gene with a massively parallel architecture (one million or more processors). Any algorithmic speedup will only help to further reduce MD computational time. Back to Top

Long-timestep algorithms

One of the golden opportunities for work reduction is in the time integration protocol. Since the overall cost of MD simulations is dominated by the number of force evaluations, reducing this number -- typically by increasing the effective timestep size -- can lengthen the physical timespan that can be simulated. Yet, this "timestep problem" [Schlick et al., 1997; Schlick, 1998] has remained a `thorny nut'. Much of this difficulty can be simply attributed to the dilemma of balancing a larger stepsize with lower resolution accuracy. That is, with decreased intervals of molecular observations (force update frequency), the resolution of some fast processes might have to be sacrificed. Another obstacle is practical: the more sophisticated algorithms, such as those described below, are not trivial to incorporate in the context of large programs, since evaluations of the total force depend on many program units. This makes the application of new integrators for MD more complex than fast schemes for electrostatics summations.

In the evolution of certain physical systems -- motion of planets or flow of compound concentrations in chemical reactions -- the effect of the fast processes on the global motion is negligible. In biomolecular systems -- unfortunately for algorithm developers -- vibrational modes are intricately coupled: thus, fast small-amplitude motions can trigger a cascade of events that culminate in large-scale global rearrangements. This intricate coupling limits the traditional mathematical machinery available for MD integration and compels algorithm developers to seek inventive, tailored approaches. Still, against these challenges, mathematicians and other computational scientists have labored in the past decade to understand the numerical limitations of MD integration, devise clever schemes, and design approaches departing from accurate motion-following that instead yield greater overall information on the large thermally-accessible configuration space of macromolecules [Elber, 1996; Schlick, 1998; Doniach & Eastman, 1999; Daggett, 2000]. Back to Top

Recent algorithmic work: general

Our group is developing new mathematical tools to study biomolecular structures to push the modeling limits of current computational methods. Examples include developing new modeling approaches for long DNAs with bound proteins (e.g., [Beard & Schlick, 2001b; Beard & Schlick, 2001a; Huang et al., 2001; Schlick et al., 2000]), efficient techniques for nonlinear optimization by the truncated Newton method (package TNPACK) [Schlick & Overton, 1987; Schlick & Fogelson, 1992a; Schlick & Fogelson, 1992b; Derreumaux et al., 1994; Xie & Schlick, 1999a; Xie & Schlick, 1999b], and distance geometry and image processing techniques based on the singular value decomposition for combinatorial library design [Xie et al., 2000; Xie & Schlick, 2000; Xie et al., 2001]. Back to Top

Recent algorithmic work: long timesteps and the LN algorithm

Much effort has focused on developing long-timestep methods [Schlick et al., 1997; Schlick, 1998; Barth & Schlick, 1998a; Sandu & Schlick, 1999; Batcho et al., 2001]. Our extrapolative multiple-timestep scheme called LN [Barth & Schlick, 1998a] (for its origin in a Langevin/normal modes scheme [Zhang & Schlick, 1993; Zhang & Schlick, 1994]) combines two simple ingredients that work together to alleviate resonances and allow larger timesteps in the LN scheme [Barth et al., 1997]: extrapolative force splitting (which by itself leads to energy drifts but is not as vulnerable to resonance artifacts as are impulse variants) and stochastic dynamics (which eliminate the drifts as well as dampen the mild resonances). The stochastic framework, in the form of Langevin dynamics [Pastor, 1994], represents a departure from Hamiltonian dynamics, but stochastic simulations are appropriate for many thermodynamic and sampling questions, since the same equilibrium states are approached in theory. In practice, the bath coupling parameter (damping constant ) can be set as small as possible to minimize the Hamiltonian perturbation while still ensuring numerical stability [Barth & Schlick, 1998a; Sandu & Schlick, 1999; Schlick & Yang, 2001]. This stochastic coupling was also adopted later by Skeel and coworkers, in the context of a mollified impulse method [Izaguirre, 1999], but the impulse formulation restricts the outer timestep.

Performance of MTS schemes can be assessed by comparing energetic, geometric, and dynamic properties to single-timestep (STS) simulations as well as experimental data where available. In the case of LN trajectories, the reference integrator is an extension of Verlet for Langevin dynamics [Brunger et al., 1982].

The numerical stability and resonance alleviation were shown in trajectories for the proteins bovine pancreatic trypsin inhibitor (BPTI) and lysozyme, a large water system [Barth & Schlick, 1998a; Sandu & Schlick, 1999], solvated DNA [Strahs & Schlick, 2000], and a large DNA/protein system [Schlick & Yang, 2001; Yang et al., 2001], where STS results were well-reproduced for outer timesteps of 50 fs or more. Results have shown that a good parameter choice for a 3-class LN scheme is = 0.5 fs, tm = 1 fs, and t up to 150 fs. If constrained dynamics (SHAKE [Rychaert et al., 1977]) for the light atom bonds are used, the inner timestep can be increased to 1 fs and the medium timestep to around 2 fs. The speedup factors depend on the reference system but can be an order of magnitude (factor of 10 or more) with respect to 0.5 fs inner timesteps [Barth & Schlick, 1998a] or around 5 for 1-fs inner timesteps (when SHAKE is used). Relative mean energy differences between MTS and STS simulations are lower when SHAKE is applied [Schlick & Yang, 2001].

The figure shown here illustrates LN's performance for two solvated biomolecular systems containing a polymerase /DNA primer/template complex (43751 atoms) [Schlick & Yang, 2001; Yang et al., 2001], and TATA-box DNA/TBP complex (41011 atoms) [D. Strahs, X. Qian, and T. Schlick, work in progress]. The `Manhattan plots', reporting the differences in mean energy components as a function of the outer timestep t relative to STS Langevin simulations, show low relative errors in all energy components (below 3%) for outer timesteps up to t = 120 fs. In both cases, the inner and medium timesteps are fixed at 1 and 2 fs, respectively, and the computational speedup factor for t = 120 fs is five or more with respect to STS simulations at 1 fs. Back to Top

LN Applications

The added efficiency introduced by LN has allowed us to sample better the minor-groove bending preferences in A-tract DNA [Strahs & Schlick, 2000], study a large number of TATA variants (13) to assess sequence-dependent deformability and flexibility patterns relative to the wildtype TATA element [Qian et al., 2001], and sample many segments in the pathway of an enzyme opening motion -- polymerase complexed to primer/template DNA -- with the open and closed crystallographic forms serving as experimental anchors [Yang et al., 2001]. For the polymerase application, the accelerated sampling led to the identification of a key step in the pathway (Arg258 rotation, together with release of a catalytic magnesium ion), which is slow and maybe rate-limiting. This produced the intriguing hypothesis that the Arg258 rotation, rather than large subdomain motion per se, is a crucial aspect of pol 's DNA synthesis fidelity. Back to Top

LN Integration with Popular Software

The LN algorithm has been incorporated into the widely used CHARMM program [Barth & Schlick, 1998b] in the context of periodic models with cutoffs in the microcanonical ensemble. Extensions to particle-mesh Ewald (PME) formulations and other thermodynamic ensembles are underway.

We have recently completed the application of a new multiple-timestep integrator combined to PME summation in the AMBER software package [Batcho et al., 2001]. Speedup factors of 5 compared to traditional Verlet schemes have been demonstrated. Details of the theoretical and computational simulations can be found in [Batcho & Schlick, 2001; Batcho & Schlick, 2001a], where we discuss performance speedup and the force balancing involved to maximize accuracy, maintain long-time stability, and accelerate computational times. Compared to prior MTS efforts in the context of the AMBER program, advances are reported by optimizing PME parameters for MTS applications and by using the Position Verlet, rather than Velocity Verlet scheme, for the inner loop [Batcho & Schlick, 2001a]. Moreover, ideas from the Langevin/MTS algorithm LN developed recently are applied to Newtonian formulations here. Our MTS/PME AMBER work suggested that modified Ewald formulations, using tailored alternatives to the Gaussian screening functions for the Coulombic terms, may allow larger timesteps and thus further speedups for both Newtonian and Langevin protocols. Such developments will be reported separately [Batcho & Schlick, 2001b] (see "Pairwise Interaction Splitting"). Back to Top

Ongoing Challenges

The apparent difficulty in optimizing MTS/PME combinations is that the reciprocal term -- which should isolate long-range slow forces -- contains significant force contributions from near-field particle interactions [Heyes, 1981]. This requires further mathematical transformations before adaptation to biomolecular MD. Recall that the summation for the electrostatic energy of a periodic system can be expressed by a lattice sum over all pair interactions and over all lattice vectors (excluding i = j in the primary box). Such a sum is only conditionally convergent. Ewald's trick was to convert this sum into an expression involving a sum of an absolutely and rapidly convergent series in direct and reciprocal space. This transformation is accomplished by representing each point charge as a Gaussian charge density, producing an exponentially decaying function. This Gaussian transformation is counteracted by an analogous subtraction to leave the net result of an effective point charge. Thus, the electrostatic energy for a periodic system is expressed as a real-space (direct) term (energy due to point charges screened by oppositely-charged Gaussians) that is short-range with a singularity at the origin, and an associated canceling term (periodic sum of Gaussians) that is smooth and long-range, summed in reciprocal space using smooth interpolation of Fourier-series values [Darden et al., 1999]. In practice, the width of the Gaussian distribution is chosen to balance the work in the two terms: making the direct sum rapidly decreasing and well-approximated for distances within a specified cutoff distance, and the reciprocal term converging rapidly. However, the rapid decay of the direct Ewald term implies the presence of fast components in the reciprocal term. This in turn means that the outer timestep associated with an MTS/PME code cannot be too large, as found in practice [Figueirido et al., 1997; Cheng & Merz, 1999; Kawata & Mikami, 2000; Batcho et al., 2001]. Subsequently, truly efficient MTS/PME protocols likely need to split the direct (and hence modify the reciprocal) term so that the reciprocal term is truly long-range.

Such ideas require alternative mathematical formulations, or a combination of distance and energy-component criteria for defining the force classes. In addition, smoothing functions and possibly separate treatments for the van der Waals and Coulomb terms may be needed. Some such ideas based on non-Gaussian core functions optimized to isolate the near and far-field contributions in the direct and reciprocal terms, respectively, are presented in [Batcho & Schlick, 2001b] (see "Pairwise Interaction Splitting"), and another approach using an alternative splitting strategy for standard Gaussian-based PME formulations is currently being investigated [X. Qian and T. Schlick, work in progress].

These challenging extensions, as well as the efficient implementation of MTS schemes to constant temperature and pressure simulations [Martyna et al., 1996], will undoubtedly be forthcoming in the near future and will help macromolecular models simulate a variety of physical environments, as appropriate, for less and less computer time. The ultimate bridging of the gap between experimental and simulation time frames will depend on the relative pace of improvements in each domain To be sure, important challenges remain in each arena, and the sum of instrumentation and modeling will continue to be greater than each of its parts. Back to Top





Relevant Links






Click to go back to Research