molPX intro

Guillermo Perez-Hernandez  guille.perez@fu-berlin.de

In this notebook we will be using the 1 millisecond trajectory of Bovine Pancreatic Trypsin Inhibitor (BPTI) generated by DE Shaw Research on the Anton Supercomputer and kindly made available by their lab. The original work is

  • Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, Eastwood MP, Bank JA, Jumper JM, Salmon JK, Shan Y, Wriggers W: Atomic-level characterization of the structural dynamics of proteins. Science 330:341-346 (2010). doi: 10.1126/science.1187409.

The trajectory has been duplicated and shortened to provide a mock-trajectory set and be able to deal with lists of trajectories of different lenghts:

  • c-alpha_centered.stride.100.xtc
  • c-alpha_centered.stride.100.reversed.xtc
  • c-alpha_centered.stride.100.halved.xtc

Input types and typical usecase

The typical usecase is having molecular dynamics (MD) simulation data in form of trajectory files with extensions like .xtc, .dcd etc and the associated molecular topology as a .pdb or .gro file.

These files are the most general starting point for any analysis dealing with MD, and molpx‘s API has been designed to be able to function without further input:

In [1]:
top = 'notebooks/data/bpti-c-alpha_centered.pdb'
MD_trajfiles = ['notebooks/data/c-alpha_centered.stride.1000.xtc',
               'notebooks/data/c-alpha_centered.stride.1000.reversed.xtc',
                'notebooks/data/c-alpha_centered.stride.1000.halved.xtc'
               ]

dt = 244 #saving interval in the .xtc files, in ns

import molpx
from matplotlib import pylab as plt
%matplotlib notebook
import pyemma
import numpy as np

# This way the user does not have to care where the data are:
top = molpx._molpxdir(top)
MD_trajfiles = [molpx._molpxdir(ff) for ff in MD_trajfiles]

However, molpx relies heavily on the awesome `mdtraj <http://www.mdtraj.org>`__ module for dealing with molecular structures, and so most of molpx‘s functions accept also Trajectory-type objects (native to mdtraj) as alternative inputs.

In [2]:
# Create a memory representation of the trajectories
MD_list = [molpx.generate._md.load(itraj, top=top) for itraj in MD_trajfiles]

The same idea applies to the input of projected trajectories: molpx can take the filenames as inputs (.npy, .dat, .txt etc) or deal directly with numpy.ndarray objects.

** These alternative, “from-memory” input modes (md.Trajectory and np.ndarray objects) avoid forcing the user to read from file everytime an API function is called, saving I/O overhead**

The following cell either reads or generates projected trajectory files for this demonstration. In a real usecase this step (done here using TICA) might not be needed, given that the user might have generated the projected trajectory elsewhere:

In [3]:
# Perform TICA or read from file directly if already .npy-files exist
Y_filenames = [ff.replace('.xtc','.Y.npy') for ff in MD_trajfiles]
try:
    Y = [np.load(ff) for ff in Y_filenames]
except:
    feat = pyemma.coordinates.featurizer(top)
    pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
    feat.add_distances(pairs)
    src  = pyemma.coordinates.source(MD_trajfiles, features=feat)
    tica = pyemma.coordinates.tica(src, lag=10, dim=3)
    Y = tica.get_output()
    [np.save(ff, iY) for ff, iY in zip(Y_filenames, Y)]

Visualize a FES and the molecular structures behind it

Execute the following cell and click either on the FES or on the slidebar. Some input parameters have been comented out for you to try out different modes of input (disk vs memory) as well as different projection indices:

In [4]:
ax, fig, iwd, data_sample, geom = molpx.visualize.FES(
                                                      MD_list,
                                                      #MD_trajfiles,
                                                      top,
                                                      Y_filenames,
                                                      #Y,
                                                      nbins=50,
                                                      #proj_idxs=[1,2],
                                                      axlabel='TIC',
                                          )
iwd

Visualize trajectories, FES and molecular structures

The user can sample structures as they occurr in sequence in the actual trajectory. Depending on the size of the dataset, this can be very time consuming, particularly if data is being read from disk.

In this example, try changing MD_trajfiles to MD_list and/or changing Y_filenames to simply Y and see if it helps.

Furthermore, the objects in memory can be strided down to fewer frames before being parsed to the method. To stride objects being read from the disk, use the stride parameter.

Other commented parameters provide more control on the output of visualize.traj. Uncomment them and see what happens

In [5]:
__, myfig, iwd, __ = molpx.visualize.traj(MD_trajfiles,
                                          top,
                                          Y,
                                          #Y_filenames,
                                          plot_FES = True,
                                          dt = dt*1e-6, tunits='ms',
                                          active_traj=0,
                                          #traj_selection = 1,
                                          #sharey_traj=False,
                                          #max_frames=100,
                                          proj_idxs=[0, 1],
                                          panel_height=2,
                          )
myfig.tight_layout()
iwd
/home/guille/Programs/molPX/molpx/visualize.py:344: RuntimeWarning: divide by zero encountered in log
  _plt.contourf(-_np.log(h).T, extent=irange)

Intermediate steps: using molpx to generate a regspace sample of the data

See the documentation of molpx.generate.sample to find out about all possible options:

molpx.generate.sample(MD_trajectories, MD_top, projected_trajectories, proj_idxs=[0, 1], n_points=100, n_geom_samples=1, keep_all_samples=False, proj_stride=1, verbose=False, return_data=False)
In [6]:
data_sample, geoms = molpx.generate.sample(#MD_list,
                                           MD_trajfiles,
                                           top,
                                           #Y,
                                           Y_filenames,
                                           n_points=200
                                    )
data_sample.shape, geoms

Out[6]:
((192, 2),
 <mdtraj.Trajectory with 192 frames, 58 atoms, 58 residues, and unitcells at 0x7fd65922c1d0>)

Paths samples along the different projections (=axis)

In [8]:
paths_dict, idata = molpx.generate.projection_paths(#MD_list,
                                                    MD_trajfiles,
                                                    top,
                                                    Y_filenames,
                                                    #Y, # You can also directly give the data here
                                                    n_points=50,
                                                    proj_idxs=[0,1],
                                                    n_projs=3,
                                                    proj_dim = 3,
                                                    verbose=False,
                                        )

Interaction with PyEMMA

molpx is using many methods of the coordinates submodule of PyEMMA, and thus it also understands some of PyEMMA‘s classes as input (like clustering objects or streaming transformers). ## Using the TICA object to visualize the most correlated input features If the projected coordinates come from a TICA (or PCA) transformation, and the TICA object is available in memory molpx.visualize.traj can make use of correlation information to display not only the projected coordinates (i.e the TICs, in this case), but also the “original” input features behind it

In [11]:
# Re-do the TICA computation to make sure we have a tica object in memory
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src  = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()

The method molpx.visualize.correlations tries to provide a visual representation of the projected coordinates by relating them to the input features, which carry more meaning, since they are (usually) familiar parameters such as atom distances, angles, contacts etc.

In [28]:
# Comment or uncomment the optinal parameters and see how the method reacts
#iwd = visualize._nglview.show_mdtraj(MD_list[0][0])
iwd = None
corr, iwd = molpx.visualize.correlations(tica,
                                          n_feats=3,
                                          proj_idxs=[0,1,2],
                                          geoms=MD_list[0][::100],
                                          #verbose=True,
                                          #proj_color_list=['red', 'blue', 'green'],
                                          widget=iwd
                                         )
iwd

Also, molpx.visualize.traj can help in visualizing these correlations by parsing along the tica object itself as projection=tica. Can you spot the differences: * In the nglwidget? * In the trajectories?

In [13]:
# Reuse the visualize.traj method with the tica object as input
__, myfig, iwd, __ = molpx.visualize.traj(MD_trajfiles,
                                          top,
                                          Y,
                                          #Y_filenames,
                                          plot_FES = True,
                                          dt = dt*1e-6, tunits='ms',
                                          active_traj=1,
                                          #traj_selection = 0,
                                          #sharey_traj=False,
                                          #max_frames=100,
                                          proj_idxs=[0,1],
                                          panel_height=2,
                                          projection=tica, ## this is what's new
                                          n_feats=3
                                         )

myfig.tight_layout()
iwd
/home/guille/Programs/molPX/molpx/visualize.py:344: RuntimeWarning: divide by zero encountered in log
  _plt.contourf(-_np.log(h).T, extent=irange)

Use a clustering object as input

If the dataset has already been clustered, and it is that clustering that the user wants to explore, molpx.generate.sample can take this clustering object as an input instead of the the projected trajectories:

In [14]:
# Do "some" clustering
clkmeans = pyemma.coordinates.cluster_kmeans([iY[:,:2] for iY in Y], 5)
17-03-17 11:11:13 pyemma.coordinates.clustering.kmeans.KmeansClustering[7] INFO     Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter.
In [15]:
data_sample, geoms = molpx.generate.sample(MD_trajfiles, top, clkmeans,
                                     n_geom_samples=50,
                                     #keep_all_samples=True # read the doc for this argument
                                    )
In [16]:
# Plot clusters
plt.figure(figsize=(7,7))
plt.plot(clkmeans.clustercenters[:,0], clkmeans.clustercenters[:,1],' ok')
# FES as background is optional (change the bool to False)
if True:
    plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)

# Link the clusters positions with the molecular structures
iwdg = molpx.visualize.sample(data_sample,
                              geoms.superpose(geoms[0]),
                              plt.gca(),
                              clear_lines=False,
                              #plot_path=True
                            )
iwdg
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:6: RuntimeWarning: divide by zero encountered in log

Visual representations for MSMs

Visually inspect the network behind an MSM

In [17]:
MSM = pyemma.msm.estimate_markov_model(clkmeans.dtrajs, 20)
In [18]:
plt.figure(figsize=(7,7))

ax, pos  = pyemma.plots.plot_markov_model(MSM.P,
                                          minflux=5e-4,
                                          arrow_labels=None,
                                          ax=plt.gca(),
                                          arrow_curvature = 2, show_frame=True,
                                          pos=clkmeans.clustercenters)
# Add a background if wanted
h, (x, y) = np.histogramdd(np.vstack(Y)[:,:2], weights=np.hstack(MSM.trajectory_weights()),  bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
plt.xlim(x[[0,-1]])
plt.xticks(np.unique(x.round()))
plt.yticks(np.unique(y.round()))

plt.ylim(y[[0,-1]])

iwd = molpx.visualize.sample(pos, geoms, plt.gca())
iwd
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:11: RuntimeWarning: divide by zero encountered in log

TPT Reactive Pathway Representation

In [19]:
# Do an MSM with a realistic number of clustercenters
cl_many = pyemma.coordinates.cluster_regspace([iY[:,:2] for iY in Y], dmin=.25)
M = pyemma.msm.estimate_markov_model(cl_many.dtrajs, 20)
cl_many.n_clusters
Out[19]:
123
In [20]:
# Use this object to sample geometries
pos, geom = molpx.generate.sample(MD_trajfiles, top, cl_many)
In [21]:
# Find the most representative microstate of each
# and least populated macrostate
M.pcca(3)
dens_max_i = [distro.argmax() for distro in M.metastable_distributions]
A = np.argmax([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
B = np.argmin([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
print(cl_many.clustercenters[dens_max_i[A]],
      cl_many.clustercenters[dens_max_i[B]])
[-0.18704125 -0.77366424] [ 6.71851349  0.03159955]
In [22]:
# Create a TPT object with most_pop, least_pop as source, sink respectively
tpt = pyemma.msm.tpt(M, [dens_max_i[A]], [dens_max_i[B]])
paths, flux = tpt.pathways(fraction=.5)
In [23]:
# Get a path with a decent number of intermediates
sample_path = paths[np.argmax([len(ipath) for ipath in paths])]
In [24]:
plt.figure()
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
iwd = molpx.visualize.sample(cl_many.clustercenters[sample_path],
                       geom[sample_path].superpose(geom[sample_path[0]]), plt.gca(),
                       plot_path=True,
                      )
plt.scatter(*cl_many.clustercenters.T, alpha=.25)
iwd
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: RuntimeWarning: divide by zero encountered in log
  from ipykernel import kernelapp as app
In [25]:
# Check
# https://github.com/arose/nglview/issues/518
# https://github.com/arose/nglview/issues/517