molPX intro

Guillermo Perez-Hernandez  guille.perez@fu-berlin.de

In this notebook we will be using the 1 millisecond trajectory of Bovine Pancreatic Trypsin Inhibitor (BPTI) generated by DE Shaw Research on the Anton Supercomputer and kindly made available by their lab. The original work is

  • Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, Eastwood MP, Bank JA, Jumper JM, Salmon JK, Shan Y, Wriggers W: Atomic-level characterization of the structural dynamics of proteins. Science 330:341-346 (2010). doi: 10.1126/science.1187409.

The trajectory has been duplicated and shortened to provide a mock-trajectory set and be able to deal with lists of trajectories of different lenghts:

  • c-alpha_centered.stride.100.xtc
  • c-alpha_centered.stride.100.reversed.xtc
  • c-alpha_centered.stride.100.halved.xtc

Input types and typical usecase

The typical usecase is having molecular dynamics (MD) simulation data in form of trajectory files with extensions like .xtc, .dcd etc and the associated molecular topology as a .pdb or .gro file.

These files are the most general starting point for any analysis dealing with MD, and molpx’s API has been designed to be able to function without further input:

[ ]:
top = 'notebooks/data/bpti-c-alpha_centered.pdb'
MD_trajfiles = ['notebooks/data/c-alpha_centered.stride.1000.xtc',
               'notebooks/data/c-alpha_centered.stride.1000.reversed.xtc',
                'notebooks/data/c-alpha_centered.stride.1000.halved.xtc'
               ]

dt = 244 #saving interval in the .xtc files, in ns

import molpx
from matplotlib import pylab as plt
%matplotlib ipympl
import pyemma
import numpy as np

# This way the user does not have to care where the data are:
top = molpx._molpxdir(top)
MD_trajfiles = [molpx._molpxdir(ff) for ff in MD_trajfiles]

However, molpx relies heavily on the awesome `mdtraj <http://www.mdtraj.org>`__ module for dealing with molecular structures, and so most of molpx’s functions accept also Trajectory-type objects (native to mdtraj) as alternative inputs.

[ ]:
# Create a memory representation of the trajectories
MD_list = [molpx.generate._md.load(itraj, top=top) for itraj in MD_trajfiles]

The same idea applies to the input of projected trajectories: molpx can take the filenames as inputs (.npy, .dat, .txt etc) or deal directly with numpy.ndarray objects.

** These alternative, “from-memory” input modes (md.Trajectory and np.ndarray objects) avoid forcing the user to read from file everytime an API function is called, saving I/O overhead**

The following cell either reads or generates projected trajectory files for this demonstration. In a real usecase this step (done here using TICA) might not be needed, given that the user might have generated the projected trajectory elsewhere:

[ ]:
# Perform TICA or read from file directly if already .npy-files exist
Y_filenames = [ff.replace('.xtc','.Y.npy') for ff in MD_trajfiles]
try:
    Y = [np.load(ff) for ff in Y_filenames]
except:
    feat = pyemma.coordinates.featurizer(top)
    pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
    feat.add_distances(pairs)
    src  = pyemma.coordinates.source(MD_trajfiles, features=feat)
    tica = pyemma.coordinates.tica(src, lag=10, dim=3)
    Y = tica.get_output()
    [np.save(ff, iY) for ff, iY in zip(Y_filenames, Y)]

Visualize a FES and the molecular structures behind it

Execute the following cell and click either on the FES or on the slidebar. Some input parameters have been comented out for you to try out: * different modes of input (disk vs memory) * different projection indices * different number of overlaid structures

[ ]:
mpx_wdg_box = molpx.visualize.FES(MD_list,
                                 #MD_trajfiles,
                                 top,
                                 Y_filenames,
                                 #Y,
                                 nbins=50,
                                 #proj_idxs=[1,2],
                                 proj_labels='TIC',
                                 #n_overlays=5,
                                )
mpx_wdg_box

Visualize trajectories, FES and molecular structures

The user can sample structures as they occurr in sequence in the actual trajectory. Depending on the size of the dataset, this can be very time consuming, particularly if data is being read from disk.

In this example, try changing MD_trajfiles to MD_list and/or changing Y_filenames to simply Y and see if it helps.

Furthermore, the objects in memory can be strided down to fewer frames before being parsed to the method. To stride objects being read from the disk, use the stride parameter.

Again, we have commented out some parameters that provide more control on the output of visualize.traj. Uncomment them and see what happens

[ ]:
mpx_wdg_box = molpx.visualize.traj(MD_trajfiles,
                                   top,
                                   Y,
                                   #Y_filenames,
                                   plot_FES = True,
                                   dt = dt*1e-6, tunits='ms',
                                   #traj_selection = 1,
                                   #sharey_traj=False,
                                   #max_frames=100,
                                   proj_idxs=[0, 1],
                                   panel_height=2,
                                   #proj_labels='TIC'
                                  )
mpx_wdg_box

Intermediate steps: using molpx to generate a regspace sample of the data

See the documentation of molpx.generate.sample to find out about all possible options:

molpx.generate.sample(MD_trajectories, MD_top, projected_trajectories, atom_selection=None, proj_idxs=[0, 1], n_points=100, n_geom_samples=1, keep_all_samples=False, proj_stride=1, verbose=False, return_data=False)
[ ]:
data_sample, geoms = molpx.generate.sample(#MD_list,
                                           MD_trajfiles,
                                           top,
                                           #Y,
                                           Y_filenames,
                                           n_points=200   ,
                                            n_geom_samples=2,
                                    )
data_sample.shape, geoms

Paths samples along the different projections (=axis)

[ ]:
paths_dict, idata = molpx.generate.projection_paths(#MD_list,
                                                    MD_trajfiles,
                                                    top,
                                                    Y_filenames,
                                                    #Y, # You can also directly give the data here
                                                    n_points=50,
                                                    proj_idxs=[0,1],
                                                    n_projs=3,
                                                    proj_dim = 3,
                                                    verbose=False,
                                        )

Interaction with PyEMMA

molpx is using many methods of the coordinates submodule of PyEMMA, and thus it also understands some of PyEMMA’s classes as input (like clustering objects or streaming transformers). ## Using the TICA object to visualize the most correlated input features If the projected coordinates come from a TICA (or PCA) transformation, and the TICA object is available in memory molpx.visualize.traj can make use of correlation information to display not only the projected coordinates (i.e the TICs, in this case), but also the “original” input features behind it

[ ]:
# Re-do the TICA computation to make sure we have a tica object in memory
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src  = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()

The method molpx.visualize.correlations tries to provide a visual representation of the projected coordinates by relating them to the input features, which carry more meaning, since they are (usually) familiar parameters such as atom distances, angles, contacts etc.

[ ]:
# Comment or uncomment the optinal parameters and see how the method reacts
# You can use a pre-instantiated the widget
#iwd = molpx.visualize._nglwidget_wrapper(MD_list[0][0])
# Or instantiate at the moment of calling visualize.correlations
iwd = None
corr, ngl_wdg = molpx.visualize.correlations(tica,
                                          n_feats=3,
                                          proj_idxs=[0,1,2],
                                          geoms=MD_list[0][::100],
                                          #verbose=True,
                                          #proj_color_list=['red', 'blue', 'green'],
                                          widget=iwd
                                         )
ngl_wdg

Also, molpx.visualize.traj can help in visualizing these correlations by parsing along the tica object itself as projection=tica. In the next cell, can you spot the differences: * In the nglwidget? * In the trajectories?

[ ]:
# Reuse the visualize.traj method with the tica object as input
mpx_wdg_box = molpx.visualize.traj(MD_trajfiles,
                                  top,
                                  Y,
                                  #Y_filenames,
                                  plot_FES = True,
                                  dt = dt*1e-6, tunits='ms',
                                  #traj_selection = 0,
                                  #sharey_traj=False,
                                  #max_frames=100,
                                  proj_idxs=[0,1],
                                  panel_height=1,
                                  projection=tica, ## this is what's new
                                  n_feats=2
                                 )

mpx_wdg_box

Use a clustering object as input

If the dataset has already been clustered, and it is that clustering that the user wants to explore, molpx.generate.sample can take this clustering object as an input instead of the the projected trajectories:

[ ]:
# Do "some" clustering
clkmeans = pyemma.coordinates.cluster_kmeans([iY[:,:2] for iY in Y], 5)
[ ]:
data_sample, geoms = molpx.generate.sample(MD_trajfiles, top, clkmeans,
                                     n_geom_samples=50,
                                     #keep_all_samples=True # read the doc for this argument
                                    )
[ ]:
# Plot clusters
plt.figure(figsize=(4,4))
plt.plot(clkmeans.clustercenters[:,0], clkmeans.clustercenters[:,1],' ok')
# FES as background is optional (change the bool to False)
if True:
    plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)

# Link the clusters positions with the molecular structures
linked_ngl_wdg = molpx.visualize.sample(data_sample,
                              geoms.superpose(geoms[0]),
                              plt.gca(),
                              clear_lines=False,
                              #plot_path=True
                            )
linked_ngl_wdg

Visual representations for MSMs

Visually inspect the network behind an MSM

[ ]:
MSM = pyemma.msm.estimate_markov_model(clkmeans.dtrajs, 20)
[ ]:
plt.figure(figsize=(4,4))

ax, pos  = pyemma.plots.plot_markov_model(MSM.P,
                                          minflux=5e-4,
                                          arrow_labels=None,
                                          ax=plt.gca(),
                                          arrow_curvature = 2, show_frame=True,
                                          pos=clkmeans.clustercenters)
# Add a background if wanted
h, (x, y) = np.histogramdd(np.vstack(Y)[:,:2], weights=np.hstack(MSM.trajectory_weights()),  bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
plt.xlim(x[[0,-1]])
plt.xticks(np.unique(x.round()))
plt.yticks(np.unique(y.round()))

plt.ylim(y[[0,-1]])

linked_ngl_wd, linked_ax_wd = molpx.visualize.sample(pos, geoms, plt.gca(), dot_color='blue')
linked_ngl_wd

TPT Reactive Pathway Representation

[ ]:
# Do an MSM with a realistic number of clustercenters
cl_many = pyemma.coordinates.cluster_regspace([iY[:,:2] for iY in Y], dmin=.25)
M = pyemma.msm.estimate_markov_model(cl_many.dtrajs, 20)
cl_many.n_clusters
[ ]:
# Use this object to sample geometries
pos, geom = molpx.generate.sample(MD_trajfiles, top, cl_many)
[ ]:
# Find the most representative microstate of each
# and least populated macrostate
M.pcca(3)
dens_max_i = [distro.argmax() for distro in M.metastable_distributions]
A = np.argmax([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
B = np.argmin([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
print(cl_many.clustercenters[dens_max_i[A]],
      cl_many.clustercenters[dens_max_i[B]])
[ ]:
# Create a TPT object with most_pop, least_pop as source, sink respectively
tpt = pyemma.msm.tpt(M, [dens_max_i[A]], [dens_max_i[B]])
paths, flux = tpt.pathways(fraction=.5)
[ ]:
# Get a path with a decent number of intermediates
sample_path = paths[np.argmax([len(ipath) for ipath in paths])]
[ ]:
plt.figure()
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
linked_ngl_wdg, linked_ax_wdg = molpx.visualize.sample(cl_many.clustercenters[sample_path],
                       geom[sample_path].superpose(geom[sample_path[0]]), plt.gca(),
                       plot_path=True,
                      )
plt.scatter(*cl_many.clustercenters.T, alpha=.25)
linked_ngl_wdg
[ ]: