molPX intro¶
Guillermo Perez-Hernandez guille.perez@fu-berlin.de
In this notebook we will be using the 1 millisecond trajectory of Bovine Pancreatic Trypsin Inhibitor (BPTI) generated by DE Shaw Research on the Anton Supercomputer and kindly made available by their lab. The original work is
- Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, Eastwood MP, Bank JA, Jumper JM, Salmon JK, Shan Y, Wriggers W: Atomic-level characterization of the structural dynamics of proteins. Science 330:341-346 (2010). doi: 10.1126/science.1187409.
The trajectory has been duplicated and shortened to provide a mock-trajectory set and be able to deal with lists of trajectories of different lenghts:
c-alpha_centered.stride.100.xtc
c-alpha_centered.stride.100.reversed.xtc
c-alpha_centered.stride.100.halved.xtc
Input types and typical usecase¶
The typical usecase is having molecular dynamics (MD) simulation data in
form of trajectory files with extensions like .xtc, .dcd
etc and the
associated molecular topology as a .pdb
or .gro
file.
These files are the most general starting point for any analysis dealing
with MD, and molpx
‘s API has been designed to be able to function
without further input:
In [1]:
top = 'notebooks/data/bpti-c-alpha_centered.pdb'
MD_trajfiles = ['notebooks/data/c-alpha_centered.stride.1000.xtc',
'notebooks/data/c-alpha_centered.stride.1000.reversed.xtc',
'notebooks/data/c-alpha_centered.stride.1000.halved.xtc'
]
dt = 244 #saving interval in the .xtc files, in ns
import molpx
from matplotlib import pylab as plt
%matplotlib notebook
import pyemma
import numpy as np
# This way the user does not have to care where the data are:
top = molpx._molpxdir(top)
MD_trajfiles = [molpx._molpxdir(ff) for ff in MD_trajfiles]
However, molpx
relies heavily on the awesome
`mdtraj
<http://www.mdtraj.org>`__ module for dealing with molecular
structures, and so most of molpx
‘s functions accept also
Trajectory
-type objects (native to mdtraj
) as alternative
inputs.
In [2]:
# Create a memory representation of the trajectories
MD_list = [molpx.generate._md.load(itraj, top=top) for itraj in MD_trajfiles]
The same idea applies to the input of projected trajectories: molpx
can take the filenames as inputs (.npy
, .dat
, .txt
etc) or
deal directly with numpy.ndarray
objects.
** These alternative, “from-memory” input modes (md.Trajectory
and
np.ndarray
objects) avoid forcing the user to read from file
everytime an API function is called, saving I/O overhead**
The following cell either reads or generates projected trajectory files for this demonstration. In a real usecase this step (done here using TICA) might not be needed, given that the user might have generated the projected trajectory elsewhere:
In [3]:
# Perform TICA or read from file directly if already .npy-files exist
Y_filenames = [ff.replace('.xtc','.Y.npy') for ff in MD_trajfiles]
try:
Y = [np.load(ff) for ff in Y_filenames]
except:
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()
[np.save(ff, iY) for ff, iY in zip(Y_filenames, Y)]
Visualize a FES and the molecular structures behind it¶
Execute the following cell and click either on the FES or on the slidebar. Some input parameters have been comented out for you to try out different modes of input (disk vs memory) as well as different projection indices:
In [4]:
ax, fig, iwd, data_sample, geom = molpx.visualize.FES(
MD_list,
#MD_trajfiles,
top,
Y_filenames,
#Y,
nbins=50,
#proj_idxs=[1,2],
axlabel='TIC',
)
iwd
Visualize trajectories, FES and molecular structures¶
The user can sample structures as they occurr in sequence in the actual trajectory. Depending on the size of the dataset, this can be very time consuming, particularly if data is being read from disk.
In this example, try changing MD_trajfiles
to MD_list
and/or
changing Y_filenames
to simply Y
and see if it helps.
Furthermore, the objects in memory can be strided down to fewer frames
before being parsed to the method. To stride objects being read from
the disk, use the stride
parameter.
Other commented parameters provide more control on the output of
visualize.traj
. Uncomment them and see what happens
In [5]:
__, myfig, iwd, __ = molpx.visualize.traj(MD_trajfiles,
top,
Y,
#Y_filenames,
plot_FES = True,
dt = dt*1e-6, tunits='ms',
active_traj=0,
#traj_selection = 1,
#sharey_traj=False,
#max_frames=100,
proj_idxs=[0, 1],
panel_height=2,
)
myfig.tight_layout()
iwd
/home/guille/Programs/molPX/molpx/visualize.py:344: RuntimeWarning: divide by zero encountered in log
_plt.contourf(-_np.log(h).T, extent=irange)
Intermediate steps: using molpx to generate a regspace sample of the data¶
See the documentation of molpx.generate.sample
to find out about all
possible options:
molpx.generate.sample(MD_trajectories, MD_top, projected_trajectories, proj_idxs=[0, 1], n_points=100, n_geom_samples=1, keep_all_samples=False, proj_stride=1, verbose=False, return_data=False)
In [6]:
data_sample, geoms = molpx.generate.sample(#MD_list,
MD_trajfiles,
top,
#Y,
Y_filenames,
n_points=200
)
data_sample.shape, geoms
Out[6]:
((192, 2),
<mdtraj.Trajectory with 192 frames, 58 atoms, 58 residues, and unitcells at 0x7fd65922c1d0>)
Link the PDF plot with the sampled structures and visually explore the FES¶
Click either on the plot or on the widget slidebar: they’re connected!
In [7]:
# Replot the FES
plt.figure(figsize=(7,7))
h, (x,y) = np.histogramdd(np.vstack(Y)[:,:2], bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
# Create the linked widget
linked_wdg = molpx.visualize.sample(data_sample,
geoms.superpose(geoms[0]),
plt.gca(),
clear_lines=True,
#plot_path=True
)
plt.plot(data_sample[:,0], data_sample[:,1],' ok', zorder=0)
# Show it
linked_wdg
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:4: RuntimeWarning: divide by zero encountered in log
Paths samples along the different projections (=axis)¶
In [8]:
paths_dict, idata = molpx.generate.projection_paths(#MD_list,
MD_trajfiles,
top,
Y_filenames,
#Y, # You can also directly give the data here
n_points=50,
proj_idxs=[0,1],
n_projs=3,
proj_dim = 3,
verbose=False,
)
Link the PDF plot with the sampled paths/structures and visually explore the coordinates (separately).¶
Click either on the plot or on the widget slidebar: they’re connected! You can change the type of path between min_rmsd or min_disp and you can also change the coordinate sampled (0 or 1)
In [9]:
# Choose the coordinate and the tyep of path
coord = 1
#path_type = 'min_rmsd'
path_type = 'min_disp'
igeom = paths_dict[coord][path_type]["geom"]
ipath = paths_dict[coord][path_type]["proj"]
# Choose the proj_idxs for the path and the FES
# to be shown
proj_idxs = [0,1]
In [10]:
plt.figure(figsize=(7,7))
h, (x,y) = np.histogramdd(np.vstack(Y)[:,proj_idxs], bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
linked_wdg = molpx.visualize.sample(ipath[:,proj_idxs],
igeom.superpose(igeom[0]),
plt.gca(),
clear_lines=True,
n_smooth = 5,
plot_path=True,
#radius=True,
)
linked_wdg
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:3: RuntimeWarning: divide by zero encountered in log
app.launch_new_instance()
Interaction with PyEMMA
¶
molpx
is using many methods of the coordinates
submodule of
PyEMMA
, and thus it also understands some of PyEMMA
‘s classes as
input (like clustering objects or streaming transformers). ## Using the
TICA object to visualize the most correlated input features If the
projected coordinates come from a TICA (or PCA) transformation, and the
TICA object is available in memory molpx.visualize.traj
can make use
of correlation information to display not only the projected coordinates
(i.e the TICs, in this case), but also the “original” input features
behind it
In [11]:
# Re-do the TICA computation to make sure we have a tica object in memory
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()
The method molpx.visualize.correlations
tries to provide a visual
representation of the projected coordinates by relating them to the
input features, which carry more meaning, since they are (usually)
familiar parameters such as atom distances, angles, contacts etc.
In [28]:
# Comment or uncomment the optinal parameters and see how the method reacts
#iwd = visualize._nglview.show_mdtraj(MD_list[0][0])
iwd = None
corr, iwd = molpx.visualize.correlations(tica,
n_feats=3,
proj_idxs=[0,1,2],
geoms=MD_list[0][::100],
#verbose=True,
#proj_color_list=['red', 'blue', 'green'],
widget=iwd
)
iwd
Also, molpx.visualize.traj
can help in visualizing these
correlations by parsing along the tica object itself as
projection=tica
. Can you spot the differences: * In the nglwidget?
* In the trajectories?
In [13]:
# Reuse the visualize.traj method with the tica object as input
__, myfig, iwd, __ = molpx.visualize.traj(MD_trajfiles,
top,
Y,
#Y_filenames,
plot_FES = True,
dt = dt*1e-6, tunits='ms',
active_traj=1,
#traj_selection = 0,
#sharey_traj=False,
#max_frames=100,
proj_idxs=[0,1],
panel_height=2,
projection=tica, ## this is what's new
n_feats=3
)
myfig.tight_layout()
iwd
/home/guille/Programs/molPX/molpx/visualize.py:344: RuntimeWarning: divide by zero encountered in log
_plt.contourf(-_np.log(h).T, extent=irange)
Use a clustering object as input¶
If the dataset has already been clustered, and it is that clustering
that the user wants to explore, molpx.generate.sample
can take this
clustering object as an input instead of the the projected trajectories:
In [14]:
# Do "some" clustering
clkmeans = pyemma.coordinates.cluster_kmeans([iY[:,:2] for iY in Y], 5)
17-03-17 11:11:13 pyemma.coordinates.clustering.kmeans.KmeansClustering[7] INFO Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter.
In [15]:
data_sample, geoms = molpx.generate.sample(MD_trajfiles, top, clkmeans,
n_geom_samples=50,
#keep_all_samples=True # read the doc for this argument
)
In [16]:
# Plot clusters
plt.figure(figsize=(7,7))
plt.plot(clkmeans.clustercenters[:,0], clkmeans.clustercenters[:,1],' ok')
# FES as background is optional (change the bool to False)
if True:
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
# Link the clusters positions with the molecular structures
iwdg = molpx.visualize.sample(data_sample,
geoms.superpose(geoms[0]),
plt.gca(),
clear_lines=False,
#plot_path=True
)
iwdg
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:6: RuntimeWarning: divide by zero encountered in log
Visual representations for MSMs¶
Visually inspect the network behind an MSM
In [17]:
MSM = pyemma.msm.estimate_markov_model(clkmeans.dtrajs, 20)
In [18]:
plt.figure(figsize=(7,7))
ax, pos = pyemma.plots.plot_markov_model(MSM.P,
minflux=5e-4,
arrow_labels=None,
ax=plt.gca(),
arrow_curvature = 2, show_frame=True,
pos=clkmeans.clustercenters)
# Add a background if wanted
h, (x, y) = np.histogramdd(np.vstack(Y)[:,:2], weights=np.hstack(MSM.trajectory_weights()), bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
plt.xlim(x[[0,-1]])
plt.xticks(np.unique(x.round()))
plt.yticks(np.unique(y.round()))
plt.ylim(y[[0,-1]])
iwd = molpx.visualize.sample(pos, geoms, plt.gca())
iwd
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:11: RuntimeWarning: divide by zero encountered in log
TPT Reactive Pathway Representation¶
In [19]:
# Do an MSM with a realistic number of clustercenters
cl_many = pyemma.coordinates.cluster_regspace([iY[:,:2] for iY in Y], dmin=.25)
M = pyemma.msm.estimate_markov_model(cl_many.dtrajs, 20)
cl_many.n_clusters
Out[19]:
123
In [20]:
# Use this object to sample geometries
pos, geom = molpx.generate.sample(MD_trajfiles, top, cl_many)
In [21]:
# Find the most representative microstate of each
# and least populated macrostate
M.pcca(3)
dens_max_i = [distro.argmax() for distro in M.metastable_distributions]
A = np.argmax([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
B = np.argmin([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
print(cl_many.clustercenters[dens_max_i[A]],
cl_many.clustercenters[dens_max_i[B]])
[-0.18704125 -0.77366424] [ 6.71851349 0.03159955]
In [22]:
# Create a TPT object with most_pop, least_pop as source, sink respectively
tpt = pyemma.msm.tpt(M, [dens_max_i[A]], [dens_max_i[B]])
paths, flux = tpt.pathways(fraction=.5)
In [23]:
# Get a path with a decent number of intermediates
sample_path = paths[np.argmax([len(ipath) for ipath in paths])]
In [24]:
plt.figure()
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
iwd = molpx.visualize.sample(cl_many.clustercenters[sample_path],
geom[sample_path].superpose(geom[sample_path[0]]), plt.gca(),
plot_path=True,
)
plt.scatter(*cl_many.clustercenters.T, alpha=.25)
iwd
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: RuntimeWarning: divide by zero encountered in log
from ipykernel import kernelapp as app
In [25]:
# Check
# https://github.com/arose/nglview/issues/518
# https://github.com/arose/nglview/issues/517