Welcome to molPX: The Molecular Projection Explorer¶
The Molecular Projection Explorer, molPX, is a python module that provides interactive visualization of projected coordinates of molecular dynamics (MD) trajectories inside a Jupyter notebook.
molPX is based on the incredibly useful nglview IPython/Jupyter widget. Other libraries heavily used are mdtraj and PyEMMA. At the moment, there is also an sklearn dependency that might disappear in the future.

At the moment the API consists of two subpackages:
TL;DR: see molPX in action through the
Find more about the people behind molPX here:
Download and Install¶
If you can’t wait to play around with molPX, and you have the Anaconda scientifc python distribution (which we strongly recommend), the easiest way to get molPX is to issue the conda command:
>>> conda install molPX -c omnia
and jump to the Quick Start section of this document. Otherwise, check out our more exhaustive
Quick Start¶
Start an IPython
console
>>> ipython
Import molpx
and let the example notebook guide you
>>> import molpx
>>> molpx.example_notebook()
Voilà: you should be looking at a jupyter notebook explaining the basic functionality of molPX
Documentation¶
You can find the latest documentation online here You can build a local copy of the html documentation by navigating to the molPX installation directory and issuing:
>>> cd doc
>>> make html
This will generate molPX/docs/build/html/index.html with the html documentation. If you are missing some of the requirements for the documentation , issue:
>>> pip install -r ./source/doc_requirements.txt
If you don’t know where molPX is installed, you can find out this way:
>>> ipython
>>> import molpx
>>> molpx._molpxdir()
The output of the last command is one subdirectory of molPX’s installation directory, so just copy it and issue:
>>> cd the-output-of-the-molpx._molpxdir-command
>>> cd ..
and you are there !
Warnings¶
molPX is currently under heavy development and the API might change rapidly. Stay tuned.
Data Privacy Statement¶
When you import this Python package, some of your metadata is sent to our servers. These are:
- molPX version
- Python version
- Operating System
- Hostname/ mac address of the accessing computer
- Time of retrieval
How to disable this feature easily:¶
Even before you use molPX for the first time:
- Create a hidden folder .molpx in your home folder
- Create a file conf_molpx.py inside of .molpx with the following line: report_status = False
- Restart your ipython/jupyter sessions
Hints:
This is most easily realized from terminal by issuing:
>>> mkdir ~/.molpx >>> echo "report_status = False" >> ~/.molpx/conf_molpx.py
You can check your report status anytime by typing this line in a (i)python terminal
>>> import molpx >>> molpx._report_status()
If you don’t know where your home folder is (for whatever reason), you can find it out by typing in a (i)python terminal
>>> import os >>> os.path.expanduser('~/.molpx')
molpx.visualize¶
The core functionality is to link two interative figures, fig1 and fig2, inside an Ipython/Jupyter notebook, so that an action in fig1 (e.g.a click of the mouse or a slide of a slidebar) will trigger an event in fig2 (e.g. a frame update or point moved) and vice versa. Usually, these two figures contain representations from:
- molecules: an nglviewer widget showing the molecular structure that a particular value of is associated with and
- projected coordinates: a matplotlib figure showing the projected coordinates (e.g. TICs or PCs or any other), \({Y_0, ..., Y_N}\), either as a 2D histogram, \(PDF(Y_i, Y_j)\) or as trajectory views \({Y_0(t), ...Y_N(t)}\)
You are strongly encouraged to check nglview’ documentation, since its functionalities extend beyond the scope of this package and the molecular visualization universe is rich and complex (unlike this module).
The three methods offered by this module are:
molpx.visualize.FES (MD_trajectories, MD_top, ...) |
Return a molecular visualization widget connected with a free energy plot. |
molpx.visualize.sample (positions, geom, ax) |
Visualize the geometries in geom according to the data in positions on an existing matplotlib axes ax |
molpx.visualize.traj (MD_trajectories, ...[, ...]) |
Link one or many projected trajectories , [Y_0(t), Y_1(t)...], with the MD_trajectories that originated them. |
-
molpx.visualize.
FES
(MD_trajectories, MD_top, projected_trajectory, proj_idxs=[0, 1], nbins=100, n_sample=100, axlabel='proj')¶ Return a molecular visualization widget connected with a free energy plot.
Parameters: - MD_trajectories (str, or list of strings with the filename(s) the the molecular dynamics (MD) trajectories.) –
Any file extension that
mdtraj
(.xtc, .dcd etc) can read is accepted.Alternatively, a single
mdtraj.Trajectory
object or a list of them can be given as input. - MD_top (str to topology filename or directly an
mdtraj.Topology
object) – - projected_trajectory (str to a filename or numpy ndarray of shape (n_frames, n_dims)) – Time-series with the projection(s) that want to be explored. If these have been computed externally, you can provide .npy-filenames or readable asciis (.dat, .txt etc). NOTE: molpx assumes that there is no time column.
- proj_idxs (list or ndarray of length 2) – Selection of projection idxs (zero-idxd) to visualize.
- nbins (int, default 100) – The number of bins per axis to used in the histogram (FES)
- n_sample (int, default is 100) – The number of geometries that will be used to represent the FES. The higher the number, the higher the spatial resolution of the “click”-action.
- axlabel (str, default is 'proj') – Format of the labels in the FES plot
Returns: - ax –
pylab.Axis
object - iwd –
nglview.NGLWidget
- data_sample – numpy ndarray of shape (n, n_sample) with the position of the dots in the plot
- geoms –
mdtraj.Trajectory
object with the geometries n_sample geometries shown by the nglwidget
- MD_trajectories (str, or list of strings with the filename(s) the the molecular dynamics (MD) trajectories.) –
-
molpx.visualize.
correlations
(correlation_input, geoms=None, proj_idxs=None, feat_name=None, widget=None, proj_color_list=None, n_feats=1, verbose=False)¶ - Provide a visual and textual representation of the linear correlations between projected coordinates (PCA, TICA)
- and original features.
Parameters: - correlation_input (anything) – Something that could, in principle, be a :obj:`pyemma.coordinates.transformer, like a TICA or PCA object (this method will be extended to interpret other inputs, so for now this parameter is pretty flexible)
- geoms (None or obj:md.Trajectory, default is None) –
The values of the most correlated features will be returned for the geometires in this object. If widget is left to its default, None,
correlations
will create a new widget and try to show the most correlatedfeatures on top of the widget - widget (None or nglview widget) –
Provide an already existing widget to visualize the correlations on top of. This is only for expert use, because no checks are done to see if
correlation_input
and the geometry contained in the widget actually match. Use with caution.Note
- When objects
geoms
andwidget
are provided simultaneously, three things happen: - no new widget will be instantiated
- the display of features will be on top of whatever geometry
widget
contains - the value of the features is computed for the geometry of
geom
Use with caution and clean bookkeeping!
- When objects
- proj_color_list (list, default is None) – projection specific list of colors to provide the representations with. The default None yields blue. In principle, the list can contain one color for each projection (= as many colors as len(proj_idxs) but if your list is short it will just default to the last color. This way, proj_color_list=[‘black’] will paint all black regardless len(proj_idxs)
- proj_idxs: None, or int, or iterable of integers, default is None
- The indices of the projections for which the most correlated feture will be returned If none it will default to the dimension of the correlation_input object
- feat_name : None or str, default is None
- The prefix with which to prepend the labels of the most correlated features. If left to None, the feature
description found in
correlation_input
will be used (if available) - n_feats : int, default is 1
- Number of argmax correlation to return for each feature.
- verbose : Bool, default is True
- print to standard output
Returns: most_corr_idxs, most_corr_vals, most_corr_labels, most_corr_feats, most_corr_atom_idxs, lines, widget, lines
-
molpx.visualize.
sample
(positions, geom, ax, plot_path=False, clear_lines=True, n_smooth=0, widget=None, **link_ax2wdg_kwargs)¶ Visualize the geometries in
geom
according to the data inpositions
on an existing matplotlib axesax
Use this method when the array of positions, the geometries, the axes (and the widget, optionally) have already been generated elsewhere.
Parameters: - positions (numpy nd.array of shape (n_frames, 2)) – Contains the position associated with each frame in
geom
in that order - geom (
mdtraj.Trajectory
object) – Contains n_frames, each frame - ax (matplotlib.pyplot.Axes object) – The axes to be linked with the nglviewer widget
- plot_path (bool, default is False) – whether to draw a line connecting the positions in
positions
- clear_lines (bool, default is True) – whether to clear all the lines that were previously drawn in
ax
- n_smooth (int, default is 0,) – if n_smooth > 0, the shown geometries and paths will be smoothed out by 2*n frames.
See
bmutils.smooth_geom
for more information - widget (None or existing nglview widget) – you can provide an already instantiated nglviewer widget here (avanced use)
- link_ax2wdg_kwargs (dictionary of named arguments, optional) – named arguments for the function
_link_ax_w_pos_2_nglwidget
, which is the one that internally provides the interactivity. Non-expert users can safely ignore this option.
Returns: iwd
Return type: - positions (numpy nd.array of shape (n_frames, 2)) – Contains the position associated with each frame in
-
molpx.visualize.
traj
(MD_trajectories, MD_top, projected_trajectories, active_traj=0, max_frames=2000, stride=1, proj_stride=1, proj_idxs=[0, 1], plot_FES=False, panel_height=1, sharey_traj=True, dt=1.0, tunits='frames', traj_selection=None, projection=None, n_feats=1)¶ Link one or many
projected trajectories
, [Y_0(t), Y_1(t)...], with theMD_trajectories
that originated them.Optionally plot also the resulting FES.
Parameters: - MD_trajectories (str, or list of strings with the filename(s) the the molecular dynamics (MD) trajectories.) –
Any file extension that
mdtraj
(.xtc, .dcd etc) can read is accepted.Alternatively, a single
mdtraj.Trajectory
object or a list of them can be given as input. - MD_top (str to topology filename or directly
mdtraj.Topology
object) – - projected_trajectories (str to a filename or numpy ndarray of shape (n_frames, n_dims)) – Time-series with the projection(s) that want to be explored. If these have been computed externally, you can provide .npy-filenames or readable asciis (.dat, .txt etc). NOTE: molpx assumes that there is no time column.
- active_traj (int, default 0) – Index of the trajectory that will be responsive. (zero-indexing)
- max_frames (int, default is 1000) – If the trajectoy is longer than this, stride to this length (in frames)
- stride (int, default is 1) – Stride value in case of large datasets. In case of having
MD_trajectories
andprojected_trajectories
in memory (and not on disk) the stride can take place also before calling this method. - proj_stride (int, default is 1) – Stride value that was used in the
projected_trajectories
relative to theMD_trajectories
If the originalMD_trajectories
were stored every 5 ps but the projected trajectories were stored every 50 ps,proj_stride
= 10 has to be provided, otherwise an exception will be thrown informing the user that theMD_trajectories
and theprojected_trajectories
have different number of frames. - proj_idxs (iterable of ints, default is [0,1]) – Indices of the projected coordinates to use in the various representations
- plot_FES (bool, default is False) – Plot (and interactively link) the FES as well
- panel_height (int, default 1) – Height, in inches, of each panel of each trajectory subplots
- sharey_traj (bool, default is True) – Force the panels of each projection to have the same yaxes across trajectories (Note: Not across coordinates)
- dt (float, default is 1.0) – Physical time-unit equivalent to one frame of the
projected_trajectories
- tunits (str, default is 'frames') – Name of the physical time unit provided in
dt
- traj_selection (None, int, iterable of ints, default is None) – Don’t plot all trajectories but only few of them. The default None implies that all trajs will be plotted. Note: the data used for the FES will always include all trajectories, regardless of this value
- projection (object that generated the projection, default is None) –
The projected coordinates may come from a variety of sources. When working with pyemma a number of objects might have generated this projection, like a *
pyemma.coordinates.transform.TICA
or a *pyemma.coordinates.transform.PCA
or aPass this object along and observe and the features that are most correlated with the projections will be plotted for the active trajectory, allowing the user to establish a visual connection between the projected coordinate and the original features (distances, angles, contacts etc)
- n_feats (int, default is 1) – If a
projection
is passed along, the first n_feats features that most correlate the the projected trajectories will be represented, both in form of trajectories feat vs t as well as in the nglwidget
Returns: - ax, iwd, data_sample, geoms – return _plt.gca(), _plt.gcf(), widget, geoms
- ax –
pylab.Axis
object - fig –
pylab.Figure
object - iwd –
nglview.NGLWidget
- geoms –
mdtraj.Trajectory
object with the geometries n_sample geometries shown by the nglwidget
- MD_trajectories (str, or list of strings with the filename(s) the the molecular dynamics (MD) trajectories.) –
molpx.generate¶
This module contains methods that generate the needed objects for visualize
of the methods to work.
molpx.generate.projection_paths (...[, ...]) |
Return a path along a given projection. |
molpx.generate.sample (MD_trajectories, ...) |
Returns a sample of molecular geometries and their positions in the projected space |
-
molpx.generate.
projection_paths
(MD_trajectories, MD_top, projected_trajectories, n_projs=1, proj_dim=2, proj_idxs=None, n_points=100, n_geom_samples=100, proj_stride=1, history_aware=True, verbose=False, minRMSD_selection='backbone')¶ Return a path along a given projection. More info on what this means exactly will follow soon.
Parameters: - MD_trajectories (str, or list of strings with the filename(s) the the molecular dynamics (MD) trajectories.) –
Any file extension that
mdtraj
(.xtc, .dcd etc) can read is accepted.Alternatively, a single
mdtraj.Trajectory
object or a list of them can be given as input. - MD_top (str to topology filename or directly
mdtraj.Topology
object) – - projected_trajectories (str to a filename or numpy ndarray of shape (n_frames, n_dims)) – Time-series with the projection(s) that want to be explored. If these have been computed externally, you can provide .npy-filenames or readable asciis (.dat, .txt etc). NOTE: molpx assumes that there is no time column.
- n_projs (int, default is 1) – Number of projection paths to generate. If the input
projected_trajectories
are n-dimensional, in principle up to n-paths can be generated - proj_dim (int, default is 2) – Dimensionality of the space in which distances will be computed
- proj_idxs (int, defaultis None) – Selection of projection idxs (zero-idxd) to visualize. The default behaviour is that proj_idxs = range(n_projs). However, if proj_idxs != None, then n_projs is ignored and proj_dim is set automatically
- n_points (int, default is 100) – Number of points along the projection path. The higher this number, the higher the projected coordinate is resolved, at the cost of more computational effort. It’s a trade-off parameter
- n_geom_samples (int, default is 100) – For each of the
n_points
along the projection path,n_geom_samples
will be retrieved from the trajectory files. The higher this number, the smoother the minRMSD projection path. Also, the longer it takes for the path to be computed - proj_stride (int, default is 1) – The stride of the
projected_trajectories
relative to theMD_trajectories
. This will play a role particularly ifprojected_trajectories
is already strided (because the user is holding it in memory) but the MD-data on disk has not been strided. - history_aware (bool, default is True) – The path-searching algorigthm the can minimize distances between adjacent points along the path or minimize the distance between each point and the mean value of all the other up to that point. Use this parameter to avoid a situation in which the path gets “derailed” because an outlier is chosen at a given point.
- verbose (bool, default is False) – The verbosity level
- minRMSD_selection (str, default is 'backbone') – When computing minRMSDs between a given point and adjacent candidates, use this string to select the atoms that will be considered. Check mdtraj’s selection language here http://mdtraj.org/latest/atom_selection.html
Returns: dictionary of dictionaries containing the projection paths.
paths_dict[idxs][type_of_path]
- idxs represent the index of the projected coordinate ([0], [1]...)
- types of paths “min_rmsd” or “min_disp”
What the dictionary actually contains
paths_dict[idxs][type_of_path]["proj"]
: ndarray of shape (n_points, proj_dim) with the coordinates of the projection along the pathpaths_dict[idxs][type_of_path]["geom"]
:mdtraj.Trajectory
geometries along the path
Return type: paths_dict
- idata :
- list of ndarrays with the the data in
projected_trajectories
- MD_trajectories (str, or list of strings with the filename(s) the the molecular dynamics (MD) trajectories.) –
-
molpx.generate.
sample
(MD_trajectories, MD_top, projected_trajectories, proj_idxs=[0, 1], n_points=100, n_geom_samples=1, keep_all_samples=False, proj_stride=1, verbose=False, return_data=False)¶ Returns a sample of molecular geometries and their positions in the projected space
Parameters: - MD_trajectories (list of strings) – Filenames (any extension that
mdtraj
can read is accepted) containing the trajectory data. There is an untested input mode where the user parses directlymdtraj.Trajectory
objects - MD_top (str to topology filename or directly
mdtraj.Topology
object) – - projected_trajectories ((lists of) strings or (lists of) numpy ndarrays of shape (n_frames, n_dims)) – Time-series with the projection(s) that want to be explored. If these have been computed externally, you can provide .npy-filenames or readable asciis (.dat, .txt etc). Alternatively, you can feed in your own clustering object. NOTE: molpx assumes that there is no time column.
- proj_idxs (int, default is None) – Selection of projection idxs (zero-idxd) to visualize. The default behaviour is that proj_idxs = range(n_projs). However, if proj_idxs != None, then n_projs is ignored and proj_dim is set automatically
- n_points (int, default is 100) – Number of points along the projection path. The higher this number, the higher the projected coordinate is resolved, at the cost of more computational effort. It’s a trade-off parameter
- n_points – For each of the
n_points
along the projection path,n_geom_samples
will be retrieved from the trajectory files. The higher this number, the smoother the minRMSD projection path. Also, the longer it takes for the path to be computed - n_geom_samples (int, default is 1) – This is a trade-off parameter between how smooth the transitons between geometries can be and how long it takes to generate the sample
- keep_all_samples (boolean, default is False) – In principle, once the closest-to-ref geometry has been kept, the other geometries are discarded, and the output sample contains only n_point geometries. HOWEVER, there are special cases where the user might want to keep all sampled geometries. Typical use-case is when the n_points is low and many representatives per clustercenters will be much more informative than the other way around (i know, this is confusing TODO: write this better)
Returns: - pos – ndarray with the positions of the sample
- geom_smpl –
mdtraj.Trajectory
object with the sampled geometries
- MD_trajectories (list of strings) – Filenames (any extension that
Example Jupyter Notebook¶
There are several ways to see the example notebook, which you can find in the molpx/notebooks/
installation directory.
- Play:
molpx
has a method that will launch a working, temporary copy of the example notebook. From an IPython console, just type:>>>> import molpx >>>> molpx.example_notebook()
A Juypter notebook should automagically appear in front of you after a few seconds. This is the most interactive and usefull way to see
molpx
in action, but you’ll only have access to it after downloading and installingmolpx
- Read:
The links you see below are an html-rendered version of the notebook. Click on them to navigate the notebook. Unfortunately for this html documentation,
nglview
‘s output, i.e. the pictures of molecular structures, cannot be stored currently in the notebook file. In short: the html-notebook is lacking the most visually appealing part ofmolpx
.
molPX intro¶
Guillermo Perez-Hernandez guille.perez@fu-berlin.de
In this notebook we will be using the 1 millisecond trajectory of Bovine Pancreatic Trypsin Inhibitor (BPTI) generated by DE Shaw Research on the Anton Supercomputer and kindly made available by their lab. The original work is
- Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, Eastwood MP, Bank JA, Jumper JM, Salmon JK, Shan Y, Wriggers W: Atomic-level characterization of the structural dynamics of proteins. Science 330:341-346 (2010). doi: 10.1126/science.1187409.
The trajectory has been duplicated and shortened to provide a mock-trajectory set and be able to deal with lists of trajectories of different lenghts:
c-alpha_centered.stride.100.xtc
c-alpha_centered.stride.100.reversed.xtc
c-alpha_centered.stride.100.halved.xtc
Input types and typical usecase¶
The typical usecase is having molecular dynamics (MD) simulation data in
form of trajectory files with extensions like .xtc, .dcd
etc and the
associated molecular topology as a .pdb
or .gro
file.
These files are the most general starting point for any analysis dealing
with MD, and molpx
‘s API has been designed to be able to function
without further input:
In [1]:
top = 'notebooks/data/bpti-c-alpha_centered.pdb'
MD_trajfiles = ['notebooks/data/c-alpha_centered.stride.1000.xtc',
'notebooks/data/c-alpha_centered.stride.1000.reversed.xtc',
'notebooks/data/c-alpha_centered.stride.1000.halved.xtc'
]
dt = 244 #saving interval in the .xtc files, in ns
import molpx
from matplotlib import pylab as plt
%matplotlib notebook
import pyemma
import numpy as np
# This way the user does not have to care where the data are:
top = molpx._molpxdir(top)
MD_trajfiles = [molpx._molpxdir(ff) for ff in MD_trajfiles]
However, molpx
relies heavily on the awesome
`mdtraj
<http://www.mdtraj.org>`__ module for dealing with molecular
structures, and so most of molpx
‘s functions accept also
Trajectory
-type objects (native to mdtraj
) as alternative
inputs.
In [2]:
# Create a memory representation of the trajectories
MD_list = [molpx.generate._md.load(itraj, top=top) for itraj in MD_trajfiles]
The same idea applies to the input of projected trajectories: molpx
can take the filenames as inputs (.npy
, .dat
, .txt
etc) or
deal directly with numpy.ndarray
objects.
** These alternative, “from-memory” input modes (md.Trajectory
and
np.ndarray
objects) avoid forcing the user to read from file
everytime an API function is called, saving I/O overhead**
The following cell either reads or generates projected trajectory files for this demonstration. In a real usecase this step (done here using TICA) might not be needed, given that the user might have generated the projected trajectory elsewhere:
In [3]:
# Perform TICA or read from file directly if already .npy-files exist
Y_filenames = [ff.replace('.xtc','.Y.npy') for ff in MD_trajfiles]
try:
Y = [np.load(ff) for ff in Y_filenames]
except:
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()
[np.save(ff, iY) for ff, iY in zip(Y_filenames, Y)]
Visualize a FES and the molecular structures behind it¶
Execute the following cell and click either on the FES or on the slidebar. Some input parameters have been comented out for you to try out different modes of input (disk vs memory) as well as different projection indices:
In [4]:
ax, fig, iwd, data_sample, geom = molpx.visualize.FES(
MD_list,
#MD_trajfiles,
top,
Y_filenames,
#Y,
nbins=50,
#proj_idxs=[1,2],
axlabel='TIC',
)
iwd
Visualize trajectories, FES and molecular structures¶
The user can sample structures as they occurr in sequence in the actual trajectory. Depending on the size of the dataset, this can be very time consuming, particularly if data is being read from disk.
In this example, try changing MD_trajfiles
to MD_list
and/or
changing Y_filenames
to simply Y
and see if it helps.
Furthermore, the objects in memory can be strided down to fewer frames
before being parsed to the method. To stride objects being read from
the disk, use the stride
parameter.
Other commented parameters provide more control on the output of
visualize.traj
. Uncomment them and see what happens
In [5]:
__, myfig, iwd, __ = molpx.visualize.traj(MD_trajfiles,
top,
Y,
#Y_filenames,
plot_FES = True,
dt = dt*1e-6, tunits='ms',
active_traj=0,
#traj_selection = 1,
#sharey_traj=False,
#max_frames=100,
proj_idxs=[0, 1],
panel_height=2,
)
myfig.tight_layout()
iwd
/home/guille/Programs/molPX/molpx/visualize.py:344: RuntimeWarning: divide by zero encountered in log
_plt.contourf(-_np.log(h).T, extent=irange)
Intermediate steps: using molpx to generate a regspace sample of the data¶
See the documentation of molpx.generate.sample
to find out about all
possible options:
molpx.generate.sample(MD_trajectories, MD_top, projected_trajectories, proj_idxs=[0, 1], n_points=100, n_geom_samples=1, keep_all_samples=False, proj_stride=1, verbose=False, return_data=False)
In [6]:
data_sample, geoms = molpx.generate.sample(#MD_list,
MD_trajfiles,
top,
#Y,
Y_filenames,
n_points=200
)
data_sample.shape, geoms
Out[6]:
((192, 2),
<mdtraj.Trajectory with 192 frames, 58 atoms, 58 residues, and unitcells at 0x7fd65922c1d0>)
Link the PDF plot with the sampled structures and visually explore the FES¶
Click either on the plot or on the widget slidebar: they’re connected!
In [7]:
# Replot the FES
plt.figure(figsize=(7,7))
h, (x,y) = np.histogramdd(np.vstack(Y)[:,:2], bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
# Create the linked widget
linked_wdg = molpx.visualize.sample(data_sample,
geoms.superpose(geoms[0]),
plt.gca(),
clear_lines=True,
#plot_path=True
)
plt.plot(data_sample[:,0], data_sample[:,1],' ok', zorder=0)
# Show it
linked_wdg
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:4: RuntimeWarning: divide by zero encountered in log
Paths samples along the different projections (=axis)¶
In [8]:
paths_dict, idata = molpx.generate.projection_paths(#MD_list,
MD_trajfiles,
top,
Y_filenames,
#Y, # You can also directly give the data here
n_points=50,
proj_idxs=[0,1],
n_projs=3,
proj_dim = 3,
verbose=False,
)
Link the PDF plot with the sampled paths/structures and visually explore the coordinates (separately).¶
Click either on the plot or on the widget slidebar: they’re connected! You can change the type of path between min_rmsd or min_disp and you can also change the coordinate sampled (0 or 1)
In [9]:
# Choose the coordinate and the tyep of path
coord = 1
#path_type = 'min_rmsd'
path_type = 'min_disp'
igeom = paths_dict[coord][path_type]["geom"]
ipath = paths_dict[coord][path_type]["proj"]
# Choose the proj_idxs for the path and the FES
# to be shown
proj_idxs = [0,1]
In [10]:
plt.figure(figsize=(7,7))
h, (x,y) = np.histogramdd(np.vstack(Y)[:,proj_idxs], bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
linked_wdg = molpx.visualize.sample(ipath[:,proj_idxs],
igeom.superpose(igeom[0]),
plt.gca(),
clear_lines=True,
n_smooth = 5,
plot_path=True,
#radius=True,
)
linked_wdg
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:3: RuntimeWarning: divide by zero encountered in log
app.launch_new_instance()
Interaction with PyEMMA
¶
molpx
is using many methods of the coordinates
submodule of
PyEMMA
, and thus it also understands some of PyEMMA
‘s classes as
input (like clustering objects or streaming transformers). ## Using the
TICA object to visualize the most correlated input features If the
projected coordinates come from a TICA (or PCA) transformation, and the
TICA object is available in memory molpx.visualize.traj
can make use
of correlation information to display not only the projected coordinates
(i.e the TICs, in this case), but also the “original” input features
behind it
In [11]:
# Re-do the TICA computation to make sure we have a tica object in memory
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()
The method molpx.visualize.correlations
tries to provide a visual
representation of the projected coordinates by relating them to the
input features, which carry more meaning, since they are (usually)
familiar parameters such as atom distances, angles, contacts etc.
In [28]:
# Comment or uncomment the optinal parameters and see how the method reacts
#iwd = visualize._nglview.show_mdtraj(MD_list[0][0])
iwd = None
corr, iwd = molpx.visualize.correlations(tica,
n_feats=3,
proj_idxs=[0,1,2],
geoms=MD_list[0][::100],
#verbose=True,
#proj_color_list=['red', 'blue', 'green'],
widget=iwd
)
iwd
Also, molpx.visualize.traj
can help in visualizing these
correlations by parsing along the tica object itself as
projection=tica
. Can you spot the differences: * In the nglwidget?
* In the trajectories?
In [13]:
# Reuse the visualize.traj method with the tica object as input
__, myfig, iwd, __ = molpx.visualize.traj(MD_trajfiles,
top,
Y,
#Y_filenames,
plot_FES = True,
dt = dt*1e-6, tunits='ms',
active_traj=1,
#traj_selection = 0,
#sharey_traj=False,
#max_frames=100,
proj_idxs=[0,1],
panel_height=2,
projection=tica, ## this is what's new
n_feats=3
)
myfig.tight_layout()
iwd
/home/guille/Programs/molPX/molpx/visualize.py:344: RuntimeWarning: divide by zero encountered in log
_plt.contourf(-_np.log(h).T, extent=irange)
Use a clustering object as input¶
If the dataset has already been clustered, and it is that clustering
that the user wants to explore, molpx.generate.sample
can take this
clustering object as an input instead of the the projected trajectories:
In [14]:
# Do "some" clustering
clkmeans = pyemma.coordinates.cluster_kmeans([iY[:,:2] for iY in Y], 5)
17-03-17 11:11:13 pyemma.coordinates.clustering.kmeans.KmeansClustering[7] INFO Algorithm did not reach convergence criterion of 1e-05 in 10 iterations. Consider increasing max_iter.
In [15]:
data_sample, geoms = molpx.generate.sample(MD_trajfiles, top, clkmeans,
n_geom_samples=50,
#keep_all_samples=True # read the doc for this argument
)
In [16]:
# Plot clusters
plt.figure(figsize=(7,7))
plt.plot(clkmeans.clustercenters[:,0], clkmeans.clustercenters[:,1],' ok')
# FES as background is optional (change the bool to False)
if True:
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
# Link the clusters positions with the molecular structures
iwdg = molpx.visualize.sample(data_sample,
geoms.superpose(geoms[0]),
plt.gca(),
clear_lines=False,
#plot_path=True
)
iwdg
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:6: RuntimeWarning: divide by zero encountered in log
Visual representations for MSMs¶
Visually inspect the network behind an MSM
In [17]:
MSM = pyemma.msm.estimate_markov_model(clkmeans.dtrajs, 20)
In [18]:
plt.figure(figsize=(7,7))
ax, pos = pyemma.plots.plot_markov_model(MSM.P,
minflux=5e-4,
arrow_labels=None,
ax=plt.gca(),
arrow_curvature = 2, show_frame=True,
pos=clkmeans.clustercenters)
# Add a background if wanted
h, (x, y) = np.histogramdd(np.vstack(Y)[:,:2], weights=np.hstack(MSM.trajectory_weights()), bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
plt.xlim(x[[0,-1]])
plt.xticks(np.unique(x.round()))
plt.yticks(np.unique(y.round()))
plt.ylim(y[[0,-1]])
iwd = molpx.visualize.sample(pos, geoms, plt.gca())
iwd
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:11: RuntimeWarning: divide by zero encountered in log
TPT Reactive Pathway Representation¶
In [19]:
# Do an MSM with a realistic number of clustercenters
cl_many = pyemma.coordinates.cluster_regspace([iY[:,:2] for iY in Y], dmin=.25)
M = pyemma.msm.estimate_markov_model(cl_many.dtrajs, 20)
cl_many.n_clusters
Out[19]:
123
In [20]:
# Use this object to sample geometries
pos, geom = molpx.generate.sample(MD_trajfiles, top, cl_many)
In [21]:
# Find the most representative microstate of each
# and least populated macrostate
M.pcca(3)
dens_max_i = [distro.argmax() for distro in M.metastable_distributions]
A = np.argmax([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
B = np.argmin([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
print(cl_many.clustercenters[dens_max_i[A]],
cl_many.clustercenters[dens_max_i[B]])
[-0.18704125 -0.77366424] [ 6.71851349 0.03159955]
In [22]:
# Create a TPT object with most_pop, least_pop as source, sink respectively
tpt = pyemma.msm.tpt(M, [dens_max_i[A]], [dens_max_i[B]])
paths, flux = tpt.pathways(fraction=.5)
In [23]:
# Get a path with a decent number of intermediates
sample_path = paths[np.argmax([len(ipath) for ipath in paths])]
In [24]:
plt.figure()
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
iwd = molpx.visualize.sample(cl_many.clustercenters[sample_path],
geom[sample_path].superpose(geom[sample_path[0]]), plt.gca(),
plot_path=True,
)
plt.scatter(*cl_many.clustercenters.T, alpha=.25)
iwd
/home/guille/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: RuntimeWarning: divide by zero encountered in log
from ipykernel import kernelapp as app
In [25]:
# Check
# https://github.com/arose/nglview/issues/518
# https://github.com/arose/nglview/issues/517
- Watch:
- Our Youtube video or the gif animation show
molpx
in action.
About¶
molPX has been developed mostly by Dr. Guillermo Pérez-Hernández in the group of Prof. Dr. Frank Noé, with the priceless help of:
Beyond molPX’s own methods, this module connects two incredibly powerful and incredibly useful python modules:
- mdtraj for handling molecular structures inside python
- nglview IPython/Jupyter widget for in-notebook molecular visualization.
molPX is specially in debt to Dr. Alexander Rose, who, apart from developing the impressive nglview (among other projects) provided the very first proof-of-concept for molPX.
molPX was recently introduced to the community in a PyEMMA workshop in Berlin:
Installation¶
To install molPX , you need a few Python package dependencies. If these dependencies are not available in their required versions, the installation will fail. We recommend one particular way for the installation that is relatively safe, but you are welcome to try another approaches if you know what you are doing.
Anaconda install (Recommended)¶
We strongly recommend to use the Anaconda scientific python distribution in order to install python-based software. Python-based software is not trivial to distribute and this approach saves you many headaches and problems that frequently arise in other installation methods. You are free to use a different approach (see below) if you know how to sort out problems, but play at your own risk.
If you already have a conda installation, directly go to step 3:
Download and install miniconda for Python 2.7 or 3+, 32 or 64 bit depending on your system. Note that you can still use Python 2.7, however we recommend to use Python3:
http://conda.pydata.org/miniconda.html
For Windows users, who do not know what to choose for 32 or 64 bit, it is strongly recommended to read the second question of this FAQ first:
http://windows.microsoft.com/en-us/windows/32-bit-and-64-bit-windows
Run the installer and select yes to add conda to the PATH variable.
If you have installed from a Linux shell, either open a new shell to have an updated PATH, or update your PATH variable by
source ~/.bashrc
(or .tcsh, .csh - whichever shell you are using).Add the omnia-md software channel, and install (or update) molPX:
if the command conda is unknown, the PATH variable is probably not set correctly (see 1. and 2.)
Check installation:
shows you the installed python packages. You should find a molpx 0.1.2 (or later) and ipython, ipython-notebook 3.1 (or later). If ipython is not up to date, you canot use molPX. Please update it by
Python Package Index (PyPI)¶
If you do not like Anaconda for some reason you should use the Python package manager pip to install. This is not recommended, because in the past, various problems have arisen with pip in compiling the packages that molPX depends upon, see this issue for more information.
If you do not have pip, please read the install guide: install guide.
Make sure pip is enabled to install so called wheel packages:
pip install wheel
Now you are able to install binaries if you use MacOSX or Windows.
Install molPX using
pip install molPX
Check your installation
python >>> import molpx >>> molpx.__version__
should print 0.1.2 or later
>>> import IPython >>> IPython.__version__
should print 3.1 or later. If ipython is not up to date, update it by
pip install ipython
Building from Source¶
Building all dependencies from molPX from source is sometimes (if not usually) tricky, takes a long time and is error prone. It is not recommended nor supported by us. If unsure, use the Anaconda installation.
What you can do is clone or download the source from github. After that, just cd to the download directory (and untar/unzip if necessary) and:
>>> cd molPX
>>> python setup.py install
but be aware that success is not guaranteed. See the “Known Issues” below.
Known Issues¶
- A
SandboxViolation
error might appear when installing from source. Until we figure this out,try to install
nglview
externally issuing:>>> conda install nglview -c biocondaor, alternatively
>>> pip install nglview
- Note that molPX only works with
nglview
versions >=0.6.2.1.- The interplay between some modules (nglview, nbextensions, ipywidgets) might limit you to use python3.X on some platforms. Sorry about that.
Changelog¶
0.1.5 (tba)
New features:
molpx.visualize
:- new method
visualize.correlations()
to visualize the feature_TIC_correlation attribute of PyEMMA TICA-objects in the widget - method
visualize.traj()
has new optional parameterprojection
to accept a PyEMMA TICA object to visualize the linear correlations in the widget AND in as trajectories f(t)
- new method
molpx.bmutils
:- new method
bmutils.most_corr_info()
to support the new methods inmolpx.visualize
- new method
- notebooks:
- new section added to showcase the better PyEMMA integration, in particular with TICA
Fixes:
molpx.visualize
:- all calls to
nglview.show_mdtraj()
has been wrapped in_initialize_nglwidget_if_safe
to avoid erroring in tests
- all calls to