molpx.generate

This module contains methods that generate the needed objects for visualize of the methods to work.

molpx.generate.projection_paths(...[, ...]) Return a path along a given projection.
molpx.generate.sample(MD_trajectories, ...) Returns a sample of molecular geometries and their positions in the projected space
molpx.generate.projection_paths(MD_trajectories, MD_top, projected_trajectories, n_projs=1, proj_dim=2, proj_idxs=None, n_points=100, n_geom_samples=100, proj_stride=1, history_aware=True, verbose=False, minRMSD_selection='backbone')

Return a path along a given projection. More info on what this means exactly will follow soon.

Parameters:
  • MD_trajectories (str, or list of strings with the filename(s) the the molecular dynamics (MD) trajectories.) –

    Any file extension that mdtraj (.xtc, .dcd etc) can read is accepted.

    Alternatively, a single mdtraj.Trajectory object or a list of them can be given as input.

  • MD_top (str to topology filename or directly mdtraj.Topology object) –
  • projected_trajectories (str to a filename or numpy ndarray of shape (n_frames, n_dims)) – Time-series with the projection(s) that want to be explored. If these have been computed externally, you can provide .npy-filenames or readable asciis (.dat, .txt etc). NOTE: molpx assumes that there is no time column.
  • n_projs (int, default is 1) – Number of projection paths to generate. If the input projected_trajectories are n-dimensional, in principle up to n-paths can be generated
  • proj_dim (int, default is 2) – Dimensionality of the space in which distances will be computed
  • proj_idxs (int, defaultis None) – Selection of projection idxs (zero-idxd) to visualize. The default behaviour is that proj_idxs = range(n_projs). However, if proj_idxs != None, then n_projs is ignored and proj_dim is set automatically
  • n_points (int, default is 100) – Number of points along the projection path. The higher this number, the higher the projected coordinate is resolved, at the cost of more computational effort. It’s a trade-off parameter
  • n_geom_samples (int, default is 100) – For each of the n_points along the projection path, n_geom_samples will be retrieved from the trajectory files. The higher this number, the smoother the minRMSD projection path. Also, the longer it takes for the path to be computed
  • proj_stride (int, default is 1) – The stride of the projected_trajectories relative to the MD_trajectories. This will play a role particularly if projected_trajectories is already strided (because the user is holding it in memory) but the MD-data on disk has not been strided.
  • history_aware (bool, default is True) – The path-searching algorigthm the can minimize distances between adjacent points along the path or minimize the distance between each point and the mean value of all the other up to that point. Use this parameter to avoid a situation in which the path gets “derailed” because an outlier is chosen at a given point.
  • verbose (bool, default is False) – The verbosity level
  • minRMSD_selection (str, default is 'backbone') – When computing minRMSDs between a given point and adjacent candidates, use this string to select the atoms that will be considered. Check mdtraj’s selection language here http://mdtraj.org/latest/atom_selection.html
Returns:

dictionary of dictionaries containing the projection paths.

  • paths_dict[idxs][type_of_path]

    • idxs represent the index of the projected coordinate ([0], [1]...)
    • types of paths “min_rmsd” or “min_disp”
  • What the dictionary actually contains

    • paths_dict[idxs][type_of_path]["proj"] : ndarray of shape (n_points, proj_dim) with the coordinates of the projection along the path
    • paths_dict[idxs][type_of_path]["geom"] : mdtraj.Trajectory geometries along the path

Return type:

paths_dict

idata :
list of ndarrays with the the data in projected_trajectories
molpx.generate.sample(MD_trajectories, MD_top, projected_trajectories, proj_idxs=[0, 1], n_points=100, n_geom_samples=1, keep_all_samples=False, proj_stride=1, verbose=False, return_data=False)

Returns a sample of molecular geometries and their positions in the projected space

Parameters:
  • MD_trajectories (list of strings) – Filenames (any extension that mdtraj can read is accepted) containing the trajectory data. There is an untested input mode where the user parses directly mdtraj.Trajectory objects
  • MD_top (str to topology filename or directly mdtraj.Topology object) –
  • projected_trajectories ((lists of) strings or (lists of) numpy ndarrays of shape (n_frames, n_dims)) – Time-series with the projection(s) that want to be explored. If these have been computed externally, you can provide .npy-filenames or readable asciis (.dat, .txt etc). Alternatively, you can feed in your own clustering object. NOTE: molpx assumes that there is no time column.
  • proj_idxs (int, default is None) – Selection of projection idxs (zero-idxd) to visualize. The default behaviour is that proj_idxs = range(n_projs). However, if proj_idxs != None, then n_projs is ignored and proj_dim is set automatically
  • n_points (int, default is 100) – Number of points along the projection path. The higher this number, the higher the projected coordinate is resolved, at the cost of more computational effort. It’s a trade-off parameter
  • n_points – For each of the n_points along the projection path, n_geom_samples will be retrieved from the trajectory files. The higher this number, the smoother the minRMSD projection path. Also, the longer it takes for the path to be computed
  • n_geom_samples (int, default is 1) – This is a trade-off parameter between how smooth the transitons between geometries can be and how long it takes to generate the sample
  • keep_all_samples (boolean, default is False) – In principle, once the closest-to-ref geometry has been kept, the other geometries are discarded, and the output sample contains only n_point geometries. HOWEVER, there are special cases where the user might want to keep all sampled geometries. Typical use-case is when the n_points is low and many representatives per clustercenters will be much more informative than the other way around (i know, this is confusing TODO: write this better)
Returns:

  • pos – ndarray with the positions of the sample
  • geom_smplmdtraj.Trajectory object with the sampled geometries