Potential Energy Surface Interface

PyVibDMC Requires the potential energy surface to be callable from Python.

Important points and constraints:

All calculations inside the DMC simulation are done in atomic units.
PyVibDMC itself passes only the coordinates, in Bohr, as a nxmx3 (n=num_geoms,m=num_atoms,3=xyz) NumPy array to the Potential Wrapper potential_manager.
- To pass more than one argument to the Python function you are using, you will need to create a dictionary that you pass to the potential manager (example below). Otherwise, you will just write a function like: def example_potential(cds)
PyVibDMC expects the function you write to return a 1D NumPy array of potential values in Hartree.

Please follow the next steps for calling a PES from Python written in Fortran or C/C++ after a brief note about parallelization.

Multiproccesing Pool: Parallelizing Potential Calls

The Potential manager uses multiprocessing by default. This allows the user to parallelize the potential call across CPU cores. This will almost always yield a very large speed-up, and is highly reccomended. PyVibDMC uses Python’s Multiprocessing module to do this. The only argument you need to use in order to take advantage of this is the num_cores parameter:

from pyvibdmc import potential_manager as pm
water_pot = pm.Potential(potential_function=pot_func,
                      python_file=py_file,
                      potential_directory=pot_dir,
                      num_cores=2)

The num_cores parameter specifies the number of Python processes one would spawn at the beginning of the DMC simulation. Each Python process takes up 1 core. If you are working with a 16-core CPU, perhaps use 10 to 12 cores for maximum performance if you are only running one calculation at once. The number of walkers does NOT need to be divisible by the number of cores/processes. If this is run on a laptop with 4 cores, only using 2 cores is recommended.

If you explicity do NOT want to use multiprocessing for some reason, such as the parallelization is done elsewhere, there is a Potential_NoMP object that you can use instead:

from pyvibdmc import potential_manager as pm
water_pot = pm.Potential_NoMP(potential_function=pot_func,
                      python_file=py_file,
                      potential_directory=pot_dir,
                      ch_dir=True)

This works identically to Potential but does not involve the multiprocessing module at all. Because of how this code calls the potential, it does not by default call the Python function from the directory of the python function, instead it calls it from the current working directory that you are running th calculation in. If your python function relies on the code being in the directory where the python function is housed, you can pass the ch_dir=True keyword argument to Potential_NoMP, which will cause the code to chdir into the directory of interest, call the potential, and then cd back out each time the potential is called. ch_dir is set to False by default for computational efficiency.

Passing more than just the coordinates to the potential manager

The potential wrapper can take in more than just the coordinates if desired. However, these arguments must be pickleable due to how multiprocessing works. The other arguments to be passed to each of the multiprocessing instances should be in the form of a Python dictionary:

from pyvibdmc.simulation_utilities import potential_manager as pm
extra_args = {'num_water_molecules': 6, 'other_parameter': True}
water_pot = pm.Potential(potential_function=pot_func,
                      python_file=py_file,
                      potential_directory=pot_dir,
                      num_cores=2,
                      pot_kwargs=extra_args)

Then, when writing the Python function:

def sample_potential(cds, extra_args):
    num_waters = extra_args['num_water_molecules']
    other_param = extra_args['other_parameter']
    ...

In Development: Tensorflow Keras Neural Network Potentials

It is possible to run a DMC simulation using a potential energy surface generated using neural networks. If you have a tensorflow keras model trained and ready to go, you can plug it in to the potential manager’s NN_Potential object:

from pyvibdmc import potential_manager as pm
from pyvibdmc.simulation_utilities.tensorflow_descriptors.tf_coulomb import TF_Coulomb #experimental cds <--> descriptor module
import tensorflow as tf
coulomb_transformer = TF_Coulomb([8,1,1,8,1,1]) # water dimer
extra_args = {'descriptorizer': coulomb_transformer, 'batch_size': 100000}
my_nn_model = tf.keras.models.load_model('/path/to/model')
water_pot = pm.NN_Potential(potential_function=pot_func,
                      python_file=py_file,
                      potential_directory=pot_dir,
                      model=my_nn_model,
                      pot_kwargs=extra_args)

Then, for the potential call:

#some other py_file
def call_nn_potential(cds, model, extra_args):
    batch_size = extra_args['batch_size']
    descriptor = extra_args['descriptorizer']
    transformed_cds = descriptor.get_coulomb(cds)
    y = model.predict(transformed_cds,batch_size=batch_size)
    ...
    return v

This was written with the intention that these models would be evaluated on GPUs, so NN_Potential does not have multiprocessing support and is actually a subclass of Potential_NoMP. As such, the extra_args here do not have to be pickleable. The potential wrapper is not confined to tensorflow, however other ML packages have not been tested within the confines of PyVibDMC

Fortran Potentials: F2PY

F2PY is an easy-to-use interface generator that calls Fortran from Python. It is baked into NumPy, a prerequisite for this PyVibDMC. Here is the Makefile that comes with PyVibDMC that uses f2py:

main :
    gfortran -c -fPIC h2opes_v2.f
    gfortran -c -fPIC calc_h2o_pot.f
    python3 -m numpy.f2py -c -I. h2opes_v2.o -m h2o_pot calc_h2o_pot.f

clean :
    rm *.o *.so

Essentially, we build a Python module in the form of a .so (shared object/library) file. The subroutine h2o_pot in calc_h2o_pot.f calls a subroutine in h2opes_v2.f, so it is necessary to compile h2opes_v2.f before generating the extension. It is optional to also compile calc_h2o_pot.f before the f2py call.

A file called h2o_pot.cpython...so will be generated. This Python module is now importable inside Python. This is done in h2o_potential.py:

# h2o_potential.py
 from h2o_pot import calc_hoh_pot
 import numpy as np


 def water_pot(cds):
     return calc_hoh_pot(cds, len(cds)) # returns a 1D numpy array of potential energy values in Hartree


 if __name__ == '__main__':
     x = np.random.random((100, 3, 3))
     v = water_pot(x)
     print(v)

This program, if called via python h2o_potential.py, will import the .so Python module called h2o_pot, exposing the subroutine calc_hoh_pot.

Now, after we made sure this ran to completion, we can call this potential as done in the tutorial:

Potential(potential_function='water_pot',
          python_file='h2o_potential.py',
          potential_directory='path/to/Partridge_Schwenke_H2O/',
          num_cores=4)

C/C++ Potentials: ctypes

For C/C++ Potentials, we require a bit more legwork on the Python side. We will use ctypes. Once you compile a shared object .so file that calls the potential of interest, using the ctypes module, we can load in that call in Python. Say we had a shared library called lib_expot.so that takes in a pointer to an int, a pointer to a coordinate array of doubles (num_atomsx3 on Python side), and a pointer to the 1D potential array, which in this case is of len(v)=1.

Here is the example of how to load that in and call it:

# call_cpot.py
import ctypes
from numpy.ctypeslib import ndpointer
import numpy as np

def call_a_cpot(cds):
   lib = ctypes.cdll.LoadLibrary("./libexpot.so")
   example_fun = lib.calcpot_
   example_fun.restype = None
   example_fun.argtypes = [ctypes.POINTER(ctypes.c_int),
                   ndpointer(ctypes.c_double, flags="C_CONTIGUOUS"),
                   ndpointer(ctypes.c_double, flags="C_CONTIGUOUS")]
   nw = ctypes.c_int32(6) #some integer that needs to be passed to the potential
   v = np.zeros(1)
   vpot = np.zeros(len(cds))
   for num,coord in enumerate(cds):
       v = np.zeros(1)
       example_fun(ctypes.byref(nw),v,coord)
       vpot[num] = v[0]

In this example, all the looping is done on the Python side, and so only one geometry is fed to the example_fun at a time. Indeed, one could loop over the geometres on the C/C++ side and get a speed-up.

Nonetheless, you may then use this Python function in the Potential object by doing:

Potential(potential_function='call_a_cpot',
           python_file='call_cpot.py',
           potential_directory=pot_dir,
           num_cores=4)

You don’t need to load the shared object file each time, though, thanks to the pot_kwargs option:

#before passing to the potential manager...
lib = ctypes.cdll.LoadLibrary("./libexpot.so")
example_fun = lib.calcpot_
example_fun.restype = None
example_fun.argtypes = [ctypes.POINTER(ctypes.c_int),
                  ndpointer(ctypes.c_double, flags="C_CONTIGUOUS"),
                  ndpointer(ctypes.c_double, flags="C_CONTIGUOUS")]

my_kwargs = {'example_fun': example_fun, 'nw': ctypes.c_int32(6)}
Potential(potential_function='call_a_cpot',
          python_file='call_cpot.py',
          potential_directory=pot_dir,
          num_cores=4,
          pot_kwargs= my_kwargs )

Then…:

# in the python function call...
def call_a_cpot(cds, extra_args):
  ex_fun = extra_args['example_fun']
  nw = extra_args['nw']
  v = np.zeros(1)
  vpot = np.zeros(len(cds))
  for num,coord in enumerate(cds):
      v = np.zeros(1)
      ex_fun(ctypes.byref(nw),v,coord)
      vpot[num] = v[0]

MPI4Py: Parallelizing Potential Calls

If one has a Python function that calls the potential, one can also parallelize the calls to it using MPI. This is handled internally using the futures module in MPI4Py. While one can run on multiple cores using multiprocessing, one can run on multiple cores/nodes in high-performance computing environments using the MPI_Potential:

from pyvibdmc.simulation_utilities.mpi_potential_manager import MPI_Potential
# Import is like this rather than import pyvibdmc as pv so that people don't have to have MPI4Py installed

if __name__=='__main__':
    pot_dir = '.../legacy_mbpol/'
    py_file = 'call_mbpol.py'
    pot_func = 'call_any'

    mbp_dimer_pot = MPI_Potential(potential_function=pot_func,
                          python_file=py_file,
                          potential_directory=pot_dir,
                          pot_kwargs={'nw':2})

As you can see, the syntax for MPI_Potential is similar to the Potential manager above. No other work needs to be done to parallelize the code on the user side, this is all handled internally. The num_cores argument should not be used, as the MPI_Potential manager simply looks for the MPI.COMM_WORLD.get_size() attribute to figure out how many MPI processes to use.

PyVibDMC uses the mpi4py.futures module for MPI parallelization. This module is a relatively new implementation, and does not handle memory in the most efficient way. While you should get close to linear scaling with the number of nodes used for calling the potential, there may be memory / efficiency issues as you scale to larger numbers of nodes.

The more difficult part of the MPI potential manager is setting up the desired MPI environment for the high-performance computing environment one may want to work on. In the McCoy group, we have containers on dockerhub for McCoy group students to use. The dockerfiles for these containers are hosted on GitHub on the McCoy Group GitHub page.

Of course, one could compile or the load the the appropriate MPI module installed on the supercomputer of interest, and simply load that before using this feature. If using the MPI_Potential manager, one should run the code using mpirun

Here is an example sbatch file for running a containerized MPI DMC simulation using Singularity on Mox:

#!/bin/bash
#SBATCH --job-name=mpi_test
#SBATCH --partition=ilahie
#SBATCH --account=ilahie
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH --time=01:00:00
#SBATCH --mem=120G
#SBATCH --chdir=.

module load singularity
. /.../mcenv.sh
MCENV_IMAGE=/.../mcenv_centos.sif
MCENV_PACKAGES_PATH=/.../packages

module load icc_19-ompi_3.1.4

mcenv=$(mcenv -e)
mpirun --mca mpi_warn_on_fork 0 -n $SLURM_NTASKS $mcenv --exec python -m mpi4py.futures pv_mpi_test.py

The python -m mpi4py.futures pv_mpi_test.py syntax should be used regardless of whether or not one is in a container, amd $SLURM_NTASKS here is the number of nodes * the number of tasks per node.

Alternative Approach (Not recommended): executables and subprocess calls

If for some reason these approaches do not meet your needs, you can always write the (nxmx3) geometries to a file, call an executable that loads in the file, and then reload back in the potential values written to a second file, all in a Python call. This is not recommended as it will be slow, as hard drive reads/writes are slow (especially if you have a hard drive vs an SSD). Nonetheless, here is an example of how to do such a thing:

#pot_call_exec.py
import subprocess as sub
def call_exec(cds):
   exportCoords(cds,'coords.txt') #some function that writes the coordinates to file
   sub.run('./pot_executable',cwd='...',shell=True)
   pots = np.loadtxt('pots.txt')
   return pots

Then, we may use this function in the Potential object:

Potential(potential_function='call_exec',
           python_file='pot_call_exec.py',
           potential_directory=pot_dir,
           num_cores=1) #cannot parallelize executables easily using multiprocessing. Can read/write to mutliple files...