1.7. Accelerating HORTON code with Cython

HORTON was designed to prioritize ease of programming over performance. This is a reasonable decision in light of the fact that the vast majority of time in a quantum chemistry code is only spent in a relatively small section of code. In HORTON, we rewrite these critical portion s of code in C++ and link them into Python using the Cython framework. We identify these critical portions using profiling tools.

1.7.1. Before you begin

There are several downsides to accelerating code with Cython. Please make sure they are acceptable to you before starting to code.

  • Developer tools will break. PDB cannot read C++ debug symbols. cProfile will break. IDEs will not syntax highlight correctly. Valgrind will report false positive memory leaks.
  • Cython is still a fairly new project. The API may not be stable. Don’t be surprised if your code breaks after a few versions.
  • A steep learning curve. Unless you are already familiar with C/C++ and Python profiling tools, you may not obtain the speed up you expected. Memory allocation and management of arrays is particularly tricky.

1.7.2. Basic example

1.7.2.1. Background

We will take a simplified example from the slicing code of the matrix class. The original Cholesky decomposed 2-electron integrals had an operation to slice along 3-indices which was consuming a significant portion of the time in the code. This was implemented using the numpy.einsum method.

The original code is here

def get_slice_slow(self):
    return numpy.einsum("ack, bck-> abc", B, B_prime)

This code takes a slice of B where the indices c are kept the same and then contracts across the last index.

\[A_{abc} = \sum_k B_{ack} B'_{bck}\]

A quick check using the python cProfile module (python -m cProfile -o slice.pstats get_slice_slow.py; gprof2dot -f pstats slice.pstats | dot -Tpng -o slice.png) showed get_slice_slow method required almost 40% of the total code runtime. Since this operation was simple to implement in C++, it was a good candidate for Cythonizing.

The C++ code to implement the same operation is below:

//get_slice_abcc.cpp
void get_slice_abcc(double* B, double* B_prime, double* out, long nbasis, long nvec){
    new long k; //example
    for (k=0; k<nvec; k++){
        for (long a=0; a<nbasis; a++){
            for (long b=0; b<nbasis; b++){
                for (long c=0; c<nbasis; c++){
                    out[a*nbasis*nbasis + b*nbasis + c] += inp[k*nbasis*nbasis + a*nbasis + c] * inp2[k*nbasis*nbasis + b*nbasis + c];
                }
            }
        }
    }
    delete k; //example
}

and the header is below:

//get_slice_abcc.h
void get_slice_abcc(double* inp, double* inp2, double* out, long nbasis, long nvec);

This code now needs to be interfaced with Python using Cython.

1.7.2.2. Cythonizing your code

First create a Cython .pxd header. This file provides information for Cython to link your compiled code to the cython file later.

#get_slice_abcc.pxd
cdef extern from "get_slice_abcc.h":
    void get_slice_abcc(double* B, double* B_prime, double* out, long nbasis, long nvec)

You’ll notice here that the Cython header is remarkably similar to the C++ header. There are a few keywords introduced here, the most significant being cdef. The .pxd files are python syntax with a few other keywords and syntax for pointers. See the Cython documentation on C++ below for more details on how to Cythonize classes and more.

The .pyx file is where brunt of the work by Cython is done. It is also python syntax with a few extra keywords.

#cext.pyx
cimport get_slice_abcc
def get_slice_fast(np.ndarray[double, ndim=3] B not None,
                    np.ndarray[double, ndim=3] B_prime not None,
                    np.ndarray[double, ndim=3] out not None,
                   long nbasis, long nvec):

assert B.flags['C_CONTIGUOUS']
assert B.shape[0] == nvec
assert B.shape[1] == B.shape[2] == nbasis
#etc...

get_slice_abcc.get_slice_abcc(&B[0, 0, 0], &B_prime[0, 0, 0], &out[0, 0, 0], nbasis, nvec)
return out

There are several things to note here:

  • The arguments are statically typed.
  • The Numpy arrays have their datatypes declared as well as the number of dimensions
  • It is good practice to have safety checks because the code in .pyx files will not give clean stack traces.
  • Python and Numpy use long datatypes by default.
  • You can pass the address of the first element of a Numpy array to a function expecting double* as long as it is contiguous.

There are several other nuances not illustrated in this example, but they are well covered in the Cython documentation below. Users should be particularly cognizant of whether variables are Python-style (dynamic typed) or C-style (static typed). In our example above, everything is static typed as the method declaration declares everything.

1.7.3. Additional notes

The above example leaves all memory management to the Python interpreter. This is not always possible, especially when implementing iterative algorithms in C/C++ code. There is no issue when memory is allocated and deallocated dynamically in the C++ code as in the example above. However, if memory must be allocated by C++ and freed by Python, it can be much more complicated. The reverse case, memory allocated by Python and freed by C++, should be much more rare and won’t be covered here.

The most common form of memory allocated in C++ and passed back to Python for management is likely Numpy arrays. We will show a code snippet for managing this.

cdef double* data = NULL
cdef np.npy_intp dims[3]

nvec = calculate_cholesky(&data)
dims[0] = <np.npy_intp> nvec
dims[1] = <np.npy_intp> nbasis
dims[2] = <np.npy_intp> nbasis

result = numpy.PyArray_SimpleNewFromData(3, dims, np.NPY_DOUBLE, data)

The method PyArray_SimpleNewFromData creates a new Numpy array from memory which has already been allocated. The numpy data types must be specified, as well as the dimensionality. Data is simply a 1D double* array of size nvec * nbasis * nbasis.