Sailing the C with Python for free (and going deep?)

Sailing the C with Python for free (and going deep?)

Image adapted by Alexander J. Pfleger.
Original image by Pk0001, CC BY-SA 4.0, via Wikimedia Commons,

In the last blog post we had a look at quite superficial performance improvements for Python programs. The limits were set by the basic performance of Python and the number of existing modules. This time we try to surpass those limits by creating our own modules from scratch – in C++. Again we will start with a simple square function, but the concepts stay the same for more advanced functions.

To call C-code from Python we need a Python binding. There are several different libraries to achieve this. One of the most popular libraries is pybind11. It is rather compact and requires only a few additions to existing C-code. Also, many marvellous C++ features can be used, since pybind11 uses a C++ compiler.

To begin, we will only allow single integer values for our square function A^2. In C++, the function can be written like this:

int square(int A){
    return A*A;

In order to use this function in Python, it needs to be converted to a Python module. This can be done by altering the code as following:

#include <pybind11/pybind11.h>

int square(int A){
    return A*A;

    m.def("squareCPP", &square, "NOTE: squares integers");

In the first line, the pybind11 library is included in C++. The second line is the already implemented square function. The last lines generate the actual module. Also, short documentation can be added. After compiling the C++ code, the module can be loaded and used in Python:

from bind_sq import squareCPP


To use more advanced functions, the same concepts need to be applied. To use NumPy arrays like in the previous example, some further additions need to be made in the C++ code. About three lines need to be added for each array. Those cases are well explained in the pybind11 documentation.

The performance boost of pybind11 using arrays is shown here:

For this figure, the jacobi function is used, since it has more impact on the project. The new module is ten times faster than the already optimised Python code. The performance is similar to a stand-alone C++ program. The third line in this plot is generated by a parallelised module and provides a second boost by a factor of ten. We will have a look at this method in the next paragraphs.

To further increase the performance of the function, parallelisation techniques like OpenMP can be used. The C++ code has to be slightly altered but no changes are made in the python project. This helps to keep the code clean while parallelisation is done in the background. In the previous figure, the performance of an OpenMP module with 18 threads is compared to the performance of the simple pybind11 module. Depending on the problem size a drastic speed-up can be noticed.

If you consider building your new modules right now, I would highly recommend you to use a Linux system, since it makes the setup a lot easier. If you are new to Linux, you can install a distribution like Ubuntu on a virtual machine, without making major changes to your computer.

Currently writing his master's thesis on "Simulations of SOFC systems" and working for the Journal JIPSS. Holds bachelor degrees in electrical engineering and audio engineering, and physics. Founder of PLANCKS Austria.

Tagged with: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.