Optimizing Performance of Python Code Using Cython

Alexander Dobrzhansky

Python dates back to 1991 when it was issued by Guido van Rossum. Over the years, Python has made a name as one of the most handy, well equipped, and downright useful programming languages.

The distinctive features of Python include:

  • Speed of development
  • Readability
  • Great ecosystem of libraries
  • Big community

However, the execution speed is not one of Python’s advantages. With this background, in situations when application performance becomes an important characteristic in terms of ease-of-use or saving money, we have a real question - how much do we care?

In some cases, performance can be increased by adding extra hardware, but this option is quite expensive and not always effective. Another possibility is to search for bottlenecks by profiling the code.

Optimizing Python Code Performance

Having found the bottlenecks in the code, you can ask yourself what you can do to remove them. Actually, there are several instruments you can use for this purpose.

Here are the tools for optimizing the performance of Python code:

  • C extensions (be ready to write on C)
  • Change runtime PyPy, Pyston, GrumPy, etc.
  • Cython

 

Cython as a tool to optimize Python code

So let’s talk about Cython. Cython is an extension to the Python language that allows explicit type declarations and is compiled directly to C. This addresses Python’s large overhead for numerical loops and the difficulty of efficiently making use of existing C code, which Cython code can interact with natively. The Cython language combines the speed of C with the power and simplicity of the Python language.

You may know that Python code can make calls directly into C modules. Those C modules can be either generic C libraries or libraries built specifically to work with Python. So, how we can add Cython here? Cython generates the second kind of a module: C libraries that talk to Python’s internals. These modules can be combined with Python code that we already have to extend its capabilities and enhance performance.

Icing on the cake is that Cython approach is incremental. In working conditions, it means a developer can make spot changes in an existing Python application to speed it up, instead of rewriting the whole application from scratch.

This approach dovetails with the nature of software performance issues generally. In most programs, the vast majority of CPU-intensive code is concentrated in a few hot spots – a version of the Pareto principle, also known as the “80/20” rule. Thus, most of the code in a Python application doesn’t need to be performance-optimized, just a few critical pieces. You can incrementally translate those hot spots into Cython, and so get the performance gains you need where it matters most. The rest of the program can remain in Python for the convenience of the developers.

Example of optimizing Python code with Cython

For clarity, let’s consider a small example, we have defined a block with cyclical calculations, which does not work as fast as we would like. So, we select a piece of code that we want to speed up and create a separate file: mean.pyx with the following content:

def cython_mean(double[:] x):

    cdef double total = 0

    for i in range(len(x)):

        total += x[i]

    return total / len(x)

 

As we see, our module uses its own declared data types and has no outside calls. Now we need to create the setup.py. As a python Makefile, setup.py should look as follows:

from distutils.core import setup

from Cython.Build import cythonize

setup(

    ext_modules=cythonize("mean.pyx"),

)

 

$ python setup.py build_ext –inplace

 

Now we have a Python module that is ready for integration into any application and easy to call.

>>> import mean

>>> mean.cython_mean (100000)

 

Let’s also compare its execution speed with an analogue written on Numba.

Numba is NumPy-aware optimizing compiler for Python. It uses the LLVM compiler infrastructure to compile Python to machine code.

from numba import jit

 

@jit

def numba_mean(x):

    total = 0

    for xi in x:

        total += xi

    return total / len(x)

 

When we benchmark this example, IPython’s  %timeit reports that calling this function on a 100,000 element array takes:

 ~16 ms with pure Python version

 ~93 µs with Numba

~86 µs with Cython

 

Summing up Cython advantages and limitations

Cython advantages

  • Speeded up working with external C libraries
  • Direct communication to the underlying libraries, without Python in the way
  • Capability to use both C and Python memory management
  • Capability to create and manage your own C-level structures and use malloc/free to work with them
  • Cython automatically performs runtime checks for common problems that pop up in C, such as out-of-bounds access on an array
  • Cython allows you to natively access Python structures that use the “buffer protocol” for direct access to data stored in memory
  • Cython C code can benefit from releasing the GIL

Cython limitations

  • Keep in mind that Cython isn’t a magic wand. It doesn’t automatically turn every instance of poky Python code into sizzling-fast C code
  • Little speedup for conventional Python code
  • When Cython encounters Python code, it can’t translate completely into C and transforms that code into a series of C calls to Python’s internals
  • Little speedup for native Python data structures
  • Cython code runs fastest when it is “pure C”