Python Performance Testing: A Comprehensive Guide

Eric Goebelbecker - September 30, 2022

Python Performance Testing: A Comprehensive Guide

The following guest post addresses how to improve your services’s performance with Sentry and other application profilers for Python. Check out this post to learn more about application profiling and Sentry’s upcoming mobile application profiling offering. We’re making intentional investments in performance monitoring to make sure we give you all the context to help you solve what’s urgent faster. For those who are new to profiling, application profilers augment performance monitoring data by capturing every single function call taking place in a given operation. As you’ll read below, profiling data can be pretty difficult to analyze on its own. When combined with Sentry’s Error and Performance Monitoring, developers get insights into not only what’s slow, but also the function calls that may be the root cause.

Python aims to be a general-purpose programming language, and it's successful. Developers use Python for data analysis, machine learning, web development, and for system scripts and utilities. It's proven to be a flexible language that's easy to learn. But easy-to-learn doesn't mean trouble-free. As systems become more reliant on Python, it also means that they rely on its performance. Python performance testing is an important skill.

In this article, we're going to look at Python performance testing. We'll discuss what it means and then cover two popular performance testing and profiling tools.

Let's get started.

What is Python Performance Testing? A Quick Definition

Before we discuss Python Performance testing, let's agree on a quick definition.

Performance testing verifies and documents a system's speed, reliability, and scalability. It's testing that, rather than focusing on features and functionality, tests how software responds to a load and how quickly it executes its tasks. This definition covers a lot of ground. It includes both end-to-end testing of complex systems, and individual applications and code modules.

Since we're talking about Python performance testing, we're going to focus on the latter. We'll look at examples of testing individual Python code snippets, using techniques that also scale up to applications.

Our primary focus will be on calculating how quickly Python code executes and isolates bottlenecks.

How Do I Calculate the Speed of Execution in Python?

This is a common question. Measuring code speed manually isn't easy and is error-prone.

A brute force approach using tools like datetime can give you an idea of how long it takes code to run, but where do you put the timings? Will they interfere with the code you're trying to measure? Fortunately, there are Python tools for testing code execution speed. Depending on which you choose, you can simply test for elapsed time, or you can get a breakdown of individual function calls to look for bottlenecks if your code is running slowly.

TimeIt is a Python module for timing code. It's designed for "small bits" of code, not entire programs, but it has useful features for running repeated tests and can load code and setup routines from strings. Its output is a list of execution times. cProfile is Python's profiler. It will time Python code and is useful for small portions of code or entire programs. Its output is a detailed call tree with individual times.

Let's look at these tools in more detail.

Python Code Profiling

Let's look at two Python code snippets. Here's a for loop that creates pairs from two lists:

output = []
for a in (1, 3, 5, 7, 9, 11):
    for b in (2, 4, 6, 8, 10, 12):
        output.append((a, b))

Here's a list comprehension that does the same thing:

output = [(a, b) for a in (1, 3, 5, 7, 9, 11) for b in (2, 4, 6, 8, 10, 12)]

Is one of them faster than the other? You probably already know the answer, and you may even know why. Finding out isn't the point of this post. We're only using this code as something to run with the tools and as a basis for comparing performance results.

Let's compare their execution speed and see how they differ.

TimeIt

You can run timeit from the command line or call it from your code. Other than the interface, both methods are similar and yield the same results. So, we'll focus on the callable interface here. First, let's dive right in by timing the for loop version of our code snippet:

import timeit

test_code = '''
output = []
for a in (1, 3, 5, 7, 9, 11):
    for b in (2, 4, 6, 8, 10, 12):
        output.append((a, b))
'''

if __name__ == "__main__":
    timings = timeit.timeit(test_code)
    print(timings)

Here's the output:

1.7423516250000002

We start by importing the module. Then we need to save the code under test in a string so we can pass it to timeit. Using a heredoc instead of formatting it on one line makes it easier to read the code.

Then we pass it to the timeit method.

Timeit executed the code one million times and returned the time, in seconds, it took to execute them.

Let's look at the list comprehension next:

import timeit

if __name__ == "__main__":
    timings = timeit.timeit('[(a, b) for a in (1, 3, 5, 7, 9, 11) for b in (2, 4, 6, 8, 10, 12)]')
    print(timings)

Here's the output:

1.125285292

Since the list comprehension is a single line, we can pass it directly to timeit. This doesn't affect the performance; it just illustrates two ways to pass code for testing. So, it seems like the list comprehension is faster. That's not a surprise, but let's run a few more tests.

First, we can raise the number of runs to get a larger sample, by overriding the default number:

import timeit

if __name__ == "__main__":
   timings = timeit.timeit('[(a, b) for a in (1, 3, 5, 7, 9, 11) for b in (2, 4, 6, 8, 10, 12)]', number=2000000)
   print(timings)

The output increases to reflect doubling the number of iterations, so it looks like the performance is linear:

2.21627325

This makes sense since we're testing a very simple operation that doesn't need a lot of memory or computing power.

If we make the same change to the for loop timing, we see a proportional increase.

3.466272334

So, we can see that the difference in performance stays consistent with more iterations. Let's wrap up with one more variation in how you can use timeit to test execution time.

The repeat function runs your code in multiple sets:

import timeit

if __name__ == "__main__":
        timings = timeit.repeat('[(a, b) for a in (1, 3, 5, 7, 9, 11) for b in (2, 4, 6, 8, 10, 12)]', number=50000, repeat=5)
        print(timings)

This will run the list comprehension 50000 times in five sets. Here's the output:

[0.067692583, 0.05575891699999999, 0.055653583000000006, 0.055781334000000016, 0.0553728340000000]

Timeit returned a list of five timings. So, it's easy to get a large sampling of timings from a single test execution. It looks like the list comprehension is significantly faster than a for loop.

Can we see why?

cProfile

Like timeit, cProfile offers both command line and callable interfaces. We'll use the callable interface here, too.

Here's the for loop with cProfile:

import cProfile

testcode = '''
output = []
for a in (1, 3, 5, 7, 9, 11):
    for b in (2, 4, 6, 8, 10, 12):
        output.append((a, b))
'''

if __name__ == "__main__":
    cProfile.run(testcode)

Unlike timeit, cProfile prints its statistics to the terminal, instead of returning a string, so we don't need to capture a return value and print it:

         39 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:2(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
       36    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

We have a breakdown of functions with how many times the code called them and how much time was spent in each. But the timings aren't useful. A single iteration didn't spend enough time in each function to register.

This illustrates a big difference between timeit and cProfile that goes beyond their output. Timeit runs a piece of code multiple times and gives you an aggregate timing. Timing events at a resolution of less than a microsecond is difficult without the right hardware and kernel. (Arguably microseconds aren't completely accurate, either.) As a result, timing and comparing larger sets of operations yields more reliable results. That's why the default number of iterations for timeit is so high.

cProfile runs something once and provides a breakdown of what happened. If the time spent in each function is less than a hundredth of a second, it displays zero. You can overcome this by supplying cProfile with a custom timer, but unless you have the right hardware and kernel, this isn't the best approach.

What you can do is test a larger section of code or borrow from timeit's approach and run it more than once.

Since we only have a few lines of code, we'll try the latter.

Let's add another loop that calls the code three million times:

import cProfile

test_code = '''
for a in range(3000000):
    output = []
    for a in (1, 3, 5, 7, 9, 11):
        for b in (2, 4, 6, 8, 10, 12):
            output.append((a, b))
'''

if __name__ == "__main__":
    cProfile.run(test_code)

By calling the loop three million times, we moved the needle in our function timings:

108000003 function calls in 12.213 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    9.414    9.414   12.213   12.213 <string>:2(<module>)
        1    0.000    0.000   12.213   12.213 {built-in method builtins.exec}
108000000    2.799    0.000    2.799    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

cProfile's output has six columns:

ncalls -  the number of times this function was called
tottime -  the total time spent in this function, excluding sub-functions.
percall -  tottime / ncalls
cumtime - the total tie spent in this function including sub-functions.
percall - cumtime / primitive calls
filename:lineno(function) - function name and location

Note that cProfile lists the functions by name, not by time or the order that the test code called them. So, our test ran for about 12 seconds, and it spent 9.4 creating strings and 2.8 in list's append function. Let's look at the list comprehension in cProfile:

import cProfile

testcode = '''
for x in range(3000000):
    [(a, b) for a in (1, 3, 5, 7, 9, 11) for b in (2, 4, 6, 8, 10, 12)]
'''

if __name__ == "__main__":
        cProfile.run(testcode)

We added the same for loop. Here are the results:

3000003 function calls in 3.498 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.976    0.976    3.498    3.498 <string>:2(<module>)
  3000000    2.522    0.000    2.522    0.000 <string>:3(<listcomp>)
        1    0.000    0.000    3.498    3.498 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This test ran for only three seconds! It spent less than a second in string, and never called list's append method. Based on the profile, the list comprehension got more work done inside built-in code and didn't rely on list or string to get the work done. So, cProfile showed us why one approach performs better than the other.