Panda DataFrames - iterrows vs itertuples

Panda DataFrames - iterrows vs itertuples
Photo by Marc Sendra Martorell / Unsplash

The other day I was made aware (thanks to Dinesh Dutt) of a small tip when working with and iterating over Pandas DataFrames.

If you`ve used Python and tools like Batfish (course here) or Suzieq (course here) to automate your network, then you may be familiar with DataFrames. However, for those of you who are new to Panda DataFrames. A Pandas DataFrame is a,

2D Python data structure that allows you to work with your data via rows and columns, which eases the pain when working with large amounts of data in Python. You can think of it much like a spreadsheet, but 100% Python-based.

Great! So what was the tip?

To speed up your DataFrame iterations use itertuples instead of iterrows.

Is it really that much faster? This called for a quick benchmark test.

# Return all the network interfaces using Batfish.
df = bfq.interfaceProperties().answer().frame()

# Some imports
from timeit import timeit
from rich import print

# Define the 2 iteration methods
def loop_via_itertuples():
    for _ in df.itertuples():
        continue

def loop_via_iterrows():
    for _, _ in df.iterrows():
        continue

With this in place, we can quickly benchmark the 2 iteration types, by running each iteration function 10 times and collecting the time they take to run. Like so:

time_iterrows = timeit(loop_via_iterrows, number=10)
time_itertuples = timeit(loop_via_itertuples, number=10)

Finally, we print the results...

print(
    "===\n"
    f"Execution Time for df.iterrows = {time_iterrows}\n"
    f"Execution Time for df.itertuples = {time_itertuples}\n"
    f"Result: itertuples is {time_increase(time_iterrows, time_itertuples)} times faster then iterrows!\n"
    f"===\n"
)
===
Execution Time for df.iterrows = 0.9970452000006844
Execution Time for df.itertuples = 0.07641729999977542
Result: itertuples is 13 times faster then iterrows!

13 times faster! So yes it's certainly faster! And to answer the final question of why? The TL;DR is (full details here),

The reason iterrows() is slower than itertuples() is due to iterrows() performing a lot of type checks in the lifetime of its call.

So there we have it when iterating over Panda DataFrames always use itertuples rather than iterrows.

Subscribe to our newsletter and stay updated.

Don't miss anything. Get all the latest posts delivered straight to your inbox.
Great! Check your inbox and click the link to confirm your subscription.
Error! Please enter a valid email address!