The other day I was made aware (thanks to Dinesh Dutt) of a small tip when working with and iterating over Pandas DataFrames.
If you`ve used Python and tools like Batfish (course here) or Suzieq (course here) to automate your network, then you may be familiar with DataFrames. However, for those of you who are new to Panda DataFrames. A Pandas DataFrame is a,
2D Python data structure that allows you to work with your data via rows and columns, which eases the pain when working with large amounts of data in Python. You can think of it much like a spreadsheet, but 100% Python-based.
Great! So what was the tip?
To speed up your DataFrame iterations use
itertuples
instead ofiterrows
.
Is it really that much faster? This called for a quick benchmark test.
# Return all the network interfaces using Batfish.
df = bfq.interfaceProperties().answer().frame()
# Some imports
from timeit import timeit
from rich import print
# Define the 2 iteration methods
def loop_via_itertuples():
for _ in df.itertuples():
continue
def loop_via_iterrows():
for _, _ in df.iterrows():
continue
With this in place, we can quickly benchmark the 2 iteration types, by running each iteration function 10 times and collecting the time they take to run. Like so:
time_iterrows = timeit(loop_via_iterrows, number=10)
time_itertuples = timeit(loop_via_itertuples, number=10)
Finally, we print the results...
print(
"===\n"
f"Execution Time for df.iterrows = {time_iterrows}\n"
f"Execution Time for df.itertuples = {time_itertuples}\n"
f"Result: itertuples is {time_increase(time_iterrows, time_itertuples)} times faster then iterrows!\n"
f"===\n"
)
===
Execution Time for df.iterrows = 0.9970452000006844
Execution Time for df.itertuples = 0.07641729999977542
Result: itertuples is 13 times faster then iterrows!
13 times faster! So yes it's certainly faster! And to answer the final question of why? The TL;DR is (full details here),
The reason
iterrows()
is slower thanitertuples()
is due toiterrows()
performing a lot of type checks in the lifetime of its call.
So there we have it when iterating over Panda DataFrames always use itertuples
rather than iterrows
.