Big Data - Time is of the Essence

mayasandler
May 10, 2021
3 min read

I am not the best Python program engineer and usually I learn as I go according to what I need to do. Therefore, sometimes I have a script that executes well but is not the most optimized - time and memory wise. Usually it is enough to get the job done, but not when dealing with big data.

I recently needed to work on a large set of data (almost 20 million rows) and had an experience that made me change the way I write code. The execution time was spread on hours and the CPU needed for the run was so large that my computer did not handle it and crashed! I did a lot of reading and realized that when dealing with big data it is highly important to shorten the time of the execution as well as the usage of memory as much as possible. I needed to change of my script again ang again to make it better, and came up with a few insights I wanted to share here.

Measuring the time to run a script in Python:

How will I know that my code is efficient? To do that I needed to measure the execution time. There are several ways to do that, but I decided to use time module and use time.time() function, like so:

Using iloc[] and repeatedly used parameters:

A good example is when I wanted to create several calculated columns. One of them was calculating the revenue for each trip (i.e. row). The script needed to run over >18 M rows, fetch several values in each row, calculate the revenue and save all the results to a new column. In a big dataset this has to be extremely efficient, and I learned it the hard way.

My original script was:

This process iterates over all rows and columns, needs to find the correct value several times in the loops over millions of data rows. As a result this execution takes more than 60 minutes to run and my computer was gasping, using a lot of CPU, and crashing each time!

After changing the script using iloc[] and parameters, the new more efficient script looked like this:

This process was efficient and took 9 minutes! Yey!! :)

Using iloc[] and apply() instead of locating columns by name:

In another example, I needed to change a datatype of several columns in each file of my dataset. Each file was medium size (202,349 rows, 12 columns), and I had 5-6 files for each year times 5 years. Knowing Pandas works better for arrays, I assumed that if I will run is with a range of columns and apply() function, I would get a better result than finding each column and change it one by one.

This is my original code, going over the columns one by one and changes the data type:

This execution took 29.46 seconds. I had more than 20 files more to go, and this is only one change :)

However, using iloc[], which also works fast, setting a range of columns to go to, and apply() the execution time decreased significantly:

This execution took 8.87 sec. What a difference for a single medium sized file!

Next, because I had several years of data and each year was comprised of several files, I wanted to check what is the most efficient way to execute it: (1) change the data types in each data frame first and then concatenate the data frames, or (2) first concatenate the data frames and then change the data types in the big fata frame. I measured the time for each option, and these are my findings:

Rad and change data types one by one and then concatenate the big data:

54.73 seconds, but the code is not pretty...

Read the files, concatenate the data frames, and then change data types of the big data:

53.14 seconds and the code is very readable.

Read the files, change the data type in each data frame, and then concatenate:

52.74 seconds but is less readable that the second option.

Surprisingly, the execution time is not that different! Therefore, I will choose according to the length and readability of the code. Therefore, the second code, of concatenating the data frames and then changing the data types, is the best.

By the way, using this solution for multiple files (more than 18M rows) had its affect - the execution time of changing the data type for the whole database was reduced in 33% from 15 to 5 minutes!

Conclusions:

When dealing with big sized data, it is better to use what we know on how Python and Pandas work and always measure the processes to know what is the best solution for the situation.

Big Data - Time is of the Essence

Recent Posts

Comentários