High Performance Big Data Analysis Using NumPy, Numba & Python Asynchronous Programming

Introduction

A couple of months ago a client of mine asked me the following question: “What is the faster data structure object in Python for Big Data analysis today?” I get questions like this one all the time. Some of them are not easy to solve at all and it takes some time to find the right optimized solutions for them. In general, I do this for fun on the weekends and at night.

At that time, based on this question, my first simple answer was the Python List object. I used the List object in many Data Science projects including Data Pipeline and Extract-Transform-Load (ETL) production system. Then the following questions came to mind: Can I use the List object for data manipulation and analysis of millions or billions of rows? What about if I divide a Data Science project into small tasks and run them asynchronously using the latest Python asyncio library? Based on these questions, I decided to spend some time and find out some practical solutions for Big Data analysis using Pythion Data Ecosystem libraries. To make it simple to understand and find the results quickly, the program will calculate the Arithmetic Mean, Median and Sample Standard Deviation values from a float one dimensional NumPy array. For the programs runtime comparison I’ll use the following libraries:

NumPy – The fundamental Pyrhon package for scientific computing.

Numba – Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and FORTRAN, without having to switch languages or Python interpreters.

asyncio – Python asynchronous programming library.

Why use NumPy?

As the NumPy website said: NumPy is the fundamental package for scientific computing with Python. It provides a powerful N-dimensional array object and sophisticated (broadcasting) functions. With NumPy imported library the Python programs performance better with a high execution speed, more convenient for consistency and a lot of numerical and matrix functionalities. Maybe because of this there isn’t a reason to use Python List objects anymore? It’s important to mention that many Python Data Ecosystem libraries are built on top of NumPy like Pandas, SciPy, Matplotlib, etc.

Used Python Algorithms

I have decided to provide simple calculations of the Arithmetic Mean, Median and Sample Standard Deviation to show the Python programs execution time (runtime) and compare them. The test data will be generated using one dimensional NumPy array with float 64-bit data type. The following three Python algorithms were implemented and analyzed:

NumPy array
NumPy array with asyncio asynchronous library
NumPy array with Numba library

NumPy Array Program

Let’s look at the code for each algorithm. Every algorithm has its own class object and a main calling program to follow the Object-Oriented Programming (OOP) methodology. This class object contains the following five methods:

calculate_number_observation() – calculate number of observation
calculate_arithmetic_mean() – calculate arithmetic mean
calculate_median() – calculate median
calculate_sample_standard_deviation() – calculate sample standard deviation
print_exception_message() – print exception message if occurred

Listing 1 shows the summary statistics class object code using NumPy array only.

[codesyntax lang=”python”]

import sys

import traceback

import time

from math import sqrt



class SummaryStatistics(object):

   """

   calculate number of observations, arithmetic mean, median

   and sample standard deviation using standard procedures

   """

   def __init__(self):

       pass

       

   def calculate_number_observation(self, one_dimensional_array):        

       """

       calculate  number of observations

       :param one_dimensional_array: numpy one dimensional array

       :return number of observations value

       """

       number_observation = 0

       try:

           number_observation = one_dimensional_array.size   

       except Exception:

           self.print_exception_message()

       return number_observation



   def calculate_arithmetic_mean(self, one_dimensional_array, number_observation):    

       """

       calculate arithmetic mean

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation: number of observations

       :return arithmetic mean value

       """

       arithmetic_mean = 0.0

       try:

           sum_result = 0.0

           for i in range(number_observation):       

               sum_result += one_dimensional_array[i]    

           arithmetic_mean = sum_result / number_observation

       except Exception:

           self.print_exception_message()

       return arithmetic_mean



   def calculate_median(self, one_dimensional_array, number_observation):      

       """

       calculate  median

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation: number of observations

       :return median value

       """

       median = 0.0

       try:

           one_dimensional_array.sort()    

           half_position = number_observation // 2

           if not number_observation % 2:

               median = (one_dimensional_array[half_position - 1] + one_dimensional_array[half_position]) / 2.0

           else:

               median = one_dimensional_array[half_position]        

       except Exception:

           self.print_exception_message()

       return median



   def calculate_sample_standard_deviation(self, one_dimensional_array, number_observation, arithmetic_mean):    

       """

       calculate sample standard deviation

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation:  number of observations

       :param arithmetic_mean: arithmetic mean value

       :return sample standard deviation value

       """

       sample_standard_deviation = 0.0

       try:

           sum_result = 0.0

           for i in range(number_observation):                   

               sum_result += pow((one_dimensional_array[i] - arithmetic_mean), 2)            

           sample_variance = sum_result / (number_observation - 1)            

           sample_standard_deviation = sqrt(sample_variance)        

       except Exception:

           self.print_exception_message()

       return sample_standard_deviation



   def print_exception_message(self, message_orientation = "horizontal"):

       """

       print full exception message

       :param message_orientation: horizontal or vertical

       :return none

       """

       try:

           exc_type, exc_value, exc_tb = sys.exc_info()            

           file_name, line_number, procedure_name, line_code = traceback.extract_tb(exc_tb)[-1]       

           time_stamp = " [Time Stamp]: " + str(time.strftime("%Y-%m-%d %I:%M:%S %p"))

           file_name = " [File Name]: " + str(file_name)

           procedure_name = " [Procedure Name]: " + str(procedure_name)

           error_message = " [Error Message]: " + str(exc_value)        

           error_type = " [Error Type]: " + str(exc_type)                    

           line_number = " [Line Number]: " + str(line_number)                

           line_code = " [Line Code]: " + str(line_code)

           if (message_orientation == "horizontal"):

               print( "An error occurred:{};{};{};{};{};{};{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))

           elif (message_orientation == "vertical"):

               print( "An error occurred:\n{}\n{}\n{}\n{}\n{}\n{}\n{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))

           else:

               pass                    

       except Exception:

           pass

[/codesyntax]

Listing 1. Summary statistics class object code using NumPy array

Listing 2 shows the summary statistics main program. As you can see this program creates the summary_statistics class object and then the methods that they are called. The NumPy library needs to be imported to generate the one dimensional array. The program runtime is calculated by using the clock() method of the time module.

[codesyntax lang=”python”]

 import time

import numpy as np



from class_summary_statistics import SummaryStatistics



def main(one_dimensional_array):

   

#     create summary statistics class object

   summary_statistics = SummaryStatistics()

   

#     calculate number of observation

   number_observation = summary_statistics.calculate_number_observation(one_dimensional_array)

   print("Number of Observation: {} ".format(number_observation))

   

#     calculate arithmetic mean

   arithmetic_mean = summary_statistics.calculate_arithmetic_mean(one_dimensional_array, number_observation)

   print("Arithmetic Mean: {} ".format(arithmetic_mean))

   

#     calculatte median

   median = summary_statistics.calculate_median(one_dimensional_array, number_observation)

   print("Median: {} ".format(median))

   

#     calculate sample standard deviation

   sample_standard_deviation = summary_statistics.calculate_sample_standard_deviation(one_dimensional_array, number_observation, arithmetic_mean)

   print("Sample Standard Deviation: {} ".format(sample_standard_deviation))



if __name__ == '__main__':

   start_time = time.clock()  

   one_dimensional_array = np.arange(100000000, dtype=np.float64)        

   main(one_dimensional_array)

   end_time = time.clock()

   print("Program Runtime: {} seconds".format(round(end_time - start_time, 1)))

[/codesyntax]

Listing 2. Summary statistics main program code using NumPy array

With one million rows the summary statistics main program will show the result below:

Number of Observation: 1000000

Arithmetic Mean: 499999.5

Median: 499999.5

Sample Standard Deviation: 288675.27893349814

Program Runtime: 1.3 seconds

Numpy Array with asyncio Library

Listing 3 shows the summary statistics asyncio class object code with Python asyncio asynchronous library. Note that the main() method starts the event loop asynchronous process with the calculate_number_observation() as the first and unique task.

[codesyntax lang=”python”]

import sys

import time

import traceback

import asyncio

from math import sqrt



class SummaryStatisticsAsyncio(object):

   """

   calculate number of observations, arithmetic mean, median

   and sample standard deviation using asyncio library

   """

   def __init__(self):

       pass

   

   async def calculate_number_observation(self, one_dimensional_array):    

       """

       calculate  number of observations

       :param one_dimensional_array: numpy one dimensional array

       :return none

       """

       try:

           print('start calculate_number_observation() procedure')   

           await asyncio.sleep(0)

           number_observation = one_dimensional_array.size

           print("Number of Observation: {} ".format(number_observation))    

           await self.calcuate_arithmetic_mean(one_dimensional_array, number_observation)

           print("finished calculate_number_observation() procedure")   

       except Exception:

           self.print_exception_message()

           

   async def calcuate_arithmetic_mean(self, one_dimensional_array, number_observation):    

       """

       calculate arithmetic mean

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation: number of observations

       :return none

       """

       try:

           print('start calcuate_arithmetic_mean() procedure')   

           await self.calculate_median(one_dimensional_array, number_observation)

           sum_result = 0.0

           await asyncio.sleep(0)

           for i in range(number_observation):       

               sum_result += one_dimensional_array[i]    

           arithmetic_mean = sum_result / number_observation

           print("Arithmetic Mean: {} ".format(arithmetic_mean))    

           await self.calculate_sample_standard_deviation(one_dimensional_array, number_observation, arithmetic_mean)

           print("finished calcuate_arithmetic_mean() procedure")   

       except Exception:

           self.print_exception_message()

           

   async def calculate_median(self, one_dimensional_array, number_observation):      

       """

       calculate  median

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation: number of observations

       :return none

       """

       try:

           print('starting calculate_median()')   

           await asyncio.sleep(0)

           one_dimensional_array.sort()    

           half_position = number_observation // 2

           if not number_observation % 2:

               median = (one_dimensional_array[half_position - 1] + one_dimensional_array[half_position]) / 2.0

           else:

               median = one_dimensional_array[half_position]        

           print("Median: {} ".format(median))

           print("finished calculate_median() procedure")   

       except Exception:

           self.print_exception_message()

           

   async def calculate_sample_standard_deviation(self, one_dimensional_array, number_observation, arithmetic_mean):    

       """

       calculate sample standard deviation

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation:  number of observations

       :param arithmetic_mean: arithmetic mean value

       :return none

       """

       try:

           print('start calculate_sample_standard_deviation() procedure')   

           await asyncio.sleep(0)

           sum_result = 0.0

           for i in range(number_observation):                   

               sum_result += pow((one_dimensional_array[i] - arithmetic_mean), 2)            

           sample_variance = sum_result / (number_observation - 1)            

           sample_standard_deviation = sqrt(sample_variance)        

           print("Sample Standard Deviation: {} ".format(sample_standard_deviation))

           print("finished calculate_sample_standard_deviation() procedure")   

       except Exception:

           self.print_exception_message()



   def print_exception_message(self, message_orientation = "horizontal"):

           """

           print full exception message

           :param message_orientation: horizontal or vertical

           :return none

           """

           try:

               exc_type, exc_value, exc_tb = sys.exc_info()            

               file_name, line_number, procedure_name, line_code = traceback.extract_tb(exc_tb)[-1]       

               time_stamp = " [Time Stamp]: " + str(time.strftime("%Y-%m-%d %I:%M:%S %p"))

               file_name = " [File Name]: " + str(file_name)

               procedure_name = " [Procedure Name]: " + str(procedure_name)

               error_message = " [Error Message]: " + str(exc_value)        

               error_type = " [Error Type]: " + str(exc_type)                    

               line_number = " [Line Number]: " + str(line_number)                

               line_code = " [Line Code]: " + str(line_code)

               if (message_orientation == "horizontal"):

                   print( "An error occurred:{};{};{};{};{};{};{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))

               elif (message_orientation == "vertical"):

                   print( "An error occurred:\n{}\n{}\n{}\n{}\n{}\n{}\n{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))

               else:

                   pass                    

           except Exception:

               pass

   

   def main(self, one_dimensional_array):    

       """

       start the event loop asynchronous process

       :param one_dimensional_array: numpy one dimensional array

       """

       try:

           ioloop = asyncio.get_event_loop()

           tasks = [ioloop.create_task(self.calculate_number_observation(one_dimensional_array))]

           wait_tasks = asyncio.wait(tasks)

           ioloop.run_until_complete(wait_tasks)

           ioloop.close()

       except Exception:

           self.print_exception_message()

[/codesyntax]

Listing 3. Summary statistics asyncio class object code with Python asynchronous library

The summary statistics asyncio main program is shown in Listing 4. As you can see the main() method is the only one to be called.

[codesyntax lang=”python”]

import time

import numpy as np



from class_summary_statistics_asyncio import SummaryStatisticsAsyncio



def main(one_dimensional_array):

   

#     create summary statistics asyncio class object

   summary_statistics_asyncio = SummaryStatisticsAsyncio()



#     call main method

   summary_statistics_asyncio.main(one_dimensional_array)



if __name__ == '__main__':

   start_time = time.clock()  

   one_dimensional_array = np.arange(1000000000, dtype=np.float64)        

   main(one_dimensional_array)

   end_time = time.clock()

   print("Program Runtime: {} seconds".format(round(end_time - start_time, 1)))

[/codesyntax]

Listing 4. Summary statistics asyncio main program code with Python asynchronous library

With one billion rows the summary statistics asyncio main program will show the result below. I have included the printing of the start/finish procedures to show how the asynchronous process works in this particular case.

start calculate_number_observation() procedure

Number of Observation: 1000000000

start calcuate_arithmetic_mean() procedure

starting calculate_median()

Median: 499999.5

finished calculate_median() procedure

Arithmetic Mean: 499999.5

start calculate_sample_standard_deviation() procedure

Sample Standard Deviation: 288675.27893349814

finished calculate_sample_standard_deviation() procedure

finished calcuate_arithmetic_mean() procedure

finished calculate_number_observation() procedure

Program Runtime: 1504.4 seconds

Numpy Array with Numba Library Program

The summary statistics class object code with Numba library is shown in Listing 5. Check the Numba GitHub repository to learn more about this Open Source NumPy-aware optimizing compiler for Python. It’s important to mention that Numba supports CUDA GPU programming. As you can see the debugging code has been removed to run the program in compile mode.

[codesyntax lang=”python”]

import time

from numba import jit

import numpy as np

from math import sqrt



class SummaryStatisticsNumba(object):

   """

   calculate number of observations, arithmetic mean, median

   and sample standard deviation using numba library

   """

   def __init__(self):

       pass

       

   @jit

   def calculate_number_observation(self, one_dimensional_array):    

       """

       calculate  number of observations

       :param one_dimensional_array: numpy one dimensional array

       :return number of observations value

       """        

       number_observation = one_dimensional_array.size

       return number_observation

   

   @jit

   def calcuate_arithmetic_mean(self, one_dimensional_array, number_observation):    

       """

       calculate arithmetic mean

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation: number of observations

       :return arithmetic mean value

       """

       sum_result = 0.0

       for i in range(number_observation):       

           sum_result += one_dimensional_array[i]    

       arithmetic_mean = sum_result / number_observation

       return arithmetic_mean

   

   @jit

   def calculate_median(self, one_dimensional_array, number_observation):      

       """

       calculate  median

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation: number of observations

       :return median value

       """

       one_dimensional_array.sort()    

       half_position = number_observation // 2

       if not number_observation % 2:

           median = (one_dimensional_array[half_position - 1] + one_dimensional_array[half_position]) / 2.0

       else:

           median = one_dimensional_array[half_position]        

       return median

   

   @jit

   def calculate_sample_standard_deviation(self, one_dimensional_array, number_observation, arithmetic_mean):    

       """

       calculate sample standard deviation

       :param one_dimensional_array: numpy one dimensional array

       :param number_observation:  number of observations

       :param arithmetic_mean: arithmetic mean value

       :return sample standard deviation value

       """

       sum_result = 0.0

       for i in range(number_observation):                   

           sum_result += pow((one_dimensional_array[i] - arithmetic_mean), 2)            

       sample_variance = sum_result / (number_observation - 1)            

       sample_standard_deviation = sqrt(sample_variance)        

       return sample_standard_deviation

[/codesyntax]

Listing 5. Summary statistics class object code with Numba library

The summary statistics Numba main program is shown in Listing 6.

[codesyntax lang=”python”]

import time

import numpy as np



from class_summary_statistics_numba import SummaryStatisticsNumba



def main(one_dimensional_array):

   

#     create class summary statistics_numba class object

   class_summary_statistics_numba = SummaryStatisticsNumba()

   

#     calculate number of observation

   number_observation = class_summary_statistics_numba.calculate_number_observation(one_dimensional_array)

   print("Number of Observation: {} ".format(number_observation))

       

#     calculate arithmetic mean

   arithmetic_mean = class_summary_statistics_numba.calcuate_arithmetic_mean(one_dimensional_array, number_observation)

   print("Arithmetic Mean: {} ".format(arithmetic_mean))

   

#     calculatte median

   median = class_summary_statistics_numba.calculate_median(one_dimensional_array, number_observation)

   print("Median: {} ".format(median))

   

#     calculate sample standard deviation

   sample_standard_deviation = class_summary_statistics_numba.calculate_sample_standard_deviation(one_dimensional_array, number_observation, arithmetic_mean)

   print("Sample Standard Deviation: {} ".format(sample_standard_deviation))

   

if __name__ == '__main__':

   start_time = time.clock()  

   one_dimensional_array = np.arange(1000000000, dtype=np.float64)        

   main(one_dimensional_array)

   end_time = time.clock()

   print("Program Runtime: {} seconds".format(round(end_time - start_time, 1)))

[/codesyntax]

Listing 6. Summary statistics Numba main program

With one billion rows the summary statistics Numba main program will show the result below:

Number of Observation: 1000000000

Arithmetic Mean: 499999999.067109

Median: 499999999.5

Sample Standard Deviation: 288675134.73899055

Program Runtime: 40.2 seconds

It’s a very exciting result to see the calculations finished in 40.2 seconds with one billion rows in the NumPy array. I think it’s time to use NumPy array with Numba library wherever possible in Big Data Science projects. Some research and testing may be require for specific cases.

Laptop Hardware Parameters

Here are the laptop hardware parameters used to run the Python programs:

Windows 10 64-bit OS
Intel Core™ i7-2670QM CPU @ 2.20 GHz
16 GB RAM

Programs Runtime Comparison

Table 1 shows the programs runtime execution for 1 million, 10 million, 100 million and 1 billion rows.

Number of Rows	Used Python Algorithms
Number of Rows	NumPy array	NumPy array with asyncio asynchronous library	NumPy array with Numba library
1 million	1.3 seconds	1.4 seconds	0.9 seconds
10 million	12.8 seconds	12.7 seconds	1.1 seconds
100 million	2.16 minutes	2.17 minutes	4.4 seconds
1 billion	22.66 minutes	22.64 minutes	40.2 seconds

Table 1: Programs Runtime Comparison

Conclusions

There is not a practically difference between using NumPy array and NumPy array with asyncio asynchronous library. Although these calculations are not totally sufficient to prove the performances of the asyncio asynchronous library in Python Data Science projects therefore more research may be required to find the right applications.
The combination of NumPy array with Numba library provides the best performance for data manipulation and analysis compared with NumPy array and NumPy array with asyncio asynchronous library. I was very impressed to see 40.2 seconds execution time with one billion rows in the Numpy array. I wonder if R programs can do that today! If not, maybe it’s time for R programmers to learn Python and its Data Ecosystem libraries. One more thing, make sure to write Python programs using OOP methodology for real production environments with Continue Integration software development and deployment practices.
In Data Pipeline and Extract-Transform-Load (ETL) system projects with different types of data sources, the NumPy array with Numba library implementation is one of the best programming practices for Big Data analysis today. There shouldn’t be a need of using Python List objects for it.

Feel free to email Ernest any questions about his article.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: big data analysis Numba Numpy python

High Performance Big Data Analysis Using NumPy, Numba & Python Asynchronous Programming

Related Posts

AI is infiltrating scientific literature day by day

Snorkel Flow update offers a brand new approach to enterprise data management

Compose your dream music from your couch with Suno AI

Xaira secures a billion-dollar bet on the future of AI drug discovery

Microsoft Phi-3 is the tech giant’s next tiny titan

Ray-Ban Meta Glasses boast new styles and AI features

Comments 1

Leave a Reply Cancel reply

LATEST ARTICLES

AI is infiltrating scientific literature day by day

NVIDIA acquires Run:ai for 700 million USD

Snorkel Flow update offers a brand new approach to enterprise data management

Compose your dream music from your couch with Suno AI

Xaira secures a billion-dollar bet on the future of AI drug discovery

Microsoft Phi-3 is the tech giant’s next tiny titan

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

High Performance Big Data Analysis Using NumPy, Numba & Python Asynchronous Programming

Introduction

Why use NumPy?

Used Python Algorithms

NumPy Array Program

Numpy Array with asyncio Library

Numpy Array with Numba Library Program

Laptop Hardware Parameters

Programs Runtime Comparison

Conclusions

Related Posts

Comments 1

Leave a Reply Cancel reply

LATEST ARTICLES

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

Follow Us