Handling Large Datasets in Python - GeeksforGeeks (2024)

Last Updated : 08 Apr, 2024

Comments

Improve

Handling large datasets is a common task in data analysis and modification. When working with large datasets, it’s important to use efficient techniques and tools to ensure optimal performance and avoid memory issues. In this article, we will see how we can handle large datasets in Python.

Handle Large Datasets in Python

To handle large datasets in Python, we can use the below techniques:

Reduce Memory Usage by Optimizing Data Types

By default, Pandas assigns data types that may not be memory-efficient. For numeric columns, consider downcasting to smaller types (e.g., int32 instead of int64, float32 instead of float64). For example, if a column holds values like 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, using int8 (8 bits) instead of int64 (64 bits) is sufficient. Similarly, converting object data types to categories can also save memory.

Python3

import pandas as pd# Define the size of the datasetnum_rows = 1000000 # 1 million rows# Example DataFrame with inefficient datatypesdata = {'A': [1, 2, 3, 4], 'B': [5.0, 6.0, 7.0, 8.0]}df = pd.DataFrame(data)# Replicate the DataFrame to create a larger datasetdf_large = pd.concat([df] * (num_rows // len(df)), ignore_index=True)# Check memory usage before conversionprint("Memory usage before conversion:")print(df_large.memory_usage().sum())# Convert to more memory-efficient datatypesdf_large['A'] = pd.to_numeric(df_large['A'], downcast='integer')df_large['B'] = pd.to_numeric(df_large['B'], downcast='float')# Typecastingdf_large['A'] = df_large['A'].astype('int32')df_large['B'] = df_large['B'].astype('float32')# Check memory usage after conversionprint("Memory usage after conversion:")print(df_large.memory_usage().sum())# Print type castingprint("\nType casting:")print("Column 'A' dtype:", df_large['A'].dtype)print("Column 'B' dtype:", df_large['B'].dtype)

Output

Memory usage before conversion:16000128Memory usage after conversion:5000128

Split Data into Chunks

Use the chunksize parameter in pd.read_csv() to read the dataset in smaller chunks. Process each chunk iteratively to avoid loading the entire dataset into memory at once.

Python3

import pandas as pd# Create sample DataFramedata = {'A': range(10000), 'B': range(10000)}# Process data in chunkschunk_size = 1000for chunk in pd.DataFrame(data).groupby(pd.DataFrame(data).index // chunk_size): print(chunk)

Output

(0, A B0 0 01 1 12 2 23 3 34 4 4.. ... ...995 995 995996 996 996997 997 997998 998 998999 999 999[1000 rows x 2 columns])(1, A B1000 1000 10001001 1001 10011002 1002 10021003 1003 10031004 1004 1004... ... ...1995 1995 19951996 1996 19961997 1997 19971998 1998 19981999 1999 1999[1000 rows x 2 columns])(2, A B2000 2000 20002001 2001 20012002 2002 20022003 2003 20032004 2004 2004... ... ...2995 2995 29952996 2996 29962997 2997 29972998 2998 29982999 2999 2999[1000 rows x 2 columns])(3, A B3000 3000 30003001 3001 30013002 3002 30023003 3003 30033004 3004 3004... ... ...3995 3995 39953996 3996 39963997 3997 39973998 3998 39983999 3999 3999[1000 rows x 2 columns])(4, A B4000 4000 40004001 4001 40014002 4002 40024003 4003 40034004 4004 4004... ... ...4995 4995 49954996 4996 49964997 4997 49974998 4998 49984999 4999 4999[1000 rows x 2 columns])(5, A B5000 5000 50005001 5001 50015002 5002 50025003 5003 50035004 5004 5004... ... ...5995 5995 59955996 5996 59965997 5997 59975998 5998 59985999 5999 5999[1000 rows x 2 columns])(6, A B6000 6000 60006001 6001 60016002 6002 60026003 6003 60036004 6004 6004... ... ...6995 6995 69956996 6996 69966997 6997 69976998 6998 69986999 6999 6999[1000 rows x 2 columns])(7, A B7000 7000 70007001 7001 70017002 7002 70027003 7003 70037004 7004 7004... ... ...7995 7995 79957996 7996 79967997 7997 79977998 7998 79987999 7999 7999[1000 rows x 2 columns])(8, A B8000 8000 80008001 8001 80018002 8002 80028003 8003 80038004 8004 8004... ... ...8995 8995 89958996 8996 89968997 8997 89978998 8998 89988999 8999 8999[1000 rows x 2 columns])(9, A B9000 9000 90009001 9001 90019002 9002 90029003 9003 90039004 9004 9004... ... ...9995 9995 99959996 9996 99969997 9997 99979998 9998 99989999 9999 9999[1000 rows x 2 columns])

Use Dask for Parallel Computing

Dask is a parallel computing library that allows us to scale Pandas workflows to larger-than-memory datasets. Leverage parallel processing for efficient handling of big data.

Python3

import dask.dataframe as ddimport pandas as pd# Create sample DataFramedata = {'A': range(10000), 'B': range(10000)}df = pd.DataFrame(data)# Load data using Daskddf = dd.from_pandas(df, npartitions=4)# Perform parallelized operationsresult = ddf.groupby('A').mean().compute()print(result)

Output

 BA 0 0.01 1.02 2.03 3.04 4.0... ...9995 9995.09996 9996.09997 9997.09998 9998.09999 9999.0[10000 rows x 1 columns]

Conclusion

In conclusion, handling large datasets in Python involves using streaming techniques, lazy evaluation, parallel processing, and data compression to optimize performance and memory usage. These steps helps to efficiently process and analyze large datasets for data analysis and modification.

Please Login to comment...

Handling Large Datasets in Python - GeeksforGeeks (2024)

FAQs

How do you handle large amounts of data in Python? ›

To handle large datasets in Python, we can use the below techniques:

Reduce Memory Usage by Optimizing Data Types.
Split Data into Chunks.
Use Dask for Parallel Computing.
Conclusion.

Apr 8, 2024

Read On ›

Is Python good for large data sets? ›

Python is well-suited for large datasets due to its powerful data processing capabilities and extensive library ecosystem. Libraries like Pandas and NumPy enable efficient data manipulation, while Dask allows for parallel computing to manage larger-than-memory datasets.

Discover More Details ›

Can Python be used to handle big data? ›

Overall, Python provides a powerful set of tools for working with big data. By installing the necessary libraries, importing your data into Python, and manipulating it using Pandas' various functions and methods, you'll be well on your way to exploring large datasets in no time!

How to handle a huge dataset? ›

What are the best practices for handling data that is too large to fit into memory?

Use streaming or chunking.
Compress or reduce data.
Use external or cloud storage.
Use appropriate tools and frameworks.
Optimize your code and algorithms.
Here's what else to consider.

Sep 18, 2023

See Details ›

How to handle large data sets in Python? ›

3 ways to deal with large datasets in Python. Georgia Deaconu. ...
Reduce memory usage by optimizing data types. When using Pandas to load data from a file, it will automatically infer data types unless told otherwise. ...
Split data into chunks. ...
Take advantage of lazy evaluation.

Find Out More ›

Can Jupyter notebook handle big data? ›

It allows you to work with larger than memory datasets. If your dataset is too large, you can use sampling techniques to reduce its size. You can either use random sampling or stratified sampling depending on your needs.

Tell Me More ›

Can Python handle millions of rows? ›

By following these steps, you can efficiently read and process millions of rows of SQL data using Python. This approach ensures that your application remains responsive and performant, even when dealing with large datasets.

Show Me More ›

Can Python handle more data than Excel? ›

Python is better for large datasets and complex analysis, offering more power and scalability. Excel and Python can also be paired to blend their strengths together. So, while Python can handle some tasks better than Excel, it doesn't entirely replace it. The choice is really up to you!

Explore More ›

Which algorithm is best for large datasets? ›

For large-scale data analysis, algorithms such as Stochastic Gradient Descent (SGD), Random Forests, and Gradient Boosting Machines (GBM) are known for their fast processing speeds. SGD efficiently updates model parameters using small data subsets, making it suitable for large datasets.

How do I handle large files in Python? ›

To read large text files in Python, we can use the file object as an iterator to iterate over the file and perform the required task. Since the iterator just iterates over the entire file and does not require any additional data structure for data storage, the memory consumed is less comparatively.

Show Me More ›

What is the largest number Python can handle? ›

The maximum integer value in Python is not set (there is no constant like that). It can grow as much as is needed — limited only by the size of the available memory. So, for practical problems, it is unlimited.

Read The Full Story ›

What is the best format to store large data in Python? ›

HDF5 is a high performance storage format for storing large amounts of data in multiple datasets in a single file. It is especially popular in fields where you need to store big multidimensional arrays such as physical sciences.

See Details ›

What are three major concerns when dealing with large datasets? ›

Large data sets are challenging to process and make sense of. The three V's of big data include volume, velocity and variety.

Get More Info Here ›

How to store large amounts of data in Python? ›

A popular database for storing big data in Python is Apache Cassandra. It is a highly scalable and distributed NoSQL database, which makes it a good choice for storing large amounts of data across multiple nodes in a distributed system.

What is the Python package for large datasets? ›

Dask is a Python library for parallel computing, which can perform computations on large datasets while scaling well-known Python libraries such as pandas , NumPy , and scikit-learn . Dask splits the dataset into a number of partitions. Unlike pandas , each Dask partition is sent to a separate CPU core.

How do you handle a large collection of data? ›

Best practices for collecting big data

Develop a framework for collection that includes security, compliance and governance from the start. Build a data catalog early in the process to know what's in the organization's data platform.

View Details ›

How would you manage large amounts of data? ›

Best practices for big data management

Develop a detailed strategy and roadmap upfront. ...
Design and implement a solid architecture. ...
Stay focused on business goals and needs. ...
Eliminate disconnected data silos. ...
Be flexible on managing data. ...
Put strong access and governance controls in place.