Last Updated : 08 Apr, 2024
Comments
Improve
Handling large datasets is a common task in data analysis and modification. When working with large datasets, it’s important to use efficient techniques and tools to ensure optimal performance and avoid memory issues. In this article, we will see how we can handle large datasets in Python.
Handle Large Datasets in Python
To handle large datasets in Python, we can use the below techniques:
Reduce Memory Usage by Optimizing Data Types
By default, Pandas assigns data types that may not be memory-efficient. For numeric columns, consider downcasting to smaller types (e.g., int32 instead of int64, float32 instead of float64). For example, if a column holds values like 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, using int8 (8 bits) instead of int64 (64 bits) is sufficient. Similarly, converting object data types to categories can also save memory.
import pandas as pd# Define the size of the datasetnum_rows = 1000000 # 1 million rows# Example DataFrame with inefficient datatypesdata = {'A': [1, 2, 3, 4], 'B': [5.0, 6.0, 7.0, 8.0]}df = pd.DataFrame(data)# Replicate the DataFrame to create a larger datasetdf_large = pd.concat([df] * (num_rows // len(df)), ignore_index=True)# Check memory usage before conversionprint("Memory usage before conversion:")print(df_large.memory_usage().sum())# Convert to more memory-efficient datatypesdf_large['A'] = pd.to_numeric(df_large['A'], downcast='integer')df_large['B'] = pd.to_numeric(df_large['B'], downcast='float')# Typecastingdf_large['A'] = df_large['A'].astype('int32')df_large['B'] = df_large['B'].astype('float32')# Check memory usage after conversionprint("Memory usage after conversion:")print(df_large.memory_usage().sum())# Print type castingprint("\nType casting:")print("Column 'A' dtype:", df_large['A'].dtype)print("Column 'B' dtype:", df_large['B'].dtype)
Output
Memory usage before conversion:16000128Memory usage after conversion:5000128
Split Data into Chunks
Use the chunksize parameter in pd.read_csv() to read the dataset in smaller chunks. Process each chunk iteratively to avoid loading the entire dataset into memory at once.
import pandas as pd# Create sample DataFramedata = {'A': range(10000), 'B': range(10000)}# Process data in chunkschunk_size = 1000for chunk in pd.DataFrame(data).groupby(pd.DataFrame(data).index // chunk_size): print(chunk)
Output
(0, A B0 0 01 1 12 2 23 3 34 4 4.. ... ...995 995 995996 996 996997 997 997998 998 998999 999 999[1000 rows x 2 columns])(1, A B1000 1000 10001001 1001 10011002 1002 10021003 1003 10031004 1004 1004... ... ...1995 1995 19951996 1996 19961997 1997 19971998 1998 19981999 1999 1999[1000 rows x 2 columns])(2, A B2000 2000 20002001 2001 20012002 2002 20022003 2003 20032004 2004 2004... ... ...2995 2995 29952996 2996 29962997 2997 29972998 2998 29982999 2999 2999[1000 rows x 2 columns])(3, A B3000 3000 30003001 3001 30013002 3002 30023003 3003 30033004 3004 3004... ... ...3995 3995 39953996 3996 39963997 3997 39973998 3998 39983999 3999 3999[1000 rows x 2 columns])(4, A B4000 4000 40004001 4001 40014002 4002 40024003 4003 40034004 4004 4004... ... ...4995 4995 49954996 4996 49964997 4997 49974998 4998 49984999 4999 4999[1000 rows x 2 columns])(5, A B5000 5000 50005001 5001 50015002 5002 50025003 5003 50035004 5004 5004... ... ...5995 5995 59955996 5996 59965997 5997 59975998 5998 59985999 5999 5999[1000 rows x 2 columns])(6, A B6000 6000 60006001 6001 60016002 6002 60026003 6003 60036004 6004 6004... ... ...6995 6995 69956996 6996 69966997 6997 69976998 6998 69986999 6999 6999[1000 rows x 2 columns])(7, A B7000 7000 70007001 7001 70017002 7002 70027003 7003 70037004 7004 7004... ... ...7995 7995 79957996 7996 79967997 7997 79977998 7998 79987999 7999 7999[1000 rows x 2 columns])(8, A B8000 8000 80008001 8001 80018002 8002 80028003 8003 80038004 8004 8004... ... ...8995 8995 89958996 8996 89968997 8997 89978998 8998 89988999 8999 8999[1000 rows x 2 columns])(9, A B9000 9000 90009001 9001 90019002 9002 90029003 9003 90039004 9004 9004... ... ...9995 9995 99959996 9996 99969997 9997 99979998 9998 99989999 9999 9999[1000 rows x 2 columns])
Use Dask for Parallel Computing
Dask is a parallel computing library that allows us to scale Pandas workflows to larger-than-memory datasets. Leverage parallel processing for efficient handling of big data.
import dask.dataframe as ddimport pandas as pd# Create sample DataFramedata = {'A': range(10000), 'B': range(10000)}df = pd.DataFrame(data)# Load data using Daskddf = dd.from_pandas(df, npartitions=4)# Perform parallelized operationsresult = ddf.groupby('A').mean().compute()print(result)
Output
BA 0 0.01 1.02 2.03 3.04 4.0... ...9995 9995.09996 9996.09997 9997.09998 9998.09999 9999.0[10000 rows x 1 columns]
Conclusion
In conclusion, handling large datasets in Python involves using streaming techniques, lazy evaluation, parallel processing, and data compression to optimize performance and memory usage. These steps helps to efficiently process and analyze large datasets for data analysis and modification.
Next Article
Handling Large Datasets in Pandas