You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Description:
Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192.
I found that archive/presentations/FAT_Star_Tutorial_Measuring_Unintended_Bias_in_Text_Classification_Models_with_Real_Data.ipynb and archive/unintended_ml_bias/Train_Toxicity_Model.ipynb both used the influenced api. There may be more files used the influenced api.
Steps to Reproduce:
I have created a small reproducible example to better illustrate this issue.
# v1.3.4
import os
import pandas
import numpy
import timeit
def generate_sample():
if os.path.exists("test_small.csv.gz") == False:
nb_col = 100000
nb_row = 5
feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
for i in range(nb_col):
feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
df = pandas.DataFrame(feature_list)
df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")
def load_csv_file():
col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
types_dict = {col: numpy.float32 for col in col_names}
types_dict.update({'sample': str})
feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
print("loaded dataframe shape:", feature_df.shape)
generate_sample()
timeit.timeit(load_csv_file, number=1)
# results
loaded dataframe shape: (5, 100000)
120.37690759263933
# v1.3.5
import os
import pandas
import numpy
import timeit
def generate_sample():
if os.path.exists("test_small.csv.gz") == False:
nb_col = 100000
nb_row = 5
feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
for i in range(nb_col):
feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
df = pandas.DataFrame(feature_list)
df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")
def load_csv_file():
col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
types_dict = {col: numpy.float32 for col in col_names}
types_dict.update({'sample': str})
feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
print("loaded dataframe shape:", feature_df.shape)
generate_sample()
timeit.timeit(load_csv_file, number=1)
# results
loaded dataframe shape: (5, 100000)
2.8567268839105964
Suggestion
I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files.
Any other workarounds or solutions would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered:
Issue Description:
Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192.
I found that
archive/presentations/FAT_Star_Tutorial_Measuring_Unintended_Bias_in_Text_Classification_Models_with_Real_Data.ipynb
andarchive/unintended_ml_bias/Train_Toxicity_Model.ipynb
both used the influenced api. There may be more files used the influenced api.Steps to Reproduce:
I have created a small reproducible example to better illustrate this issue.
Suggestion
I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files.
Any other workarounds or solutions would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered: