Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in2csv: Empty CSV reading a specific xlsx file #1129

Closed
aborruso opened this issue Jun 13, 2021 · 23 comments
Closed

in2csv: Empty CSV reading a specific xlsx file #1129

aborruso opened this issue Jun 13, 2021 · 23 comments

Comments

@aborruso
Copy link

Hi,
using in2csv to convert this "governative" COVID-19 file, I have a strange empty CSV that has only one column label.

I run

in2csv -I --sheet stato_clinico covid_19-iss.xlsx

and I have only the value iss_date as result.

If I open it with libreoffice or using other cli apps, I have the real values

iss_date,SESSO,AGE_GROUP,STATO_CLINICO,CASI
12/06/2021,F,0-9,ASINTOMATICO,5446
12/06/2021,F,0-9,CRITICO,66
12/06/2021,F,0-9,LIEVE,1061
12/06/2021,F,0-9,PAUCI-SINTOMATICO,934
12/06/2021,F,0-9,SEVERO,52
12/06/2021,F,10-19,ASINTOMATICO,8205
12/06/2021,F,10-19,CRITICO,220
12/06/2021,F,10-19,LIEVE,2289
12/06/2021,F,10-19,PAUCI-SINTOMATICO,2135
12/06/2021,F,10-19,SEVERO,55
12/06/2021,F,20-29,ASINTOMATICO,8782
12/06/2021,F,20-29,CRITICO,264
12/06/2021,F,20-29,LIEVE,2891
12/06/2021,F,20-29,PAUCI-SINTOMATICO,2625
12/06/2021,F,20-29,SEVERO,240
12/06/2021,F,30-39,ASINTOMATICO,9310
12/06/2021,F,30-39,CRITICO,302
12/06/2021,F,30-39,LIEVE,3569
12/06/2021,F,30-39,PAUCI-SINTOMATICO,2963
12/06/2021,F,30-39,SEVERO,448
12/06/2021,F,40-49,ASINTOMATICO,11240
12/06/2021,F,40-49,CRITICO,518
12/06/2021,F,40-49,LIEVE,4767
12/06/2021,F,40-49,PAUCI-SINTOMATICO,3897
12/06/2021,F,40-49,SEVERO,443
12/06/2021,F,50-59,ASINTOMATICO,11553
12/06/2021,F,50-59,CRITICO,622
12/06/2021,F,50-59,LIEVE,4762
12/06/2021,F,50-59,PAUCI-SINTOMATICO,3980
12/06/2021,F,50-59,SEVERO,723
12/06/2021,F,60-69,ASINTOMATICO,7326
12/06/2021,F,60-69,CRITICO,497
12/06/2021,F,60-69,LIEVE,3190
12/06/2021,F,60-69,PAUCI-SINTOMATICO,2609
12/06/2021,F,60-69,SEVERO,978
12/06/2021,F,70-79,ASINTOMATICO,5491
12/06/2021,F,70-79,CRITICO,403
12/06/2021,F,70-79,LIEVE,2371
12/06/2021,F,70-79,PAUCI-SINTOMATICO,1713
12/06/2021,F,70-79,SEVERO,1418
12/06/2021,F,80-89,ASINTOMATICO,4781
12/06/2021,F,80-89,CRITICO,195
12/06/2021,F,80-89,LIEVE,1520
12/06/2021,F,80-89,PAUCI-SINTOMATICO,1082
12/06/2021,F,80-89,SEVERO,1689
12/06/2021,F,>90,ASINTOMATICO,2075
12/06/2021,F,>90,CRITICO,47
12/06/2021,F,>90,LIEVE,571
12/06/2021,F,>90,PAUCI-SINTOMATICO,414
12/06/2021,F,>90,SEVERO,509
12/06/2021,F,Non noto,ASINTOMATICO,18
12/06/2021,F,Non noto,PAUCI-SINTOMATICO,<5
12/06/2021,F,Non noto,SEVERO,<5
12/06/2021,M,0-9,ASINTOMATICO,5744
12/06/2021,M,0-9,CRITICO,63
12/06/2021,M,0-9,LIEVE,1114
12/06/2021,M,0-9,PAUCI-SINTOMATICO,928
12/06/2021,M,0-9,SEVERO,61
12/06/2021,M,10-19,ASINTOMATICO,9203
12/06/2021,M,10-19,CRITICO,232
12/06/2021,M,10-19,LIEVE,2315
12/06/2021,M,10-19,PAUCI-SINTOMATICO,2065
12/06/2021,M,10-19,SEVERO,89
12/06/2021,M,20-29,ASINTOMATICO,9804
12/06/2021,M,20-29,CRITICO,289
12/06/2021,M,20-29,LIEVE,2967
12/06/2021,M,20-29,PAUCI-SINTOMATICO,2632
12/06/2021,M,20-29,SEVERO,195
12/06/2021,M,30-39,ASINTOMATICO,9535
12/06/2021,M,30-39,CRITICO,264
12/06/2021,M,30-39,LIEVE,3081
12/06/2021,M,30-39,PAUCI-SINTOMATICO,2687
12/06/2021,M,30-39,SEVERO,369
12/06/2021,M,40-49,ASINTOMATICO,11025
12/06/2021,M,40-49,CRITICO,470
12/06/2021,M,40-49,LIEVE,4259
12/06/2021,M,40-49,PAUCI-SINTOMATICO,3407
12/06/2021,M,40-49,SEVERO,726
12/06/2021,M,50-59,ASINTOMATICO,11612
12/06/2021,M,50-59,CRITICO,737
12/06/2021,M,50-59,LIEVE,4611
12/06/2021,M,50-59,PAUCI-SINTOMATICO,3852
12/06/2021,M,50-59,SEVERO,1363
12/06/2021,M,60-69,ASINTOMATICO,8072
12/06/2021,M,60-69,CRITICO,753
12/06/2021,M,60-69,LIEVE,3299
12/06/2021,M,60-69,PAUCI-SINTOMATICO,2407
12/06/2021,M,60-69,SEVERO,1718
12/06/2021,M,70-79,ASINTOMATICO,5133
12/06/2021,M,70-79,CRITICO,594
12/06/2021,M,70-79,LIEVE,2232
12/06/2021,M,70-79,PAUCI-SINTOMATICO,1583
12/06/2021,M,70-79,SEVERO,1933
12/06/2021,M,80-89,ASINTOMATICO,2953
12/06/2021,M,80-89,CRITICO,162
12/06/2021,M,80-89,LIEVE,1124
12/06/2021,M,80-89,PAUCI-SINTOMATICO,692
12/06/2021,M,80-89,SEVERO,1472
12/06/2021,M,>90,ASINTOMATICO,694
12/06/2021,M,>90,CRITICO,22
12/06/2021,M,>90,LIEVE,213
12/06/2021,M,>90,PAUCI-SINTOMATICO,151
12/06/2021,M,>90,SEVERO,244
12/06/2021,M,Non noto,ASINTOMATICO,15
12/06/2021,M,Non noto,LIEVE,<5
12/06/2021,M,Non noto,PAUCI-SINTOMATICO,<5
@jpmckinney
Copy link
Member

Can you run with -v:

in2csv -v -I --sheet stato_clinico covid_19-iss.xlsx

There might be some suppressed error output.

@aborruso
Copy link
Author

aborruso commented Oct 8, 2021

@jpmckinney no error output.

I have only the first cell :(

@jpmckinney
Copy link
Member

Hmm, okay, I will have to test with the file.

@hyper-jarrod
Copy link

I also experienced the same issue. Only the very first cell is outputted.

@jpmckinney jpmckinney changed the title Empty CSV reading a xlsx file in2csv: Empty CSV reading a xlsx file Oct 17, 2023
@jpmckinney jpmckinney changed the title in2csv: Empty CSV reading a xlsx file in2csv: Empty CSV reading a specific xlsx file Oct 17, 2023
@raphox
Copy link

raphox commented Oct 23, 2023

It seems related to the Unnamed columns. Try to use Panda library to read your file using this code:

import pandas as pd

xlsx = pd.read_excel('file.xlsx', sheet_name=0)
xlsx.to_csv('file.csv', index=False)

You will get something like this:

Sales Period,May 2023,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14

The reason could be this https://stackoverflow.com/a/67605541/511514

I don't have a solution yet.

@jpmckinney
Copy link
Member

@raphox Can you share an XLSX file that causes the error?

@jpmckinney
Copy link
Member

Ah, csvkit was trusting the file's metadata, which for @aborruso's file set the max columns and rows to 1.

I made a commit to never trust the file, and to reset dimensions instead. I assume Pandas did the same.

@raphox
Copy link

raphox commented Oct 23, 2023

I could simulate the same issue using this code:

import agate
import agateexcel

table = agate.Table.from_xlsx('file.xlsx', sheet=0)
table.to_csv('file.csv')

The Panda library and Csvkit are using the Agate library.

@aborruso
Copy link
Author

I made a commit to never trust the file, and to reset dimensions instead. I assume Pandas did the same.

thank you very much

@raphox
Copy link

raphox commented Oct 23, 2023

@jpmckinney

Sorry, I can not share my file. It comes from a customer.

Using your commit, I got this error message:

ValueError: Row 1 has 15 values, but Table only has 2 columns.

@jpmckinney
Copy link
Member

jpmckinney commented Oct 23, 2023

The Panda library and Csvkit are using the Agate library.

@raphox Pandas does not use agate.

Sorry, I can not share my file. It comes from a customer.

Is it impossible to create a file (e.g. by modifying your customer file) to reproduce the issue?

Using your commit, I got this error message:

Indeed - it looks like setting reset_dimensions for all files is too aggressive (e.g. even for csvkit's own test files). I instead made it opt-in (--reset-dimensions). However, since this makes the solution harder for a user to discover, I added content to the in2csv command's page.

(For example, csvkit's examples/test.xlsx has data in a column that lacks a header. Setting reset_dimensions causes openpyxl to trim the number of columns to the number of headers. This causes ValueError: Row 1 has 8 values, but Table only has 7 columns. on that file. Without reset_dimensions the unnamed column is within the file's dimensions.)

https://openpyxl.readthedocs.io/en/stable/optimized.html#worksheet-dimensions

@raphox
Copy link

raphox commented Oct 24, 2023

@jpmckinney

Sorry, I got confused. The pandas library uses openpyxl, and it uses the agate, as csvkit.

I can not share the customer's file, and when I try to remove the sensitive data and save the file using Microsoft Excel on MacOS, the problem disappears.

But I got another weird behavior trying to convert this file sales_with_header_in_row_two_google.xlsx. I used the Google Docs to generate it based on my customer's file.

Screenshot 2023-10-24 at 09 46 53.

The result using this file is:

in2csv test/fixtures/files/xlsx/sales_with_header_in_row_two_google.xlsx
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 2 has no name. Using "c".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 3 has no name. Using "d".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 4 has no name. Using "e".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 5 has no name. Using "f".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 6 has no name. Using "g".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 7 has no name. Using "h".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 8 has no name. Using "i".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 9 has no name. Using "j".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 10 has no name. Using "k".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 11 has no name. Using "l".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 12 has no name. Using "m".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 13 has no name. Using "n".
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/agate/utils.py:274: UnnamedColumnWarning: Column 14 has no name. Using "o".
Sales Period,2023-05-01 00:00:00,c,d,e,f,g,h,i,j,k,l,m,n,o
Title,Author,ASIN,Marketplace,Units Sold,Units Refunded,Net Units Sold,Royalty Type,Transaction Type,Currency,Avg. List Price without tax,Avg. Offer Price without tax,Avg. File Size (MB),Avg. Delivery Cost,Royalty
Book One,John,ABC123,Amazon.com,1.0,0.0,1.0,0.7,Standard,USD,"7,99","7,99","0,26","0,04","5,56"

Look at the original blank cells ,c,d,e,f,g,h,i,j,k,l,m,n,o in the first row.

@raphox
Copy link

raphox commented Oct 24, 2023

Using pandas:

import pandas as pd

# Lendo o arquivo xlsx
xlsx = pd.read_excel('test/fixtures/files/xlsx/sales_with_header_in_row_two_google.xlsx', sheet_name=0)

# Escrevendo o conteúdo em um arquivo csv
xlsx.to_csv('output_file.csv', index=False)

Result in output_file.csv:

Sales Period,2023-05-01 00:00:00,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
Title,Author,ASIN,Marketplace,Units Sold,Units Refunded,Net Units Sold,Royalty Type,Transaction Type,Currency,Avg. List Price without tax,Avg. Offer Price without tax,Avg. File Size (MB),Avg. Delivery Cost,Royalty
Book One,John,ABC123,Amazon.com,1,0,1,0.7,Standard,USD,"7,99","7,99","0,26","0,04","5,56"

@raphox
Copy link

raphox commented Oct 24, 2023

I am making progress:

Using Pandas:

import pandas as pd

xlsx = pd.read_excel('test/fixtures/files/xlsx/sales_with_header_in_row_two.xlsx', sheet_name=0, index_col=None, header=None)
xlsx.to_csv('output_file.csv', index=False, header=None)

# Results:
Sales Period,May 2023,,,,,,,,,,,,,
Title,Author,ASIN,Marketplace,Units Sold,Units Refunded,Net Units Sold,Royalty Type,Transaction Type,Currency,Avg. List Price without tax,Avg. Offer Price without tax,Avg. File Size (MB),Avg. Delivery Cost,Royalty
A Navy SEAL’s Guide to the Zombie Apocalypse: A Practical Guide for Surviving Anything (even zombies),Peter Four,B09X9RNN91,Amazon.com,1,0,1,70%,Standard,USD,7.99,7.99,0.26,0.04,5.56

I got the same results, using in2csv for the Google Doc, but using the customer's original file I got this:

in2csv test/fixtures/files/xlsx/sales_with_header_in_row_two.xlsx -H
/Users/raphox/.pyenv/versions/3.10.4/lib/python3.10/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default
a
Sales Period

@jpmckinney
Copy link
Member

To skip that useless first row, use -K 1 or --skip-lines 1

Your comments are no longer about the issue, and are support questions (rather than bug reports).

@raphox
Copy link

raphox commented Oct 25, 2023

Thanks @jpmckinney

My goal was just to convert an XLSX file to a CSV version. My objective is to have a small version of the file before importing it into my web application using Ruby on Rails. I don't want to worry about the header or skip rows.

In my case, I used the custom pandas script to do this:

# bin/excel_to_csv_by_panda.py

#!/usr/bin/env python

import sys
import warnings
import pandas as pd

warnings.filterwarnings("ignore")

xlsx = pd.read_excel(sys.argv[1], sheet_name=sys.argv[2], index_col=None, header=None)
xlsx.to_csv(sys.argv[3], index=False, header=None)

Ruby code:

system(
  Rails.root.join('bin/excel_to_csv_by_panda.py').to_s,
  excel_file_path,
  excel_sheet_name,
  csv_file_path
)

@steve-estes
Copy link

steve-estes commented Oct 27, 2023

Ah, csvkit was trusting the file's metadata, which for @aborruso's file set the max columns and rows to 1.

I made a commit to never trust the file, and to reset dimensions instead. I assume Pandas did the same.

I just tried to use the --reset-dimensions parameter, and despite it being csvkit 1.3.0 and not yet documented, it does not seem to recognize the parameter (nor is it in the help menu - though it's in the readthedocs doc as a footer). I'm assuming the below issue is because the commit you mention is ahead of the latest 1.3.0 release.

Screen Shot 2023-10-27 at 3 25 25 PM

Digging into this a bit further, I see 2 issues:

  1. Add this to the documented parameters in the docs
  2. The warnings are annoying and probably unnecessary. You could take the approach of either of the top 2 answers over here, to suppress those during execution, since they will always pop up and they're always irrelevant, and so it kinda obscures what's actually happening with a flood of unneeded warnings.
  3. Over at Line 73 of the agate_xlsx script, after doing reset_dimensions(), it might be helpful to then do what openpyxl suggests and add a line for: sheet.calculate_dimension(force=True) to reset the calculation to the proper dimensions. Unsure if that will add an excessive full scan of the sheet, or if it'll help other things that rely on the dimensions.

Hope that helps, James!

@jpmckinney
Copy link
Member

@steve-estes This feature is not yet part of a release. You can see it listed under "Unreleased" in the changelog: https://csvkit.readthedocs.io/en/latest/changelog.html

You can install csvkit from GitHub using:

pip install git+https://github.com/wireservice/csvkit.git

@jpmckinney
Copy link
Member

The warnings are annoying and probably unnecessary. You could take the approach of either of the top 2 answers over here, to suppress those during execution, since they will always pop up and they're always irrelevant, and so it kinda obscures what's actually happening with a flood of unneeded warnings.

This is a separate issue. I haven't seen such warnings with any files, personally. Please open a new issue, and upload a file that causes the warning.

Over at Line 73 of the agate_xlsx script, after doing reset_dimensions(), it might be helpful to then do what openpyxl suggests and add a line for: sheet.calculate_dimension(force=True) to reset the calculation to the proper dimensions. Unsure if that will add an excessive full scan of the sheet, or if it'll help other things that rely on the dimensions.

calculate_dimension doesn't accept any parameters: https://openpyxl.readthedocs.io/en/stable/_modules/openpyxl/worksheet/worksheet.html?highlight=calculate_dimension#

Where did you find this suggestion?

@steve-estes
Copy link

I will open a new issue about the warning

Over at Line 73 of the agate_xlsx script, after doing reset_dimensions(), it might be helpful to then do what openpyxl suggests and add a line for: sheet.calculate_dimension(force=True) to reset the calculation to the proper dimensions. Unsure if that will add an excessive full scan of the sheet, or if it'll help other things that rely on the dimensions.

calculate_dimension doesn't accept any parameters: https://openpyxl.readthedocs.io/en/stable/_modules/openpyxl/worksheet/worksheet.html?highlight=calculate_dimension#

Where did you find this suggestion?

While tinkering around with command-line python, and having the module tell me as much.

sagemaker-user@studio$ python3
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29) 
[GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import agateexcel
>>> import agate
>>> import csvkit
>>> import openpyxl
>>> wb = load_workbook(filename='active_cds.xlsx', read_only=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'load_workbook' is not defined
>>> wb = openpyxl.load_workbook(filename='active_cds.xlsx', read_only=True)
/opt/conda/lib/python3.9/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default
>>> wb.sheetnames
['queryri.ks_bpcd>Sheet1']
>>> ws = wb['queryri.ks_bpcd>Sheet1']
>>> ws.calculate_dimension()
'A1:A1'
>>> ws.reset_dimensions()
>>> ws.calculate_dimension()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.9/site-packages/openpyxl/worksheet/_read_only.py", line 141, in calculate_dimension
    raise ValueError("Worksheet is unsized, use calculate_dimension(force=True)")
ValueError: Worksheet is unsized, **use calculate_dimension(force=True)**
>>> ws.calculate_dimension(force=True)
'A1:L3367'
>>> ws.calculate_dimension()
'A1:L3367'

By following the referenced code path, it can be seen here on the openpyxl dev repository.

@jpmckinney
Copy link
Member

jpmckinney commented Oct 30, 2023

Thank you, @steve-estes! I've released a new version of agate-excel that recalculates the dimensions (instead of only resetting them to None, as this causes the behavior of treating the width of the header row as the width of the full table).

With this change, it would be possible to have reset_dimensions be used for all Excel files (viz my comment #1129 (comment)). However, recalculating dimensions requires reading the entire file, so I've left this as opt-in in CSV Kit for performance.

Arguably, CSV Kit could favor user experience by instead making a flag to opt out of resetting dimensions, e.g. --no-reset-dimensions. However, not resetting dimensions is the existing behavior, so I'll continue with opt-in for now. We can maybe do opt-out in a major version.

@steve-estes
Copy link

Thank you, @steve-estes! I've released a new version of agate-excel that recalculates the dimensions (instead of only resetting them to None, as this causes the behavior of treating the width of the header row as the width of the full table).

With this change, it would be possible to have reset_dimensions be used for all Excel files (viz my comment #1129 (comment)). However, recalculating dimensions requires reading the entire file, so I've left this as opt-in in CSV Kit for performance.

Arguably, CSV Kit could favor user experience by instead making a flag to opt out of resetting dimensions, e.g. --no-reset-dimensions. However, not resetting dimensions is the existing behavior, so I'll continue with opt-in for now. We can maybe do opt-out in a major version.

Agreed I see no reason to make this opt-out rather than opt-in, given the performance implications.

If we want to be a little bit fancy, one thing agate-excel could do, as a future item, is just first check whether the calculation_dimensions() result is 'A1:A1', and if so, issue a warning ("warning: worksheet bounds are set to A1:A1, use reset_dimensions to recalculate") or just do a recalc automatically. Because there's just no data file where that is going to be correct - some tool in common use out there must be exporting with a default dimension listed in the excel worksheet metadata, and just never recalculating it.

@jpmckinney
Copy link
Member

Good suggestion! I've made that change to agate-excel, so it'll occur automatically via csvkit. I didn't add a warning, as I agree that there is pretty much no chance that converting a single cell is the behavior desired by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants