Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing xls file leads to all RAM consumption #42

Closed
quaxsze opened this issue Oct 13, 2020 · 5 comments
Closed

Processing xls file leads to all RAM consumption #42

quaxsze opened this issue Oct 13, 2020 · 5 comments
Labels

Comments

@quaxsze
Copy link

quaxsze commented Oct 13, 2020

Processing a xls file will trigger crazy RAM consumption, until oomkiller is called.

    import agateexcel
    result = agate.Table.from_xls(filepath)

File size is around 30 Mb.
Did you already encounter such an scenario with this function?

@abulte
Copy link

abulte commented Oct 14, 2020

The file causing the problem (open data file, tested free of viruses) Donnees_transport_ferroviaire_voyageurs.xlsx

When opening it, LibreOffice says the maximum column count has been reached, but still opens it fine.

clamscan Donnees_transport_ferroviaire_voyageurs.xlsx: OK

----------- SCAN SUMMARY -----------
Known viruses: 8923593
Engine version: 0.102.3
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 41.17 MB
Data read: 3.16 MB (ratio 13.03:1)
Time: 32.868 sec (0 m 32 s)

@jpmckinney
Copy link
Member

jpmckinney commented Oct 21, 2020

@abulte I can't reproduce with the following:

import agate
import agateexcel

result = agate.Table.from_xlsx('Donnees_transport_ferroviaire_voyageurs.xlsx')

@quaxsze In your example, you use from_xls. I suspect it has to do with the XLS file (assuming you are using from_xls with an XLS file, not XLSX file). It might be an issue upstream with xlrd. However, I can't test unless you share the file.

@jpmckinney
Copy link
Member

By the way, I notice the issue description uses from_xls, but the file in the following comment is .xlsx. XLS and XLSX are very different formats. Be sure to use the correct method :)

@jpmckinney
Copy link
Member

@abulte Aha, yes, it seems you are using from_xls for an XLSX file: https://github.com/etalab/csvapi/blob/d840eb4874f40832b9a428bd95b8b1b6c1b1c8fe/csvapi/parser.py#L31-L33

This will not work. XLS is a binary format, whereas XLSX is a ZIP file containing XML documents.

@abulte
Copy link

abulte commented Oct 21, 2020

🙈 Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants