Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Usage #23

Open
scraperdragon opened this issue Jul 22, 2013 · 6 comments
Open

Memory Usage #23

scraperdragon opened this issue Jul 22, 2013 · 6 comments

Comments

@scraperdragon
Copy link
Contributor

This is a combined messytables/xypath issue

We need to be cautious about the amount of memory we're using:

http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip

a 1.5MB zip (15MB csv)

with

fh = dl.grab(url)
mt, = list(messytables.zip.ZIPTableSet(fh).tables)
xy = xypath.Table.from_messy(mt)

uses around 3 gigabytes of ram.

Given that, in the "upload a spreadsheet" tool, people could upload files this big trivially, we'll need to think about memory consumption.

Top tip: dictionaries are horrific.

Dave.

@scraperdragon
Copy link
Contributor Author

Not significantly better with the new changes :( (40%+ ram locally; estimate ~ 2G)
import StringIO import requests import xypath import messytables url = 'http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip' z = requests.get(url).content fh = StringIO.StringIO(z) mt, = list(messytables.zip.ZIPTableSet(fh).tables) xy = xypath.Table.from_messy(mt)

It's not ZIP specific.

@pwaller
Copy link
Contributor

pwaller commented Sep 6, 2013

When making large numbers of instances of objects which only have a couple of per-instance variables, you can save a ton of memory by defining __slots__.

@scraperdragon
Copy link
Contributor Author

__slots__ was implemented; not tested performance.

@scraperdragon
Copy link
Contributor Author

Now 33% ram. Better, but not a vast improvement.

@scraperdragon
Copy link
Contributor Author

More improvements, driven by a change in this file. Mostly ditching the double-index.

@StevenMaude
Copy link
Contributor

StevenMaude commented Sep 26, 2016

This remains a problem.

Checking with the same code above (just tidied for ease of copy-pasting):

import StringIO
import requests
import xypath
import messytables

url = 'http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip'
z = requests.get(url).content
fh = StringIO.StringIO(z)
mt, = list(messytables.zip.ZIPTableSet(fh).tables)
xy = xypath.Table.from_messy(mt)

and running it with /usr/bin/time -v python faostat.py

results in:

Maximum resident set size (kbytes): 3375120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants