Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption #263

Merged
merged 5 commits into from
May 28, 2019
Merged

Memory consumption #263

merged 5 commits into from
May 28, 2019

Conversation

luminoso-beaudoin
Copy link
Contributor

This PR contains changes to reduce the RAM needed to build ConceptNet down to 15GB. The main changes are additional sharding (e.g. the "convert" steps packaging input vector embeddings as hdf files are now done on shards, and the "propagate" step is done with a larger number of shards). Also, a new internal format for hdf5 files is introduced, to allow reading the files in "horizontal" shards (i.e., it is possible to read a selected subset of the rows of a new-format file rather than the entire file).

Changes to the resulting vectors (as shown by cn5-vectors evaluate) are minimal.

@rspeer
Copy link
Member

rspeer commented May 20, 2019

I tried clearing my data/vectors directory and running this build, and during the join_convert step, I got:

Traceback (most recent call last):
  File "/home/rspeer/.virtualenvs/main/bin/cn5-vectors", line 11, in <module>
    load_entry_point('ConceptNet', 'console_scripts', 'cn5-vectors')()
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/rspeer/code/conceptnet5/conceptnet5/vectors/cli.py", line 344, in run_join_shard_files
    join_shards(filename, nshards, sort=sort)
  File "/home/rspeer/code/conceptnet5/conceptnet5/vectors/retrofit.py", line 56, in join_shards
    shard = load_hdf(output_filename + ".shard0")
  File "/home/rspeer/code/conceptnet5/conceptnet5/vectors/formats.py", line 48, in load_hdf
    start=start_row, stop=end_row
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 429, in __new__
    data = list(data)
  File "/home/rspeer/code/conceptnet5/conceptnet5/vectors/formats.py", line 46, in <genexpr>
    array.tobytes().decode("utf-8")
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/tables/vlarray.py", line 634, in __next__
    self.listarr = self.read(self._startb, self._stopb, self._step)
  File "/home/rspeer/.virtualenvs/main/lib/python3.6/site-packages/tables/vlarray.py", line 821, in read
    listarr = self._read_array(start, stop, step)
  File "tables/hdf5extension.pyx", line 2155, in tables.hdf5extension.VLArray._read_array
ValueError: cannot set WRITEABLE flag to True of this array

I'll see what I can find out about this error.

@rspeer
Copy link
Member

rspeer commented May 20, 2019

Okay - per pandas-dev/pandas#24839, we should depend on tables >= 3.5.

@rspeer rspeer merged commit 7fb141f into master May 28, 2019
@rspeer rspeer deleted the memory-consumption branch May 28, 2019 17:02
rspeer pushed a commit that referenced this pull request Jul 1, 2019
This reverts commit 7fb141f, reversing
changes made to 705144c.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants