Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sparse arrays for subject vectors #377

Closed
osma opened this issue Jan 27, 2020 · 3 comments
Closed

Use sparse arrays for subject vectors #377

osma opened this issue Jan 27, 2020 · 3 comments

Comments

@osma
Copy link
Member

osma commented Jan 27, 2020

Currently subject vectors (wrapped by VectorSuggestionResult and ListSuggestionResult objects) are basic, dense NumPy vectors. These take up a lot of RAM, for example a single YSO vector takes about 40,000 subjects * 4 bytes (float32) = 160kB and these add up especially for ensemble backends (pav and nn_ensemble) that need to keep many such vectors in memory. SciPy sparse vectors would likely be much more space-efficient so we should (try to) switch to them.

(related to #339)

@osma osma added this to the Short term milestone Jan 27, 2020
@osma
Copy link
Member Author

osma commented Jan 27, 2020

On second thought (and after some not-very-succesful experimentation), it might make sense to stick to dense NumPy arrays within the VectorSuggestionResult and ListSuggestionResult classes, and only use sparse arrays when large numbers of subject vectors need to be aggregated, for example within the PAV and NN ensemble backends and possibly in the evaluation functionality. There are several kinds of sparse arrays and each have their own pros and cons in specific kinds of usage scenarios.

@osma
Copy link
Member Author

osma commented Feb 4, 2020

PR #381 switches to sparse vectors within the nn_ensemble backend.

@osma
Copy link
Member Author

osma commented Feb 4, 2020

I think the most important uses for sparse vectors are already covered in #379 and #381. Sparse vectors could potentially be useful for evaluation functionality (saving RAM) but I don't think that's crucial, and might be problematic for performance. If it seems necessary to do so then we can open a new issue. Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant