Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

750 save csv v2 #941

Merged
merged 23 commits into from
Mar 30, 2022
Merged

750 save csv v2 #941

merged 23 commits into from
Mar 30, 2022

Conversation

bhagemeier
Copy link
Member

@bhagemeier bhagemeier commented Mar 25, 2022

Description

This feature provides saving of data in CSV format. In order to utilize MPI-IO and write in parallel, the CSV format written is normalized to fixed-width. This is not the same as np.savetxt(), but may serve as a starting point to move in that direction. Supports all possible splits for 2D tensors, so None, 0, and 1. save_csv will only ever store at most 2D data, comparable to np.savetxt().

Issue/s resolved: #750

Changes proposed:

  • add ht.save_csv()

Type of change

  • New feature (non-breaking change which adds functionality)

Memory requirements

Requires a buffer (per process) to contain the row that is currently written. Except for very wide rows, it should be negligible. A 3595x3595 tensor taken from a real example worked well and showed only a small increase in the memory footprint.

Performance

  • split=None: no gain
  • split=0 reduces writing time substantially
  • split=1

Due Diligence

  • All supported (None, 0, 1) split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

yes: the generic save function can now save CSVs

skip ci

it has problems when running with mpirun, so will try a different 
approach on a different branch.
This is an implementation of save_csv that covers split None and 0.

resolves #750
Using the collective MPI.File.Write_at_all led to problems with not 
perfectly balanced chunks. The ordinary Write_at is much better for this 
purpose.

On the way, I also removed print statements and comments.
floating is the supertype of float32 and float64, not float, which is 
just an alias. Added a corresponding test.
The way we use MPI-IO does not reset existing contents of files and 
therefore may leave garbage at the end if the data to be written has a 
shorter representation than the existing file. Therefore, we reset by 
default, but allow to omit this step.
the difference from split 0 is not so big after all
sys was only there for debugging purposes
remove unreachable else branch and start from common offset for all 
splits
Not synchronizing at the end of writing the file may lead to strange 
effects for imbalanced tensors.
@ghost
Copy link

ghost commented Mar 28, 2022

CodeSee Review Map:

Review these changes using an interactive CodeSee Map

Review in an interactive map

View more CodeSee Maps

Legend

CodeSee Map Legend

@bhagemeier bhagemeier marked this pull request as ready for review March 28, 2022 08:15
coquelin77
coquelin77 previously approved these changes Mar 28, 2022
@bhagemeier
Copy link
Member Author

Just found a bug with split=1, which does not work for nprocs>shape[1]. Need to fix before merging.

Having more processes than chunks in split 1 did not work. Rather than 
checking whether we are the last (overall) rank, we check whether we 
have the last chunk of data and don't write anything if we have no data. 
Last chunk is relevant to distinguish newline or separator addition at 
the end of our buffer.
@codecov
Copy link

codecov bot commented Mar 28, 2022

Codecov Report

Merging #941 (20b2023) into main (3bcab68) will decrease coverage by 4.41%.
The diff coverage is 92.10%.

@@            Coverage Diff             @@
##             main     #941      +/-   ##
==========================================
- Coverage   95.50%   91.08%   -4.42%     
==========================================
  Files          64       64              
  Lines        9801     9875      +74     
==========================================
- Hits         9360     8995     -365     
- Misses        441      880     +439     
Flag Coverage Δ
gpu ?
unit 91.08% <92.10%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
heat/core/io.py 89.00% <92.10%> (+0.54%) ⬆️
heat/optim/dp_optimizer.py 24.19% <0.00%> (-71.89%) ⬇️
heat/core/stride_tricks.py 81.53% <0.00%> (-12.31%) ⬇️
heat/core/devices.py 86.66% <0.00%> (-11.12%) ⬇️
heat/nn/data_parallel.py 84.13% <0.00%> (-10.35%) ⬇️
heat/cluster/spectral.py 85.71% <0.00%> (-8.58%) ⬇️
heat/core/communication.py 89.90% <0.00%> (-6.86%) ⬇️
heat/core/tests/test_suites/basic_test.py 91.26% <0.00%> (-4.86%) ⬇️
heat/core/linalg/qr.py 97.30% <0.00%> (-2.70%) ⬇️
heat/utils/data/partial_dataset.py 92.30% <0.00%> (-2.06%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bcab68...20b2023. Read the comment docs.

Sometimes, load_csv complained about files not being available anymore. 
We need to sync through a barrier before unlinking files.
The maximum of a tensor can be less than 0, so need abs around the max, 
too, before passing it to log10. Also, a 0 value must be excluded.
@bhagemeier
Copy link
Member Author

After thorough testing, finding and resolving 2 more bugs, I am sufficiently confident to merge this PR.

@bhagemeier bhagemeier merged commit 2b10dd2 into main Mar 30, 2022
@bhagemeier bhagemeier deleted the 750-save-csv-v2 branch March 30, 2022 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a save_csv function
3 participants