-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
750 save csv v2 #941
750 save csv v2 #941
Conversation
it has problems when running with mpirun, so will try a different approach on a different branch.
This is an implementation of save_csv that covers split None and 0. resolves #750
Using the collective MPI.File.Write_at_all led to problems with not perfectly balanced chunks. The ordinary Write_at is much better for this purpose. On the way, I also removed print statements and comments.
floating is the supertype of float32 and float64, not float, which is just an alias. Added a corresponding test.
The way we use MPI-IO does not reset existing contents of files and therefore may leave garbage at the end if the data to be written has a shorter representation than the existing file. Therefore, we reset by default, but allow to omit this step.
the difference from split 0 is not so big after all
sys was only there for debugging purposes
remove unreachable else branch and start from common offset for all splits
Not synchronizing at the end of writing the file may lead to strange effects for imbalanced tensors.
CodeSee Review Map:Review in an interactive map View more CodeSee Maps Legend |
Just found a bug with split=1, which does not work for nprocs>shape[1]. Need to fix before merging. |
Having more processes than chunks in split 1 did not work. Rather than checking whether we are the last (overall) rank, we check whether we have the last chunk of data and don't write anything if we have no data. Last chunk is relevant to distinguish newline or separator addition at the end of our buffer.
Codecov Report
@@ Coverage Diff @@
## main #941 +/- ##
==========================================
- Coverage 95.50% 91.08% -4.42%
==========================================
Files 64 64
Lines 9801 9875 +74
==========================================
- Hits 9360 8995 -365
- Misses 441 880 +439
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Sometimes, load_csv complained about files not being available anymore. We need to sync through a barrier before unlinking files.
The maximum of a tensor can be less than 0, so need abs around the max, too, before passing it to log10. Also, a 0 value must be excluded.
After thorough testing, finding and resolving 2 more bugs, I am sufficiently confident to merge this PR. |
Description
This feature provides saving of data in CSV format. In order to utilize MPI-IO and write in parallel, the CSV format written is normalized to fixed-width. This is not the same as np.savetxt(), but may serve as a starting point to move in that direction. Supports all possible splits for 2D tensors, so None, 0, and 1.
save_csv
will only ever store at most 2D data, comparable to np.savetxt().Issue/s resolved: #750
Changes proposed:
Type of change
Memory requirements
Requires a buffer (per process) to contain the row that is currently written. Except for very wide rows, it should be negligible. A 3595x3595 tensor taken from a real example worked well and showed only a small increase in the memory footprint.
Performance
Due Diligence
Does this change modify the behaviour of other functions? If so, which?
yes: the generic save function can now save CSVs
skip ci