-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix structured array IO #252
Fix structured array IO #252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either allow any length of strings or warn the user if their strings are too long. If my suggestion below doesn’t work, we should do
value = value.astype(new_dtype)
idx_long, = (np.char.str_len(value) == 200).nonzero()
if len(idx_long):
idx_str_ = ', '.join(idx_long[:10])
if len(idx_long) > 10:
idx_str += '...'
warn(f'The element(s) {idx_str} are ≥200 bytes long and have been trimmed.')
return value
I've figured out how to make these work without it being too complicated. BTW, I think it might be worth writing out a conversion table of what we think should happen to values of different types when they're written/ read from disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK great! Please address the unused import reported by codacy and the following comment.
If you want to keep the write error and the IO fixes separate commits, please do
$ git rebase -i --exec='black . && PYTHONPATH=. pytest' master
$ git log --oneline --max-count=3
XXXXXXX (HEAD -> zarr-structured-array-strings) Allow any length of string to be written in a structured array to zarr
XXXXXXX Add useful info to write errors
31659e6 (origin/master) Add repr tests
Else I’ll squash & commit everything.
anndata/readwrite/utils.py
Outdated
# raise AnnDataReadError( | ||
# f"Above error raised while writing key '{key}' of type {type(val)} from {parent}." | ||
# ) | ||
))).with_traceback(sys.exc_info()[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what this is supposed to do. “Thrown twice”? If you mean you don’t want to see
During handling of the above exception, another exception occurred
… you should use exception chaining (PEP 3134). It creates less confusing output that what this does:
raise ... from e
:
In [12]: try:
...: try:
...: raise Exception('test')
...: except Exception:
...: raise Exception('test2')
...: except Exception as e:
...: raise Exception(f'{e} 3') from e
...:
...:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-12-6e2b8f9e70be> in <module>
2 try:
----> 3 raise Exception('test')
4 except Exception:
Exception: test
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
<ipython-input-12-6e2b8f9e70be> in <module>
4 except Exception:
----> 5 raise Exception('test2')
6 except Exception as e:
Exception: test2
The above exception was the direct cause of the following exception:
Exception Traceback (most recent call last)
<ipython-input-12-6e2b8f9e70be> in <module>
5 raise Exception('test2')
6 except Exception as e:
----> 7 raise Exception(f'{e} 3') from e
8
9
Exception: test2 3
with_traceback
:
In [13]: try:
...: try:
...: raise Exception('test')
...: except Exception:
...: raise Exception('test2')
...: except Exception as e:
...: raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2])
...:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-13-8dde79ff88d6> in <module>
2 try:
----> 3 raise Exception('test')
4 except Exception:
Exception: test
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
<ipython-input-13-8dde79ff88d6> in <module>
4 except Exception:
----> 5 raise Exception('test2')
6 except Exception as e:
Exception: test2
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
<ipython-input-13-8dde79ff88d6> in <module>
5 raise Exception('test2')
6 except Exception as e:
----> 7 raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2])
8
<ipython-input-13-8dde79ff88d6> in <module>
3 raise Exception('test')
4 except Exception:
----> 5 raise Exception('test2')
6 except Exception as e:
7 raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2])
Exception: test2 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the traceback is longer I think it doesn't look as strange for the with_traceback
version, but sure. from
also seems to do some reordering, but either is easy enough to parse.
What I would really like to do, is just modify the existing error to have some extra information in the message. I don't think there's a perfect solution to this since exceptions can define their own string representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, type(e)(msg)
is flaky. The exception type’s __init__
might not accept a message as its first argument.
If you understand the with_traceback
output, please explain it to me because I don’t. The modified exception has 2 locations? huh?
This definitely shouldn't be squashed. The granularity of the commits will be useful in the future. While writing this I wished I still had the commit history from previous implementation so I could see what things I had tried before. This is especially the case for working with poorly documented interfaces like structured dtypes. I could potentially reorder the commits, and squash some formatting ones, but since I typically navigate git history through blames I'm not sure how useful this would be. |
Of course blame should be maximally useful, so there’s two options:
Keeping “Formatting” and “whoops” commits are detrimental to blame’s usefulness. |
@flying-sheep, I've opened a new issue about managing git history so we can split that discussion off. For this PR, I'll follow the strategy I propose in my top comment. Once this is ready to merge, I'll remove formatting commits, and consider squashing other ones. We can figure out a different strategy, but I don't think it's worth holding up this PR for. Back on topic of this PR, is there code that should change? |
No, looks good to go! |
Fixed roundtripping structured dtypes with unicode strings for hdf5. This still fails for zarr.
This makes it so we will always read an object array from disk. However, this writes fixed length unicode data to disk for zarr in structured arrays. In future, we'll write vlength arrays for text data, but that doesn't work at the moment. One difficulty with implementing and testing this is that you can't compare a structured array with variable length strings to a structured array of fixed length strings in numpy. This could use a work-around.
83f52cd
to
6e1dc84
Compare
@flying-sheep, I've removed the formatting commits. I think I want to test this more thoroughly, but I'll do that in a separate PR. Ready to merge. |
Great, thank you! Now let's get this released! |
Fixes #249 and fixes scverse/scanpy#832 by saying we'll always read unicode strings in as variable length. This still needs an editing pass, and another look over in the morning, but I'd like to get feedback as soon as possible. @flying-sheep, thoughts?