Fix structured array IO #252

ivirshup · 2019-11-18T08:40:50Z

Fixes #249 and fixes scverse/scanpy#832 by saying we'll always read unicode strings in as variable length. This still needs an editing pass, and another look over in the morning, but I'd like to get feedback as soon as possible. @flying-sheep, thoughts?

flying-sheep

I think we should either allow any length of strings or warn the user if their strings are too long. If my suggestion below doesn’t work, we should do

value = value.astype(new_dtype)

idx_long, = (np.char.str_len(value) == 200).nonzero()
if len(idx_long):
    idx_str_ = ', '.join(idx_long[:10])
    if len(idx_long) > 10:
        idx_str += '...'
    warn(f'The element(s) {idx_str} are ≥200 bytes long and have been trimmed.')

return value

anndata/compat.py

ivirshup · 2019-11-19T04:54:45Z

I think we should either allow any length of strings or warn the user if their strings are too long.

I've figured out how to make these work without it being too complicated. BTW, np.char.str_len(value) doesn't work on arrays of objects.

I think it might be worth writing out a conversion table of what we think should happen to values of different types when they're written/ read from disk.

anndata/compat.py

anndata/readwrite/h5ad.py

anndata/compat.py

anndata/readwrite/utils.py

flying-sheep

OK great! Please address the unused import reported by codacy and the following comment.

If you want to keep the write error and the IO fixes separate commits, please do

$ git rebase -i --exec='black . && PYTHONPATH=. pytest' master
$ git log --oneline --max-count=3
XXXXXXX (HEAD -> zarr-structured-array-strings) Allow any length of string to be written in a structured array to zarr
XXXXXXX Add useful info to write errors
31659e6 (origin/master) Add repr tests

Else I’ll squash & commit everything.

flying-sheep · 2019-11-19T11:42:53Z

anndata/readwrite/utils.py

-                # raise AnnDataReadError(
-                #     f"Above error raised while writing key '{key}' of type {type(val)} from {parent}."
-                # )
+                ))).with_traceback(sys.exc_info()[2])


Unclear what this is supposed to do. “Thrown twice”? If you mean you don’t want to see

During handling of the above exception, another exception occurred

… you should use exception chaining (PEP 3134). It creates less confusing output that what this does:

raise ... from e:

In [12]: try: ...: try: ...: raise Exception('test') ...: except Exception: ...: raise Exception('test2') ...: except Exception as e: ...: raise Exception(f'{e} 3') from e ...: ...: --------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-12-6e2b8f9e70be> in <module> 2 try: ----> 3 raise Exception('test') 4 except Exception: Exception: test During handling of the above exception, another exception occurred: Exception Traceback (most recent call last) <ipython-input-12-6e2b8f9e70be> in <module> 4 except Exception: ----> 5 raise Exception('test2') 6 except Exception as e: Exception: test2 The above exception was the direct cause of the following exception: Exception Traceback (most recent call last) <ipython-input-12-6e2b8f9e70be> in <module> 5 raise Exception('test2') 6 except Exception as e: ----> 7 raise Exception(f'{e} 3') from e 8 9 Exception: test2 3

with_traceback:

In [13]: try: ...: try: ...: raise Exception('test') ...: except Exception: ...: raise Exception('test2') ...: except Exception as e: ...: raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2]) ...: --------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-13-8dde79ff88d6> in <module> 2 try: ----> 3 raise Exception('test') 4 except Exception: Exception: test During handling of the above exception, another exception occurred: Exception Traceback (most recent call last) <ipython-input-13-8dde79ff88d6> in <module> 4 except Exception: ----> 5 raise Exception('test2') 6 except Exception as e: Exception: test2 During handling of the above exception, another exception occurred: Exception Traceback (most recent call last) <ipython-input-13-8dde79ff88d6> in <module> 5 raise Exception('test2') 6 except Exception as e: ----> 7 raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2]) 8 <ipython-input-13-8dde79ff88d6> in <module> 3 raise Exception('test') 4 except Exception: ----> 5 raise Exception('test2') 6 except Exception as e: 7 raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2]) Exception: test2 3

When the traceback is longer I think it doesn't look as strange for the with_traceback version, but sure. from also seems to do some reordering, but either is easy enough to parse.

What I would really like to do, is just modify the existing error to have some extra information in the message. I don't think there's a perfect solution to this since exceptions can define their own string representation.

yes, type(e)(msg) is flaky. The exception type’s __init__ might not accept a message as its first argument.

If you understand the with_traceback output, please explain it to me because I don’t. The modified exception has 2 locations? huh?

ivirshup · 2019-11-20T03:37:06Z

This definitely shouldn't be squashed. The granularity of the commits will be useful in the future. While writing this I wished I still had the commit history from previous implementation so I could see what things I had tried before. This is especially the case for working with poorly documented interfaces like structured dtypes.

I could potentially reorder the commits, and squash some formatting ones, but since I typically navigate git history through blames I'm not sure how useful this would be.

flying-sheep · 2019-11-20T10:33:15Z

Of course blame should be maximally useful, so there’s two options:

Squash everything with a GitHub button (low effort, low granularity, but PR text and commit message contain everything)
Manually rebase: Every commit done to fix tests or formatting or bugs you introduced in earlier commits of the same PR need to be squashed. (high effort, perfect granularity)

Keeping “Formatting” and “whoops” commits are detrimental to blame’s usefulness.

ivirshup · 2019-11-21T06:56:44Z

@flying-sheep, I've opened a new issue about managing git history so we can split that discussion off. For this PR, I'll follow the strategy I propose in my top comment. Once this is ready to merge, I'll remove formatting commits, and consider squashing other ones. We can figure out a different strategy, but I don't think it's worth holding up this PR for.

Back on topic of this PR, is there code that should change?

flying-sheep · 2019-11-21T14:43:30Z

No, looks good to go!

Fixed roundtripping structured dtypes with unicode strings for hdf5. This still fails for zarr.

This makes it so we will always read an object array from disk. However, this writes fixed length unicode data to disk for zarr in structured arrays. In future, we'll write vlength arrays for text data, but that doesn't work at the moment. One difficulty with implementing and testing this is that you can't compare a structured array with variable length strings to a structured array of fixed length strings in numpy. This could use a work-around.

ivirshup · 2019-11-22T23:49:36Z

@flying-sheep, I've removed the formatting commits. I think I want to test this more thoroughly, but I'll do that in a separate PR. Ready to merge.

flying-sheep · 2019-11-25T09:21:44Z

Great, thank you! Now let's get this released!

flying-sheep requested changes Nov 18, 2019

View reviewed changes

anndata/compat.py Outdated Show resolved Hide resolved

flying-sheep requested changes Nov 19, 2019

View reviewed changes

ivirshup mentioned this pull request Nov 21, 2019

Managing git history #254

Closed

ivirshup changed the title ~~[WIP] Fix structured array IO~~ Fix structured array IO Nov 21, 2019

flying-sheep approved these changes Nov 21, 2019

View reviewed changes

ivirshup added 9 commits November 23, 2019 10:35

Add useful info to write errors

85f66ab

Partial fix for structured dtype IO.

496c94d

Fixed roundtripping structured dtypes with unicode strings for hdf5. This still fails for zarr.

Allow any length of string to be written in a structured array to zarr

88c93c4

Fix write error being thrown twice

5df9b77

Tentative change from descr to fields for _to_hdf5_vlen_strings

07bd6ff

Minor changes to dtype access in compat

9c7417d

Code reuse for error handling

6b1a44d

Modify way write exceptions are raised

6e1dc84

ivirshup force-pushed the zarr-structured-array-strings branch from 83f52cd to 6e1dc84 Compare November 22, 2019 23:39

flying-sheep merged commit 5c634f2 into scverse:master Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix structured array IO #252

Fix structured array IO #252

ivirshup commented Nov 18, 2019 •

edited by flying-sheep

Loading

flying-sheep left a comment •

edited

Loading

ivirshup commented Nov 19, 2019 •

edited

Loading

flying-sheep left a comment •

edited

Loading

flying-sheep Nov 19, 2019 •

edited

Loading

ivirshup Nov 20, 2019

flying-sheep Nov 20, 2019 •

edited

Loading

ivirshup commented Nov 20, 2019

flying-sheep commented Nov 20, 2019 •

edited

Loading

ivirshup commented Nov 21, 2019

flying-sheep commented Nov 21, 2019

ivirshup commented Nov 22, 2019

flying-sheep commented Nov 25, 2019

Fix structured array IO #252

Fix structured array IO #252

Conversation

ivirshup commented Nov 18, 2019 • edited by flying-sheep Loading

flying-sheep left a comment • edited Loading

Choose a reason for hiding this comment

ivirshup commented Nov 19, 2019 • edited Loading

flying-sheep left a comment • edited Loading

Choose a reason for hiding this comment

flying-sheep Nov 19, 2019 • edited Loading

Choose a reason for hiding this comment

ivirshup Nov 20, 2019

Choose a reason for hiding this comment

flying-sheep Nov 20, 2019 • edited Loading

Choose a reason for hiding this comment

ivirshup commented Nov 20, 2019

flying-sheep commented Nov 20, 2019 • edited Loading

ivirshup commented Nov 21, 2019

flying-sheep commented Nov 21, 2019

ivirshup commented Nov 22, 2019

flying-sheep commented Nov 25, 2019

ivirshup commented Nov 18, 2019 •

edited by flying-sheep

Loading

flying-sheep left a comment •

edited

Loading

ivirshup commented Nov 19, 2019 •

edited

Loading

flying-sheep left a comment •

edited

Loading

flying-sheep Nov 19, 2019 •

edited

Loading

flying-sheep Nov 20, 2019 •

edited

Loading

flying-sheep commented Nov 20, 2019 •

edited

Loading