Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix structured array IO #252

Merged
merged 9 commits into from
Nov 25, 2019

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Nov 18, 2019

Fixes #249 and fixes scverse/scanpy#832 by saying we'll always read unicode strings in as variable length. This still needs an editing pass, and another look over in the morning, but I'd like to get feedback as soon as possible. @flying-sheep, thoughts?

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either allow any length of strings or warn the user if their strings are too long. If my suggestion below doesn’t work, we should do

value = value.astype(new_dtype)

idx_long, = (np.char.str_len(value) == 200).nonzero()
if len(idx_long):
    idx_str_ = ', '.join(idx_long[:10])
    if len(idx_long) > 10:
        idx_str += '...'
    warn(f'The element(s) {idx_str} are ≥200 bytes long and have been trimmed.')

return value

anndata/compat.py Outdated Show resolved Hide resolved
@ivirshup
Copy link
Member Author

ivirshup commented Nov 19, 2019

I think we should either allow any length of strings or warn the user if their strings are too long.

I've figured out how to make these work without it being too complicated. BTW, np.char.str_len(value) doesn't work on arrays of objects.

I think it might be worth writing out a conversion table of what we think should happen to values of different types when they're written/ read from disk.

anndata/compat.py Outdated Show resolved Hide resolved
anndata/compat.py Outdated Show resolved Hide resolved
anndata/readwrite/h5ad.py Outdated Show resolved Hide resolved
anndata/readwrite/h5ad.py Outdated Show resolved Hide resolved
anndata/compat.py Outdated Show resolved Hide resolved
anndata/readwrite/utils.py Outdated Show resolved Hide resolved
anndata/readwrite/utils.py Outdated Show resolved Hide resolved
anndata/readwrite/utils.py Show resolved Hide resolved
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great! Please address the unused import reported by codacy and the following comment.

If you want to keep the write error and the IO fixes separate commits, please do

$ git rebase -i --exec='black . && PYTHONPATH=. pytest' master
$ git log --oneline --max-count=3
XXXXXXX (HEAD -> zarr-structured-array-strings) Allow any length of string to be written in a structured array to zarr
XXXXXXX Add useful info to write errors
31659e6 (origin/master) Add repr tests

Else I’ll squash & commit everything.

# raise AnnDataReadError(
# f"Above error raised while writing key '{key}' of type {type(val)} from {parent}."
# )
))).with_traceback(sys.exc_info()[2])
Copy link
Member

@flying-sheep flying-sheep Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what this is supposed to do. “Thrown twice”? If you mean you don’t want to see

During handling of the above exception, another exception occurred

… you should use exception chaining (PEP 3134). It creates less confusing output that what this does:

raise ... from e:
In [12]: try: 
    ...:     try: 
    ...:         raise Exception('test') 
    ...:     except Exception: 
    ...:         raise Exception('test2') 
    ...: except Exception as e: 
    ...:     raise Exception(f'{e} 3') from e 
    ...:      
    ...:                                                                                                                                                                                                                                     
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-12-6e2b8f9e70be> in <module>
      2     try:
----> 3         raise Exception('test')
      4     except Exception:

Exception: test

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-12-6e2b8f9e70be> in <module>
      4     except Exception:
----> 5         raise Exception('test2')
      6 except Exception as e:

Exception: test2

The above exception was the direct cause of the following exception:

Exception                                 Traceback (most recent call last)
<ipython-input-12-6e2b8f9e70be> in <module>
      5         raise Exception('test2')
      6 except Exception as e:
----> 7     raise Exception(f'{e} 3') from e
      8 
      9 

Exception: test2 3
with_traceback:
In [13]: try: 
    ...:     try: 
    ...:         raise Exception('test') 
    ...:     except Exception: 
    ...:         raise Exception('test2') 
    ...: except Exception as e: 
    ...:     raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2]) 
    ...:                                                                                                                                                                                                                                     
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-13-8dde79ff88d6> in <module>
      2     try:
----> 3         raise Exception('test')
      4     except Exception:

Exception: test

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-13-8dde79ff88d6> in <module>
      4     except Exception:
----> 5         raise Exception('test2')
      6 except Exception as e:

Exception: test2

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-13-8dde79ff88d6> in <module>
      5         raise Exception('test2')
      6 except Exception as e:
----> 7     raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2])
      8 

<ipython-input-13-8dde79ff88d6> in <module>
      3         raise Exception('test')
      4     except Exception:
----> 5         raise Exception('test2')
      6 except Exception as e:
      7     raise Exception(f'{e} 3').with_traceback(sys.exc_info()[2])

Exception: test2 3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the traceback is longer I think it doesn't look as strange for the with_traceback version, but sure. from also seems to do some reordering, but either is easy enough to parse.

What I would really like to do, is just modify the existing error to have some extra information in the message. I don't think there's a perfect solution to this since exceptions can define their own string representation.

Copy link
Member

@flying-sheep flying-sheep Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, type(e)(msg) is flaky. The exception type’s __init__ might not accept a message as its first argument.

If you understand the with_traceback output, please explain it to me because I don’t. The modified exception has 2 locations? huh?

@ivirshup
Copy link
Member Author

This definitely shouldn't be squashed. The granularity of the commits will be useful in the future. While writing this I wished I still had the commit history from previous implementation so I could see what things I had tried before. This is especially the case for working with poorly documented interfaces like structured dtypes.

I could potentially reorder the commits, and squash some formatting ones, but since I typically navigate git history through blames I'm not sure how useful this would be.

@flying-sheep
Copy link
Member

flying-sheep commented Nov 20, 2019

Of course blame should be maximally useful, so there’s two options:

  1. Squash everything with a GitHub button (low effort, low granularity, but PR text and commit message contain everything)
  2. Manually rebase: Every commit done to fix tests or formatting or bugs you introduced in earlier commits of the same PR need to be squashed. (high effort, perfect granularity)

Keeping “Formatting” and “whoops” commits are detrimental to blame’s usefulness.

@ivirshup ivirshup mentioned this pull request Nov 21, 2019
@ivirshup
Copy link
Member Author

@flying-sheep, I've opened a new issue about managing git history so we can split that discussion off. For this PR, I'll follow the strategy I propose in my top comment. Once this is ready to merge, I'll remove formatting commits, and consider squashing other ones. We can figure out a different strategy, but I don't think it's worth holding up this PR for.


Back on topic of this PR, is there code that should change?

@ivirshup ivirshup changed the title [WIP] Fix structured array IO Fix structured array IO Nov 21, 2019
@flying-sheep
Copy link
Member

No, looks good to go!

Fixed roundtripping structured dtypes with unicode strings for hdf5. This still fails for zarr.
This makes it so we will always read an object array from disk. However, this writes fixed length unicode data to disk for zarr in structured arrays. In future, we'll write vlength arrays for text data, but that doesn't work at the moment.

One difficulty with implementing and testing this is that you can't compare a structured array with variable length strings to a structured array of fixed length strings in numpy. This could use a work-around.
@ivirshup ivirshup force-pushed the zarr-structured-array-strings branch from 83f52cd to 6e1dc84 Compare November 22, 2019 23:39
@ivirshup
Copy link
Member Author

@flying-sheep, I've removed the formatting commits. I think I want to test this more thoroughly, but I'll do that in a separate PR. Ready to merge.

@flying-sheep flying-sheep merged commit 5c634f2 into scverse:master Nov 25, 2019
@flying-sheep
Copy link
Member

Great, thank you! Now let's get this released!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants