Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing datasets to public API #4951

Merged
merged 26 commits into from
Oct 29, 2020
Merged

Conversation

hmaarrfk
Copy link
Member

@hmaarrfk hmaarrfk commented Aug 30, 2020

Description

@scikit-image/core I noticed a few datasets were missing from data.__init__.py.

I added them and updated the examples that used them.

However, I am unclear about the exact meaning of these datasets, and their importance.

Help is appreciated elaborating on the description of each dataset.

TODO:

Checklist

For reviewers

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.

@alexdesiqueira
Copy link
Member

However, I am unclear about the exact meaning of these datasets, and their importance.
Help is appreciated elaborating on the description of each dataset.

Not sure what you mean here @hmaarrfk. Where would we describe? What would you like to see?
There are some descriptions on our GitLab data repo. Is this useful, somehow?

def cells3D():
"""3D microscopy images of cells.

The data for this was provided by the Allen Institute for Cell Science. It
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think information like 598–609 should be in a Notes section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding your data is critical to good analysis and visualization.

What is the difference between the "Notes" and the "Description".

If you feel strongly about it, do reorganize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We follow the Numpy docstyle. According to them:

The sections of a function’s docstring are:
1. Short summary
A one-line summary that does not use variable names or the function name (...)
(...)
13. Notes
An optional section that provides additional information about the code, possibly including a discussion (...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some modifications, however, I will point to:

Extended Summary

A few sentences giving an extended description. This section should be used to clarify functionality, not to discuss implementation detail or background theory, which should rather be explored in the Notes section below. You may refer to the parameters and the function name, but parameter descriptions still belong in the Parameters section.

Functionality here, for me, represents things like metadata, depicting voxel size, which are simply not communicated in a numpy array. I've kept those details.

@hmaarrfk
Copy link
Member Author

Thanks for the review @alexdesiqueira! I don't see the need for a Notes section if the description is empty.

IMO: we can reorganize once the description gets crowded.

@pep8speaks
Copy link

pep8speaks commented Aug 31, 2020

Hello @hmaarrfk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 134:80: E501 line too long (91 > 79 characters)
Line 138:80: E501 line too long (88 > 79 characters)
Line 143:80: E501 line too long (88 > 79 characters)
Line 148:80: E501 line too long (158 > 79 characters)
Line 149:80: E501 line too long (132 > 79 characters)

Comment last updated at 2020-10-29 08:13:43 UTC

@sciunto
Copy link
Member

sciunto commented Aug 31, 2020

CC @mkcor



#####################################################################
# Load and display 3D images
# ==========================
# Three-dimensional data can be loaded with `io.imread`.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#
# ==========================

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

# Three-dimensional data can be loaded with `io.imread`.
#
# The optional dependency, pooch is required to obtain the dataset.
#####################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks


Notes
-----

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blank line can be deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

skimage/data/__init__.py Outdated Show resolved Hide resolved
skimage/data/__init__.py Outdated Show resolved Hide resolved
@sciunto
Copy link
Member

sciunto commented Aug 31, 2020

I think my comments will turn the CI green. Once applied, i'm ready to approve.

Copy link
Member

@jni jni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with @sciunto's suggestions.

skimage/data/tests/test_data.py Outdated Show resolved Hide resolved
skimage/data/__init__.py Outdated Show resolved Hide resolved
skimage/data/__init__.py Outdated Show resolved Hide resolved
skimage/data/__init__.py Outdated Show resolved Hide resolved
@hmaarrfk
Copy link
Member Author

Thank you all for the reviews.

@mkcor suggested that we unify the names in the registry with the names of the functions.

Now that I look at this, I notice that we might be "breaking" a private API. Are you guys OK with that? I'm not sure how many people you've pointed to these functions.

@mkcor
Copy link
Member

mkcor commented Aug 31, 2020

@mkcor suggested that we unify the names in the registry with the names of the functions.

Did I? ;-)

Are you referring to my considerations about one-word names?

@hmaarrfk
Copy link
Member Author

I'm refering to #4951 (comment) :D

@mkcor
Copy link
Member

mkcor commented Aug 31, 2020

I'm refering to #4951 (comment) :D

Oh! Well, if this PR is merged, fetch('data/cells.tif') simply won't work anymore... So this comment of mine wasn't discussing naming conventions.

But 😉 I'm always happy to have a discussion about naming conventions, as @jni knows! I know that we already have cell in our data registry, but I'm not too fond of cells3D. Then, should we make sure all 3D images have a name ending with '3D'...? I prefer documenting this in the docstring.

@hmaarrfk
Copy link
Member Author

Honestly, i don't know what to call this dataset.

Context would help. I would prefer not naming it cells3D either.

Something more descriptive of the experiment would be really nice.

@alexdesiqueira
Copy link
Member

Context would help. I would prefer not naming it cells3D either.

In this case, I'd go for "if isn't broke don't fix it" 😂 cells is the best way to go (until we find a better one).

@jni
Copy link
Member

jni commented Sep 1, 2020

Sorry, I'm not sure about the complete discussion here. Can someone summarise the proposed options? Regarding deprecating names of functions already released, I'd prefer we avoid that, even if we didn't publicise them. Someone might have done data.<tab> and used some of those functions. For stuff that hasn't been released, we can be flexible going forward.

On example is the cells-3d dataset. For that one, if we haven't actually published it in the data. namespace, I think actually we should rework it to contain both channels — there is a membranes.tif file in the tutorials repo that corresponds to the same cells. As to what to call it... I dunno, I kinda like cells3d (lowercase d), other alternatives welcome...

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 3, 2020

@mkcor would you mind helping me attribute the data correctly?

Do you have al link to the original data?

Maybe this sentence isn't correct: "Data of human cells undergoing mitosis taken from [1]_." Strictly speaking, the image wasn't taken from that paper, it was taken from CellProfiler as detailed here: https://scikit-image.org/docs/dev/auto_examples/applications/plot_human_mitosis.html#segment-human-cells-in-mitosis
(The image is part of the data used in the cited paper, but does not ship as is with the paper, even considering the supplemental material, etc.)

@@ -170,4 +173,4 @@ def test_lily_multichannel():
"""
fetch('data/lily.tif')
lily = data.lily()
assert lily.shape == (922, 922, 4)
assert lily.shape == (922, 922, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't support this change 😉 (only referring to the EOF thing)

@mkcor
Copy link
Member

mkcor commented Sep 3, 2020

Sorry, I'm not sure about the complete discussion here. Can someone summarise the proposed options? Regarding deprecating names of functions already released, I'd prefer we avoid that, even if we didn't publicise them. Someone might have done data.<tab> and used some of those functions. For stuff that hasn't been released, we can be flexible going forward.

@jni data.mitosis() and data.cells() have not been published yet; they are indeed missing. With this PR, @hmaarrfk is filling this gap while renaming mitosis into human_mitosis -- I don't particularly support this renaming because, then, should we rename kidney into mouse_kidney?

On example is the cells-3d dataset. For that one, if we haven't actually published it in the data. namespace, I think actually we should rework it to contain both channels — there is a membranes.tif file in the tutorials repo that corresponds to the same cells. As to what to call it... I dunno, I kinda like cells3d (lowercase d), other alternatives welcome...

Interesting, I wasn't aware of this. Ok, so let's look into this more carefully before we publish membranes and/or cells3d.

@mkcor
Copy link
Member

mkcor commented Sep 3, 2020

@mkcor would you mind helping me attribute the data correctly?

Do you have a link to the original data?

Sure, it is: https://github.com/CellProfiler/examples/blob/master/ExampleHuman/images/AS_09125_050116030001_D03f00d0.tif as documented under AS_09125_050116030001_D03f00d0.tif at https://gitlab.com/scikit-image/data/#data

@mkcor
Copy link
Member

mkcor commented Sep 3, 2020

mouse_kidney

s/mouse/murine/

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Sep 3, 2020

thanks for the reference for human mitosis.

@jni do you k,now the original link for the cells3d?

@emmanuelle
Copy link
Member

@jni do you k,now the original link for the cells3d?

Knowing more about the dataset might help crafting a new name.

@emmanuelle
Copy link
Member

Actually I'm quite in favor of cells3d because it tells user they should not directly call plt.imshow on the image array.

@hmaarrfk
Copy link
Member Author

Thanks @grlee77 i've fixed a few more thngs.

I'm a little hesitant that we are pointing to master, and not to a specific branch or to more permanent links.

That said, the sample data feels more ephemeral to me, than the library itself.

@grlee77
Copy link
Contributor

grlee77 commented Oct 29, 2020

I see there was one other issue by @jni that remains to be addressed: #4951 (comment)

@jni, was your intention that skimage.data.cells3d should return a shape (60, 256, 256, 2) array created by stacking these two files?

Here is a plot of a single slice of the 3D datasets overlaid for context

from skimage import io
import matplotlib.pyplot as plt

m = io.imread('cells_membrane.tif')
c = io.imread('cells.tif')
fig, axes = plt.subplots(1, 3)
axes[0].imshow(c[30])
axes[0].set_title('cells.tif')
axes[1].imshow(m[30])
axes[1].set_title('cells_membrane.tif')
axes[2].imshow(c[30] + 2*m[30])
axes[2].set_title('overlaid')

cells_and_membranes

@jni
Copy link
Member

jni commented Oct 29, 2020

Yeah, that's my goal/intention. I intend to push to this branch shortly, hold please!

@jni
Copy link
Member

jni commented Oct 29, 2020

Done! Assuming tests pass this is ok from me.

@emmanuelle emmanuelle merged commit 636592a into scikit-image:master Oct 29, 2020
@jni
Copy link
Member

jni commented Oct 29, 2020

🎉 🚀

@hmaarrfk
Copy link
Member Author

Thanks everybody

@jni
Copy link
Member

jni commented Oct 29, 2020

Thank you, @hmaarrfk! You always do the hard work and then it sits in committee for months! But we deeply appreciate it! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants