Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to provenance log #968

Merged
merged 17 commits into from
Feb 28, 2019
Merged

Add to provenance log #968

merged 17 commits into from
Feb 28, 2019

Conversation

shanaxel42
Copy link
Collaborator

Added fields for:
-core dependencies
-git hash
-os information

Also refactored the LogEncoder to be in the new utils.logging file.

@shanaxel42 shanaxel42 requested a review from ttung January 23, 2019 16:56
@codecov-io
Copy link

codecov-io commented Jan 23, 2019

Codecov Report

Merging #968 into master will increase coverage by 0.55%.
The diff coverage is 97.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #968      +/-   ##
==========================================
+ Coverage   88.58%   89.14%   +0.55%     
==========================================
  Files         164      164              
  Lines        5977     6570     +593     
==========================================
+ Hits         5295     5857     +562     
- Misses        682      713      +31
Impacted Files Coverage Δ
starfish/types/__init__.py 100% <ø> (ø) ⬆️
starfish/types/_constants.py 97.61% <100%> (+0.05%) ⬆️
starfish/__init__.py 93.33% <100%> (+0.47%) ⬆️
starfish/pipeline/algorithmbase.py 100% <100%> (ø) ⬆️
starfish/imagestack/imagestack.py 86.2% <100%> (+2.23%) ⬆️
starfish/util/logging.py 96.66% <96.66%> (ø)
starfish/starfish.py 95.77% <0%> (-4.23%) ⬇️
...h/spots/_detector/trackpy_local_max_peak_finder.py 87.65% <0%> (-3.88%) ⬇️
sptx_format/validate_sptx.py 83.51% <0%> (-1.73%) ⬇️
starfish/image/_filter/gaussian_low_pass.py 95.65% <0%> (-1.02%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 800bb18...e4ffa7e. Read the comment docs.


@lru_cache(maxsize=1)
def get_git_commit_hash():
return subprocess.check_output(["git", "describe", "--always"]).strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if either or both of starfish was installed via pip or user is not int the starfish directory hold true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an excellent point

def get_core_dependency_info():
dependency_info = dict()
for dependency in CORE_DEPENDENCIES:
ps = Popen(('pip', 'show', dependency), stdout=PIPE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is highly likely that pkg_resources is a more portable way of determining this.

@lru_cache(maxsize=1)
def get_git_commit_hash():
# First check for git repo
if call(["git", "branch"], stderr=STDOUT, stdout=open(os.devnull, 'w')) != 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this breaks down if someone has a different git repo that pip installs starfish. Then we would be picking up the hash of the repo that depends on starfish.

I think what we want is to detect that starfish was installed via pip install -e . (and its conda equivalent), and only in that case do we log the commit hash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait you mean only if it wasn't installed with pip or conda then we check for the git hash right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait I think I understand what you mean. What do you think is the best way to check for that? Checking is the starfish package location doesn't include site-packages?

Copy link
Collaborator

@ttung ttung Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here's my guess, with basically zero time put into proving its feasibility:

  1. get the path of some module in starfish. resolve it to its .py file (https://stackoverflow.com/questions/7162366/get-location-of-the-py-source-file might give you an idea of how to do this, but I would test it out) as opposed to the .pyc / .pyo file.
  2. run git status --porcelain -- 123 $file. this output will be different, depending on whether the file is tracked by a git repo or not. You may need to cd into the directory first though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would investigate pkg_resources to see if you can determine how a package was installed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having some issues with git status --porcelain but alternate simpler idea; what if we just check if a starfish.py is being tracked by git with git ls-files --error-unmatch starfish.py and if so do a git descibe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having some issues with git status --porcelain but alternate simpler idea; what if we just check if a starfish.py is being tracked by git with git ls-files --error-unmatch starfish.py and if so do a git descibe

What issues are you having with git status --porcelain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --porcelain flag doesn't produce any output. Git status does


@lru_cache(maxsize=1)
def get_os_info():
return {"Platform": platform.system(), "Version:": platform.version()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python_version is not os_info, but also worth logging.

@shanaxel42 shanaxel42 requested a review from ttung January 26, 2019 00:52
Copy link
Collaborator

@ttung ttung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely on the right track! I'm curious what issues you have with git status --porcelain. it's supposed to be a very stable API.



@lru_cache(maxsize=1)
def get_core_dependency_info():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_core_dependency_info():
def get_core_dependency_info() -> Mapping[str, str]:



@lru_cache(maxsize=1)
def get_git_commit_hash():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_git_commit_hash():
def get_git_commit_hash() -> str:

# First check if in starfish repo
try:
check_output(["git", "ls-files", "--error-unmatch", starfish.__file__])
return check_output(["git", "describe", "--always"]).strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be in the try-block. It would seem that a failure here is indicative of a bug, rather than starfish is not under git tracking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the git describe or the git ls-files --error-unmatch? Because the second is just list the files under tracking and error if starfish.file is not one of them, isn't that the behavior we want?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that all you need to do is to execute git ls-files os.path.basename(starfish.__file__), after you chdir to os.path.dirname(starfish.__file__).

git ls-files of a file tracked by git, but not while you are in the directory fails:

[tt]:~> git ls-files microscopy/starfish-git/starfish/__init__.py
fatal: not a git repository (or any of the parent directories): .git

but chdir and ls-files and it works:

[tt]:~> cd microscopy/starfish-git/starfish/
[tt]:~/microscopy/starfish-git/starfish:tonytung-pr-684> git ls-files __init__.py
__init__.py
[tt]:~/microscopy/starfish-git/starfish:tonytung-pr-684> 



@lru_cache(maxsize=1)
def get_os_info():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_os_info():
def get_os_info() -> Mapping[str, str]:

starfish/imagestack/imagestack.py Show resolved Hide resolved


@lru_cache(maxsize=1)
def get_git_commit_hash():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method, as written, is subject to strange working directory effects. i suspect you should run the commands in the parent path of starfish.__file__.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starfish.file outputs the file directory path though so does it matter?

@shanaxel42 shanaxel42 requested a review from ttung January 29, 2019 19:56
@@ -34,6 +34,10 @@ class Coordinates(AugmentedEnum):
"""
This is name of the provenance log attribute stored on the IntensityTable
"""
CORE_DEPENDENCIES = ['numpy', 'scikit-image', 'pandas', 'scikit-learn', 'scipy', 'xarray', 'sympy']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a tuple or a set.

# First check if in starfish repo
try:
check_output(["git", "ls-files", "--error-unmatch", starfish.__file__])
return check_output(["git", "describe", "--always"]).strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that all you need to do is to execute git ls-files os.path.basename(starfish.__file__), after you chdir to os.path.dirname(starfish.__file__).

git ls-files of a file tracked by git, but not while you are in the directory fails:

[tt]:~> git ls-files microscopy/starfish-git/starfish/__init__.py
fatal: not a git repository (or any of the parent directories): .git

but chdir and ls-files and it works:

[tt]:~> cd microscopy/starfish-git/starfish/
[tt]:~/microscopy/starfish-git/starfish:tonytung-pr-684> git ls-files __init__.py
__init__.py
[tt]:~/microscopy/starfish-git/starfish:tonytung-pr-684> 

@ttung
Copy link
Collaborator

ttung commented Feb 9, 2019

I also strongly encourage you to add some tests.

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the testing strategy here, in either docker or just via virtualenvs there should likely be three-ish different install strategies each which call get_git_commit_hash:

  • python -m starfish ...
  • pip install -e . && starfish ..
  • python setup.py sdist && pip install dist/build/starfish... && starfish ...

starfish/util/logging.py Outdated Show resolved Hide resolved
@lru_cache(maxsize=1)
def get_git_commit_hash() -> str:
# First check if in starfish repo
os.chdir(os.path.dirname(starfish.__file__))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll likely want a finally along with this chdir returning the user to getcwd otherwise file look ups will fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@shanaxel42 shanaxel42 requested a review from ttung February 20, 2019 17:53
Copy link
Collaborator

@ttung ttung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the testing strategy here, in either docker or just via virtualenvs there should likely be three-ish different install strategies each which call get_git_commit_hash:

  • python -m starfish ...
  • pip install -e . && starfish ..
  • python setup.py sdist && pip install dist/build/starfish... && starfish ...

I would expand the matrix a bit further to state that logging has to work regardless of cwd.

Would recommend shoving this into new travis stages. Can run on a lower frequency (only on commits to master? or if @joshmoore builds a release process, we can have tests that only run when we attempt to cut a release?) if it takes too much time.

@lru_cache(maxsize=1)
def get_git_commit_hash() -> str:
# First check if in starfish repo
os.chdir(os.path.dirname(starfish.__file__))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@joshmoore joshmoore mentioned this pull request Feb 25, 2019
10 tasks
@shanaxel42
Copy link
Collaborator Author

shanaxel42 commented Feb 26, 2019

Thinking about the testing strategy here, in either docker or just via virtualenvs there should likely be three-ish different install strategies each which call get_git_commit_hash:

  • python -m starfish ...
  • pip install -e . && starfish ..
  • python setup.py sdist && pip install dist/build/starfish... && starfish ...

I would expand the matrix a bit further to state that logging has to work regardless of cwd.

Would recommend shoving this into new travis stages. Can run on a lower frequency (only on commits to master? or if @joshmoore builds a release process, we can have tests that only run when we attempt to cut a release?) if it takes too much time.

I don't understand the third example of install? Also this seems like a lot just to test this one command. Creating a virtual env. or docker container and installing starfish three different ways. One of them being getting travis to checkout from git, while already hooked up to run from a branch seems messy. I wanna push back a little and say "is anyone actually asking for this git hash info to be included in the log"? For most users (that installed through pip) it's going to result in a line that says "Starfish project not under git tracking" for every command, which might confuse them.

@joshmoore
Copy link
Member

I wanna push back a little and say "is anyone actually asking for this git hash info to be included in the log"?

I think this matches my concern: basically getting (and keeping) this right is going to be high cost for low value.

For most users (that installed through pip) it's going to result in a line that says "Starfish project not under git tracking" for every command, which might confuse them.

nods The reverse strategy be to encode the git during build and otherwise UNKNOWN.

@ttung
Copy link
Collaborator

ttung commented Feb 26, 2019

I wanna push back a little and say "is anyone actually asking for this git hash info to be included in the log"? For most users (that installed through pip) it's going to result in a line that says "Starfish project not under git tracking" for every command, which might confuse them.

Yes, I think @berl asked for it...? My guess is that you are right: people should not be processing data in a permanent way with unreleased versions, and if that happens, we should just assume that all bets are off. Recording the hash is pretty meaningless when people can be mucking around with the source files.

Proposal: record that it's being run from source and not a released version. Probably the best way to do that is to have the release build set a magic variable.

@berl
Copy link
Collaborator

berl commented Feb 26, 2019

I mentioned the git hash as a possibility for more fine-grained tracking of the codebase to augment the version number if applicable. @ttung proposal sounds good. Of course no one will need to ever muck about with source files in starfish anyway!

@shanaxel42 shanaxel42 requested a review from ttung February 27, 2019 01:38
Copy link
Collaborator

@ttung ttung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, can you dump what the log looks like now? might be good to have the community (i.e., #starfish-dev) look it over to see if it covers all the bases.

@@ -22,6 +22,9 @@
# NOTE: if we move to python 3.7, we can produce this value at call time via __getattr__
__version__ = pkg_resources.require("starfish")[0].version

# Variable to be set by release process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it super explicit, e.g., is_released_version or something like that.

also, cc: #761

@shanaxel42
Copy link
Collaborator Author

out of curiosity, can you dump what the log looks like now? might be good to have the community (i.e., #starfish-dev) look it over to see if it covers all the bases.

Example from ISS pipeline:

{'method': 'WhiteTophat', 
'arguments': 
    {'masking_radius': 15, 'is_volume': False}, 
    'os': {'Platform': 'Darwin', 'Version:': 'Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64', 
    'Python Version': '3.6.5'}, 
    'dependencies': {'scikit-learn': '0.20.2', 'pandas': '0.24.1', 'scikit-image': '0.14.2', 'scipy': '1.2.1', 'numpy': '1.16.1', 'sympy': '1.3', 'xarray': '0.11.3'}, 
    'release tag': 'Running starfish from source', 'starfish version': '0.0.33'},
{'method': 'FourierShiftRegistration', 
    'arguments': 
        {'upsampling': 1000, 'reference_stack': '"<starfish.ImageStack (r: 1, c: 1, z: 1, y: 140, x: 200)>"'}, 
        'os': {'Platform': 'Darwin', 'Version:': 'Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64', 'Python Version': '3.6.5'}, 
        'dependencies': {'scikit-learn': '0.20.2', 'pandas': '0.24.1', 'scikit-image': '0.14.2', 'scipy': '1.2.1', 'numpy': '1.16.1', 'sympy': '1.3', 'xarray': '0.11.3'}, 
        'release tag': 'Running starfish from source', 'starfish version': '0.0.33'}, 
{'method': 'BlobDetector', 
    'arguments': 
        {'min_sigma': 1, 
        'max_sigma': 10, 
        'num_sigma': 30, 
        'threshold': 0.01, 
        'overlap': 0.5, 
        'is_volume': True, 
        'measurement_function': '"<function mean at 0x10adbabf8>"', 
        'detector_method': '"<function blob_log at 0x12af9a048>"'}, 
        'os': {'Platform': 'Darwin', 'Version:': 'Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64', 'Python Version': '3.6.5'}, 
        'dependencies': {'scikit-learn': '0.20.2', 'pandas': '0.24.1', 'scikit-image': '0.14.2', 'scipy': '1.2.1', 'numpy': '1.16.1', 'sympy': '1.3', 'xarray': '0.11.3'}, 
        'release tag': 'Running starfish from source', 'starfish version': '0.0.33'}
        ]```

@shanaxel42 shanaxel42 merged commit 3904381 into master Feb 28, 2019
@shanaxel42 shanaxel42 deleted the saxelrod-logging branch February 28, 2019 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants