Mostly polished version of AIA timelags notebook #1

wtbarnes · 2018-11-09T01:45:48Z

Regarding pangeo-data/astro.pangeo.io-deploy#15

@Cadair can you take a look at this and let me know if you have any suggestions?

@rabernat @martindurant I'd also be interested to see how any of the data viz could be improved with holoviews or the data import simplified with intake.

mrocklin · 2018-11-09T01:51:47Z

Looking briefly through this notebook I'm pretty much amazed. I'd be curious to know what performance was like (I suspect that there are areas where we could improve).

jhamman · 2018-11-09T06:36:01Z

PR on @wtbarnes adds the binder config to run this on binder.pangeo.io:

martindurant · 2018-11-09T15:32:52Z

That notebook is very thorough and looks lovely and of course rigorous.

From the Intake point of view, it's not immediately obvious to me what you would want to do here. The input is FITS, and I have published an intake FITS driver (in intake-astro), which produces arrays (with WCS as an attribute) or tables. However, the data structure is more like Xarrays, with multiple variables and reference coordinates - except that the coordinate system is more complex than a regular grid. Furthermore, the code uses sunpy classes and functionality rather than typical (x)array stuff.

In short, you certainly could write a FITS->sunpy classes driver than contains some of the code in the notebook, and that might be very convenient for several datasets along these lines, to be able to reference them in a public catalogue. You could also write FITS->xarray that may be more generally useful outside of sun stuff, but then you also need to keep some ingest code separate. You could also have a combination of the two, a driver for FITS->(x)array and a second driver that takes those data-sets as inputs and creates the more specialised data containers from them.

As an aside, we talked about the possibility of more general coordinate systems in Xarray, but I don't know where that discussion got to. Of course, there is always a trade-of between the wish to have shared code and data representation frameworks, and the flexibility of more domain-specialised classes.

rabernat · 2018-11-09T16:00:35Z

Wow, this is awesome! Thanks @wtbarnes! I will definitely have something cool to demo at the heliophysics workshop. I especially appreciate the detailed comments and explanations.

A couple points in terms of the notebook readability / useability. Note that this are just intended as nerdy tech-talk, not a criticism of your excellent contribution.

There are some conda / pip install commands up front. Instead of doing this at the user level, let's make sure the astro.pangeo.io or binder environment is configured correctly from the beginning.
Would it make sense to offload cell 8 ("a data structure for stacking multiple FITS files to create an AIA data cube") into a more general-purpose library? It seems like everyone who wants to work with this data would essentially have to copy / paste this cell into each notebook. This is a stark contrast from the situation we have with our climate datasets, where we can just say ds = intake.cat.ECCOv4r3.to_dask() and obtain a full xarray dataset with many different variables / coordinates. (see this example notebook)
Why do you cal .rechunk in your average routine? It seems like this would slow things down. 100x100 chunks are very small. Why not just use the full image? It still has to be read as a full image, so as far as I understand, you are not saving any I/O here (which usually is the main bottleneck)

mrocklin · 2018-11-09T16:03:54Z

cc also @SimonKrughoff

wtbarnes · 2018-11-09T17:53:20Z

@martindurant Thanks for all of the suggestions! Xarray would really be a game changer for dealing with solar data. As I think this notebook communicates, dealing with data one FITS file at a time is VERY limiting. Perhaps there should be some functionality in SunPy (or a SunPy affiliated package) for mapping many FITS files to an xarray dataset that would still have all of the functionality that the SunPy Map object has. Presumably this could leverage all of the work you've already done on intake-astro. There is the ndcube package, developed by @DanRyanIrish, which aims to solve a similar problem. However, I don't know how easily this could be integrated with the Dask/xarray ecosystem.

rabernat · 2018-11-09T18:01:46Z

The following comment is admittedly very biased, since I am an xarray core developer. So take it with a grain of salt.

What I have observed over the years is that multidimensional labelled arrays + metadata are ubiquitous in physical sciences. They have been implemented over and over by different domain-specific packages. Refactoring those packages around xarray would allow those package developers to focus more on their domain-specific problems. Plus xarray comes will full dask integration for free.

A great example of how a more domain specific package leverages xarray under the hood can be found in the satpy docs.

Refactoring is not trivial of course, but it can lead to big payoffs.

wtbarnes · 2018-11-09T18:22:43Z

@rabernat I hope this is helpful to your talk. I'm happy to clarify anything that is not clear! I suspect many of the solar people in the audience will have at least heard of this timelag technique as it has become fairly popular in the field in the last few years. Of course, the general approach is useful for other analyses as well!

To address your other points:

I agree. From what I can tell the astro-pangeo cloud deployment has the latest release of SunPy installed. A recent feature is the ability to create Map objects from a Dask array, but this is only available in the dev version currently (it will be available in the 1.0 release in a few months or so). Presumably, the SunPy master branch could just be pip-installed and this could be added to an conda env file. Where exactly would this environment be defined?
Definitely though I'm not exactly sure where it should live. This relates back to @martindurant's comment too and is part of larger discussion about how to deal with multidimensional data in solar physics. Ideally, SunPy (or ndcube or some other package) would have a standard way to deal with mapping multiple FITS files to a single array object in a way that allows for easy integration with Dask. I'd be interested to hear what people at the workshop have to say about this as well.
I have no idea! I just tried it without rechunking and it is a bit faster. I think my thought was that it should be rechunked along the time axis since I am taking a time average.

wtbarnes · 2018-11-09T18:38:00Z

@mrocklin Thanks! Dask makes everything so easy! What specifically do you mean by performance?

Add binder configs so this can be run on binder.pangeo.io

martindurant · 2018-11-09T20:40:15Z

Ideally, SunPy (or ndcube or some other package) would have a standard way to deal with mapping multiple FITS files to a single array object in a way that allows for easy integration with Dask.

Or, perhaps even more ideal, sunpy can piggy-back on existing interfaces in xarray. How feasible that is, I don't really know, but I think it's a lofty goal.

Cadair · 2018-11-09T20:54:26Z

(Very quick response to this)

The main blocker last time sunpy evaluated xarray was the need for having an object to calculate the two spatial coordinates. This is primarily because utilising existing libraries for the coupled map-projection spatial dimensions is practically the only way to go.

It's been a few years since I looked at this, but conversations with people at scipy seemed that this might be possible given enough work.

rabernat · 2018-11-12T17:54:51Z

So is this PR good to go? If so, someone from @pangeo-data/pangeo-astro should merge.

SimonKrughoff · 2018-11-12T18:23:14Z

@rabernat I'm happy to merge this, but I'm not sure how this gets pulled into the pangeo-astro deploy. Is that automatic?

rabernat · 2018-11-12T18:32:36Z

That depends on how you have your cluster set up. I don't know the answer.

wtbarnes · 2018-11-12T18:33:24Z

If @Cadair is happy with this then I am as well.

@rabernat I'm happy to make any needed changes or clarifications before your talk/demo as well.

Cadair · 2018-11-12T18:37:05Z

This lgtm from a quick read. I don't have the time to go over it in detail.

rabernat · 2018-11-12T18:40:03Z

I'm happy to make any needed changes or clarifications before your talk/demo as well.

Right now the notebook is very long and complex for a demo, mostly because of the package management and the classes you define. Two totally optional things you could do to streamline would be the following:

Remove the pip and conda installs from the notebook and instead add these to the environment specification of the cluster and / or binder. (I really don't want to start the demo by pip installing stuff.)
Offload all your function and class definitions into a standalone utils.py module that lives right next to the notebook, and then import them as needed. This is purely to reduce the length / complexity.

wtbarnes · 2018-11-12T18:44:00Z

@rabernat OK I'll try to make those fixes ASAP. The talk is Wednesday correct?

rabernat · 2018-11-12T18:45:24Z

Correct.

…

On Mon, Nov 12, 2018 at 1:44 PM Will Barnes ***@***.***> wrote: @rabernat <https://github.com/rabernat> OK I'll try to make those fixes ASAP. The talk is Wednesday correct? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJFJuRblNsXzbwTHqse-rj8ZySusSBXks5uucFwgaJpZM4YVz45> .

rabernat · 2018-11-12T18:45:30Z

p.s. THANK YOU so much for doing this!

SimonKrughoff · 2018-11-12T19:22:03Z

That depends on how you have your cluster set up. I don't know the answer.

OK, so maybe a question for @NicWayand or @dsludwig. Apologies, but I haven't had a chance to wrap my head around how all this hangs together yet.

NicWayand · 2018-11-12T20:40:06Z

If a CiricleCI job is set up for astro.pangeo.io (per @dsludwig's guide), then you need to create a new PR to https://github.com/pangeo-data/astro.pangeo.io-deploy (staging branch likely), and a new image will automatically be created that astro.pangeo.io will load.

SimonKrughoff · 2018-11-12T21:07:29Z

Yes. That all works. What I didn't know was if these examples get pulled into that deployment.

NicWayand · 2018-11-12T21:18:53Z

Ah ok, sorry I don't know that.

dsludwig · 2018-11-12T22:04:36Z

These examples are not setup to be pulled into that deployment. I can look into doing that, or you could copy this example into the image directory here: https://github.com/pangeo-data/astro.pangeo.io-deploy/tree/staging/deployments/astro.pangeo.io/image That's where the current examples are loaded into the astro.pangeo.io image.

SimonKrughoff · 2018-11-12T22:15:14Z

OK. Thanks a lot. I guess I'd also need to diff the environments as well to see if any new packages need to be added.

wtbarnes · 2018-11-13T20:30:08Z

Where is the best place to specify the packages to be installed when astro.pangeo.io spins up? Or is it best to just create a new environment?

dsludwig · 2018-11-13T20:34:29Z

You can specify new packages here: https://github.com/pangeo-data/astro.pangeo.io-deploy/blob/staging/deployments/astro.pangeo.io/image/binder/environment.yml

…

On Tue, Nov 13, 2018 at 12:30 PM Will Barnes ***@***.***> wrote: Where is the best place to specify the packages to be installed when astro.pangeo.io spins up? Or is it best to just create a new environment? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwCVzlh1AsxzSWEayxbcsmrvnF_Ijc3ks5uuyvRgaJpZM4YVz45> .

rabernat · 2018-11-13T21:02:32Z

In the short term, I would focus on making sure the binder works, rather than astro.pangeo.io.

I will have to fork this in order to get it ready for tomorrow.

wtbarnes · 2018-11-13T21:06:27Z

Thanks @dsludwig !

wtbarnes · 2018-11-13T21:13:31Z

EDIT: looks like that is already there

@rabernat Ok so I should just be able to add

- git+https://github.com/sunpy/sunpy

under the pip section of pangeo-astro-examples/binder/environment.yml?

wtbarnes · 2018-11-13T21:43:01Z

I've now pulled out the AIA cube code from the notebook into a separate file, util.py. However, when I try to read in the metadata from many files using client.map, I get a "No module named 'util'" exception.

openfiles = dask.bytes.open_files('gcs://pangeo-data/SDO_AIA_Images/diffrot/171/*.fits')
futures = client.map(get_header, openfiles)

where get_header is a function that I've imported into the notebook from util.py.

I don't know whether this would be a problem when launching with binder as well.

martindurant · 2018-11-13T21:49:27Z

The cwd of the dask workers is not necessarily the same as your ipython kernel, it would be best to set a PYTHONPATH or otherwise "install" the file. You should be able to find out with client.run(os.getcwd) to find out.

wtbarnes · 2018-11-13T22:36:17Z

Hmm client.run(os.getcwd) just shows the home directory, but even moving util.py into home still gives a module not found.

martindurant · 2018-11-13T22:44:26Z

You would need util in both maybe, or explicitly set the working directory?
Otherwise, I don't have any other immediate suggestion for you :|

wtbarnes · 2018-11-14T00:00:13Z

@rabernat I'm having some difficulties with pulling out the data structures into a separate file and properly resolving the paths on the workers. In the interest of time, I'm just including all of the code inline. This could either be placed at the top or the bottom of the notebook to "hide" it during the demo. Sorry this is not ideal.

rabernat · 2018-11-14T02:34:33Z

No worries. This is great as is.

rabernat · 2018-11-14T03:53:48Z

I went through this as is and was able to get it working by putting the class / function definitions back into the notebook. What @martindurant was missing is the fact that the dask workers can't see your home directly. There is no shared filesystem here. So the only way to get the module to the workers is to actually install it into their environment using pip / conda. Since there is no package for your util module, this is not possible.

@wtbarnes were you planning to make a new commit? I screwed up and merged this before you actually updated. Any chance you could push the corrected, working notebook?

wtbarnes · 2018-11-14T05:05:16Z

@rabernat OK. At some point, this functionality will hopefully will live in an installable package.

I can push a fix now. Sorry this is coming so late! I'm at another meeting right now and on mountain time!

martindurant · 2018-11-14T14:01:58Z

@rabernat , of course! Must have been late in the day...

wtbarnes added 3 commits November 8, 2018 19:54

lots of cleanup;clarify some explanations

fb6140e

more cleanup

f4efa49

cleanup; reduce to 2 min cadence

a1eee89

Joseph Hamman added 5 commits November 8, 2018 19:32

binderfy

34dd72d

bump jupyterlab version

2c58eb6

add nodejs to env

364b492

rework env a bit

256d432

Merge branch 'master' of github.com:wtbarnes/pangeo-astro-examples

781a079

wtbarnes added 2 commits November 9, 2018 12:41

Merge pull request #1 from jhamman/master

1218bdf

Add binder configs so this can be run on binder.pangeo.io

add another timelag map; more explanation

dd5b992

put data structures in separate package; dont pip install in a notebook

ca0da04

rabernat merged commit e3f4050 into pangeo-data:master Nov 14, 2018

wtbarnes mentioned this pull request Nov 14, 2018

Solar timelags example: put all code back in notebook #2

Merged

Mostly polished version of AIA timelags notebook #1

Mostly polished version of AIA timelags notebook #1

Conversation

wtbarnes commented Nov 9, 2018

mrocklin commented Nov 9, 2018

jhamman commented Nov 9, 2018

martindurant commented Nov 9, 2018

rabernat commented Nov 9, 2018 • edited Loading

mrocklin commented Nov 9, 2018

wtbarnes commented Nov 9, 2018

rabernat commented Nov 9, 2018

wtbarnes commented Nov 9, 2018

wtbarnes commented Nov 9, 2018

martindurant commented Nov 9, 2018

Cadair commented Nov 9, 2018

rabernat commented Nov 12, 2018

SimonKrughoff commented Nov 12, 2018

rabernat commented Nov 12, 2018

wtbarnes commented Nov 12, 2018

Cadair commented Nov 12, 2018

rabernat commented Nov 12, 2018

wtbarnes commented Nov 12, 2018

rabernat commented Nov 12, 2018 via email

rabernat commented Nov 12, 2018

SimonKrughoff commented Nov 12, 2018

NicWayand commented Nov 12, 2018

SimonKrughoff commented Nov 12, 2018

NicWayand commented Nov 12, 2018

dsludwig commented Nov 12, 2018

SimonKrughoff commented Nov 12, 2018

wtbarnes commented Nov 13, 2018

dsludwig commented Nov 13, 2018 via email

rabernat commented Nov 13, 2018

wtbarnes commented Nov 13, 2018

wtbarnes commented Nov 13, 2018 • edited Loading

wtbarnes commented Nov 13, 2018 • edited Loading

martindurant commented Nov 13, 2018

wtbarnes commented Nov 13, 2018

martindurant commented Nov 13, 2018

wtbarnes commented Nov 14, 2018

rabernat commented Nov 14, 2018

rabernat commented Nov 14, 2018

wtbarnes commented Nov 14, 2018

martindurant commented Nov 14, 2018

rabernat commented Nov 9, 2018 •

edited

Loading

wtbarnes commented Nov 13, 2018 •

edited

Loading

wtbarnes commented Nov 13, 2018 •

edited

Loading