Integrate jupyter-archive #252

hadim · 2020-06-29T20:54:07Z

Two new API endpoints:

r"/directories/(.*)", to download a directory (after packing it into an archive).
r"/extract-directories/(.*)",: extract an archive into a directory on the server side.

This implementation is also being discussed at #251

And the original project is located at jupyterlab-contrib/jupyter-archive#2

(I used black to format the files I was working on so some diffs are not related to the PR, sorry for that.)

jupyter_server/files/handlers.py

SylvainCorlay · 2020-07-07T15:59:47Z

I am -1 on this. The point of the jupyter_server project was to be easily extensible, and as minimal as possible.

Besides, there are many case where we specifically don't want to have such an endpoint, like in the case of a voilà application, where we don't want to expose the content of directories...

What is the issue with keeping this feature in an extension to the Jupyter server?

echarles · 2020-07-07T16:29:05Z

@SylvainCorlay I understand voila's requirement. We can think to other features that we could expose to make client's life easier but that would not be recommended for specific use cases (voila or others). The goal is indeed to see the proliferation of a rich extensions ecosystem, and at some point, we need to make a decision if we integrate an extension which would be used by the mass, or if we ask the mass to install separated python packages and change their launch command to enable them.

For the archive case, my impression is that it is generic and needed enough to get that into jupyter server repository so that end users should not install separated package.

To support voila.. requirements, I was thinking to introduce profiles (I think we have discussed that also during weekly meeting. The default profile would enable the default extensions (including this archive one), but we could have also a minimum profile that would deserve more strict cases like voila.

That concept of profile still needs to be defined/implemented, it is just an idea for now.

hadim · 2020-07-07T16:56:49Z

I understand your concern but as long as a file handler already exists I don't see the reason why a dir handler could not exist.

As for the extension, it's great to have it but it's not shipped by default and so require to install it.

SylvainCorlay · 2020-07-07T17:10:25Z

As for the extension, it's great to have it but it's not shipped by default and so require to install it.

I think that adding new handlers in core makes the server heavier to move forward - while the point of the split was to make it lighterweight.

blink1073 · 2020-07-07T17:12:30Z

I'd consider this to be missing functionality in the notebook server that we are rectifying.

SylvainCorlay · 2020-07-07T17:15:00Z

I'd consider this to be missing functionality in the notebook server that we are rectifying.

It is probably ok for this extension to be moved to a jupyter org, and to be a dependency of e.g. lab and/or nbclassic, so that it always gets installed! But I don't see the point of including it in the core repository.

profiles

The auto-enabling of server extensions is more security sensitive than front-end extensions. You install a new package in your environment, and you may have a new endpoint on your server that you were not expecting.

I think that the way to handle this was for the ExtensionApp system to allow the app used as the main entry point to disable other extensions (which would be used by Voilà, and e.g. not JupyterLab).

blink1073 · 2020-07-07T17:18:17Z

One of the things the server does at its core is serve files, this is fixing a hole in that capability, not adding a new capability.

SylvainCorlay · 2020-07-07T17:21:16Z

I like the "everything is an extension" approach of lab, and I think it is a reasonable thing to do for jupyter_server as well. Making things extensions also makes it easier to have a separation of concerns when building upon it.

hadim · 2020-07-07T17:28:15Z

My only usage of Jupyter is throught JLab so I would be fine having it there personally.

But @blink1073 has a good point saying it's more a missing feature than a new one considering that file handler already exists. So it should belong to core. Or to stay consistent we should move the existing file handler outside of core?

You guys are more experienced than me about the jupyter ecosystem so I'll let you decide where you want this to go.

kevin-bates · 2020-07-07T19:14:07Z

I agree that this PR belongs in the core server.

Regarding @SylvainCorlay's concern about agility and weight, I think we should try to distinguish between today's server extension and core service. IMHO, an extension brings functionality that a) is not necessarily mainstream, and b) is not an obvious improvement or relation to existing (core) functionality. Extensions could also be POCs, who's end game is to be promoted into server core once mature (as is the evolution of many lab extensions).

It sounds like Voila could leverage service composition where Contents/Files handlers are not available at runtime and perhaps we should try to make that easier for applications.

echarles · 2020-07-08T03:16:34Z

+1 to get this feature into core. Content is anyway already exposed, we just provide a shortcut to download in one shot what could have been dowloaded (disclaimer: I have not yet looked that the content security model is applied in this PR - I assume it is and that we don't allow access to content that should not be allowed).

About the comparison with jlab, you can enable/disable a core extension. Not that easy... you need to find the correct setting (search for uninstalled_core_extensions on https://jupyterlab.readthedocs.io/en/stable/user/extensions.html?highlight=uninstalled_core_extensions#settings) and build jlab. For the download disable, you would have to disable the complete file browser, so you can not disable a finer feature like download.

The profile notion I have introduced maps the #178 @kevin-bates has pointed. We could migrate the existing API endpoints as extensions. So Content would be an extension, but obviously, the archive enpoint would be part of that content extension, so you would enable/disable the whole range of content endpoints, similarly to how JupyterLab is working now (enable/disable file browerser, including the download file).

SylvainCorlay · 2020-07-08T06:48:52Z

Please understand that my concern was more architectural (embracing the "everything is an extension" approach) than a concern about Voilà, and is not about the usefulness or the quality of code.

If everyone feels that it belongs to core, no problem with moving forward with this.

ivanov · 2020-07-08T08:21:17Z

I am also -1 to adding new endpoints to the base server. This can and should be done via extensions. We should be conservative about which APIs are guaranteed to be available to consumers.

kevin-bates · 2020-07-08T14:23:46Z

Thanks @ivanov. It sounds like we should discuss this further at tomorrow's (July 9) server meeting. These are fundamental concepts regarding direction that we should be on the same page about.

MSeal · 2020-07-09T21:28:48Z

Side-comment about the functionality -- it might be nice to define restrictions for archive sizes to/from the server that's enforced if this is getting added. I'd hate to have the notebook server crashing for all users when one user tries to upload a 10GB archive or download a working directory with all their data files. This can technically happen with individual files, but it's a lot easier to accidentally do this when exposing directory interactions.

hadim · 2020-07-09T21:33:20Z

What about a warning on the jlab side and/or an option that is set to a certain size by default but can be increased as needed?

Both above proposals are frontend only.

MSeal · 2020-07-09T21:39:22Z

Option to set the size seems reasonable. A warning is usually too late if you've started the actions because you're now racing to an OOM event. Usually what I've seen for doing this in a dynamic nature / to support wanrings is to make it have a server default configurable value with a 413 response code if it's an oversized request and allow an explicit size limit as an API request parameter. This lets the front-end prompt a user about the issue, and for interfaces to override if they really want to. But really any interface that keeps the accidental bad-action from causing a crash is usually good enough in my experience.

hadim · 2020-07-09T21:49:22Z

Make sense to me as long as this is something configurable and not hardcoded on the server-side. I will let others comment about it.

hadim · 2020-07-31T14:54:04Z

Have you decided something about this PR?

Zsailer · 2020-07-31T18:42:19Z

Thanks for pinging here, @hadim.

We discussed this a couple of weeks ago (see the notes here). We are planning to accept this change.

Unfortunately, our attention has been directed at getting Jupyter Server into JupyterLab before it's 3.0 API freeze scheduled for next week. Once we get to a good place there, we'll circle back and finish reviewing this PR.

Thanks for your patience @hadim! I really appreciate the work you're doing here.

blink1073 · 2020-08-31T21:51:18Z

@hadim, are you available to add some missing docstrings to a few spots in this PR? We are getting close to releasing 1.0 and this would make a great addition.

hadim · 2020-08-31T21:55:37Z

Sure. Please comment directly in the code and I will try to address your reviews this week. I will also add docstrings where needed.

jupyter_server/files/handlers.py

jupyter_server/services/contents/manager.py

hadim · 2020-09-01T14:18:53Z

@blink1073 let me know whether this is ok for you.

By the way, tests fails on my machine (python 3.8, Ubuntu 20.04) while it works on the CI. Errors:

FAILED tests/services/contents/test_api.py::test_delete[-inroot] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[Directory with spaces in-inspace] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[unicod\xe9-innonascii] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-a] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-b] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-name with spaces] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-unicod\xe9] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo/bar-baz] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[ordering-A] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[ordering-b] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[ordering-C] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[\xe5 b-\xe7 d] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete_dirs - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete_non_empty_dir - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error

blink1073 · 2020-09-02T11:17:47Z

I guess it's because my / and my /home are two different partitions.

Ah, I see.

Can you guide me on this? I can't find the equivalent for files in api.yaml. (or if you want to do it I am good too :-D )

Good point, that's because they aren't under the /api namespace. I opened #298 to track it

jupyter_server/files/handlers.py

blink1073 · 2020-09-02T11:19:42Z

Kicking CI to see if the timeout was spurious

hadim · 2020-09-02T12:34:28Z

@blink1073 anything else I can do here?

blink1073 · 2020-09-02T15:13:21Z

I think we're good, thanks! I'll bring this up again at the server meeting tomorrow to make sure we want to commit to this API expansion.

blink1073 · 2020-09-03T15:37:21Z

Notes from today's meeting

Can you please make the method for extract-directories a PUT to align with upload?
Can you please check for hasattr(contents_manager, "directory_handler_class") and add handlers the give back a 405 Method Not Allowed for these two endpoints? This covers the case of us using a ContentsManager that uses the class in notebook.

hadim · 2020-09-03T16:12:01Z

@blink1073 if I set ExtractDirectoryHandler to async def put I have the following error:

                        self.request.body is not None
                        or self.request.body_producer is not None
                    )
                    if (body_expected and not body_present) or (
                        body_present and not body_expected
                    ):
>                       raise ValueError(
                            "Body must %sbe None for method %s (unless "
                            "allow_nonstandard_methods is true)"
                            % ("not " if body_expected else "", self.request.method)
                        )
E                       ValueError: Body must not be None for method PUT (unless allow_nonstandard_methods is true)

hadim · 2020-09-03T16:30:23Z

I can make the test to pass with r = await fetch("extract-directories", relative_archive_path, method="PUT", allow_nonstandard_methods=True). What do you want me to do? Keep PUT with allow_nonstandard_methods or go back to GET?

ivanov · 2020-09-03T16:48:21Z

I am glad ya'll are working on this, but I will ask again if there is a reason why the jupyter archive stuff cannot be an extension?

It really feels like no one wants to justify putting it in as part of the core API, as is, other than having a case of the "gimmes". I understand the excitement of a new feature, but can we please recognize how much pain it will be to handle changing these APIs if this ships to everyone from the get-go versus allowing it to bake as an opt-in extension?

There's so much complexity around this that it will become its own can of worms. Here are a few that come to mind:

support other archive formats
exclusion patterns for files
changing the default size limit
running out of disk space while creating the archive
files modified in the middle of archiving
recursive and non-recursive folder downloads
what happens to symlinks to files and folders - should they be included as links or as content?
are subsequent calls to download folder cached?
what happens when there's a race condition and a second request is made to archive while the first is in being archived?
should extractions overwrite files, skip existing ones, ask or atomically fail for the whole archive?
how do you pass blocksize and compression time-size tradeoff parameters to the archiver? (--fast, --best, -1...-9)
how do you utilize another archiver, like pigz as a parallel gzip

hadim · 2020-09-07T14:10:33Z

@blink1073 can you clarify the status of this during the next jupyter meeting?

blink1073 · 2020-09-08T14:02:50Z

Thanks @ivanov for bringing up these points. Yes, @hadim, I agree we should discuss further on Thursday.

blink1073 · 2020-09-10T19:16:28Z

Hi @hadim, we discussed this in today's meeting. We came to the consensus that we would like to see this continue as a separate server extension, but would eventually like it to become a Jupyter community managed extension. We have other repositories that could become siblings to jupyter_server, and are thinking that it would make sense to have a separate GitHub organization dedicated to jupyter_server and related projects.
This extension would be a great addition to those related projects. There are currently two accepted related enhancement proposals: server and kernel providers, but we would need to create a new enhancement proposal in order to create the new org. We will commit to shepherding that enhancement proposal after JupyterCon.
Thank you for your patience as we find the best path forward for this effort.

hadim · 2020-09-10T19:20:26Z

Not sure I agree about all the points raised in this ticket going against this integration but you guys decided as a community, not me.

(at least make it default in JLab...)

* add robots * subclass jupyter robots * working robots * ignore robots logic for test coverage score * fix command for running robots in make file * add dependency on jupyter robot framework * try a different test image * setup chromedriver * try installing chrome * install chromedriver from source * install chrome * try different image * remove webdriver * top level docker image in rio * see if new docker image is used * build chrome and chromedriver from scratch * different installing command for chrome * wrong package name in rio * just install driver * install webdriver with new docker image * remove robots from rio * update manifest to robots

hadim mentioned this pull request Jun 29, 2020

jupyter-archive integration #251

Closed

kevin-bates reviewed Jul 6, 2020

View reviewed changes

jupyter_server/files/handlers.py Outdated Show resolved Hide resolved

This was referenced Jul 11, 2020

[WIP] Follow symlinks and exclude hidden directories jupyterlab-contrib/jupyter-archive#36

Merged

Integration to Jupyter Notebook and JLab jupyterlab-contrib/jupyter-archive#2

Closed

hadim mentioned this pull request Aug 4, 2020

pip install yields errors jupyterlab-contrib/jupyter-archive#37

Closed

blink1073 reviewed Aug 31, 2020

View reviewed changes

hadim force-pushed the archive branch from 3d185e1 to 4d311d9 Compare September 1, 2020 13:53

hadim added 5 commits September 1, 2020 12:29

replace yield by await

4282e4c

doc

e85b904

docstrings

0769272

add size limit + tests

626908c

formatting

c1c62fb

hadim force-pushed the archive branch from b1441fe to c1c62fb Compare September 1, 2020 16:29

hadim mentioned this pull request Sep 1, 2020

"Download as an archive" entry to the context menu for folders in the filebrowser. jupyterlab/jupyterlab#7294

Closed

blink1073 reviewed Sep 2, 2020

View reviewed changes

jupyter_server/files/handlers.py Outdated Show resolved Hide resolved

blink1073 closed this Sep 2, 2020

blink1073 reopened this Sep 2, 2020

remove print

af58ff3

hadim added 2 commits September 3, 2020 12:07

make ExtractDirectoryHandler a PUT method

6e13652

add check for directory_handler_class

7099c20

hadim closed this Sep 10, 2020

hadim mentioned this pull request Dec 24, 2020

Looking for maintainers jupyterlab-contrib/jupyter-archive#40

Closed

vidartf mentioned this pull request May 6, 2021

Download a folder as a zip file #68

Closed

blink1073 mentioned this pull request May 6, 2021

Jupyter Server Notes 2021 jupyter-server/team-compass#4

Closed

Integrate jupyter-archive #252

Integrate jupyter-archive #252

Conversation

hadim commented Jun 29, 2020 • edited Loading

SylvainCorlay commented Jul 7, 2020

echarles commented Jul 7, 2020

hadim commented Jul 7, 2020

SylvainCorlay commented Jul 7, 2020

blink1073 commented Jul 7, 2020

SylvainCorlay commented Jul 7, 2020 • edited Loading

blink1073 commented Jul 7, 2020 • edited Loading

SylvainCorlay commented Jul 7, 2020

hadim commented Jul 7, 2020

kevin-bates commented Jul 7, 2020

echarles commented Jul 8, 2020

SylvainCorlay commented Jul 8, 2020

ivanov commented Jul 8, 2020 • edited Loading

kevin-bates commented Jul 8, 2020

MSeal commented Jul 9, 2020

hadim commented Jul 9, 2020

MSeal commented Jul 9, 2020

hadim commented Jul 9, 2020

hadim commented Jul 31, 2020

Zsailer commented Jul 31, 2020

blink1073 commented Aug 31, 2020

hadim commented Aug 31, 2020

hadim commented Sep 1, 2020

blink1073 commented Sep 2, 2020

blink1073 commented Sep 2, 2020

hadim commented Sep 2, 2020

blink1073 commented Sep 2, 2020

blink1073 commented Sep 3, 2020

hadim commented Sep 3, 2020

hadim commented Sep 3, 2020

ivanov commented Sep 3, 2020 • edited Loading

hadim commented Sep 7, 2020

blink1073 commented Sep 8, 2020

blink1073 commented Sep 10, 2020

hadim commented Sep 10, 2020

hadim commented Jun 29, 2020 •

edited

Loading

SylvainCorlay commented Jul 7, 2020 •

edited

Loading

blink1073 commented Jul 7, 2020 •

edited

Loading

ivanov commented Jul 8, 2020 •

edited

Loading

ivanov commented Sep 3, 2020 •

edited

Loading