Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate jupyter-archive #252

Closed
wants to merge 11 commits into from
Closed

Integrate jupyter-archive #252

wants to merge 11 commits into from

Conversation

hadim
Copy link

@hadim hadim commented Jun 29, 2020

Two new API endpoints:

  • r"/directories/(.*)", to download a directory (after packing it into an archive).
  • r"/extract-directories/(.*)",: extract an archive into a directory on the server side.

This implementation is also being discussed at #251

And the original project is located at jupyterlab-contrib/jupyter-archive#2

(I used black to format the files I was working on so some diffs are not related to the PR, sorry for that.)

@SylvainCorlay
Copy link
Contributor

I am -1 on this. The point of the jupyter_server project was to be easily extensible, and as minimal as possible.

Besides, there are many case where we specifically don't want to have such an endpoint, like in the case of a voilà application, where we don't want to expose the content of directories...

What is the issue with keeping this feature in an extension to the Jupyter server?

@echarles
Copy link
Member

echarles commented Jul 7, 2020

@SylvainCorlay I understand voila's requirement. We can think to other features that we could expose to make client's life easier but that would not be recommended for specific use cases (voila or others). The goal is indeed to see the proliferation of a rich extensions ecosystem, and at some point, we need to make a decision if we integrate an extension which would be used by the mass, or if we ask the mass to install separated python packages and change their launch command to enable them.

For the archive case, my impression is that it is generic and needed enough to get that into jupyter server repository so that end users should not install separated package.

To support voila.. requirements, I was thinking to introduce profiles (I think we have discussed that also during weekly meeting. The default profile would enable the default extensions (including this archive one), but we could have also a minimum profile that would deserve more strict cases like voila.

That concept of profile still needs to be defined/implemented, it is just an idea for now.

@hadim
Copy link
Author

hadim commented Jul 7, 2020

I understand your concern but as long as a file handler already exists I don't see the reason why a dir handler could not exist.

As for the extension, it's great to have it but it's not shipped by default and so require to install it.

@SylvainCorlay
Copy link
Contributor

As for the extension, it's great to have it but it's not shipped by default and so require to install it.

I think that adding new handlers in core makes the server heavier to move forward - while the point of the split was to make it lighterweight.

@blink1073
Copy link
Contributor

I'd consider this to be missing functionality in the notebook server that we are rectifying.

@SylvainCorlay
Copy link
Contributor

SylvainCorlay commented Jul 7, 2020

I'd consider this to be missing functionality in the notebook server that we are rectifying.

It is probably ok for this extension to be moved to a jupyter org, and to be a dependency of e.g. lab and/or nbclassic, so that it always gets installed! But I don't see the point of including it in the core repository.

profiles

The auto-enabling of server extensions is more security sensitive than front-end extensions. You install a new package in your environment, and you may have a new endpoint on your server that you were not expecting.

I think that the way to handle this was for the ExtensionApp system to allow the app used as the main entry point to disable other extensions (which would be used by Voilà, and e.g. not JupyterLab).

@blink1073
Copy link
Contributor

blink1073 commented Jul 7, 2020

One of the things the server does at its core is serve files, this is fixing a hole in that capability, not adding a new capability.

@SylvainCorlay
Copy link
Contributor

I like the "everything is an extension" approach of lab, and I think it is a reasonable thing to do for jupyter_server as well. Making things extensions also makes it easier to have a separation of concerns when building upon it.

@hadim
Copy link
Author

hadim commented Jul 7, 2020

My only usage of Jupyter is throught JLab so I would be fine having it there personally.

But @blink1073 has a good point saying it's more a missing feature than a new one considering that file handler already exists. So it should belong to core. Or to stay consistent we should move the existing file handler outside of core?

You guys are more experienced than me about the jupyter ecosystem so I'll let you decide where you want this to go.

@kevin-bates
Copy link
Member

I agree that this PR belongs in the core server.

Regarding @SylvainCorlay's concern about agility and weight, I think we should try to distinguish between today's server extension and core service. IMHO, an extension brings functionality that a) is not necessarily mainstream, and b) is not an obvious improvement or relation to existing (core) functionality. Extensions could also be POCs, who's end game is to be promoted into server core once mature (as is the evolution of many lab extensions).

It sounds like Voila could leverage service composition where Contents/Files handlers are not available at runtime and perhaps we should try to make that easier for applications.

@echarles
Copy link
Member

echarles commented Jul 8, 2020

+1 to get this feature into core. Content is anyway already exposed, we just provide a shortcut to download in one shot what could have been dowloaded (disclaimer: I have not yet looked that the content security model is applied in this PR - I assume it is and that we don't allow access to content that should not be allowed).

About the comparison with jlab, you can enable/disable a core extension. Not that easy... you need to find the correct setting (search for uninstalled_core_extensions on https://jupyterlab.readthedocs.io/en/stable/user/extensions.html?highlight=uninstalled_core_extensions#settings) and build jlab. For the download disable, you would have to disable the complete file browser, so you can not disable a finer feature like download.

Screenshot 2020-07-08 at 04 49 36

The profile notion I have introduced maps the #178 @kevin-bates has pointed. We could migrate the existing API endpoints as extensions. So Content would be an extension, but obviously, the archive enpoint would be part of that content extension, so you would enable/disable the whole range of content endpoints, similarly to how JupyterLab is working now (enable/disable file browerser, including the download file).

@SylvainCorlay
Copy link
Contributor

Please understand that my concern was more architectural (embracing the "everything is an extension" approach) than a concern about Voilà, and is not about the usefulness or the quality of code.

If everyone feels that it belongs to core, no problem with moving forward with this.

@ivanov
Copy link
Contributor

ivanov commented Jul 8, 2020

I am also -1 to adding new endpoints to the base server. This can and should be done via extensions. We should be conservative about which APIs are guaranteed to be available to consumers.

@kevin-bates
Copy link
Member

Thanks @ivanov. It sounds like we should discuss this further at tomorrow's (July 9) server meeting. These are fundamental concepts regarding direction that we should be on the same page about.

@MSeal
Copy link
Contributor

MSeal commented Jul 9, 2020

Side-comment about the functionality -- it might be nice to define restrictions for archive sizes to/from the server that's enforced if this is getting added. I'd hate to have the notebook server crashing for all users when one user tries to upload a 10GB archive or download a working directory with all their data files. This can technically happen with individual files, but it's a lot easier to accidentally do this when exposing directory interactions.

@hadim
Copy link
Author

hadim commented Jul 9, 2020

What about a warning on the jlab side and/or an option that is set to a certain size by default but can be increased as needed?

Both above proposals are frontend only.

@MSeal
Copy link
Contributor

MSeal commented Jul 9, 2020

Option to set the size seems reasonable. A warning is usually too late if you've started the actions because you're now racing to an OOM event. Usually what I've seen for doing this in a dynamic nature / to support wanrings is to make it have a server default configurable value with a 413 response code if it's an oversized request and allow an explicit size limit as an API request parameter. This lets the front-end prompt a user about the issue, and for interfaces to override if they really want to. But really any interface that keeps the accidental bad-action from causing a crash is usually good enough in my experience.

@hadim
Copy link
Author

hadim commented Jul 9, 2020

Make sense to me as long as this is something configurable and not hardcoded on the server-side. I will let others comment about it.

@hadim
Copy link
Author

hadim commented Jul 31, 2020

Have you decided something about this PR?

@Zsailer
Copy link
Member

Zsailer commented Jul 31, 2020

Thanks for pinging here, @hadim.

We discussed this a couple of weeks ago (see the notes here). We are planning to accept this change.

Unfortunately, our attention has been directed at getting Jupyter Server into JupyterLab before it's 3.0 API freeze scheduled for next week. Once we get to a good place there, we'll circle back and finish reviewing this PR.

Thanks for your patience @hadim! I really appreciate the work you're doing here.

@blink1073
Copy link
Contributor

@hadim, are you available to add some missing docstrings to a few spots in this PR? We are getting close to releasing 1.0 and this would make a great addition.

@hadim
Copy link
Author

hadim commented Aug 31, 2020

Sure. Please comment directly in the code and I will try to address your reviews this week. I will also add docstrings where needed.

jupyter_server/files/handlers.py Show resolved Hide resolved
jupyter_server/files/handlers.py Show resolved Hide resolved
jupyter_server/files/handlers.py Show resolved Hide resolved
jupyter_server/files/handlers.py Show resolved Hide resolved
jupyter_server/files/handlers.py Outdated Show resolved Hide resolved
jupyter_server/files/handlers.py Show resolved Hide resolved
jupyter_server/services/contents/manager.py Outdated Show resolved Hide resolved
@hadim
Copy link
Author

hadim commented Sep 1, 2020

@blink1073 let me know whether this is ok for you.

By the way, tests fails on my machine (python 3.8, Ubuntu 20.04) while it works on the CI. Errors:

FAILED tests/services/contents/test_api.py::test_delete[-inroot] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[Directory with spaces in-inspace] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[unicod\xe9-innonascii] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-a] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-b] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-name with spaces] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo-unicod\xe9] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[foo/bar-baz] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[ordering-A] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[ordering-b] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[ordering-C] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete[\xe5 b-\xe7 d] - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete_dirs - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error
FAILED tests/services/contents/test_api.py::test_delete_non_empty_dir - tornado.httpclient.HTTPClientError: HTTP 500: Internal Server Error

@blink1073
Copy link
Contributor

I guess it's because my / and my /home are two different partitions.

Ah, I see.

Can you guide me on this? I can't find the equivalent for files in api.yaml. (or if you want to do it I am good too :-D )

Good point, that's because they aren't under the /api namespace. I opened #298 to track it

@blink1073
Copy link
Contributor

Kicking CI to see if the timeout was spurious

@blink1073 blink1073 closed this Sep 2, 2020
@blink1073 blink1073 reopened this Sep 2, 2020
@hadim
Copy link
Author

hadim commented Sep 2, 2020

@blink1073 anything else I can do here?

@blink1073
Copy link
Contributor

I think we're good, thanks! I'll bring this up again at the server meeting tomorrow to make sure we want to commit to this API expansion.

@blink1073
Copy link
Contributor

Notes from today's meeting

  • Can you please make the method for extract-directories a PUT to align with upload?
  • Can you please check for hasattr(contents_manager, "directory_handler_class") and add handlers the give back a 405 Method Not Allowed for these two endpoints? This covers the case of us using a ContentsManager that uses the class in notebook.

@hadim
Copy link
Author

hadim commented Sep 3, 2020

@blink1073 if I set ExtractDirectoryHandler to async def put I have the following error:

                        self.request.body is not None
                        or self.request.body_producer is not None
                    )
                    if (body_expected and not body_present) or (
                        body_present and not body_expected
                    ):
>                       raise ValueError(
                            "Body must %sbe None for method %s (unless "
                            "allow_nonstandard_methods is true)"
                            % ("not " if body_expected else "", self.request.method)
                        )
E                       ValueError: Body must not be None for method PUT (unless allow_nonstandard_methods is true)

@hadim
Copy link
Author

hadim commented Sep 3, 2020

I can make the test to pass with r = await fetch("extract-directories", relative_archive_path, method="PUT", allow_nonstandard_methods=True). What do you want me to do? Keep PUT with allow_nonstandard_methods or go back to GET?

@ivanov
Copy link
Contributor

ivanov commented Sep 3, 2020

I am glad ya'll are working on this, but I will ask again if there is a reason why the jupyter archive stuff cannot be an extension?

It really feels like no one wants to justify putting it in as part of the core API, as is, other than having a case of the "gimmes". I understand the excitement of a new feature, but can we please recognize how much pain it will be to handle changing these APIs if this ships to everyone from the get-go versus allowing it to bake as an opt-in extension?

There's so much complexity around this that it will become its own can of worms. Here are a few that come to mind:

  • support other archive formats
  • exclusion patterns for files
  • changing the default size limit
  • running out of disk space while creating the archive
  • files modified in the middle of archiving
  • recursive and non-recursive folder downloads
  • what happens to symlinks to files and folders - should they be included as links or as content?
  • are subsequent calls to download folder cached?
  • what happens when there's a race condition and a second request is made to archive while the first is in being archived?
  • should extractions overwrite files, skip existing ones, ask or atomically fail for the whole archive?
  • how do you pass blocksize and compression time-size tradeoff parameters to the archiver? (--fast, --best, -1...-9)
  • how do you utilize another archiver, like pigz as a parallel gzip

@hadim
Copy link
Author

hadim commented Sep 7, 2020

@blink1073 can you clarify the status of this during the next jupyter meeting?

@blink1073
Copy link
Contributor

Thanks @ivanov for bringing up these points. Yes, @hadim, I agree we should discuss further on Thursday.

@blink1073
Copy link
Contributor

Hi @hadim, we discussed this in today's meeting. We came to the consensus that we would like to see this continue as a separate server extension, but would eventually like it to become a Jupyter community managed extension. We have other repositories that could become siblings to jupyter_server, and are thinking that it would make sense to have a separate GitHub organization dedicated to jupyter_server and related projects.
This extension would be a great addition to those related projects. There are currently two accepted related enhancement proposals: server and kernel providers, but we would need to create a new enhancement proposal in order to create the new org. We will commit to shepherding that enhancement proposal after JupyterCon.
Thank you for your patience as we find the best path forward for this effort.

@hadim
Copy link
Author

hadim commented Sep 10, 2020

Not sure I agree about all the points raised in this ticket going against this integration but you guys decided as a community, not me.

(at least make it default in JLab...)

@hadim hadim closed this Sep 10, 2020
Zsailer added a commit to Zsailer/jupyter_server that referenced this pull request Nov 18, 2022
* add robots

* subclass jupyter robots

* working robots

* ignore robots logic for test coverage score

* fix command for running robots in make file

* add dependency on jupyter robot framework

* try a different test image

* setup chromedriver

* try installing chrome

* install chromedriver from source

* install chrome

* try different image

* remove webdriver

* top level docker image in rio

* see if new docker image is used

* build chrome and chromedriver from scratch

* different installing command for chrome

* wrong package name in rio

* just install driver

* install webdriver with new docker image

* remove robots from rio

* update manifest to robots
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants