Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bugfix: properly handle timeout when copying project #2655

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Nov 19, 2021

What do these changes do?

When creating a project FROM a TEMPLATE or AS a TEMPLATE FROM a Study, the webserver does the following:

  1. copies the project skeleton in a new project (with new uuids for project, nodes)
  2. copies the data by calling into the storage API

In case 2. fails or times-out (due to the webserver or more often due to the webbrowser timing out), the project is still listed for the user, although it is not valid and in an undefined state.

This PR aims to fix this issue by creating the new project as hidden by default UNTIL all the data was completely copied. In case of time out or of disconnection, the handler is cancelled and that cancellation is now handled. This will in turn delete the project, thus cleaning up.

As users still need something akin to copying only the skeleton an additional optional query parameter has been added to the creation project endpoint such as:
POST /projects now takes as query parameters the following:

  • as_template: optional to create a template from a study
  • from_template: optional to create a study from a template
  • copy_data: optional (defaults to true) but can be set to false to copy only the study/template skeleton (without outputs)

Bonus:
now lock the study when creating a template

Related issue/s

How to test

Checklist

@sanderegg sanderegg added the a:webserver issue related to the webserver service label Nov 19, 2021
@sanderegg sanderegg added this to the Meerkat milestone Nov 19, 2021
@sanderegg sanderegg self-assigned this Nov 19, 2021
@codecov
Copy link

codecov bot commented Nov 19, 2021

Codecov Report

Merging #2655 (6972448) into master (0b3eb3a) will decrease coverage by 0.5%.
The diff coverage is 97.3%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #2655     +/-   ##
========================================
- Coverage    78.1%   77.6%   -0.6%     
========================================
  Files         635     635             
  Lines       25918   25944     +26     
  Branches     2507    2513      +6     
========================================
- Hits        20259   20145    -114     
- Misses       5000    5140    +140     
  Partials      659     659             
Flag Coverage Δ
integrationtests 65.1% <57.3%> (-1.9%) ⬇️
unittests 73.7% <94.6%> (+<0.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...simcore_service_webserver/projects/projects_api.py 74.7% <95.0%> (-9.6%) ⬇️
...re_service_webserver/projects/projects_handlers.py 84.1% <96.9%> (+0.6%) ⬆️
...simcore_service_webserver/projects/project_lock.py 94.7% <100.0%> (+0.2%) ⬆️
.../simcore_service_webserver/projects/projects_db.py 90.2% <100.0%> (-0.2%) ⬇️
...mcore_service_webserver/projects/projects_utils.py 80.5% <100.0%> (+0.8%) ⬆️
...ce_webserver/resource_manager/garbage_collector.py 73.9% <100.0%> (-3.7%) ⬇️
.../server/src/simcore_service_webserver/users_api.py 94.2% <100.0%> (+2.6%) ⬆️
...simcore_service_webserver/computation_subscribe.py 27.3% <0.0%> (-68.5%) ⬇️
...rc/simcore_service_webserver/computation_config.py 68.4% <0.0%> (-26.4%) ⬇️
..._service_webserver/projects/projects_exceptions.py 76.1% <0.0%> (-23.9%) ⬇️
... and 11 more

@sanderegg sanderegg changed the title WIP: 🐛 Bugfix: properly handle timeout when copying project 🐛 Bugfix: properly handle timeout when copying project Nov 22, 2021
@sanderegg sanderegg marked this pull request as ready for review November 22, 2021 10:37
@sanderegg sanderegg force-pushed the maintenance/bugfix/cancellation_of_project_copy branch from 66f964f to c76c7bd Compare November 22, 2021 13:08
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to come

db = app[APP_PROJECT_DBAPI]
await director_v2_api.delete_pipeline(app, user_id, UUID(project_uuid))
await db.delete_user_project(user_id, project_uuid)
# requests storage to delete all project's stored data
await delete_data_folders_of_project(app, project_uuid, user_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GitHK please notice this change since it points one of the main source of the problem

  • First, the function ends with _from_db and this call triggers the storage API (which takes very long to repond).
  • Second, the same function was again called via delete_project_data in delete_project ... which means that storage gets two heavy delete calls

@@ -6887,6 +6887,12 @@ paths:
schema:
type: string
description: 'Option to create a template from existing project: as_template={study_uuid}'
- name: copy_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this option should help implementing ITISFoundation/osparc-issues#560

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a first step

async def update_project_without_enforcing_checks(
self, project_data: Dict, project_uuid: str
async def update_project_without_checking_permissions(
self, project_data: Dict, project_uuid: str, hidden: Optional[bool] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR:

  • for readability on the caller, i would enforce using key in opptionals, i.e.
  • Don't we have a ProjectDict somewhere?
async def update_project_without_checking_permissions(
        self, 
        project_data: ProjectDict, 
        project_uuid: ProjectIDStr, 
        *, 
       hidden: Optional[bool] = None,
    ):
      ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prj_states: ProjectState = await get_project_states_for_user(
user_id, project_uuid, app
)
log.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIP: for errors (specially those we need to debug with support) we need good info. Having variable names is more handy and easily achievable using new f-strings in 3.8

            log.error(
            "Project %s already locked in state %s. Please check with support.",
            f"{project_uuid=}",
            f"{prj_states.locked.status=}",
            )

Copy link
Member Author

@sanderegg sanderegg Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm not sure I see the point here. I will get the name of the variable on top?
well I think the message already gives it out right? But I keep that formating for later. I find the f-string on top of the '%s' thing a bit much. looking forward to passing f-strings directly into the log message instead.

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 great fix. added some comments/thoughts around it that hope are useful. I added @GitHK
and you on some comments to underline some fixes so avoid these issues in future designs
thx! :-)


await clone_data_coro
else:
await clone_data_coro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUESTION? why dont' you lock with notification when not as_template? I guess this is creating a normal project so the user in principle does not see in the UI the thumbnail of the dashboard. Nonetheless, logically it should also lock? what would happen in one day the GUI changes? that will lead to an inconsistent state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the lock is here to prevent people from modifying a project while it is accessed.
in the case of from_template we copy from a template (which is a read-only snapshot per se) to a new study. Per construction the template cannot be modified, therefore there is no need to lock it.
Now that I think of it, we could actually lock the new study until it is complete... maybe something to think about.

else:
await clone_data_coro
# unhide the project if needed since it is now complete
if not new_project_was_hidden_before_data_was_copied:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THOUGHT: perhaps features like these should be better managed with some kind of guards implemented with context managers. Upon enter we can store some fields that can be ensured upon exit ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.. here I felt a bit bad... the function did already too much before, but I was a bit reluctant of taking the time to rework it completely.

else:
raise web.HTTPCreated(text=json.dumps(project), content_type="application/json")
log.debug("project created successfuly")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIP: debug messages should be content richer and not only plain text. Use the trick i mentioned above with f-strings to display the value of variables in the log e.g.

log.debug("project %s for %s created successfuly with %s", 
   f"{project_id=}", 
   f"{user_id=}", 
   f"{as_template=}",
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question.. what if the project_id does not exist?

@@ -339,7 +371,7 @@ async def replace_project(request: web.Request):
)

try:
projects_api.validate_project(request.app, new_project)
await projects_api.validate_project(request.app, new_project)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THOUGHT: did you by any chance check the speed improvement here? I guess the biggest will happen when listing projects. I think there is a validation before responding ... BTW did you change it tehre as well? That would definitively make a difference when e.g.e the test users gets too many failing studies (have just removed around 300+ in staging today)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. not sure we still validate there

@@ -563,7 +568,7 @@ async def remove_all_projects_for_user(app: web.Application, user_id: int) -> No
"The project can be removed as is not shared with write access with other users"
)
try:
await delete_project_from_db(app, project_uuid, user_id)
await delete_project(app, project_uuid, user_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GitHK @sanderegg the new naming makes clear that the entire project is now deleted

@sanderegg sanderegg force-pushed the maintenance/bugfix/cancellation_of_project_copy branch from 9fdfcaa to 4f937d6 Compare November 23, 2021 10:06
@sanderegg sanderegg merged commit 387f1e3 into ITISFoundation:master Nov 23, 2021
@sanderegg sanderegg deleted the maintenance/bugfix/cancellation_of_project_copy branch November 23, 2021 15:14
mrnicegyu11 pushed a commit to mrnicegyu11/osparc-simcore that referenced this pull request Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:webserver issue related to the webserver service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants