Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIRT migration #301

Open
wpbonelli opened this issue Apr 13, 2022 · 6 comments
Open

DIRT migration #301

wpbonelli opened this issue Apr 13, 2022 · 6 comments
Assignees
Labels
in progress Someone is working on this issue (please also assign yourself) priority Should be resolved first, if possible

Comments

@wpbonelli
Copy link
Member

wpbonelli commented Apr 13, 2022

Each DIRT user may have any number of image sets. We need to prompt them to transfer their image sets to correspondingly named folders in the data store, so that they can then use plantit to run DIRT. We can prompt the user via:

  • web UI (option in navbar dropdown, similar to ‘Log in to GitHub’ button)
  • mass email to DIRT user group

We should also have a dedicated page for this in the documentation (walkthrough + screenshots).

When the user begins, we first detect if they have any DIRT image sets. If so, for each, transfer files from tucco’s attached NFS to a smaller temporary staging area (also an NFS) on portnoy, then transfer to its own folder in the user’s home directory in the data store, with folder name as DIRT image set name and file names preserved. We should also decorate datasets with a metadata tag indicating DIRT origin, as well as any attached metadata. We should show some kind of progress monitor in the UI, then send an email notification to the user.

Data transfer via:

  • PyCyAPI to Terrain? (now supports metadata tagging)
  • iCommands container?

IMPT: make a final backup of all the DIRT datasets before the migration period ends

@wpbonelli wpbonelli added priority Should be resolved first, if possible in progress Someone is working on this issue (please also assign yourself) labels Apr 13, 2022
@wpbonelli wpbonelli self-assigned this Apr 13, 2022
wpbonelli added a commit that referenced this issue Apr 19, 2022
wpbonelli added a commit that referenced this issue Apr 20, 2022
wpbonelli added a commit that referenced this issue May 15, 2022
wpbonelli added a commit that referenced this issue May 15, 2022
@wpbonelli
Copy link
Member Author

wpbonelli commented May 16, 2022

A minimal version of this is working now. Triggered from the top-right dropdown in the UI.

Remaining tasks:

  • add a corresponding page in the docs
  • send out an email to the DIRT user list

It may also be worthwhile to let users select a folder to transfer their data to, or create a new one, rather than creating and using one with hard-coded path /iplant/home/{username}/dirt_migration as we currently do. It seems unlikely anybody will already have a folder with the same name so we are probably safe from collisions, but still.

@wpbonelli
Copy link
Member Author

In as of 92d06ed

@wpbonelli
Copy link
Member Author

Reopening because we should preserve collection names and collection & image metadata. This can be pulled from the DIRT database given an image path. Because files are stored by date on the tucco NFS, rather than by collection membership, we need to do a lookup for each file separately. We may also need to lookup usernames, since some users' data seems to be associated with their full name rather than their CyVerse username.

SQL queries

The Drupal CMS produces a fairly unwieldy database schema where every object is a node, i.e. an entity, and each node's associated information and metadata are scattered around various tables linked via foreign key. We will need a number of queries to extract relevant information:

  • Get image fid given image path:

select fid from file_managed where uri like 'public://{path}%';

  • Get image entity_id given field_root_image_fid

select entity_id from field_data_field_root_image where field_root_image_fid = {field_root_image_fid};

  • Get various image metadata given entity_id
select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_resolution where entity_id = {entity_id};
select * from field_data_field_root_img_age where entity_id = {entity_id};
select * from field_data_field_root_img_dry_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_family where entity_id = {entity_id};
select * from field_data_field_root_img_fresh_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_genus where entity_id = {entity_id};
select * from field_data_field_root_img_spad where entity_id = {entity_id};
select * from field_data_field_root_img_species where entity_id = {entity_id};
  • Get collection entity_id given image field_marked_coll_root_img_ref_target_id (entity_id)

select * from field_data_field_marked_coll_root_img_ref where field_marked_coll_root_img_ref_target_id = {entity_id};

  • Get image or collection title given entity_id (nid, node ID)

select * from node where nid = {entity_id};

  • Get collection metadata, location, & various other info given collection entity_id
select * from field_data_field_collection_metadata where entity_id = {entity_id};
select * from field_data_field_collection_location where entity_id = {entity_id};
select * from field_data_field_collection_plantation where entity_id = {entity_id};
select * from field_data_field_collection_harvest where entity_id = {entity_id;}
select * from field_data_field_collection_soil_group where entity_id = {entity_id};
select * from field_data_field_collection_soil_moisture where entity_id = {entity_id};
select * from field_data_field_collection_soil_nitrogen where entity_id = {entity_id};
select * from field_data_field_collection_soil_phosphorus where entity_id = {entity_id};
select * from field_data_field_collection_soil_potassium where entity_id = {entity_id};
select * from field_data_field_collection_pesticides where entity_id = {entity_id};

wpbonelli added a commit that referenced this issue Jun 5, 2022
wpbonelli added a commit that referenced this issue Jun 5, 2022
wpbonelli added a commit that referenced this issue Jun 5, 2022
wpbonelli added a commit that referenced this issue Jun 5, 2022
wpbonelli added a commit that referenced this issue Jun 5, 2022
wpbonelli added a commit that referenced this issue Jun 6, 2022
wpbonelli added a commit that referenced this issue Jun 6, 2022
wpbonelli added a commit that referenced this issue Jun 6, 2022
@wpbonelli wpbonelli added the blocked Depends on something else label Jun 23, 2022
@wpbonelli
Copy link
Member Author

Depends on #312

@wpbonelli
Copy link
Member Author

wpbonelli commented Jun 30, 2022

Occasionally the Celery process running the migration gets killed for excess memory usage, e.g.: Process 'ForkPoolWorker-2' pid:17 exited with 'signal 9 (SIGKILL)'

Might need to give the Celery container more memory

Update: could be a Paramiko memory leak where the client and transport don't clean up after themselves properly. Currently we keep a client open for the entire migration and reuse it for every SFTP download. Might be able to resolve this by opening/closing a new client for each file transferred, at risk of slowing things down a bit due to overhead.

@wpbonelli
Copy link
Member Author

Refactor in progress to use Celery's eventlet scheduler for non-blocking IO. This should dramatically speed up data transfer as we can perform a large number of file downloads/uploads in parallel instead of serially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Someone is working on this issue (please also assign yourself) priority Should be resolved first, if possible
Projects
None yet
Development

No branches or pull requests

1 participant