DIRT migration #301

wpbonelli · 2022-04-13T13:32:44Z

Each DIRT user may have any number of image sets. We need to prompt them to transfer their image sets to correspondingly named folders in the data store, so that they can then use plantit to run DIRT. We can prompt the user via:

web UI (option in navbar dropdown, similar to ‘Log in to GitHub’ button)
mass email to DIRT user group

We should also have a dedicated page for this in the documentation (walkthrough + screenshots).

When the user begins, we first detect if they have any DIRT image sets. If so, for each, transfer files from tucco’s attached NFS to a smaller temporary staging area (also an NFS) on portnoy, then transfer to its own folder in the user’s home directory in the data store, with folder name as DIRT image set name and file names preserved. We should also decorate datasets with a metadata tag indicating DIRT origin, as well as any attached metadata. We should show some kind of progress monitor in the UI, then send an email notification to the user.

Data transfer via:

PyCyAPI to Terrain? (now supports metadata tagging)
~~iCommands container?~~

IMPT: make a final backup of all the DIRT datasets before the migration period ends

The text was updated successfully, but these errors were encountered:

…ementation (#301)

wpbonelli · 2022-05-16T00:10:58Z

A minimal version of this is working now. Triggered from the top-right dropdown in the UI.

Remaining tasks:

~~add a corresponding page in the docs~~
send out an email to the DIRT user list

It may also be worthwhile to let users select a folder to transfer their data to, or create a new one, rather than creating and using one with hard-coded path /iplant/home/{username}/dirt_migration as we currently do. It seems unlikely anybody will already have a folder with the same name so we are probably safe from collisions, but still.

…fore beginning migration (#301)

… compose config (#301)

…301)

wpbonelli · 2022-05-17T13:14:10Z

In as of 92d06ed

wpbonelli · 2022-05-20T02:17:11Z

Reopening because we should preserve collection names and collection & image metadata. This can be pulled from the DIRT database given an image path. Because files are stored by date on the tucco NFS, rather than by collection membership, we need to do a lookup for each file separately. We may also need to lookup usernames, since some users' data seems to be associated with their full name rather than their CyVerse username.

SQL queries

The Drupal CMS produces a fairly unwieldy database schema where every object is a node, i.e. an entity, and each node's associated information and metadata are scattered around various tables linked via foreign key. We will need a number of queries to extract relevant information:

Get image fid given image path:

select fid from file_managed where uri like 'public://{path}%';

Get image entity_id given field_root_image_fid

select entity_id from field_data_field_root_image where field_root_image_fid = {field_root_image_fid};

Get various image metadata given entity_id

select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_resolution where entity_id = {entity_id};
select * from field_data_field_root_img_age where entity_id = {entity_id};
select * from field_data_field_root_img_dry_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_family where entity_id = {entity_id};
select * from field_data_field_root_img_fresh_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_genus where entity_id = {entity_id};
select * from field_data_field_root_img_spad where entity_id = {entity_id};
select * from field_data_field_root_img_species where entity_id = {entity_id};

Get collection entity_id given image field_marked_coll_root_img_ref_target_id (entity_id)

select * from field_data_field_marked_coll_root_img_ref where field_marked_coll_root_img_ref_target_id = {entity_id};

Get image or collection title given entity_id (nid, node ID)

select * from node where nid = {entity_id};

Get collection metadata, location, & various other info given collection entity_id

select * from field_data_field_collection_metadata where entity_id = {entity_id};
select * from field_data_field_collection_location where entity_id = {entity_id};
select * from field_data_field_collection_plantation where entity_id = {entity_id};
select * from field_data_field_collection_harvest where entity_id = {entity_id;}
select * from field_data_field_collection_soil_group where entity_id = {entity_id};
select * from field_data_field_collection_soil_moisture where entity_id = {entity_id};
select * from field_data_field_collection_soil_nitrogen where entity_id = {entity_id};
select * from field_data_field_collection_soil_phosphorus where entity_id = {entity_id};
select * from field_data_field_collection_soil_potassium where entity_id = {entity_id};
select * from field_data_field_collection_pesticides where entity_id = {entity_id};

…WIP)

wpbonelli · 2022-06-23T11:20:54Z

Depends on #312

wpbonelli · 2022-06-30T11:35:38Z

Occasionally the Celery process running the migration gets killed for excess memory usage, e.g.: Process 'ForkPoolWorker-2' pid:17 exited with 'signal 9 (SIGKILL)'

Might need to give the Celery container more memory

Update: could be a Paramiko memory leak where the client and transport don't clean up after themselves properly. Currently we keep a client open for the entire migration and reuse it for every SFTP download. Might be able to resolve this by opening/closing a new client for each file transferred, at risk of slowing things down a bit due to overhead.

wpbonelli · 2022-07-12T13:16:52Z

Refactor in progress to use Celery's eventlet scheduler for non-blocking IO. This should dramatically speed up data transfer as we can perform a large number of file downloads/uploads in parallel instead of serially.

wpbonelli added priority Should be resolved first, if possible in progress Someone is working on this issue (please also assign yourself) labels Apr 13, 2022

wpbonelli self-assigned this Apr 13, 2022

wpbonelli added a commit that referenced this issue Apr 19, 2022

DIRT file transfer task (#301 WIP)

34692f0

wpbonelli added a commit that referenced this issue Apr 20, 2022

DIRT migration (#301 WIP)

a557adf

wpbonelli added a commit that referenced this issue Apr 28, 2022

add pycyapi as a dependency (#299), DIRT data transfer task full impl…

721bf79

…ementation (#301)

wpbonelli added a commit that referenced this issue May 15, 2022

DIRT migration WIP (#301)

31d91ef

wpbonelli added a commit that referenced this issue May 15, 2022

DIRT migration complete? (#301)

9a1da54

wpbonelli added a commit that referenced this issue May 16, 2022

separate specification for DIRT migration env vars (#301)

7fc7b39

wpbonelli added a commit that referenced this issue May 16, 2022

check that user doesn't already have a 'dirt_migration' collection be…

62d7c18

…fore beginning migration (#301)

wpbonelli added a commit that referenced this issue May 16, 2022

add migration page to docs (#301)

ddee965

wpbonelli added a commit that referenced this issue May 16, 2022

add dirt migration page to docs root (#301)

3b3c6da

wpbonelli added a commit that referenced this issue May 16, 2022

mount DIRT migration staging dir into celery container in prod docker…

86087b5

… compose config (#301)

wpbonelli added a commit that referenced this issue May 16, 2022

fix dirt migration dir duplicate creation (#301)

4c580d1

wpbonelli added a commit that referenced this issue May 17, 2022

migration bugfixes (#301)

c93a51c

wpbonelli added a commit that referenced this issue May 17, 2022

persistent migration status, better migration UI, various fixes (#301)

a927179

wpbonelli added a commit that referenced this issue May 17, 2022

debug #301

d25fe18

wpbonelli added a commit that referenced this issue May 17, 2022

remove staging files immediately after transfer to save space on NFS (#…

39c1c52

…301)

wpbonelli added a commit that referenced this issue May 17, 2022

migration UI fix (#301)

92d06ed

wpbonelli closed this as completed May 17, 2022

wpbonelli reopened this May 20, 2022

wpbonelli added a commit that referenced this issue May 20, 2022

DIRT migration updates to preserve collection names/metadata (#301, WIP)

72b3880

wpbonelli added a commit that referenced this issue May 26, 2022

preserve collection names and metadata in DIRT migration (#301)

ee5f10e

wpbonelli added a commit that referenced this issue May 26, 2022

debug DIRT migration (#301 WIP)

a700c17

wpbonelli added a commit that referenced this issue May 27, 2022

check if migration json fields are none when converting to dict (#301 …

f9a7044

…WIP)

wpbonelli added a commit that referenced this issue May 31, 2022

switch to PyMySQL to access DIRT DB (#301)

8220e8c

wpbonelli added a commit that referenced this issue Jun 3, 2022

debug PyMySQL (#301)

3b67e53

wpbonelli added a commit that referenced this issue Jun 5, 2022

debug #301

06110c1

wpbonelli added a commit that referenced this issue Jun 5, 2022

debug #301

0990a7d

wpbonelli added a commit that referenced this issue Jun 5, 2022

debug #301 UI

f463f84

wpbonelli added a commit that referenced this issue Jun 5, 2022

debug #301, better UI for uploads/deletions

46f5d16

wpbonelli added a commit that referenced this issue Jun 5, 2022

remove unnecessary DB close in #301

8de1a1b

wpbonelli added a commit that referenced this issue Jun 5, 2022

migrate computation results, logs, and metadata files too (#301)

d3dedd8

wpbonelli added a commit that referenced this issue Jun 5, 2022

debug #301

cefe8f9

wpbonelli added a commit that referenced this issue Jun 6, 2022

migration UI fix (#301)

9ebe28c

wpbonelli added a commit that referenced this issue Jun 6, 2022

migration UI fix (#301)

956521c

wpbonelli added a commit that referenced this issue Jun 6, 2022

migration fix (#301)

271d35b

wpbonelli added a commit that referenced this issue Jun 17, 2022

migration directory path bugfix (#301)

1278b6f

wpbonelli added the blocked Depends on something else label Jun 23, 2022

wpbonelli added a commit that referenced this issue Jun 30, 2022

try to patch memory leak during DIRT migration task (#301)

b891f7c

wpbonelli added a commit that referenced this issue Jun 30, 2022

attach metadata and environmental data to root image files (#301)

d0b8c2d

wpbonelli removed the blocked Depends on something else label Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIRT migration #301

DIRT migration #301

wpbonelli commented Apr 13, 2022 •

edited

Loading

wpbonelli commented May 16, 2022 •

edited

Loading

wpbonelli commented May 17, 2022

wpbonelli commented May 20, 2022

wpbonelli commented Jun 23, 2022

wpbonelli commented Jun 30, 2022 •

edited

Loading

wpbonelli commented Jul 12, 2022

DIRT migration #301

DIRT migration #301

Comments

wpbonelli commented Apr 13, 2022 • edited Loading

wpbonelli commented May 16, 2022 • edited Loading

wpbonelli commented May 17, 2022

wpbonelli commented May 20, 2022

SQL queries

wpbonelli commented Jun 23, 2022

wpbonelli commented Jun 30, 2022 • edited Loading

wpbonelli commented Jul 12, 2022

wpbonelli commented Apr 13, 2022 •

edited

Loading

wpbonelli commented May 16, 2022 •

edited

Loading

wpbonelli commented Jun 30, 2022 •

edited

Loading