Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting stuck with bunch of batched git-annex processes running #56

Closed
yarikoptic opened this issue Aug 6, 2024 · 6 comments
Closed
Assignees
Labels
bug:stall report Report of a backup2datalad run getting stuck or stalling

Comments

@yarikoptic
Copy link
Member

I think we had similar case some time recently which was "resolved" by killing those few git-annex processes. But looking at the process tree:

root     3826314  0.0  0.0   8580  3456 ?        Ss   Jul05   0:14   /usr/sbin/cron -f
root     2909512  0.0  0.0  10652  2704 ?        S    Jul26   0:00     /usr/sbin/CRON -f
dandi    2909513  0.0  0.0   2576  1536 ?        Ss   Jul26   0:00       /bin/sh -c chronic flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c '/mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron'
dandi    2909514  0.0  0.0  15732  7296 ?        S    Jul26   3:58         /usr/bin/perl /usr/bin/chronic flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron
dandi    2909515  0.0  0.0   5460  1792 ?        S    Jul26   0:00           /usr/bin/flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron
dandi    2909516  0.0  0.0   7064  2048 ?        S    Jul26   0:00             /bin/bash /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron
dandi    2909709  0.0  0.0   7064   776 ?        S    Jul26   0:00               /bin/bash /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron
dandi    2909710  0.0  0.0 2150212 2936 ?        Sl   Jul26   1:00                 /home/dandi/miniconda3/envs/dandisets-2/bin/python /home/dandi/miniconda3/envs/dandisets-2/bin/backups2datalad -l WARNING --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 5 -e 000(026|108|243|876)$
dandi    2938723  0.0  0.0   9892  2304 ?        S    Jul26   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex addurl -c annex.alwayscompact=false --batch --with-files --jobs 3 --json --json-error-messages --json-progress --raw-except=datalad
dandi    2938725  1.7  0.0 1074420924 33152 ?    Sl   Jul26 280:51                     /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex addurl -c annex.alwayscompact=false --batch --with-files --jobs 3 --json --json-error-messages --json-progress --raw-except=datalad
dandi    2938763  0.0  0.0   9956   896 ?        S    Jul26   0:00                       /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.alwayscompact=false cat-file --batch
dandi    2938884  0.0  0.0 269824  2912 ?        S    Jul26   0:00                       /home/dandi/miniconda3/envs/dandisets-2/bin/python3.1 /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex-remote-datalad
dandi    2938810  0.0  0.0   9892  2304 ?        S    Jul26   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=SHA256E
dandi    2938812  0.0  0.0 1074052072 8320 ?     Sl   Jul26   0:00                     /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex examinekey --batch --migrate-to-backend=SHA256E
dandi    2938876  0.0  0.0   9892  2176 ?        S    Jul26   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex fromkey --force --batch --json --json-error-messages
dandi    2938879  0.0  0.0 1074052204 9728 ?     Sl   Jul26   0:00                     /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex fromkey --force --batch --json --json-error-messages
dandi    2938926  0.0  0.0   9956   896 ?        S    Jul26   0:00                       /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi    2938941  0.0  0.0   9988   640 ?        S    Jul26   0:00                       /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs hash-object -w --no-filters --stdin-paths
dandi    2938950  0.0  0.0   9892  2304 ?        S    Jul26   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi    2938952  0.3  0.0 1074199672 19328 ?    Sl   Jul26  54:27                     /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi    2938978  0.0  0.0   9956   768 ?        S    Jul26   0:00                       /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.alwayscompact=false cat-file --batch
dandi    2938979  0.0  0.0   7168  2176 ?        S    Jul26   0:00                       /bin/bash /usr/bin/git-annex-remote-rclone
dandi    2939025  0.0  0.0 265740  2984 ?        S    Jul26   0:00                       /home/dandi/miniconda3/envs/dandisets-2/bin/python3.1 /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex-remote-datalad
dandi    2939266  0.0  0.0   9892  2304 ?        S    Jul26   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex whereis --batch-keys --json --json-error-messages
dandi    2939268  0.3  0.0 1074199668 19584 ?    Sl   Jul26  55:01                     /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex whereis --batch-keys --json --json-error-messages
dandi    2939316  0.0  0.0   9992   640 ?        S    Jul26   0:00                       /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs hash-object -w --no-filters --stdin-paths
dandi    2939336  0.0  0.0  10068   768 ?        S    Jul26   0:00                       /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi    2939378  0.0  0.0 265660  3064 ?        S    Jul26   0:00                       /home/dandi/miniconda3/envs/dandisets-2/bin/python3.1 /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex-remote-datalad
dandi    2939574  0.0  0.0   7168  2304 ?        S    Jul26   0:00                       /bin/bash /usr/bin/git-annex-remote-rclone
dandi    2909711  0.0  0.0   6472  2176 ?        S    Jul26   0:00                 grep -v nothing to save, working tree clean

It looks to me like the process simply did not explicitly exit/kill those batched git-annex processes. Is that a right assumption or they were supposed to exit since their pipe file descriptors were closed? (I think we had some issues like that in git-annex before). WDYT @jwodder ?

The full log is at /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2024.07.26.20.02.11Z.log

FWIW -- it seems like we did go through all dandisets, although the last one "Editing" was out of order:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ grep Editing /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2024.07.26.20.02.11Z.log
2024-07-26T16:02:21-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000070
2024-07-26T16:02:30-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000253
2024-07-26T16:02:42-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000289
2024-07-26T16:02:50-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000723
2024-07-26T16:02:52-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000940
2024-07-26T16:03:14-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000940
2024-07-26T16:03:25-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000952
2024-07-26T16:03:29-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001075
2024-07-26T16:03:35-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001087
2024-07-26T16:03:38-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001087
2024-07-26T16:03:40-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001088
2024-07-26T16:03:43-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001088
2024-07-26T16:03:43-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001089
2024-07-26T16:03:46-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001089
2024-07-26T16:03:52-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001090
2024-07-26T16:03:54-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001091
2024-07-26T16:03:55-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001090
2024-07-26T16:03:57-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001092
2024-07-26T16:03:57-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001091
2024-07-26T16:04:00-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001092
2024-07-26T16:04:07-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001093
2024-07-26T16:04:09-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001094
2024-07-26T16:04:10-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001095
2024-07-26T16:04:10-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001093
2024-07-26T16:04:12-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001094
2024-07-26T16:04:13-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001095
2024-07-26T16:04:18-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001096
2024-07-26T16:04:22-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/001096
2024-07-26T16:04:41-0400 [DEBUG   ] backups2datalad: Editing repository dandisets/000971

and note that some were "edited" multiple times (not sure what "Editing" means here exactly but if any kind of change + commit -- could as well happen).

Looking at one of those sample stuck batched processes:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ ps auxw -H  |grep 'annex examinekey --batch'
dandi    3356402  0.0  0.0   6472  2176 pts/8    S+   13:44   0:00       grep annex examinekey --batch
dandi    2938810  0.0  0.0   9892  2304 ?        S    Jul26   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=SHA256E
dandi    2938812  0.0  0.0 1074052072 8320 ?     Sl   Jul26   0:00                     /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex examinekey --batch --migrate-to-backend=SHA256E

we see that pipe was opened but never "Waiting for" to be closed:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ ls -l /proc/2938812/cwd
lrwxrwxrwx 1 dandi dandi 0 Aug  6 13:44 /proc/2938812/cwd -> /mnt/backup/dandi/dandisets/000874
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ grep 'annex examinekey --batch.*000874' /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2024.07.26.20.02.11Z.log
2024-07-26T16:02:50-0400 [DEBUG   ] backups2datalad: Opening pipe to `git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=SHA256E` [cwd=/mnt/backup/dandi/dandisets/000874]

like it did for other dandisets:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ grep 'annex examinekey --batch.*000' /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2024.07.26.20.02.11Z.log
2024-07-26T16:02:16-0400 [DEBUG   ] backups2datalad: Opening pipe to `git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=SHA256E` [cwd=/mnt/backup/dandi/dandisets/000070]
2024-07-26T16:02:17-0400 [DEBUG   ] backups2datalad: Waiting for `git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=SHA256E` [cwd=/mnt/backup/dandi/dandisets/000070] to terminate
...

suggesting that something lead our script to get stuck (could be some other git-annex process of cause)... so I looked at all of them -- seems all are for the same dandiset and none of them "Waiting for":

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ ps auxw -H  |grep ' git .*annex.*batc[h]' | while read _ p _; do ls -ld /proc/$p/cwd; done
lrwxrwxrwx 1 dandi dandi 0 Aug  6 13:48 /proc/2938723/cwd -> /mnt/backup/dandi/dandisets/000874
lrwxrwxrwx 1 dandi dandi 0 Aug  6 13:48 /proc/2938810/cwd -> /mnt/backup/dandi/dandisets/000874
lrwxrwxrwx 1 dandi dandi 0 Aug  6 13:48 /proc/2938876/cwd -> /mnt/backup/dandi/dandisets/000874
lrwxrwxrwx 1 dandi dandi 0 Aug  6 13:48 /proc/2938950/cwd -> /mnt/backup/dandi/dandisets/000874
lrwxrwxrwx 1 dandi dandi 0 Aug  6 13:48 /proc/2939266/cwd -> /mnt/backup/dandi/dandisets/000874
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ grep 'annex.*batch.*000874' /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2024.07.26.20.02.11Z.log
2024-07-26T16:02:50-0400 [DEBUG   ] backups2datalad: Opening pipe to `git -c receive.autogc=0 -c gc.auto=0 annex addurl -c annex.alwayscompact=false --batch --with-files --jobs 3 --json --json-error-messages --json-progress --raw-except=datalad` [cwd=/mnt/backup/dandi/dandisets/000874]
2024-07-26T16:02:50-0400 [DEBUG   ] backups2datalad: Opening pipe to `git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=SHA256E` [cwd=/mnt/backup/dandi/dandisets/000874]
2024-07-26T16:02:50-0400 [DEBUG   ] backups2datalad: Opening pipe to `git -c receive.autogc=0 -c gc.auto=0 annex fromkey --force --batch --json --json-error-messages` [cwd=/mnt/backup/dandi/dandisets/000874]
2024-07-26T16:02:50-0400 [DEBUG   ] backups2datalad: Opening pipe to `git -c receive.autogc=0 -c gc.auto=0 annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages` [cwd=/mnt/backup/dandi/dandisets/000874]
2024-07-26T16:02:51-0400 [DEBUG   ] backups2datalad: Opening pipe to `git -c receive.autogc=0 -c gc.auto=0 annex whereis --batch-keys --json --json-error-messages` [cwd=/mnt/backup/dandi/dandisets/000874]

dandiset itself is dirty:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets/000874$ git status
On branch draft
Your branch is up to date with 'github/draft'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   dandiset.yaml

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        rawdata/sub-SP002/ses-OCT/micr/sub-SP002_ses-OCT_sample-01_OCT.nii
        rawdata/sub-SP002/ses-OCT/micr/sub-SP002_ses-OCT_sample-02_OCT.nii
        rawdata/sub-SP002/ses-OCT/micr/sub-SP002_ses-OCT_sample-03_OCT.nii

@jwodder could you try to figure out what could the script potentially waiting on for?

I have not interrupted any process ATM so we could troubleshoot but let's prioritize since that delays update of dandisets.

@jwodder
Copy link
Member

jwodder commented Aug 7, 2024

@yarikoptic First, for the record, the "Editing repository" message means that a GitHub repository is being edited/modified, usually to update the sizes in the repository description, which is the last step in backing up an individual Dandiset. Because multiple Dandisets are backed up concurrently, it's not that unusual for the "Editing" messages to be out of order with respect to Dandiset IDs.

As to the actual problem, when I look at the log messages for Dandiset 000874, the last message is "Done feeding URLs to addurl", but there is no corresponding "Done reading from addurl". In addition, the logs contain multiple messages about starting the download of various JSON files but no corresponding messages about finishing the downloads. I tried downloading one of the listed URLs manually with:

curl -vsSL -H "Authorization: token $DANDI_API_KEY" https://api.dandiarchive.org/api/assets/3e98c412-b4be-4e3d-8709-662e721cba30/download/

(where the value of $DANDI_API_KEY was taken from .git/secrets in the superdataset on drogon) but I got a 403 "You do not have permission to perform this action" error.

If this 403 had occurred when backups2datalad was downloading the file, then there should have been a message logged about it, unless git-annex addurl changed its JSON output. To test this, I ran the following script on my machine (Can you confirm that this should work?):

#!/bin/bash
export DATALAD_dandi_token=---REDACTED---

set -ex -o pipefail

git init foo
cd foo
git annex init

mkdir -p .datalad/providers
cat > .datalad/providers/dandi.cfg <<'EOT'
[provider:dandi]
url_re = https?://api\.dandiarchive\.org/api/.*
authentication_type = http_token
credential = dandi

[credential:dandi]
type = token
EOT

git add .datalad/providers/dandi.cfg
git commit -m 'Add dandi provider config'

git annex initremote datalad \
    type=external \
    externaltype=datalad \
    encryption=none \
    autoenable=true \
    uuid=cf13d535-b47c-5df6-8590-0793cb08a90a

printf '%s %s\n' \
    https://api.dandiarchive.org/api/assets/3e98c412-b4be-4e3d-8709-662e721cba30/download/ \
    derivatives/OCT-pipeline/sub-SP002/micr/sub-SP002_ses-OCT_sample-01_res-20um_OCT.json \
| git annex addurl --batch --with-files --json --json-error-messages --json-progress --raw-except=datalad

which output the following at the addurl step:

[INFO] http session: Authenticating into session for https://api.dandiarchive.org/api/assets/3e98c412-b4be-4e3d-8709-662e721cba30/download/
Access to https://api.dandiarchive.org/api/assets/3e98c412-b4be-4e3d-8709-662e721cba30/download/ has failed.
Do you want to enter other credentials in case they were updated? (choices: yes, no):

Based on the traceback I got when I hit Ctrl-C, it seems this last prompt came from DataLad rather than git-annex. This even happened when I redirected stdout to a pipe, which seems like bad behavior from DataLad. In addition, the stalled addurl process on drogon has a git-annex-remote-datalad descendant. My best guess at the moment is that the addurl process is stalled waiting for user input that will never come.

@yarikoptic
Copy link
Member Author

oh, thank you very much for digging it up -- I should have noticed trailing datalad process!

I thought that it was my outdated token, but didn't check explicitly and changed to a new service admin user's token... but surprise -- even though I could curl endpoint for an embargoed dandiset, I still got this 403 for the assets /download/. The same when I tried my own token again. So there is more to it! Filed

Let's see if may be there would be a quick fix... and I should look into preventing interactivity from datalad...

@jwodder
Copy link
Member

jwodder commented Aug 9, 2024

@yarikoptic So should the addurl (or git-annex-remote-datalad?) process be killed now in order to get everything to finish up? I can't deploy #57 while there's an old stuck backup.

@yarikoptic
Copy link
Member Author

yes, please kill, upgrade datalad in that env to at least 1.0.3 which had relevant fix

Raise exception if an annex remote process without console tries to interact with the user, e.g. prompt for a password. datalad/datalad#7578 (by @christian-monch)

so we should not get it stuck again (may be first check on your script if that is indeed the case).

@jwodder
Copy link
Member

jwodder commented Aug 12, 2024

@yarikoptic All git-annex-remote-datalad processes were killed, the backup finished normally (aside from the failed Dandiset), and datalad in the dandisets-2 environment has been upgraded to 1.1.3.

@yarikoptic
Copy link
Member Author

Thank you! Let's consider this issue addressed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:stall report Report of a backup2datalad run getting stuck or stalling
Projects
None yet
Development

No branches or pull requests

2 participants