Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My YAML file appears to not get processed properly #259

Closed
Chris-Schnaufer opened this issue Jan 18, 2022 · 29 comments
Closed

My YAML file appears to not get processed properly #259

Chris-Schnaufer opened this issue Jan 18, 2022 · 29 comments
Assignees
Labels
bug Something isn't working in progress Someone is working on this issue (please also assign yourself) priority Should be resolved first, if possible

Comments

@Chris-Schnaufer
Copy link
Contributor

My YAML file shows that the name field is missing, although it's there. Also, trying to run reports an error that the image configuration key is missing, even though it's there as well. My YAML file: https://github.com/Chris-Schnaufer/drone-makeflow/blob/main/plantit.yaml

Screen Shot 2022-01-18 at 1 24 58 PM

@wpbonelli
Copy link
Member

wpbonelli commented Jan 20, 2022

Hi @Chris-Schnaufer thanks for reporting this! Seems there is indeed a bug in the configuration file validation logic, KeyError trying to access output.path in the configuration even though that attribute isn't required:

"validation" : {
      "errors" : [
         "Traceback (most recent call last):\n  File \"/code/plantit/plantit/github.py\", line 404, in list_connectable_repos_by_owner\n    validation = validate_repo_config(config, token)\n  File \"/code/plantit/plantit/github.py\", line 111, in validate_repo_config\n    if config['output']['path'] is not None and type(config['output']['path']) is not str:\nKeyError: 'path'\n"
      ],
      "is_valid" : false
   }

I think the missing name and image are cascading consequences of this

@wpbonelli wpbonelli added bug Something isn't working in progress Someone is working on this issue (please also assign yourself) labels Jan 20, 2022
@wpbonelli wpbonelli self-assigned this Jan 20, 2022
wpbonelli added a commit that referenced this issue Jan 20, 2022
wpbonelli added a commit that referenced this issue Jan 20, 2022
@wpbonelli
Copy link
Member

wpbonelli commented Jan 22, 2022

Hi @Chris-Schnaufer, this is addressed in v0.0.33 (out tonight), looks like the drone-makeflow repo is publicly visible now.

I did notice that in the plantit.yaml, the jobqueue section has the old time and mem attributes. These were changed to walltime and memory a few releases back. The old attribute names should still work for now but will be deprecated in a future release. The docs have now been updated to reflect this. Apologies for the delay here.

@Chris-Schnaufer
Copy link
Contributor Author

I have updated the YAML to use the correct terms.

Unfortunately, now I am unable to resolve selecting an input folder:
Screen Shot 2022-01-24 at 1 24 22 PM

I've also tried using the "Bind project/study" option without any luck.
Screen Shot 2022-01-24 at 1 25 36 PM

I'm not sure where to go from here

@wpbonelli
Copy link
Member

wpbonelli commented Jan 25, 2022

Thanks @Chris-Schnaufer, found the first problem (the 'Selected' alert doesn't handle the case for directory inputs, only file/files). Will patch shortly. Looking into the MIAPPE project binding too.

Really appreciate your help revealing all these issues- I'm not able to reliably test the whole UI surface alone.

wpbonelli added a commit that referenced this issue Jan 25, 2022
@wpbonelli
Copy link
Member

Hi @Chris-Schnaufer, the input selection issue should now be resolved (still working on the project binding fix). Apologies for the delay. Please let me know if you are still unable to submit jobs.

I just tested the pipeline and although the submission is successful, the job fails due to a missing plantit-workflow.sh script. Checking inside the agdrone/drone-workflow:1.2 image definition it looks like plantit-workflow.sh does not exist in the /scif/apps/src/ directory:

$ ls /scif/apps/src/
betydb2geojson.py           cyverse_canopycover.sh        cyverse_plotclip.sh        cyverse_soilmask_ratio.sh    git_algo_rgb_plot.py           merge_csv.py           shp2geojson_workflow.jx
betydb2geojson_workflow.jx  cyverse_find_files2json.sh    cyverse_short_workflow.sh  find_files2json.sh           git_rgb_plot_workflow.jx       merge_csv_workflow.jx  soilmask_ratio_workflow.jx
canopycover_workflow.jx     cyverse_greenness-indices.sh  cyverse_shp2geojson.sh     find_files2json_workflow.jx  greenness-indices_workflow.jx  plotclip_workflow.jx   soilmask_workflow.jx
cyverse_betydb2geojson.sh   cyverse_merge_csv.sh          cyverse_soilmask.sh        generate_geojson.sh          jx-args.json                   short_workflow.jx

@Chris-Schnaufer
Copy link
Contributor Author

Hello @w-bonelli. I was able to get back to testing this and I am still having problems. I changed the docker image so that it's pointing to a test version that has the plantit-workflow.sh file in the correct location: the image is chrisatua/development:drone_makeflow.

I'm not sure why it's reposting that it can't find the docker image. I've tried uploading the docker image again and there's no change in the run result. Here's two screen shots showing the step before running the Task, and the Task result. I'm running from the main branch of the following repo: https://github.com/Chris-Schnaufer/drone-makeflow

Screen Shot 2022-02-16 at 11 52 29 AM

Screen Shot 2022-02-16 at 11 48 13 AM

On another note, I appear to have two projects with the same GitHub path:
Screen Shot 2022-02-16 at 12 04 42 PM

@wpbonelli
Copy link
Member

Thanks @Chris-Schnaufer looking into this now

@wpbonelli
Copy link
Member

wpbonelli commented Feb 22, 2022

Hi @Chris-Schnaufer, I believe the root cause of the latest issue was the docker image attribute was not parsed properly from plantit.yaml (due to the comment here). This should now be fixed with v0.1.0. Apologies again for the delay.

I was able to submit AgPipeline/drone-makeflow without errors last night, however the container workflow did not complete successfully. Here was the error message: /usr/bin/sh: 1: /scif/apps/src/plantit-workflow.sh: not found

@Chris-Schnaufer
Copy link
Contributor Author

@w-bonelli That repository has the incorrect Docker image listed. The one at Chris-Schnaufer/drone-makeflow is the one that shows up on my system.

@wpbonelli
Copy link
Member

wpbonelli commented Feb 28, 2022

Ok, got it. I just did a test run with Chris-Schnaufer/drone-makeflow and received this output:

INPUT FOLDER /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data
WORKING FOLDER /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc
Processing with /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data/orthoimage_mask.tif /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data/plots.json
  Options:  --metadata /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data/experiment.yaml
/scif/apps/src/plantit-workflow.sh: line 61: /scif/apps/src/jx-args.json: Read-only file system
Running workflow steps: soilmask plotclip find_files2json canopycover merge_csv
Running app 0 'soilmask'
[soilmask] executing /bin/bash /scif/apps/soilmask/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 4, SOILMASK_MASK_FILE: undefined symbol") instead
2022/02/28 11:17:22.61 makeflow[30569] fatal: makeflow: couldn't load /scif/apps/src/soilmask_workflow.jx: Invalid argument
Terminated
Running app 1 'plotclip'
[plotclip] executing /bin/bash /scif/apps/plotclip/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 10, PLOTCLIP_SOURCE_FILE: undefined symbol") instead
2022/02/28 11:17:22.97 makeflow[30573] fatal: makeflow: couldn't load /scif/apps/src/plotclip_workflow.jx: Invalid argument
Terminated
Running app 2 'find_files2json'
[find_files2json] executing /bin/bash /scif/apps/find_files2json/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 10, FILES2JSON_SEARCH_NAME: undefined symbol") instead
2022/02/28 11:17:23.34 makeflow[30578] fatal: makeflow: couldn't load /scif/apps/src/find_files2json_workflow.jx: Invalid argument
Terminated
Running app 3 'canopycover'
[canopycover] executing /bin/bash /scif/apps/canopycover/scif/runscript
2022/02/28 11:17:23.71 makeflow[30582] fatal: Failed to parse in JX Args File.
Terminated
Running app 4 'merge_csv'
[merge_csv] executing /bin/bash /scif/apps/merge_csv/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 10, MERGECSV_SOURCE: undefined symbol") instead
2022/02/28 11:17:24.07 makeflow[30586] fatal: makeflow: couldn't load /scif/apps/src/merge_csv_workflow.jx: Invalid argument
Terminated
Workflow completed

Looks like the root issue is that Singularity makes the filesystem read-only by default. So when plantit-workflow.sh tries to write to /scif/apps/src/jx-args.json it fails and the error cascades.

Would it be possible to alter the way the agdrone workflow accepts configuration info? E.g., allowing the location of jx-args.json to be specified at invocation time instead of expecting it to be at /scif/apps/src?

Another option might be to use the mount attribute in plantit.yaml (this configures a Singularity bind mount mapping the specified path to the working directory on the host), then modify plantit-workflow.sh to git clone the repo and then write directly to jx-args.json instead of /scif/apps/src/jx-args.json. (plantit no longer supports the automatic clone option in plantit.yaml because of some headaches re: handling potential duplicate filenames, but there is no reason a workflow can't manually do it)

@Chris-Schnaufer
Copy link
Contributor Author

Hello @w-bonelli, I have looked at this and have some comments (I'm also not knowledgable about Singularity):

allowing the location of jx-args.json to be specified at invocation time

There are dependencies built into the container that would require additional writes to the file system to allow this. So I don't think it would work.

Singularity makes the filesystem read-only by default

What are the writable folders on the system? In other words, how do generated files get saved in the container and exported from the container when it's done? Also, the apps that run expect the /scif folder to be writable as part of the app management system - can something be done about to enable writing?

Any help on this is appreciated!

@wpbonelli
Copy link
Member

Hi @Chris-Schnaufer,

What are the writable folders on the system? In other words, how do generated files get saved in the container and exported from the container when it's done?

Singularity automatically mounts the current directory on the host into the container (as well as /home/$USER and /tmp) under the paths as they appear on the host filesystem. Those are the only writable locations by default, if I understand the docs correctly. So something like singularity exec docker://alpine touch test.txt works, and test.txt will exist in the host working directory after the container exits, but singularity exec docker://alpine touch /opt/test.txt fails with a "Read-only file system" warning.

Also, the apps that run expect the /scif folder to be writable as part of the app management system - can something be done about to enable writing?

I think bind mounts might work as an indirect way of making /scif/apps/src writable by modifying the container's view of the filesystem. Bind mounts allow mapping paths on the host to custom paths within the container. PlantIT supports this via the mount attribute in the plantit.yaml file. If you mount /scif/apps/src (example here), that will overwrite that folder in the container and replace it with the contents of the host working directory, without changing its path as visible to the container. Is there anything at /scif/apps/src in the image definition that is not present in the GitHub repo? If not, the first step in plantit-workflow.sh could be to clone the repo into /scif/apps/src, after which I believe the container could read and write to anything in that folder.

I will try this tomorrow or Fri to check that it works as expected. I wish we could provide more straightforward support for writing arbitrary locations but I think it is a pretty fundamental Singularity limitation.

@Chris-Schnaufer
Copy link
Contributor Author

The system works by checking out the repo and then building the docker image. The built solution is what's run.

@wpbonelli
Copy link
Member

Trying this now.

@wpbonelli
Copy link
Member

Hi @Chris-Schnaufer my apologies again for the delay. I think I have this working now. See the diff here for the changes.

What I did:

  • add bind mounts to plantit.yaml for each location the workflow needs to write to
  • update Dockerfile to copy plantit-workflow.sh into a different location in the container (I used opt/dev, but nearly anywhere should work), so the script isn't overwritten when the host's working directory is mounted to scif/apps/src
  • update the commands entrypoint in plantit.yaml to reflect the new location of the workflow script
  • update plantit-workflow.sh to pull the drone-makeflow repo into (what the container sees as) scif/apps/src (this is a bit involved since git refuses to clone a repo into an occupied directory)

The workflow seems to run successfully. The job log includes the following output:

Processing with /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/orthomosaic.tif /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/plots.json
  Options:  --metadata /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/experiment.yaml
Running workflow steps: soilmask plotclip find_files2json canopycover merge_csv
Running app 0 'soilmask'
[soilmask] executing /bin/bash /scif/apps/soilmask/scif/runscript
parsing /scif/apps/src/soilmask_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/soilmask_workflow.jx for consistency...
/scif/apps/src/soilmask_workflow.jx has 1 rules.
creating new log file /scif/data/soilmask/workflow.jx.makeflowlog...
checking files for unexpected changes...  (use --skip-file-check to skip this step)
starting workflow....
submitting job: ${SCIF_APPROOT}/.venv/bin/python3 ${SCIF_APPROOT}/${SCRIPT_PATH} ${DOCKER_OPTIONS} --working_space "${WORKING_FOLDER}" "${INPUT_GEOTIFF}"
submitted job 16986
{
  "code": 0,
  "file": [
    {
      "path": "/scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif",
      "key": "stereoTop",
      "metadata": {
        "data": {
          "name": "soilmask",
          "version": "2.2",
          "ratio": 0.15308405301006114
        }
      }
    }
  ]
}
job 16986 completed
nothing left to do.
Running app 1 'plotclip'
[plotclip] executing /bin/bash /scif/apps/plotclip/scif/runscript
parsing /scif/apps/src/plotclip_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/plotclip_workflow.jx for consistency...
/scif/apps/src/plotclip_workflow.jx has 1 rules.
recovering from log file /scif/data/plotclip/workflow.jx.makeflowlog...
checking for old running or failed jobs...
checking files for unexpected changes...  (use --skip-file-check to skip this step)
starting workflow....
nothing left to do.
Running app 2 'find_files2json'
[find_files2json] executing /bin/bash /scif/apps/find_files2json/scif/runscript
parsing /scif/apps/src/find_files2json_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/find_files2json_workflow.jx for consistency...
/scif/apps/src/find_files2json_workflow.jx has 1 rules.
recovering from log file /scif/data/find_files2json/workflow.jx.makeflowlog...
checking for old running or failed jobs...
checking files for unexpected changes...  (use --skip-file-check to skip this step)
starting workflow....
nothing left to do.
Running app 3 'canopycover'
[canopycover] executing /bin/bash /scif/apps/canopycover/scif/runscript
Running app 4 'merge_csv'
[merge_csv] executing /bin/bash /scif/apps/merge_csv/scif/runscript
parsing /scif/apps/src/merge_csv_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/merge_csv_workflow.jx for consistency...
/scif/apps/src/merge_csv_workflow.jx has 1 rules.
recovering from log file /scif/data/merge_csv/workflow.jx.makeflowlog...
checking for old running or failed jobs...
checking files for unexpected changes...  (use --skip-file-check to skip this step)
starting workflow....
nothing left to do.
Workflow completed

And the workflow.jx.makeflowlog file contents:

# NODE	0	${SCIF_APPROOT}/.venv/bin/python3 ${SCIF_APPROOT}/${SCRIPT_PATH} ${DOCKER_OPTIONS} --working_space "${WORKING_FOLDER}" "${INPUT_GEOTIFF}" 
# CATEGORY	0	default
# SYMBOL	0	default
# PARENTS	0
# SOURCES	0	/scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/orthomosaic.tif
# TARGETS	0	/scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif
# COMMAND	0	${SCIF_APPROOT}/.venv/bin/python3 ${SCIF_APPROOT}/${SCRIPT_PATH} ${DOCKER_OPTIONS} --working_space "${WORKING_FOLDER}" "${INPUT_GEOTIFF}" 
# FILE 1647819147640506 /scif/data/soilmask/workflow.jx.batchlog 1 0
# STARTED 1647819147659151
# FILE 1647819147677957 /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif 1 1073741824
1647819147678183 0 1 16986 0 1 0 0 0 1
# FILE 1647819150192981 /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif 2 3807181
1647819150193049 0 2 16986 0 0 1 0 0 1
# COMPLETED 1647819150193083
# FILE 1647819150301904 /scif/data/plotclip/workflow.jx.batchlog 1 0
# STARTED 1647819150328675
# COMPLETED 1647819150330269
# FILE 1647819150438240 /scif/data/find_files2json/workflow.jx.batchlog 1 0
# STARTED 1647819150446188
# COMPLETED 1647819150447752
# FILE 1647819150595902 /scif/data/merge_csv/workflow.jx.batchlog 1 0
# STARTED 1647819150602725
# COMPLETED 1647819150604079

I'm not sure what output files are expected, so I'm not sure how to validate results. You may need to specify exact names of expected output files in the output.include.names section of plantit.yaml (or use quite a few output.exclude.names), since the job working directory will have everything from the drone-workflow repo in it.

Hope this helps. Please let me know if I can do anything else and thanks again — has been a really valuable edge case to explore and figure out how to support.

@Chris-Schnaufer
Copy link
Contributor Author

Wow! Thanks @w-bonelli! This looks great so far and I will look into it further.

Regarding the mounts, is there a reason that the folders under /scif/data/* are separate mounts versus mounting only /scif/data?

@Chris-Schnaufer
Copy link
Contributor Author

Chris-Schnaufer commented Mar 21, 2022

@w-bonelli I have updated my repo at but the changes aren't reflected in PlanIT. How long does it take for changes propagate from GitHub? I also have 4 workflows (see image below) - is there a way to reduce this? Thanks

Screen Shot 2022-03-21 at 1 46 16 PM

@wpbonelli
Copy link
Member

@Chris-Schnaufer No problem! The workflows refresh every 5 minutes — you may need to reload the page to see the changes reflected. It would be nice to be able to manually refresh particular workflows though. I'll add that to the roadmap.

I also have 4, it looks like 2 branches under AgPipeline/drone-makeflow (main and develop), 1 under Chris-Schnaufer/drone-makeflow, and my own fork w-bonelli/drone-makeflow.

Screen Shot 2022-03-22 at 10 22 40 AM

De-duplication is planned for branches of the same repo (tracked here) but I have not gotten to it yet.

In the meantime, we can add one of the workflows to the Featured context if you'd like, so it shows up immediately when the user navigates to the workflows view.

Screen Shot 2022-03-22 at 10 30 45 AM

@Chris-Schnaufer
Copy link
Contributor Author

@w-bonelli thanks for the quick response. I'm not ready to have the workflow featured yet but I will let you know when I think it's ready 👍

@Chris-Schnaufer
Copy link
Contributor Author

Chris-Schnaufer commented Apr 6, 2022

Hello again @w-bonelli. I am able to make further progress but there is something happening with the machines where the tasks aren't completing. I get different results depending upon agent.

Screen Shot 2022-04-06 at 1 15 00 PM

Screen Shot 2022-04-06 at 1 13 10 PM

@Chris-Schnaufer
Copy link
Contributor Author

Hello @w-bonelli, I am still seeing these issues. Any updates? Thanks

@wpbonelli
Copy link
Member

Hi @Chris-Schnaufer apologies for the delay, which issue are you seeing? The Stampede2 agent is no longer publicly available but there is a Sapelo2 agent you should be able to submit to.

@Chris-Schnaufer
Copy link
Contributor Author

Hello, please see above comment (the last image shows the Sapelo2 issue) Please ignore the Stampede2 agent since it's no longer available. Comment link: #259 (comment)

@wpbonelli
Copy link
Member

wpbonelli commented Oct 11, 2022

Is it the authentication failed error? Would you mind sharing a more recent task ID?

@Chris-Schnaufer
Copy link
Contributor Author

Yes, the error reads Failed to grant temporary data access.

I tried to reproduce the problem but I am unable to select the Sapelo2 option

@wpbonelli
Copy link
Member

wpbonelli commented Oct 11, 2022

That's likely because the agdrone workflow requests 8 cores, while the public Sapelo2 agent allows a max of 2. If you update your plantit.yaml to request <=2 cores I think it should allow you to submit. I will see if I can reproduce the issue on my own fork later today

In the longer term we're changing the way orchestration works to take advantage of GitHub actions, so plantit won't have to manually manage agents or poll clusters for job status. This will involve changes to the plantit.yaml specification but ultimately it will allow you to plug in your own cluster, removing limitations like this

I'll update this thread as that work progresses

@Chris-Schnaufer
Copy link
Contributor Author

Hello @w-bonelli. Here is a screen shot of a run I just did that didn't work out.
Screen Shot 2022-10-12 at 9 17 24 AM

@wpbonelli
Copy link
Member

wpbonelli commented Oct 13, 2022

I think this occurred because this task's output location is your home folder /iplant/home/schnaufer. The data store only allows granting guest permissions to write to collections inside the user's home directory, not the home folder itself.

Thanks for bringing this to light, I've updated the site to disallow selecting the top-level home folder as the output location.

@Chris-Schnaufer
Copy link
Contributor Author

I am now able to run my workflows. Will open a new issue for problems found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in progress Someone is working on this issue (please also assign yourself) priority Should be resolved first, if possible
Projects
None yet
Development

No branches or pull requests

2 participants