Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the destination of where failed script is written to #1530

Closed
gaow opened this issue Dec 23, 2023 · 19 comments
Closed

Change the destination of where failed script is written to #1530

gaow opened this issue Dec 23, 2023 · 19 comments

Comments

@gaow
Copy link
Member

gaow commented Dec 23, 2023

This is typically what we see when a job failed,

ERROR: [susie_twas_1]: [0]:
Failed to execute Rscript /home/aw3600/.sos/97de343d7da3f0ce/susie_twas_1_0_ff982a12.R
exitcode=1, workdir=/home/mambauser/data
---------------------------------------------------------------------------
[susie_twas]: Exits with 1 pending step (susie_twas_2)

The line Rscript /home/aw3600/.sos/97de343d7da3f0ce/susie_twas_1_0_ff982a12.R is how we track and try to reproduce the error, to debug. However, long story short is that we are working with cloud computing where /home/aw3600/.sos/ is a path in the VM that gets destroyed after a command ends. Although it is possible to copy the entire .sos folder to permanent AWS S3 bucket before the VM dies, it is non-trivial to sync the entire folder ... all we care is this file susie_twas_1_0_ff982a12.R.

I think this conversation was once brought up but I don't remember we have an option to do it yet -- can we specify something on the sos run interface to make these temporary scripts saved to a given folder? I like the behavior of:

R: ... , stdout =, stderr = 

which writes the stderr and stdout to where I want them. I wonder if we can add something like:

R: ..., stdout=, stderr=, debug_script="/path/to/debug/folder"

and only keep the scripts to /path/to/debug/folder when there is an issue -- and change the prompt Failed to execute to pointing to this script?

@BoPeng
Copy link
Contributor

BoPeng commented Dec 23, 2023

Cannot you -v outside_sos:/home/user/.sos to save the content outside of the VM?

@gaow
Copy link
Member Author

gaow commented Dec 23, 2023

The suggestions we get from the vendor team is to avoid any communications between VM and S3 directly especially for frequent file exchanges. That was why their setup is to write data only to specific folders pre-mounted which later gets automatically copied to S3. I guess we can try to mount it as you suggest and see if it is feasible for large embarrassing parallel workflows involving many signature checks. I will report it back here if it seems too slow.

@BoPeng
Copy link
Contributor

BoPeng commented Dec 24, 2023

Let me know if it works for you. It is pretty easy to add this setting from command line but the problem is that ~/.sos/config.yml is the default configuration path and moving ~/.sos away will force users to use this option all the time. It is possible to keep ~/.sos/ but move the command log to somewhere else though.

@gaow
Copy link
Member Author

gaow commented Dec 24, 2023

@BoPeng thank you! Currently we cannot test it properly because the way our vendor set it up is that the S3 bucket is mounted with read-only. The automatic process runs the sync separate from our SoS run so we don't really write there the real time. I am asking them to reconfigure it and am waiting for the response.

It is possible to keep ~/.sos/ but move the command log to somewhere else though.

I think all we would need is to move the failed script ("command log"?) elsewhere. That's what we are interested in. If that's written to a folder that we sync between the VM and S3 that'd be the best. We should be able to leverage that and test it out.

@gaow
Copy link
Member Author

gaow commented Feb 4, 2024

It is possible to keep ~/.sos/ but move the command log to somewhere else though.

@BoPeng I'm sorry, it turns out we do need this feature -- to keep ~/.sos where it is but set the failed script to write to somewhere else, perhaps the output folder? The problem is that we did try to mount S3 bucket to ~/.sos as the cache. However, the I/O is an issue and we encounter SQLite failure frequently for large jobs to the extend that we cannot get our analysis done this way .. We will need to keep ~/.sos local.

Can we change the default for SoS anyways to write the command log to the same folder as where people set stderr to be, and if they did not set it we still write to ~/.sos? Thank you in advance for helping with this emergency fix!

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2024

The problem is that the .sos/... has the complete script that can be executed by Rscript .... etc, but stderr is a file with other content, and cannot be directly executed. Would not be enough if we write the content of the script also to stderr?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2024

Let me see if #1533 works.

@gaow
Copy link
Member Author

gaow commented Feb 4, 2024

Thank you @BoPeng it's very helpful pointer to where to modify it. I have change it to:

https://github.com/vatlab/sos/pull/1533/files

It seems good. For example for this script:

[1]
R: 
  print(hello)

it says:

Failed to execute Rscript /home/gw/.sos/aee75cfb40461b96/1_0_a0afcf75.R
exitcode=1, workdir=/home/gw/tmp/04-Feb-2024

but when i set stderr file explicitly:

[1]
R: stderr="file.txt"
  print(hello)

the temporary script gets into the current folder properly:

Failed to execute Rscript /home/gw/tmp/04-Feb-2024/1_0_9b3f78e8.R
exitcode=1, workdir=/home/gw/tmp/04-Feb-2024, stderr=file.txt

Do you see any obvious issues with this patch? Please improve as you see fit. I wonder if we can also release a new version to conda for us to pull the changes and apply it to our jobs on AWS. Thanks!

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2024

How about a combination of both patches? I think having the script directly in stderr can be useful especially when stderr is the default (sys.stderr).

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2024

If we do that, for consistency and convenient, we should also write the script to

with open(

This is because sometimes it is not entirely clear what went wrong with a task when it fails due to variable replacement problem.

gaow added a commit that referenced this issue Feb 4, 2024
@gaow
Copy link
Member Author

gaow commented Feb 4, 2024

@BoPeng I brought the other patch back via: ffeac9c instead of writing the entire script, because the script can be very long in many applications.

I think by placing the error message into stderr, it should also reflect into the task status so we don't have to modify task_executor.py?

The patch does not work as is, however. The error message is

TypeError: a bytes-like object is required, not 'str'

I think this is because you opened the stderr file with b option so we need to se.write a byte object not str? I'm not sure how to do that. I tried to wrap it around with encode_msg which bypasses the problem but the output has non-ASCII characters in it ... perhaps you know a quick fix to it :)

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2024

ok, the patch is updated, it should work w/wo option stderr and w/wo task:. Please let me know if it works as expected.

@gaow
Copy link
Member Author

gaow commented Feb 4, 2024

Thanks @BoPeng the patch works well in a couple of tests I tried. But for the conda release -- I see that there are some failed tests on #1533 should they be ignored?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2024

I will clean up the code (pylint) and make a release.

@gaow gaow closed this as completed Feb 4, 2024
@BoPeng
Copy link
Contributor

BoPeng commented Feb 4, 2024

sos 0.24.5 is relased.

@gaow
Copy link
Member Author

gaow commented Feb 5, 2024

Thank you @BoPeng . It's not here yet: https://anaconda.org/conda-forge/sos but i guess it will show up soon?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 5, 2024

Yes, it should be there in a few hours after the pypi release.

@gaow
Copy link
Member Author

gaow commented Feb 5, 2024

I am not sure if that will be the case ... according to the release history, conda should have version 0.24.4: https://pypi.org/project/sos/#history

However, it is still 0.24.3 https://anaconda.org/conda-forge/sos

Perhaps there are some check fails that prevents it from getting onto conda-forge?

@gaow
Copy link
Member Author

gaow commented Feb 5, 2024

It's posted after I merged PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants