-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO Error: interrupt out of nowhere? #288
Comments
Maybe try running gcsfuse under strace to see if you can figure out the
source of the interrupt?
|
This is what I get when I run However, this tells me nothing about the error. I still have no idea from where it is coming from. |
Can you run with strace but not the gcsfuse debugging flags, in order to
get a trace that’s not all mucked up with intermingled logs?
|
Is this more helpful?
I get this log when running |
You probably need to use the -f argument to strace, in order to trace all
gcsfuse threads. We’re looking for information on the interrupt.
Now that I think harder about it though, the interrupt might be in the
caller of the file system, not gcsfuse. Perhaps you should strace it
instead.
|
Hi! Do you have an update to this problem? I have a same issue. I give my instance full storage access and make sure I give my service account full access just in case as well. I still get an interrupt in middle of training usually after one or two epochs. It seems like the permission changes randomly over time and my VM instances get locked out. So I’m unable to read images from the bucket. I’m not sure why. |
I "intermittently but frequently" get the same issue (i/o error despite having full GCP permissions) when accessing data mounted by gcsfuse. I've opened an issue on the gcsfuse project, but it's not clear to me whether gcsfuse is the problem or whether Google Storage is erroring at a much higher rate than anticipated. |
hi, I have the exactly same issue now. when I save a model trained in my instance, the Input/Output Error occurs often randomly. is this still unresolved?? |
In #288, multiple reports suggest that sometimes interrupts from unknown source would cancel the ops running in Gcsfuse. While we currently cannot identify those interrupts, it's possible to return a more meaningful error code, ECANCELED, instead of ambiguous EIO, making debug much easier.
Is there any way to ignore these interrupt signals? We have a use case where we need to ensure the file is uploaded to the bucket regardless of any interrupts. |
Hi there. I have the exactly same issue also. Use go v.1.17.0 build code with copy files from disk to Google Storage bucker mount by gcsfuse. The "operation canceled" occurs often and randomly... (seems no reason?!). Process still running not kill. machine: GCP no GPU instance; ubuntu 18.04; kernel version: 5.4.0-1040-gcp context logs: {"level":"ERROR","ts":"2021-08-25T01:26:57.370+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"} {"level":"ERROR","ts":"2021-08-25T01:27:08.225+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"} |
Also seeing a lot of
|
Also seeing a lot of operation canceled IO errors. go1.16 with Google Cloud Run Container |
I also see this while writing TensorBoard logs on GCS while training on Cloud AI Platform:
|
This is same as #1016. |
Update: We've addressed this issue in GCSfuse v2.3 (released June 28, 2024) by changing the default behavior to ignore interrupts during file system operations. This should prevent the problem described here. Please upgrade to the latest version and let us know if you continue to experience any difficulties. |
I have spun up a new GPU instance on Google Cloud that I am using to train my model. The instance reads the data from the persistent disk and logs the checkpoints and log files into a Google Cloud storage bucket that is mounted with gcsfuse. Unfortunately I keep getting an IO error (see below)
After looking at the output from
gcsfuse --foreground --debug_gcs --debug_fuse
it seems like the problem is an interrupt coming from somewhere. TBH I have no idea where this interrupt might come from. If I run the train script on a local directory (not a bucket mounted with gcsfuse) everything works fine.Any help would be greatly appreciated :)
The text was updated successfully, but these errors were encountered: