Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Error: interrupt out of nowhere? #288

Closed
gorjanradevski opened this issue Jul 23, 2018 · 15 comments
Closed

IO Error: interrupt out of nowhere? #288

gorjanradevski opened this issue Jul 23, 2018 · 15 comments
Assignees
Labels
bug Error or flaw in the code with unintended result p2 P2

Comments

@gorjanradevski
Copy link

I have spun up a new GPU instance on Google Cloud that I am using to train my model. The instance reads the data from the persistent disk and logs the checkpoints and log files into a Google Cloud storage bucket that is mounted with gcsfuse. Unfortunately I keep getting an IO error (see below)

INFO:training.training:Found new best! Metric: 0.39190895331317616
INFO:training.training:Saving checkpoint at global step 176
Traceback (most recent call last):
  File "src/train_script.py", line 118, in <module>
    main()
  File "src/train_script.py", line 112, in main
    args.train_path, args.val_path, logdir, args.checkpoint, hparams
  File "/home/gorjanradevski/prosiebensat1/src/training/training.py", line 228, in train_and_evaluate
    checkpoint=logdir,
  File "/home/gorjanradevski/prosiebensat1/src/training/utils.py", line 51, in save_checkpoint
    torch.save(state, filepath)
  File "/home/gorjanradevski/.local/share/virtualenvs/prosiebensat1-YKpXIpQK/lib/python3.6/site-packages/torch/serialization.py", line 161, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/gorjanradevski/.local/share/virtualenvs/prosiebensat1-YKpXIpQK/lib/python3.6/site-packages/torch/serialization.py", line 116, in _with_file_like
    f = open(f, mode)
OSError: [Errno 5] Input/output error: 'log2/logging4/last.pth.tar'

After looking at the output from gcsfuse --foreground --debug_gcs --debug_fuse it seems like the problem is an interrupt coming from somewhere. TBH I have no idea where this interrupt might come from. If I run the train script on a local directory (not a bucket mounted with gcsfuse) everything works fine.

gcs: Req             0x2e: <- StatObject("logging4/last.pth.tar/")
gcs: Req             0x2f: <- StatObject("logging4/last.pth.tar")
gcs: Req             0x2d: -> Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>) (84.454985ms): OK
fuse_debug: Op 0x0000005b        connection.go:479] -> OK ()
fuse_debug: Op 0x0000005c        connection.go:479] -> OK ()
fuse_debug: Op 0x00000060        connection.go:396] <- FlushFile (inode 7)
fuse_debug: Op 0x00000061        connection.go:396] <- WriteFile (inode 7, handle 3, offset 8192, 3701 bytes)
fuse_debug: Op 0x00000062        connection.go:396] <- SetInodeAttributes (inode 7, mtime 2018-07-23 14:41:01.192127484 +0000 UTC)
gcs: Req             0x30: <- CreateObject("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs")
gcs: Req             0x2f: -> StatObject("logging4/last.pth.tar") (182.243631ms): gcs.NotFoundError: googleapi: Error 404: No such object: luminovo-prosiebensat1-logs/logging4/last.pth.tar, notFound
gcs: Req             0x2e: -> StatObject("logging4/last.pth.tar/") (182.952289ms): gcs.NotFoundError: googleapi: Error 404: No such object: luminovo-prosiebensat1-logs/logging4/last.pth.tar/, notFound
fuse_debug: Op 0x0000005f        connection.go:481] -> Error: "no such file or directory"
fuse_debug: Op 0x00000063        connection.go:396] <- CreateFile (parent 5, name "last.pth.tar")
gcs: Req             0x31: <- CreateObject("logging4/last.pth.tar")
gcs: Req             0x30: -> CreateObject("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs") (265.8374ms): OK
fuse_debug: Op 0x00000060        connection.go:479] -> OK ()
gcs: Req             0x32: <- Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>)
fuse_debug: Op 0x00000064        connection.go:396] <- interrupt (fuseid 0x00000063)
gcs: Req             0x31: -> CreateObject("logging4/last.pth.tar") (158.469322ms): Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled
fuse_debug: Op 0x00000063        connection.go:481] -> Error: "CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled"
fuse: 2018/07/23 14:41:01.543811 *fuseops.CreateFileOp error: CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled
gcs: Req             0x32: -> Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>) (95.750685ms): OK
fuse_debug: Op 0x00000061        connection.go:479] -> OK ()
fuse_debug: Op 0x00000062        connection.go:479] -> OK ()

Any help would be greatly appreciated :)

@jacobsa
Copy link
Contributor

jacobsa commented Jul 24, 2018 via email

@gorjanradevski
Copy link
Author

futex(0xabaeb0, FUTEX_WAIT, 0, NULLgcs: Req 0x13: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2") (270.688859ms): OK fuse_debug: Op 0x0000002b connection.go:479] -> OK () ) = 0 gcs: Req 0x15: <- Read("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2", <nil>) write(1, "fuse_debug: Op 0x0000002f "..., 85fuse_debug: Op 0x0000002f connection.go:396] <- interrupt (fuseid 0x0000002e) ) = 85 read(10, gcs: Req 0x14: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/last.pth.tar") (196.018391ms): Put https://www.googleapis.com/upload/storage/v1/b/tryout_fuse2/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UpCfQtZa2TElSOD2HZP9gPpgiqtbvr2Q35kYpVxt2Z9SGwxoJE1tsZfIYIrVsX7THbzXFmNT0sKRdJXb-5nqQsgbqisik8moAfbb1z4I87qolM3i_A: net/http: request canceled

This is what I get when I run strace gcsfuse --foreground --debug_gcs --debug_fuse <bucket name> <directory name>

However, this tells me nothing about the error. I still have no idea from where it is coming from.

@jacobsa
Copy link
Contributor

jacobsa commented Jul 28, 2018 via email

@gorjanradevski
Copy link
Author

Is this more helpful?

fuse: 2018/07/29 17:06:39.363295 *fuseops.CreateFileOp error: CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/tryout-fuse/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UrVGRw3Kr2L4mnAAICtSx1ysEf8f3PdAFxDjtco2qUoMqBXjuHxz0B59nOKuVRaXZ4Ie4A7eovcDhZegh5FInrAcinaGptZQYffOwaVgECYZ7bhAuY: net/http: request canceled futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 futex(0xab9f00, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=805024024}) = 0 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999897465}) = 0 futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999020506}) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(4, [], 128, 0) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 epoll_wait(4, [{EPOLLOUT, {u32=1111547376, u64=140669880425968}}], 128, 0) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999896692}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=77, tv_nsec=879846836}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=65, tv_nsec=651538671}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=30, tv_nsec=114080962}) = -1 ETIMEDOUT (Connection timed out) futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xaba3d0, FUTEX_WAKE, 1) = 1 futex(0xc42002ef10, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=442, tv_nsec=287940936}

I get this log when running strace gcsfuse --foreground <bucket_name> <directory_name>

@jacobsa
Copy link
Contributor

jacobsa commented Jul 29, 2018 via email

@ykim104
Copy link

ykim104 commented Oct 12, 2018

Hi! Do you have an update to this problem?

I have a same issue. I give my instance full storage access and make sure I give my service account full access just in case as well. I still get an interrupt in middle of training usually after one or two epochs. It seems like the permission changes randomly over time and my VM instances get locked out. So I’m unable to read images from the bucket. I’m not sure why.

@carbocation
Copy link

I "intermittently but frequently" get the same issue (i/o error despite having full GCP permissions) when accessing data mounted by gcsfuse. I've opened an issue on the gcsfuse project, but it's not clear to me whether gcsfuse is the problem or whether Google Storage is erroring at a much higher rate than anticipated.

@fregataa
Copy link

hi, I have the exactly same issue now. when I save a model trained in my instance, the Input/Output Error occurs often randomly. is this still unresolved??

lezh added a commit that referenced this issue Jun 7, 2021
In #288, multiple reports suggest that sometimes interrupts from unknown
source would cancel the ops running in Gcsfuse. While we currently
cannot identify those interrupts, it's possible to return a more
meaningful error code, ECANCELED, instead of ambiguous EIO, making debug
much easier.
@nsht
Copy link

nsht commented Jul 14, 2021

Is there any way to ignore these interrupt signals? We have a use case where we need to ensure the file is uploaded to the bucket regardless of any interrupts.

@kunsonx
Copy link

kunsonx commented Aug 24, 2021

Hi there. I have the exactly same issue also.

Use go v.1.17.0 build code with copy files from disk to Google Storage bucker mount by gcsfuse.

The "operation canceled" occurs often and randomly... (seems no reason?!). Process still running not kill.

machine: GCP no GPU instance; ubuntu 18.04;

kernel version:

5.4.0-1040-gcp

context logs:

{"level":"ERROR","ts":"2021-08-25T01:26:57.370+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"}
{"level":"ERROR","ts":"2021-08-25T01:27:08.225+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"}

@dwilliams782
Copy link

dwilliams782 commented Dec 29, 2021

Also seeing a lot of operation canceled IO errors.



>Op 0x0000008f connection.go:416] <- interrupt (fuseid 0x0000008e)

>Not retrying Read(".atlantis/atlantis.db", 1640797111622149) after error of type *url.Error ("Get \"https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149\": net/http: request canceled"): &url.Error{Op:"Get", URL:"https://www.googleapis.com:443/download/storage/v1/b/runatlantis-playground/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149", Err:(*errors.errorString)(0xc0000433b0)}" 

>ReadFile: operation canceled, fh.reader.ReadAt: readFull: Get "https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149": net/http: request canceled

@avidullu avidullu added the bug Error or flaw in the code with unintended result label Apr 14, 2022
@flufy3d
Copy link

flufy3d commented May 3, 2022

Also seeing a lot of operation canceled IO errors.

go1.16 with Google Cloud Run Container

@Darktex
Copy link

Darktex commented Jun 16, 2022

I also see this while writing TensorBoard logs on GCS while training on Cloud AI Platform:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 166, in _write
    f.write(compatify(file_content))
OSError: [Errno 5] Input/output error

@sethiay sethiay added known-issues Issues that are known or not supported yet. and removed bug Error or flaw in the code with unintended result labels May 31, 2023
@sethiay
Copy link
Collaborator

sethiay commented Jun 6, 2023

This is same as #1016.

@sethiay sethiay added bug Error or flaw in the code with unintended result and removed known-issues Issues that are known or not supported yet. labels Jun 6, 2023
@vadlakondaswetha vadlakondaswetha added the p2 P2 label Jun 8, 2023
@ashmeenkaur ashmeenkaur self-assigned this Jul 31, 2024
@ashmeenkaur
Copy link
Collaborator

Update: We've addressed this issue in GCSfuse v2.3 (released June 28, 2024) by changing the default behavior to ignore interrupts during file system operations. This should prevent the problem described here. Please upgrade to the latest version and let us know if you continue to experience any difficulties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Error or flaw in the code with unintended result p2 P2
Projects
None yet
Development

No branches or pull requests