IO Error: interrupt out of nowhere? #288

gorjanradevski · 2018-07-23T14:57:18Z

I have spun up a new GPU instance on Google Cloud that I am using to train my model. The instance reads the data from the persistent disk and logs the checkpoints and log files into a Google Cloud storage bucket that is mounted with gcsfuse. Unfortunately I keep getting an IO error (see below)

INFO:training.training:Found new best! Metric: 0.39190895331317616
INFO:training.training:Saving checkpoint at global step 176
Traceback (most recent call last):
  File "src/train_script.py", line 118, in <module>
    main()
  File "src/train_script.py", line 112, in main
    args.train_path, args.val_path, logdir, args.checkpoint, hparams
  File "/home/gorjanradevski/prosiebensat1/src/training/training.py", line 228, in train_and_evaluate
    checkpoint=logdir,
  File "/home/gorjanradevski/prosiebensat1/src/training/utils.py", line 51, in save_checkpoint
    torch.save(state, filepath)
  File "/home/gorjanradevski/.local/share/virtualenvs/prosiebensat1-YKpXIpQK/lib/python3.6/site-packages/torch/serialization.py", line 161, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/gorjanradevski/.local/share/virtualenvs/prosiebensat1-YKpXIpQK/lib/python3.6/site-packages/torch/serialization.py", line 116, in _with_file_like
    f = open(f, mode)
OSError: [Errno 5] Input/output error: 'log2/logging4/last.pth.tar'

After looking at the output from gcsfuse --foreground --debug_gcs --debug_fuse it seems like the problem is an interrupt coming from somewhere. TBH I have no idea where this interrupt might come from. If I run the train script on a local directory (not a bucket mounted with gcsfuse) everything works fine.

gcs: Req             0x2e: <- StatObject("logging4/last.pth.tar/")
gcs: Req             0x2f: <- StatObject("logging4/last.pth.tar")
gcs: Req             0x2d: -> Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>) (84.454985ms): OK
fuse_debug: Op 0x0000005b        connection.go:479] -> OK ()
fuse_debug: Op 0x0000005c        connection.go:479] -> OK ()
fuse_debug: Op 0x00000060        connection.go:396] <- FlushFile (inode 7)
fuse_debug: Op 0x00000061        connection.go:396] <- WriteFile (inode 7, handle 3, offset 8192, 3701 bytes)
fuse_debug: Op 0x00000062        connection.go:396] <- SetInodeAttributes (inode 7, mtime 2018-07-23 14:41:01.192127484 +0000 UTC)
gcs: Req             0x30: <- CreateObject("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs")
gcs: Req             0x2f: -> StatObject("logging4/last.pth.tar") (182.243631ms): gcs.NotFoundError: googleapi: Error 404: No such object: luminovo-prosiebensat1-logs/logging4/last.pth.tar, notFound
gcs: Req             0x2e: -> StatObject("logging4/last.pth.tar/") (182.952289ms): gcs.NotFoundError: googleapi: Error 404: No such object: luminovo-prosiebensat1-logs/logging4/last.pth.tar/, notFound
fuse_debug: Op 0x0000005f        connection.go:481] -> Error: "no such file or directory"
fuse_debug: Op 0x00000063        connection.go:396] <- CreateFile (parent 5, name "last.pth.tar")
gcs: Req             0x31: <- CreateObject("logging4/last.pth.tar")
gcs: Req             0x30: -> CreateObject("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs") (265.8374ms): OK
fuse_debug: Op 0x00000060        connection.go:479] -> OK ()
gcs: Req             0x32: <- Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>)
fuse_debug: Op 0x00000064        connection.go:396] <- interrupt (fuseid 0x00000063)
gcs: Req             0x31: -> CreateObject("logging4/last.pth.tar") (158.469322ms): Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled
fuse_debug: Op 0x00000063        connection.go:481] -> Error: "CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled"
fuse: 2018/07/23 14:41:01.543811 *fuseops.CreateFileOp error: CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled
gcs: Req             0x32: -> Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>) (95.750685ms): OK
fuse_debug: Op 0x00000061        connection.go:479] -> OK ()
fuse_debug: Op 0x00000062        connection.go:479] -> OK ()

Any help would be greatly appreciated :)

The text was updated successfully, but these errors were encountered:

jacobsa · 2018-07-24T02:44:18Z

Maybe try running gcsfuse under strace to see if you can figure out the source of the interrupt?

gorjanradevski · 2018-07-27T13:43:11Z

futex(0xabaeb0, FUTEX_WAIT, 0, NULLgcs: Req 0x13: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2") (270.688859ms): OK fuse_debug: Op 0x0000002b connection.go:479] -> OK () ) = 0 gcs: Req 0x15: <- Read("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2", <nil>) write(1, "fuse_debug: Op 0x0000002f "..., 85fuse_debug: Op 0x0000002f connection.go:396] <- interrupt (fuseid 0x0000002e) ) = 85 read(10, gcs: Req 0x14: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/last.pth.tar") (196.018391ms): Put https://www.googleapis.com/upload/storage/v1/b/tryout_fuse2/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UpCfQtZa2TElSOD2HZP9gPpgiqtbvr2Q35kYpVxt2Z9SGwxoJE1tsZfIYIrVsX7THbzXFmNT0sKRdJXb-5nqQsgbqisik8moAfbb1z4I87qolM3i_A: net/http: request canceled

This is what I get when I run strace gcsfuse --foreground --debug_gcs --debug_fuse <bucket name> <directory name>

However, this tells me nothing about the error. I still have no idea from where it is coming from.

jacobsa · 2018-07-28T22:21:57Z

Can you run with strace but not the gcsfuse debugging flags, in order to get a trace that’s not all mucked up with intermingled logs?

gorjanradevski · 2018-07-29T17:10:47Z

Is this more helpful?

fuse: 2018/07/29 17:06:39.363295 *fuseops.CreateFileOp error: CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/tryout-fuse/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UrVGRw3Kr2L4mnAAICtSx1ysEf8f3PdAFxDjtco2qUoMqBXjuHxz0B59nOKuVRaXZ4Ie4A7eovcDhZegh5FInrAcinaGptZQYffOwaVgECYZ7bhAuY: net/http: request canceled futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 futex(0xab9f00, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=805024024}) = 0 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999897465}) = 0 futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999020506}) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(4, [], 128, 0) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 epoll_wait(4, [{EPOLLOUT, {u32=1111547376, u64=140669880425968}}], 128, 0) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999896692}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=77, tv_nsec=879846836}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=65, tv_nsec=651538671}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=30, tv_nsec=114080962}) = -1 ETIMEDOUT (Connection timed out) futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xaba3d0, FUTEX_WAKE, 1) = 1 futex(0xc42002ef10, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=442, tv_nsec=287940936}

I get this log when running strace gcsfuse --foreground <bucket_name> <directory_name>

jacobsa · 2018-07-29T21:00:17Z

You probably need to use the -f argument to strace, in order to trace all gcsfuse threads. We’re looking for information on the interrupt. Now that I think harder about it though, the interrupt might be in the caller of the file system, not gcsfuse. Perhaps you should strace it instead.

ykim104 · 2018-10-12T10:13:11Z

Hi! Do you have an update to this problem?

I have a same issue. I give my instance full storage access and make sure I give my service account full access just in case as well. I still get an interrupt in middle of training usually after one or two epochs. It seems like the permission changes randomly over time and my VM instances get locked out. So I’m unable to read images from the bucket. I’m not sure why.

carbocation · 2020-07-15T11:44:24Z

I "intermittently but frequently" get the same issue (i/o error despite having full GCP permissions) when accessing data mounted by gcsfuse. I've opened an issue on the gcsfuse project, but it's not clear to me whether gcsfuse is the problem or whether Google Storage is erroring at a much higher rate than anticipated.

fregataa · 2021-01-11T23:18:09Z

hi, I have the exactly same issue now. when I save a model trained in my instance, the Input/Output Error occurs often randomly. is this still unresolved??

In #288, multiple reports suggest that sometimes interrupts from unknown source would cancel the ops running in Gcsfuse. While we currently cannot identify those interrupts, it's possible to return a more meaningful error code, ECANCELED, instead of ambiguous EIO, making debug much easier.

nsht · 2021-07-14T08:21:14Z

Is there any way to ignore these interrupt signals? We have a use case where we need to ensure the file is uploaded to the bucket regardless of any interrupts.

kunsonx · 2021-08-24T17:33:45Z

Hi there. I have the exactly same issue also.

Use go v.1.17.0 build code with copy files from disk to Google Storage bucker mount by gcsfuse.

The "operation canceled" occurs often and randomly... (seems no reason?!). Process still running not kill.

machine: GCP no GPU instance; ubuntu 18.04;

kernel version:

5.4.0-1040-gcp

context logs:

{"level":"ERROR","ts":"2021-08-25T01:26:57.370+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"}
{"level":"ERROR","ts":"2021-08-25T01:27:08.225+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"}

dwilliams782 · 2021-12-29T16:33:41Z

Also seeing a lot of operation canceled IO errors.



>Op 0x0000008f connection.go:416] <- interrupt (fuseid 0x0000008e)

>Not retrying Read(".atlantis/atlantis.db", 1640797111622149) after error of type *url.Error ("Get \"https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149\": net/http: request canceled"): &url.Error{Op:"Get", URL:"https://www.googleapis.com:443/download/storage/v1/b/runatlantis-playground/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149", Err:(*errors.errorString)(0xc0000433b0)}" 

>ReadFile: operation canceled, fh.reader.ReadAt: readFull: Get "https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149": net/http: request canceled

flufy3d · 2022-05-03T16:40:23Z

Also seeing a lot of operation canceled IO errors.

go1.16 with Google Cloud Run Container

Darktex · 2022-06-16T00:18:58Z

I also see this while writing TensorBoard logs on GCS while training on Cloud AI Platform:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 166, in _write
    f.write(compatify(file_content))
OSError: [Errno 5] Input/output error

sethiay · 2023-06-06T11:02:02Z

This is same as #1016.

ashmeenkaur · 2024-07-31T05:58:40Z

Update: We've addressed this issue in GCSfuse v2.3 (released June 28, 2024) by changing the default behavior to ignore interrupts during file system operations. This should prevent the problem described here. Please upgrade to the latest version and let us know if you continue to experience any difficulties.

lezh mentioned this issue Oct 28, 2021

Go programs stored on GCS via gcsfuse crash at launch #562

Closed

avidullu added the bug Error or flaw in the code with unintended result label Apr 14, 2022

avidullu mentioned this issue Apr 14, 2022

fh.reader.ReadAt: readFull: context canceled #402

Closed

axolotlgeoff mentioned this issue May 26, 2022

Operations frequently canceled when accessing a gcsfuse mount with a Go application jacobsa/fuse#122

Open

vadlakondaswetha mentioned this issue Mar 20, 2023

interrupted system call error while cloning repository in gcsfuse mounted directory #1016

Closed

sethiay added known-issues Issues that are known or not supported yet. and removed bug Error or flaw in the code with unintended result labels May 31, 2023

sethiay added bug Error or flaw in the code with unintended result and removed known-issues Issues that are known or not supported yet. labels Jun 6, 2023

vadlakondaswetha added the p2 P2 label Jun 8, 2023

ashmeenkaur mentioned this issue Apr 23, 2024

[PR-2][Phase1 Implementation] Ignore GCSFuse interrupts #1860

Merged

ashmeenkaur self-assigned this Jul 31, 2024

ashmeenkaur closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO Error: interrupt out of nowhere? #288

IO Error: interrupt out of nowhere? #288

gorjanradevski commented Jul 23, 2018

jacobsa commented Jul 24, 2018 via email

gorjanradevski commented Jul 27, 2018

jacobsa commented Jul 28, 2018 via email

gorjanradevski commented Jul 29, 2018

jacobsa commented Jul 29, 2018 via email

ykim104 commented Oct 12, 2018

carbocation commented Jul 15, 2020

fregataa commented Jan 11, 2021

nsht commented Jul 14, 2021

kunsonx commented Aug 24, 2021 •

edited

Loading

dwilliams782 commented Dec 29, 2021 •

edited

Loading

flufy3d commented May 3, 2022

Darktex commented Jun 16, 2022

sethiay commented Jun 6, 2023

ashmeenkaur commented Jul 31, 2024

IO Error: interrupt out of nowhere? #288

IO Error: interrupt out of nowhere? #288

Comments

gorjanradevski commented Jul 23, 2018

jacobsa commented Jul 24, 2018 via email

gorjanradevski commented Jul 27, 2018

jacobsa commented Jul 28, 2018 via email

gorjanradevski commented Jul 29, 2018

jacobsa commented Jul 29, 2018 via email

ykim104 commented Oct 12, 2018

carbocation commented Jul 15, 2020

fregataa commented Jan 11, 2021

nsht commented Jul 14, 2021

kunsonx commented Aug 24, 2021 • edited Loading

dwilliams782 commented Dec 29, 2021 • edited Loading

flufy3d commented May 3, 2022

Darktex commented Jun 16, 2022

sethiay commented Jun 6, 2023

ashmeenkaur commented Jul 31, 2024

kunsonx commented Aug 24, 2021 •

edited

Loading

dwilliams782 commented Dec 29, 2021 •

edited

Loading