S3 filesystem pure virtual method called; terminate called without an active exception #1912

rivershah · 2024-01-01T18:13:56Z

I am getting a core dump during interpreter teardown, when using the s3 filesystem. Can I please be given guidance how to handle this issue. Please see script to reproduce inside docker:

FROM tensorflow/tensorflow:2.14.0-gpu

The following environment variables are set

"AWS_ACCESS_KEY_ID": xxx,
"AWS_SECRET_ACCESS_KEY": xxx,
"AWS_ENDPOINT_URL_S3": xxx,
"AWS_REGION": "us-east-1",
"S3_USE_HTTPS": "1",
"S3_VERIFY_SSL": "1",
"S3_DISABLE_MULTI_PART_DOWNLOAD": "0",
"S3_ENDPOINT": xxx,

import os

import tensorflow as tf
import tensorflow_io as tfio

def illustrate_core_dump():
    print(f"tf version: {tf.__version__}")
    print(f"tfio version: {tfio.__version__}")
    filename = f"{os.environ['CLOUD_MOUNT']}/tmp/test_tfrecord.tfrecord"
    assert filename.startswith("s3://"), "problem appears to be be for s3 filesystem only"
    ds = tf.data.TFRecordDataset(filename, "GZIP")

    for i in ds:
        print(f"i.shape: {i.shape}")


if __name__ == "__main__":
    illustrate_core_dump()
    print("reaches here successfully")
    print("something broken during destruction and tf")

    # during interpreter teardown if s3 filesystem used we will get
    # pure virtual method called
    # terminate called without an active exception
    # Aborted (core dumped)

    # gs:// and file:// do not exhibit this issue which don't rely on tfio

TF_CPP_MIN_LOG_LEVEL=0 python notebooks/illustrate_core_dump.py 
2024-01-01 18:07:11.253238: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-01 18:07:11.253287: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-01 18:07:11.253323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-01 18:07:11.262384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf version: 2.14.0
tfio version: 0.35.0
2024-01-01 18:07:14.402239: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.413303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.416545: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.421598: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.423868: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.426098: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.494277: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.496519: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.498484: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.500342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
i.shape: ()
reaches here successfully
something broken during destruction and tf
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

rivershah · 2024-01-04T15:26:09Z

tensorflow-io==0.34.0 # works
tensorflow-io==0.35.0 # crashing

Can we please verify why the latest so exhibiting this issue. Thank you

jpambrun · 2024-01-29T21:59:47Z

I had the same issue and it was driving me insane. I have some unrelated custom c++ ops and wasted a day digging into those. I am using s3 and going back to 0.34.0 fixed it.

saimidu · 2024-02-06T00:21:56Z

Facing the same issue but for tensorflow==2.13, with tensorflow-io==0.34.0 (and with tensorflow-io==0.35.0). There is no straightforward root-cause, and reverting to tensorflow-io==0.33.0 fixes it.

I've also faced the same error with tensorflow==2.14, with tensorflow-io==0.35.0, which is the only version that supports TF 2.14 as per the compatibility chart on the README.md. But reverting to tensorflow-io==0.33.0 seems to fix it.

saimidu · 2024-02-12T23:32:05Z

As an update, I followed the build instructions for tensorflow-io (Ubuntu 22.04 and then Python Wheels), and discovered that this particular pure virtual method called error does not occur when I use a locally built wheel for tensorflow-io.

Note: The link in the docker build instructions is broken - https://github.com/tensorflow/io/blob/master/docs/development.md#docker - and the latest image in tfsigio/tfio is about 2 years old.

rivershah · 2024-02-13T06:45:09Z

@saimi Is there any chance you can please post the steps you took to build? I tried to build but was thwarted by the issues you mentioned.

saimidu · 2024-02-13T23:35:50Z

@rivershah I pulled the ubuntu:22.04 image from dockerhub

docker run --name tfio_builder -itd ubuntu:22.04 bash
docker exec -it tfio_builder bash

and installed all the packages and bazel as instructed in https://github.com/tensorflow/io/blob/master/docs/development.md#ubuntu-2204 (without the sudo)

apt-get -y -qq update
apt-get -y -qq install gcc g++ git unzip curl python3-pip python-is-python3 libntirpc-dev
curl -sSOL https://github.com/bazelbuild/bazelisk/releases/download/v1.11.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 /usr/local/bin/bazel
chmod +x /usr/local/bin/bazel

python3 --version  # made sure I had python version>=3.9
python3 -m pip install -U pip
git clone https://github.com/tensorflow/io
cd io/
git checkout v0.35.0
pip install "tensorflow==2.14.1"
./configure.sh
export TF_PYTHON_VERSION=3.10
bazel build -s --verbose_failures --copt="-Wno-error=array-parameter=" --copt="-I/usr/include/tirpc" //tensorflow_io/... //tensorflow_io_gcs_filesystem/...

I then followed the instructions at https://github.com/tensorflow/io/blob/master/docs/development.md#python-wheels:

python3 setup.py bdist_wheel --data bazel-bin

Then, within the same container, I was able to validate tf-io's S3 filesystem functionality by trying to checkpoint a model to S3.

I'll need to do some additional work to reproduce the failure I got when copying the generated tf-io wheel out into a different container, since I've terminated all of that setup now.

rivershah · 2024-04-05T04:47:57Z

Bumping this issue. Needs looking at to ensure build process handling correctly

rivershah · 2024-05-14T11:46:57Z

This problem persists in tensorflow-io==0.37.0 Please fix, this is rendering s3 based io unusable without resorting to old versions

skye · 2024-05-24T15:00:00Z

@yongtang would you be able to help here? Sounds like this is a pretty serious issue, so it would be much appreciated!!

ruomingp · 2024-05-29T19:23:31Z

This is blocking us from upgrading the tensorstore version. A quick fix will be much appreciated!

CecileRobertMichon · 2024-05-30T21:22:24Z

+1, also running into this issue

Bump Ubuntu version for Linux Wheel to address issue tensorflow#1912 tensorflow#1912

yarri-oss · 2024-05-30T23:00:59Z

@yongtang would you be able to help here? Sounds like this is a pretty serious issue, so it would be much appreciated!!

@yongtang per the comment #1912 (comment) above, assuming my PR #2005 passes can you please consider a minor release (0.37.1 maybe?) to address the S3 issues discussed above. Thanks!

Bump Ubuntu version for Linux Wheel to address issue #1912 #1912

rivershah · 2024-06-28T09:37:58Z

@yongtang Thanks for the fix. In the interest of us being able to upgrade tensorflow, can you please do a 0.37.1 release as per @yarri-oss request. Thanks again

spolloni · 2024-07-02T13:27:32Z

I am still seeing

pure virtual method called
terminate called without an active exception

on 0.37.1. anyone else?

spolloni · 2024-07-03T01:10:55Z

@yarri-oss how is #2005 supposed to fix this issue?

rivershah · 2024-07-03T15:06:42Z

The problem still persists. Replicable with above

pure virtual method called
terminate called without an active exception
Aborted (core dumped)

spolloni · 2024-07-03T22:20:43Z

cc @yongtang -- can we reopen the issue?

yarri-oss · 2024-07-03T22:47:24Z

End users have confirmed this issue fixed.

@spolloni If you can post a repro (with S3 bucket blob) we can investigate further. I would prefer a new issue be opened against your specific repro tho.

spolloni · 2024-07-04T04:23:25Z

End users have confirmed this issue fixed.

? which users?

If you can post a repro

the repro has not changed, it is the one posted here: #1912 (comment)

rivershah · 2024-07-04T13:12:18Z

Which users? I posted the issue and it repros as per above

Bump Ubuntu version for Linux Wheel to address issue #1912 tensorflow/io#1912

txchen · 2024-09-10T17:59:38Z

We are still having this issue with tensorflow-io 0.37.1, please help to reopen this issue. @yarri-oss

import tensorflow as tf
import tensorflow_io as tfio

tf.io.gfile.glob("s3://mybucket/dir")

my-server ~ > python test_tf.py                                                                                                                                          17:54
2024-09-10 17:55:34.571626: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-10 17:55:34.756820: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-10 17:55:35.673576: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-10 17:55:35.673613: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-10 17:55:35.679356: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-10 17:55:36.191044: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-10 17:55:36.193843: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-10 17:55:40.594805: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
pure virtual method called
terminate called without an active exception
zsh: IOT instruction  python test_tf.py

my-server ~ > echo $?                                                                                                                                                    17:55
134

markblee mentioned this issue May 23, 2024

Revert "Bump tensorflow to 2.16.1 and tensorstore to >= 0.1.56." apple/axlearn#484

Merged

yarri-oss added a commit to yarri-oss/io that referenced this issue May 30, 2024

Update build.yml

74ba9c5

Bump Ubuntu version for Linux Wheel to address issue tensorflow#1912 tensorflow#1912

yarri-oss mentioned this issue May 30, 2024

Update build.yml #2005

Merged

yongtang pushed a commit that referenced this issue Jun 18, 2024

Update build.yml (#2005)

21bde2c

Bump Ubuntu version for Linux Wheel to address issue #1912 #1912

yongtang mentioned this issue Jul 1, 2024

Bump to 0.37.1 #2023

Merged

yongtang closed this as completed in #2023 Jul 1, 2024

shantanutrip mentioned this issue Jul 18, 2024

TF2.16.2 with tensorflow-io 0.37.1 shows pure virtual method called, terminate called without an active exception causing exit code 134 when working with s3 filesystem #2039

Open

teskobif7 added a commit to teskobif7/io that referenced this issue Aug 14, 2024

Update build.yml (#2005)

cc79ab6

Bump Ubuntu version for Linux Wheel to address issue #1912 tensorflow/io#1912

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 filesystem pure virtual method called; terminate called without an active exception #1912

S3 filesystem pure virtual method called; terminate called without an active exception #1912

rivershah commented Jan 1, 2024 •

edited

Loading

rivershah commented Jan 4, 2024

jpambrun commented Jan 29, 2024

saimidu commented Feb 6, 2024 •

edited

Loading

saimidu commented Feb 12, 2024

rivershah commented Feb 13, 2024

saimidu commented Feb 13, 2024 •

edited

Loading

rivershah commented Apr 5, 2024

rivershah commented May 14, 2024

skye commented May 24, 2024

ruomingp commented May 29, 2024

CecileRobertMichon commented May 30, 2024

yarri-oss commented May 30, 2024

rivershah commented Jun 28, 2024

spolloni commented Jul 2, 2024

spolloni commented Jul 3, 2024

rivershah commented Jul 3, 2024

spolloni commented Jul 3, 2024 •

edited

Loading

yarri-oss commented Jul 3, 2024

spolloni commented Jul 4, 2024

rivershah commented Jul 4, 2024

txchen commented Sep 10, 2024

S3 filesystem pure virtual method called; terminate called without an active exception #1912

S3 filesystem pure virtual method called; terminate called without an active exception #1912

Comments

rivershah commented Jan 1, 2024 • edited Loading

rivershah commented Jan 4, 2024

jpambrun commented Jan 29, 2024

saimidu commented Feb 6, 2024 • edited Loading

saimidu commented Feb 12, 2024

rivershah commented Feb 13, 2024

saimidu commented Feb 13, 2024 • edited Loading

rivershah commented Apr 5, 2024

rivershah commented May 14, 2024

skye commented May 24, 2024

ruomingp commented May 29, 2024

CecileRobertMichon commented May 30, 2024

yarri-oss commented May 30, 2024

rivershah commented Jun 28, 2024

spolloni commented Jul 2, 2024

spolloni commented Jul 3, 2024

rivershah commented Jul 3, 2024

spolloni commented Jul 3, 2024 • edited Loading

yarri-oss commented Jul 3, 2024

spolloni commented Jul 4, 2024

rivershah commented Jul 4, 2024

txchen commented Sep 10, 2024

rivershah commented Jan 1, 2024 •

edited

Loading

saimidu commented Feb 6, 2024 •

edited

Loading

saimidu commented Feb 13, 2024 •

edited

Loading

spolloni commented Jul 3, 2024 •

edited

Loading