[Bug]: Timescale segfaults when backfilling data #6540

iliastsa · 2024-01-18T12:59:17Z

What type of bug is this?

Crash

What subsystems and features are affected?

Data ingestion

What happened?

When backfilling data into a hypertable, we get a segfault and the server goes into recovery mode. We've encountered this multiple times, on all TimescaleDB versions since we started using it (which was at v2.11.1 if I'm not mistaken).

It should be noted that we've noticed that dropping old chunks before backfilling helped the backfill to progress further sometimes.

TimescaleDB version affected

2.13.1

PostgreSQL version used

15.5

What operating system did you use?

Ubuntu 22.04 LTS x64

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

dmesg logs:

postgres[3937358]: segfault at 140 ip 00007f82f81d6661 sp 00007ffe2cdeb020 error 6 in timescaledb-tsl-2.13.1.so[7f82f819e000+80000]


postgres logs:

2024-01-18 11:04:10.174 UTC [3909348] LOG:  server process (PID 3937358) was terminated by signal 11: Segmentation fault
2024-01-18 11:04:10.174 UTC [3909348] DETAIL:  Failed process was running: copy "logs" (<list of columns>) from stdin

How can we reproduce the bug?

I haven't been able to reproduce this locally with a small dataset. I'll try and get a proof-of-concept going, but I suspect it has to do with some weird corruption issue that might be hard/impossible to reproduce.

konskov · 2024-01-19T07:31:35Z

hi @iliastsa, thank you for reaching out. Is it possible to share the schema and hypertable definition for the hypertable that you are getting the segfault with? Is that table compressed?

iliastsa · 2024-01-19T11:11:17Z

Sure, here is the DDL + TimescaleDB / compression settings:

create table logs (
    column1  bigint not null,
    column2  int not null,
    column3  int not null,

    column4  int not null,
    column5  int,
    column6  bool not null,

    column7  bytea not null,
    column8  bytea,
    column9  bytea,
    column10 bytea,
    column11 bytea,
    column12 bytea null,
    primary key (column1, column2, column3)
);

select create_hypertable('logs', 'column1', chunk_time_interval => 300000, create_default_indexes => false);

alter table logs set (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'column7',
  timescaledb.compress_orderby = 'column1 desc, column2 desc, column4 desc'
);

konskov · 2024-01-24T12:19:25Z

hello @iliastsa, we are trying to reproduce the error by inserting/deleting and COPYing into compressed chunks but unfortunately, we do not have a reproduction case so far.
Have you been able to collect a coredump for this segfault? If so, would it be possible to get the stacktrace from the coredump with gdb and share it with us? That could be very useful in debugging

iliastsa · 2024-01-29T08:09:16Z

Yeah I've also tried to reproduce it locally with inserts/deletes/COPYs but can't get it to crash. I don't have a coredump, I'll try and get one when we encounter the crash again.

SystemParadox · 2024-05-16T10:19:28Z

We are suddenly having what seems to be the same issue. This system has been running for 2 months with no issue and suddenly started getting this:

2024-05-16 10:43:09.371 BST [1] LOG:  database system is ready to accept connections
2024-05-16 10:43:09.374 BST [796] LOG:  TimescaleDB background worker launcher connected to shared catalogs
2024-05-16 10:43:09.884 BST [804] ERROR:  duplicate key value violates unique constraint "1332_828_tag_history_pkey"
2024-05-16 10:43:09.884 BST [804] DETAIL:  Key (tag_id, "time")=(8e2e78ff-46ff-59eb-a00c-019384ecbf15, 2024-05-16 10:43:04.197+01) already exists.
2024-05-16 10:43:09.884 BST [804] CONTEXT:  COPY tag_history, line 2
2024-05-16 10:43:09.884 BST [804] STATEMENT:  COPY wd.tag_history FROM STDIN
2024-05-16 10:43:15.319 BST [1] LOG:  server process (PID 801) was terminated by signal 11: Segmentation fault
2024-05-16 10:43:15.319 BST [1] DETAIL:  Failed process was running: COPY wd.tag_history FROM STDIN
2024-05-16 10:43:15.319 BST [1] LOG:  terminating any other active server processes
2024-05-16 10:43:15.322 BST [1] LOG:  all server processes terminated; reinitializing
2024-05-16 10:43:15.573 BST [808] LOG:  database system was interrupted; last known up at 2024-05-16 10:43:09 BST
2024-05-16 10:43:15.573 BST [809] FATAL:  the database system is in recovery mode

The duplicate key error is semi-expected - the issue is it should not crash postgres!

CREATE TABLE wd.tag_history (
    time TIMESTAMPTZ NOT NULL,
    tag_id UUID NOT NULL,
    quality INT,
    value_int BIGINT,
    value_bool BOOLEAN,
    value_float DOUBLE PRECISION,
    value_str TEXT,
    PRIMARY KEY(tag_id, time)
);
SELECT create_hypertable('wd.tag_history', 'time',
    if_not_exists => true,
    chunk_time_interval => interval '1 day'
);
ALTER TABLE wd.tag_history SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'tag_id'
);

Backtrace from core dump:

#0  0x000055b34ab09481 in tts_buffer_heap_getsomeattrs ()
#1  0x000055b34ab0b31e in slot_getsomeattrs_int ()
#2  0x00007fd0954a9c70 in slot_getsomeattrs (attnum=1, slot=0x7fd0952daa88) at /usr/local/include/postgresql/server/executor/tuptable.h:344
#3  slot_getattr (isnull=<synthetic pointer>, attnum=1, slot=0x7fd0952daa88) at /usr/local/include/postgresql/server/executor/tuptable.h:387
#4  build_scankeys (num_scankeys=<synthetic pointer>, slot=0x7fd0952daa88, null_columns=0x7ffdde145a38, key_columns=<optimized out>, decompressor=..., hypertable_relid=<optimized out>, hypertable_id=<optimized out>)
    at /build/timescaledb/tsl/src/compression/compression.c:1829
#5  decompress_batches_for_insert (cis=<optimized out>, chunk=<optimized out>, slot=0x7fd0952daa88) at /build/timescaledb/tsl/src/compression/compression.c:1974
#6  0x00007fd095590a0e in ts_chunk_dispatch_get_chunk_insert_state (dispatch=0x7fd1921639d0, point=0x7fd0951fe908, slot=0x7fd0952daa88, on_chunk_changed=on_chunk_changed@entry=0x0, data=data@entry=0x0)
    at /build/timescaledb/src/nodes/chunk_dispatch/chunk_dispatch.c:172
#7  0x00007fd09555a630 in TSCopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x7ffdde145da0, buffer=buffer@entry=0x7fd0952fa088) at /build/timescaledb/src/copy.c:324
#8  0x00007fd09555a978 in TSCopyMultiInsertInfoFlush (miinfo=0x7ffdde145da0, cur_cis=0x7fd0953ce378) at /build/timescaledb/src/copy.c:527
#9  0x00007fd09555b155 in copyfrom (ccstate=ccstate@entry=0x7fd192163968, range_table=<optimized out>, ht=ht@entry=0x7fd095519498, callback=<optimized out>, arg=arg@entry=0x7fd0953ff5e8, copycontext=<optimized out>)
    at /build/timescaledb/src/copy.c:1124
#10 0x00007fd09555bc21 in timescaledb_DoCopy (stmt=<optimized out>, queryString=<optimized out>, processed=<optimized out>, ht=<optimized out>) at /build/timescaledb/src/copy.c:1417
#11 0x00007fd09556ba69 in process_copy (args=0x7ffdde145f70) at /build/timescaledb/src/process_utility.c:673
#12 0x00007fd09556bed1 in process_ddl_command_start (args=0x7ffdde145f70) at /build/timescaledb/src/process_utility.c:4224
#13 timescaledb_ddl_command_start (pstmt=0x7fd19217f4b8, query_string=<optimized out>, readonly_tree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=<optimized out>, 
    completion_tag=<optimized out>) at /build/timescaledb/src/process_utility.c:4467
#14 0x000055b34ac75cdc in PortalRunUtility ()
#15 0x000055b34ac75dfb in PortalRunMulti ()
#16 0x000055b34ac7631b in PortalRun ()
#17 0x000055b34ac72532 in exec_simple_query ()
#18 0x000055b34ac74107 in PostgresMain ()
#19 0x000055b34abf064a in ServerLoop ()
#20 0x000055b34abf1493 in PostmasterMain ()
#21 0x000055b34a974346 in main ()

Postgres version: 14.9
Timescale version: 2.12.2 (based on docker image timescale/timescaledb:2.12.2-pg14)

SystemParadox · 2024-05-16T12:06:44Z

Ok it turns out in our case we were actually still on timescale 2.11.0. Running ALTER EXTENSION to actually update to 2.12.2 fixed this particular issue for us.

SystemParadox · 2024-05-16T13:14:23Z

...except that the issue has reappeared after upgrading to 2.14.2.

akuzm · 2024-05-17T11:55:25Z

Is the stack trace on 2.14.2 different? There was a similar problem that was fixed in 2.12, probably this is what you hit initially: #6117 But if it still fails on 2.14.2, maybe you're hitting something different now.

If you can make some snippet of data on which it reproduces always, that would be perfect, because I was not able to reproduce it or figure out a possible cause after some experiments.

SystemParadox · 2024-05-19T16:03:54Z

Unfortunately this was all on a live system and I didn't manage to obtain a stack trace for 2.14.2.

@DiAifU

This release contains bug fixes since the 2.15.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6540 Segmentation fault when backfilling data with COPY into a compressed chunk * timescale#6858 Before update trigger not working correctly * timescale#6908 Fix gapfill with timezone behaviour around dst switches * timescale#6911 Fix dropped chunk metadata removal in update script * timescale#6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table * timescale#6957 Fix segfault in UNION queries with ordering on compressed chunks **Thanks** * @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time * @edgarzamora for reporting issue with update triggers * @hongquan for reporting an issue with the update script * @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk

@DiAifU

the 2.14.2 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6540 Segmentation fault when backfilling data with COPY into a compressed chunk * timescale#6858 Before update trigger not working correctly * timescale#6908 Fix gapfill with timezone behaviour around dst switches * timescale#6911 Fix dropped chunk metadata removal in update script * timescale#6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table * timescale#6957 Fix segfault in UNION queries with ordering on compressed chunks **Thanks** * @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time * @edgarzamora for reporting issue with update triggers * @hongquan for reporting an issue with the update script * @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk

@DiAifU

This release contains performance improvements and bug fixes since the 2.15.0 release. Best practice is to upgrade at the next available opportunity. **Migrating from self-hosted TimescaleDB v2.14.x and earlier** After you run `ALTER EXTENSION`, you must run [this SQL script](https://github.com/timescale/timescaledb-extras/blob/master/utils/2.15.X-fix_hypertable_foreign_keys.sql). For more details, see the following pull request [#6797](#6797). If you are migrating from TimescaleDB v2.15.0, no changes are required. **Bugfixes** * #6540: Segmentation fault when you backfill data using COPY into a compressed chunk. * #6858: `BEFORE UPDATE` trigger not working correctly. * #6908: Fix `time_bucket_gapfill()` with timezone behaviour around daylight savings time (DST) switches. * #6911: Fix dropped chunk metadata removal in the update script. * #6940: Fix `pg_upgrade` failure by removing `regprocedure` from the catalog table. * #6957: Fix the `segfault` in UNION queries that contain ordering on compressed chunks. **Thanks** * @DiAifU, @kiddhombre and @intermittentnrg for reporting the issues with gapfill and daylight saving time. * @edgarzamora for reporting the issue with update triggers. * @hongquan for reporting the issue with the update script. * @iliastsa and @SystemParadox for reporting the issue with COPY into a compressed chunk.

@DiAifU

This release contains bug fixes since the 2.15.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#6540 Segmentation fault when backfilling data with COPY into a compressed chunk * timescale#6858 Before update trigger not working correctly * timescale#6908 Fix gapfill with timezone behaviour around dst switches * timescale#6911 Fix dropped chunk metadata removal in update script * timescale#6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table * timescale#6957 Fix segfault in UNION queries with ordering on compressed chunks **Thanks** * @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time * @edgarzamora for reporting issue with update triggers * @hongquan for reporting an issue with the update script * @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk

@DiAifU

This release contains bug fixes since the 2.15.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * #6540 Segmentation fault when backfilling data with COPY into a compressed chunk * #6858 Before update trigger not working correctly * #6908 Fix gapfill with timezone behaviour around dst switches * #6911 Fix dropped chunk metadata removal in update script * #6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table * #6957 Fix segfault in UNION queries with ordering on compressed chunks **Thanks** * @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time * @edgarzamora for reporting issue with update triggers * @hongquan for reporting an issue with the update script * @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk

iliastsa added the bug label Jan 18, 2024

konskov added the waiting-for-author label Jan 19, 2024

timescale-automation removed the waiting-for-author label Jan 19, 2024

konskov added the segfault Segmentation fault label Jan 19, 2024

konskov added the waiting-for-author label Jan 24, 2024

timescale-automation removed the waiting-for-author label Jan 29, 2024

jnidzwetzki added the waiting-for-author label Feb 26, 2024

timescale-automation removed the waiting-for-author label May 16, 2024

mkindahl added segfault Segmentation fault and removed segfault Segmentation fault labels May 16, 2024

akuzm mentioned this issue May 17, 2024

Do not decompress batches for COPY with potentially invalid slot #6931

Merged

akuzm closed this as completed in #6931 May 23, 2024

timescale-automation mentioned this issue May 23, 2024

Backport to 2.15.x: #6931: Do not decompress batches for COPY with potentially invalid slot #6946

Merged

akuzm mentioned this issue May 23, 2024

Add missed changelog message for COPY segfault fix #6947

Merged

timescale-automation mentioned this issue May 23, 2024

Backport to 2.15.x: #6947: Add missed changelog message for COPY segfault fix #6950

Merged

fabriziomello mentioned this issue May 27, 2024

Release 2.15.1 #6966

Merged

pallavisontakke mentioned this issue May 28, 2024

Release 2.15.1 #6970

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Timescale segfaults when backfilling data #6540

[Bug]: Timescale segfaults when backfilling data #6540

iliastsa commented Jan 18, 2024 •

edited

Loading

konskov commented Jan 19, 2024

iliastsa commented Jan 19, 2024

konskov commented Jan 24, 2024

iliastsa commented Jan 29, 2024

SystemParadox commented May 16, 2024 •

edited

Loading

SystemParadox commented May 16, 2024

SystemParadox commented May 16, 2024

akuzm commented May 17, 2024

SystemParadox commented May 19, 2024

[Bug]: Timescale segfaults when backfilling data #6540

[Bug]: Timescale segfaults when backfilling data #6540

Comments

iliastsa commented Jan 18, 2024 • edited Loading

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

konskov commented Jan 19, 2024

iliastsa commented Jan 19, 2024

konskov commented Jan 24, 2024

iliastsa commented Jan 29, 2024

SystemParadox commented May 16, 2024 • edited Loading

SystemParadox commented May 16, 2024

SystemParadox commented May 16, 2024

akuzm commented May 17, 2024

SystemParadox commented May 19, 2024

iliastsa commented Jan 18, 2024 •

edited

Loading

SystemParadox commented May 16, 2024 •

edited

Loading