Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Timescale segfaults when backfilling data #6540

Closed
iliastsa opened this issue Jan 18, 2024 · 9 comments · Fixed by #6931 or #6947
Closed

[Bug]: Timescale segfaults when backfilling data #6540

iliastsa opened this issue Jan 18, 2024 · 9 comments · Fixed by #6931 or #6947
Labels
bug segfault Segmentation fault

Comments

@iliastsa
Copy link

iliastsa commented Jan 18, 2024

What type of bug is this?

Crash

What subsystems and features are affected?

Data ingestion

What happened?

When backfilling data into a hypertable, we get a segfault and the server goes into recovery mode. We've encountered this multiple times, on all TimescaleDB versions since we started using it (which was at v2.11.1 if I'm not mistaken).

It should be noted that we've noticed that dropping old chunks before backfilling helped the backfill to progress further sometimes.

TimescaleDB version affected

2.13.1

PostgreSQL version used

15.5

What operating system did you use?

Ubuntu 22.04 LTS x64

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

dmesg logs:

postgres[3937358]: segfault at 140 ip 00007f82f81d6661 sp 00007ffe2cdeb020 error 6 in timescaledb-tsl-2.13.1.so[7f82f819e000+80000]


postgres logs:

2024-01-18 11:04:10.174 UTC [3909348] LOG:  server process (PID 3937358) was terminated by signal 11: Segmentation fault
2024-01-18 11:04:10.174 UTC [3909348] DETAIL:  Failed process was running: copy "logs" (<list of columns>) from stdin

How can we reproduce the bug?

I haven't been able to reproduce this locally with a small dataset. I'll try and get a proof-of-concept going, but I suspect it has to do with some weird corruption issue that might be hard/impossible to reproduce.
@iliastsa iliastsa added the bug label Jan 18, 2024
@konskov
Copy link
Contributor

konskov commented Jan 19, 2024

hi @iliastsa, thank you for reaching out. Is it possible to share the schema and hypertable definition for the hypertable that you are getting the segfault with? Is that table compressed?

@iliastsa
Copy link
Author

Sure, here is the DDL + TimescaleDB / compression settings:

create table logs (
    column1  bigint not null,
    column2  int not null,
    column3  int not null,

    column4  int not null,
    column5  int,
    column6  bool not null,

    column7  bytea not null,
    column8  bytea,
    column9  bytea,
    column10 bytea,
    column11 bytea,
    column12 bytea null,
    primary key (column1, column2, column3)
);

select create_hypertable('logs', 'column1', chunk_time_interval => 300000, create_default_indexes => false);

alter table logs set (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'column7',
  timescaledb.compress_orderby = 'column1 desc, column2 desc, column4 desc'
);

@konskov konskov added the segfault Segmentation fault label Jan 19, 2024
@konskov
Copy link
Contributor

konskov commented Jan 24, 2024

hello @iliastsa, we are trying to reproduce the error by inserting/deleting and COPYing into compressed chunks but unfortunately, we do not have a reproduction case so far.
Have you been able to collect a coredump for this segfault? If so, would it be possible to get the stacktrace from the coredump with gdb and share it with us? That could be very useful in debugging

@iliastsa
Copy link
Author

Yeah I've also tried to reproduce it locally with inserts/deletes/COPYs but can't get it to crash. I don't have a coredump, I'll try and get one when we encounter the crash again.

@SystemParadox
Copy link

SystemParadox commented May 16, 2024

We are suddenly having what seems to be the same issue. This system has been running for 2 months with no issue and suddenly started getting this:

2024-05-16 10:43:09.371 BST [1] LOG:  database system is ready to accept connections
2024-05-16 10:43:09.374 BST [796] LOG:  TimescaleDB background worker launcher connected to shared catalogs
2024-05-16 10:43:09.884 BST [804] ERROR:  duplicate key value violates unique constraint "1332_828_tag_history_pkey"
2024-05-16 10:43:09.884 BST [804] DETAIL:  Key (tag_id, "time")=(8e2e78ff-46ff-59eb-a00c-019384ecbf15, 2024-05-16 10:43:04.197+01) already exists.
2024-05-16 10:43:09.884 BST [804] CONTEXT:  COPY tag_history, line 2
2024-05-16 10:43:09.884 BST [804] STATEMENT:  COPY wd.tag_history FROM STDIN
2024-05-16 10:43:15.319 BST [1] LOG:  server process (PID 801) was terminated by signal 11: Segmentation fault
2024-05-16 10:43:15.319 BST [1] DETAIL:  Failed process was running: COPY wd.tag_history FROM STDIN
2024-05-16 10:43:15.319 BST [1] LOG:  terminating any other active server processes
2024-05-16 10:43:15.322 BST [1] LOG:  all server processes terminated; reinitializing
2024-05-16 10:43:15.573 BST [808] LOG:  database system was interrupted; last known up at 2024-05-16 10:43:09 BST
2024-05-16 10:43:15.573 BST [809] FATAL:  the database system is in recovery mode

The duplicate key error is semi-expected - the issue is it should not crash postgres!

CREATE TABLE wd.tag_history (
    time TIMESTAMPTZ NOT NULL,
    tag_id UUID NOT NULL,
    quality INT,
    value_int BIGINT,
    value_bool BOOLEAN,
    value_float DOUBLE PRECISION,
    value_str TEXT,
    PRIMARY KEY(tag_id, time)
);
SELECT create_hypertable('wd.tag_history', 'time',
    if_not_exists => true,
    chunk_time_interval => interval '1 day'
);
ALTER TABLE wd.tag_history SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'tag_id'
);

Backtrace from core dump:

#0  0x000055b34ab09481 in tts_buffer_heap_getsomeattrs ()
#1  0x000055b34ab0b31e in slot_getsomeattrs_int ()
#2  0x00007fd0954a9c70 in slot_getsomeattrs (attnum=1, slot=0x7fd0952daa88) at /usr/local/include/postgresql/server/executor/tuptable.h:344
#3  slot_getattr (isnull=<synthetic pointer>, attnum=1, slot=0x7fd0952daa88) at /usr/local/include/postgresql/server/executor/tuptable.h:387
#4  build_scankeys (num_scankeys=<synthetic pointer>, slot=0x7fd0952daa88, null_columns=0x7ffdde145a38, key_columns=<optimized out>, decompressor=..., hypertable_relid=<optimized out>, hypertable_id=<optimized out>)
    at /build/timescaledb/tsl/src/compression/compression.c:1829
#5  decompress_batches_for_insert (cis=<optimized out>, chunk=<optimized out>, slot=0x7fd0952daa88) at /build/timescaledb/tsl/src/compression/compression.c:1974
#6  0x00007fd095590a0e in ts_chunk_dispatch_get_chunk_insert_state (dispatch=0x7fd1921639d0, point=0x7fd0951fe908, slot=0x7fd0952daa88, on_chunk_changed=on_chunk_changed@entry=0x0, data=data@entry=0x0)
    at /build/timescaledb/src/nodes/chunk_dispatch/chunk_dispatch.c:172
#7  0x00007fd09555a630 in TSCopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x7ffdde145da0, buffer=buffer@entry=0x7fd0952fa088) at /build/timescaledb/src/copy.c:324
#8  0x00007fd09555a978 in TSCopyMultiInsertInfoFlush (miinfo=0x7ffdde145da0, cur_cis=0x7fd0953ce378) at /build/timescaledb/src/copy.c:527
#9  0x00007fd09555b155 in copyfrom (ccstate=ccstate@entry=0x7fd192163968, range_table=<optimized out>, ht=ht@entry=0x7fd095519498, callback=<optimized out>, arg=arg@entry=0x7fd0953ff5e8, copycontext=<optimized out>)
    at /build/timescaledb/src/copy.c:1124
#10 0x00007fd09555bc21 in timescaledb_DoCopy (stmt=<optimized out>, queryString=<optimized out>, processed=<optimized out>, ht=<optimized out>) at /build/timescaledb/src/copy.c:1417
#11 0x00007fd09556ba69 in process_copy (args=0x7ffdde145f70) at /build/timescaledb/src/process_utility.c:673
#12 0x00007fd09556bed1 in process_ddl_command_start (args=0x7ffdde145f70) at /build/timescaledb/src/process_utility.c:4224
#13 timescaledb_ddl_command_start (pstmt=0x7fd19217f4b8, query_string=<optimized out>, readonly_tree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=<optimized out>, 
    completion_tag=<optimized out>) at /build/timescaledb/src/process_utility.c:4467
#14 0x000055b34ac75cdc in PortalRunUtility ()
#15 0x000055b34ac75dfb in PortalRunMulti ()
#16 0x000055b34ac7631b in PortalRun ()
#17 0x000055b34ac72532 in exec_simple_query ()
#18 0x000055b34ac74107 in PostgresMain ()
#19 0x000055b34abf064a in ServerLoop ()
#20 0x000055b34abf1493 in PostmasterMain ()
#21 0x000055b34a974346 in main ()

Postgres version: 14.9
Timescale version: 2.12.2 (based on docker image timescale/timescaledb:2.12.2-pg14)

@SystemParadox
Copy link

Ok it turns out in our case we were actually still on timescale 2.11.0. Running ALTER EXTENSION to actually update to 2.12.2 fixed this particular issue for us.

@SystemParadox
Copy link

...except that the issue has reappeared after upgrading to 2.14.2.

@mkindahl mkindahl added segfault Segmentation fault and removed segfault Segmentation fault labels May 16, 2024
@akuzm
Copy link
Member

akuzm commented May 17, 2024

Is the stack trace on 2.14.2 different? There was a similar problem that was fixed in 2.12, probably this is what you hit initially: #6117 But if it still fails on 2.14.2, maybe you're hitting something different now.

If you can make some snippet of data on which it reproduces always, that would be perfect, because I was not able to reproduce it or figure out a possible cause after some experiments.

@SystemParadox
Copy link

Unfortunately this was all on a live system and I didn't manage to obtain a stack trace for 2.14.2.

fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue May 27, 2024
This release contains bug fixes since the 2.15.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#6540 Segmentation fault when backfilling data with COPY into a compressed chunk
* timescale#6858 Before update trigger not working correctly
* timescale#6908 Fix gapfill with timezone behaviour around dst switches
* timescale#6911 Fix dropped chunk metadata removal in update script
* timescale#6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table
* timescale#6957 Fix segfault in UNION queries with ordering on compressed chunks

**Thanks**
* @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time
* @edgarzamora for reporting issue with update triggers
* @hongquan for reporting an issue with the update script
* @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk
pallavisontakke added a commit to pallavisontakke/timescaledb that referenced this issue May 28, 2024
the 2.14.2 release. We recommend that you upgrade at the next
available opportunity.

**Bugfixes**
* timescale#6540 Segmentation fault when backfilling data with COPY into a compressed chunk
* timescale#6858 Before update trigger not working correctly
* timescale#6908 Fix gapfill with timezone behaviour around dst switches
* timescale#6911 Fix dropped chunk metadata removal in update script
* timescale#6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table
* timescale#6957 Fix segfault in UNION queries with ordering on compressed chunks

**Thanks**
* @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time
* @edgarzamora for reporting issue with update triggers
* @hongquan for reporting an issue with the update script
* @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk
pallavisontakke added a commit that referenced this issue May 28, 2024
This release contains performance improvements and bug fixes since
the 2.15.0 release. Best practice is to upgrade at the next 
available opportunity.

**Migrating from self-hosted TimescaleDB v2.14.x and earlier**

After you run `ALTER EXTENSION`, you must run [this SQL script](https://github.com/timescale/timescaledb-extras/blob/master/utils/2.15.X-fix_hypertable_foreign_keys.sql). For more details, see the following pull request [#6797](#6797).

If you are migrating from TimescaleDB v2.15.0, no changes are required.

**Bugfixes**
* #6540: Segmentation fault when you backfill data using COPY into a compressed chunk.
* #6858: `BEFORE UPDATE` trigger not working correctly. 
* #6908: Fix `time_bucket_gapfill()` with timezone behaviour around daylight savings time (DST) switches.
* #6911: Fix dropped chunk metadata removal in the update script. 
* #6940: Fix `pg_upgrade` failure by removing `regprocedure` from the catalog table.
* #6957: Fix the `segfault` in UNION queries that contain ordering on compressed chunks.

**Thanks**
* @DiAifU, @kiddhombre and @intermittentnrg for reporting the issues with gapfill and daylight saving time.
* @edgarzamora for reporting the issue with update triggers.
* @hongquan for reporting the issue with the update script.
* @iliastsa and @SystemParadox for reporting the issue with COPY into a compressed chunk.
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue May 28, 2024
This release contains bug fixes since the 2.15.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#6540 Segmentation fault when backfilling data with COPY into a compressed chunk
* timescale#6858 Before update trigger not working correctly
* timescale#6908 Fix gapfill with timezone behaviour around dst switches
* timescale#6911 Fix dropped chunk metadata removal in update script
* timescale#6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table
* timescale#6957 Fix segfault in UNION queries with ordering on compressed chunks

**Thanks**
* @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time
* @edgarzamora for reporting issue with update triggers
* @hongquan for reporting an issue with the update script
* @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk
fabriziomello added a commit that referenced this issue May 29, 2024
This release contains bug fixes since the 2.15.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* #6540 Segmentation fault when backfilling data with COPY into a compressed chunk
* #6858 Before update trigger not working correctly
* #6908 Fix gapfill with timezone behaviour around dst switches
* #6911 Fix dropped chunk metadata removal in update script
* #6940 Fix `pg_upgrade` failure by removing `regprocedure` from catalog table
* #6957 Fix segfault in UNION queries with ordering on compressed chunks

**Thanks**
* @DiAifU, @kiddhombre and @intermittentnrg for reporting issues with gapfill and daylight saving time
* @edgarzamora for reporting issue with update triggers
* @hongquan for reporting an issue with the update script
* @iliastsa and @SystemParadox for reporting an issue with COPY into a compressed chunk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug segfault Segmentation fault
Projects
None yet
7 participants