[Bug]: reading a chunk ends up in DataFileRead hanging, other chunks work fine #7054

fordfrog · 2024-06-20T17:41:56Z

What type of bug is this?

Data corruption, Performance issue

What subsystems and features are affected?

Command processing, Partitioning

What happened?

i can't read from a single chunk in my database. other chunks work fine. anything (select, vacuum, compress, ...) that i try to do on the chunk ends up in DataFileRead hanging. that process can't then even be killed and the whole database has to be restarted, which does not work without issues because of the hanging process.

TimescaleDB version affected

2.15.2

PostgreSQL version used

16.3

What operating system did you use?

gentoo linux

What installation method did you use?

Source

What platform did you run on?

Not applicable

Relevant log output and stack trace

# grep 2_95 postgresql-*.csv 
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29568,,6671b827.7380,1,,2024-06-18 16:39:03 GMT,34/12760452,246338502,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813381,18) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29567,,6671b827.737f,1,,2024-06-18 16:39:03 GMT,33/4746705,246338503,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,22) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29572,,6671b827.7384,1,,2024-06-18 16:39:03 GMT,62/2136623,246338506,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813379,38) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29570,,6671b827.7382,1,,2024-06-18 16:39:03 GMT,53/8301273,246338509,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813380,26) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29569,,6671b827.7381,1,,2024-06-18 16:39:03 GMT,41/8912638,246338507,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,46) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29575,,6671b827.7387,1,,2024-06-18 16:39:03 GMT,71/1757346,246338505,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813380,34) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29574,,6671b827.7386,1,,2024-06-18 16:39:03 GMT,69/921512,246338504,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813383,49) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29573,,6671b827.7385,1,,2024-06-18 16:39:03 GMT,66/2498710,246338508,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,57) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29566,,6671b827.737e,1,,2024-06-18 16:39:03 GMT,32/10371971,246338511,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,27) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 23:55:26.257 GMT,,,13543,,66721ad3.34e7,1,,2024-06-18 23:40:03 GMT,44/107414,0,LOG,00000,"automatic vacuum of table ""fxtrader._timescaledb_internal._hyper_2_95_chunk"": index scans: 1
postgresql-Tue.csv:index ""_hyper_2_95_chunk_ticks_bid_ask_ticker_timestamp_position_idx"": pages: 748048 in total, 0 newly deleted, 0 currently deleted, 0 reusable
postgresql-Tue.csv:index ""_hyper_2_95_chunk_ticks_bid_ask_timestamp_idx"": pages: 133248 in total, 0 newly deleted, 0 currently deleted, 0 reusable
postgresql-Wed.csv:2024-06-19 11:10:20.704 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,905,"SELECT",2024-06-18 18:13:27 GMT,24/793,246461755,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Wed.csv:2024-06-19 11:27:39.741 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,906,"SELECT",2024-06-18 18:13:27 GMT,24/793,246461755,ERROR,57014,"canceling statement due to user request",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Wed.csv:2024-06-19 15:53:15.247 GMT,"fordfrog","fxtrader",15284,"[local]",6671cf50.3bb4,434,"SELECT",2024-06-18 18:17:52 GMT,25/1780,246492784,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Wed.csv:2024-06-19 16:02:05.968 GMT,"fordfrog","fxtrader",15284,"[local]",6671cf50.3bb4,435,"SELECT",2024-06-18 18:17:52 GMT,25/1780,246492784,ERROR,57014,"canceling statement due to user request",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:39:13.084 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,909,"SELECT",2024-06-18 18:13:27 GMT,24/798,246641516,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:45:08.230 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,910,"SELECT",2024-06-18 18:13:27 GMT,24/798,246641516,ERROR,57014,"canceling statement due to user request",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:46:31.797 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,911,"SELECT",2024-06-18 18:13:27 GMT,24/799,246642480,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:58:27.132 GMT,,,3982,,6671cdab.f8e,7,,2024-06-18 18:10:51 GMT,,0,LOG,00000,"server process (PID 29899) was terminated by signal 9: Killed","Failed process was running: autovacuum: VACUUM _timescaledb_internal._hyper_2_95_chunk",,,,,,,,"","postmaster",,0
postgresql-Thu.csv:2024-06-20 14:59:53.181 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,5,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:59:53.182 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,6,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"locks acquired for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:59:53.200 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,7,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"new compressed chunk ""_timescaledb_internal.compress_hyper_3_174_chunk"" created",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:59:53.200 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,8,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"using tuplesort to scan rows from ""_hyper_2_95_chunk"" for compression",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 16:31:34.905 GMT,,,1294,,667443c3.50e,13,,2024-06-20 14:59:15 GMT,,0,LOG,00000,"server process (PID 2520) was terminated by signal 9: Killed","Failed process was running: select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,,,,,,"","postmaster",,0
postgresql-Thu.csv:2024-06-20 16:48:57.538 GMT,"fordfrog","fxtrader",30720,"[local]",66745c2c.7800,10,"SELECT",2024-06-20 16:43:24 GMT,20/154,0,ERROR,57014,"canceling statement due to user request",,,,,,"select * from _timescaledb_internal._hyper_2_95_chunk limit 1;",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 17:08:08.914 GMT,,,16505,,66745989.4079,7,,2024-06-20 16:32:09 GMT,,0,LOG,00000,"server process (PID 17417) was terminated by signal 9: Killed","Failed process was running: vacuum full verbose analyze _timescaledb_internal._hyper_2_95_chunk ;",,,,,,,,"","postmaster",,0

How can we reproduce the bug?

as other chunks work fine, i suspect some corruption and i have no idea how to replicate it. the log shows what was going on there with the chunk during the last week. i just recall today i noticed a vacuum hanging on the chunk for two days or so, so i killed that. but the chunk was probably already broken at that time.

EDIT: i just found out that dropping the broken chunk and re-inserting the data that belong to the chunk should recreate that chunk.

EDIT2: i managed to get rid of the chunk though the database got stuck again. and i restored the data from backup. still checking the data...

tjcutajar · 2024-06-21T12:49:55Z

We also ran into this DataFileRead hang issue a couple of days ago and had to perform a restore.

@fordfrog this might be unrelated, but did you upgrade to version 2.15.2 recently without running this script as per the release notes? Just wondering if this is somehow related.

fordfrog · 2024-06-22T10:56:24Z

@fordfrog this might be unrelated, but did you upgrade to version 2.15.2 recently without running this script as per the release notes? Just wondering if this is somehow related.

in fact 2.15.2 is the first version that i used (so i didn't use the script). i'm in the process of migrating table with close to 5 billions of ticks to hypertable with 71 chunks as of now, and so far this was the only issue i encountered.

fordfrog · 2024-06-22T14:26:49Z

also, not sure if that is important, but i have the database server replicated to another instance of postgresql and there the cluster was fine. so i used the cluster from the replicated database to recover the records (dump and then insert back to the production database after removing the broken chunk).

nikkhils · 2024-06-24T09:28:52Z

@fordfrog as can be seen in the logs, someone ran a VACUUM FULL on this chunk. VF will take an exclusive lock on the chunk and ANY other operation will not be able to do anything till VF completes.

Failed process was running: vacuum full verbose analyze _timescaledb_internal._hyper_2_95_chunk ;

Why was the VF command initiated? It's generally not a recommended practice to use VF.

fordfrog · 2024-06-25T11:42:30Z

@fordfrog as can be seen in the logs, someone ran a VACUUM FULL on this chunk. VF will take an exclusive lock on the chunk and ANY other operation will not be able to do anything till VF completes.

Failed process was running: vacuum full verbose analyze _timescaledb_internal._hyper_2_95_chunk ;

Why was the VF command initiated? It's generally not a recommended practice to use VF.

it was already after the chunk was broken and it was a try whether complete rewrite of the chunk helps or not... but it freezed, as any other operation on the chunk.

akuzm · 2024-07-15T09:32:22Z

Are you self-hosting? When the process freezes in I/O and cannot be killed, I would think that my disk is dying. This is probably too low-level to be related to the TimescaleDB extension. Anything interesting in dmesg?

fordfrog · 2024-07-15T11:12:07Z

Are you self-hosting? When the process freezes in I/O and cannot be killed, I would think that my disk is dying. This is probably too low-level to be related to the TimescaleDB extension. Anything interesting in dmesg?

i am self-hosting. this issue occured only on that chunk. no issues with the disks otherwise. no reports of issues with disks or filesystem so far.

fabriziomello · 2024-07-27T19:39:13Z

@fordfrog looks like linux OOM killer is killing your expensive processes. Did u checked your kernel logs?

fordfrog · 2024-07-29T08:19:35Z

@fordfrog looks like linux OOM killer is killing your expensive processes. Did u checked your kernel logs?

i was killing myself some hanging processes as things that should last few minutes were lasting hours... after recreating that chunk with the same data there were no issues so i suppose it was not the data but rather some corruption of the chunk, or maybe a filesystem (btrfs) issue with blocked read.

fordfrog added the bug label Jun 20, 2024

fabriziomello closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: reading a chunk ends up in DataFileRead hanging, other chunks work fine #7054

[Bug]: reading a chunk ends up in DataFileRead hanging, other chunks work fine #7054

fordfrog commented Jun 20, 2024 •

edited

Loading

tjcutajar commented Jun 21, 2024

fordfrog commented Jun 22, 2024 •

edited

Loading

fordfrog commented Jun 22, 2024

nikkhils commented Jun 24, 2024 •

edited

Loading

fordfrog commented Jun 25, 2024

akuzm commented Jul 15, 2024

fordfrog commented Jul 15, 2024

fabriziomello commented Jul 27, 2024

fordfrog commented Jul 29, 2024

[Bug]: reading a chunk ends up in DataFileRead hanging, other chunks work fine #7054

[Bug]: reading a chunk ends up in DataFileRead hanging, other chunks work fine #7054

Comments

fordfrog commented Jun 20, 2024 • edited Loading

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

tjcutajar commented Jun 21, 2024

fordfrog commented Jun 22, 2024 • edited Loading

fordfrog commented Jun 22, 2024

nikkhils commented Jun 24, 2024 • edited Loading

fordfrog commented Jun 25, 2024

akuzm commented Jul 15, 2024

fordfrog commented Jul 15, 2024

fabriziomello commented Jul 27, 2024

fordfrog commented Jul 29, 2024

fordfrog commented Jun 20, 2024 •

edited

Loading

fordfrog commented Jun 22, 2024 •

edited

Loading

nikkhils commented Jun 24, 2024 •

edited

Loading