Sudden CPU spike to 100% and disk space decreasing fast - history purges #13361

roughnecks · 2022-07-22T11:18:14Z

Description

I have "btop" always running on my second monitor..
Unpinned a message in a room (from Element Web) and all of the sudden (not sure if related) server's CPU went crazy to 100% for a minute or so, until when I restarted the ansible playbook.

In this minute my DB has grown by 1GB more or less but I lost more than 2GB of space and I don't know where exactly.
After that I ran the rust-compress utility and did a FULL VACUUM and DB is back to normal now.

Steps to reproduce

I don't have any step to reproduce, it happened out of the blue.

The only thing I was doing at that time was removing a pinned message from a room.

Homeserver

woodpeckersnest.space

Synapse Version

1.63.1

Installation Method

Other (please mention below)

Platform

I'm running the matrix-docker-ansible playbook on a Debian 11 VPS with 2GB RAM and 1 vCPU.

Relevant log output

Jul 21 21:42:11 matrix-synapse[1303919]: 2022-07-21 19:42:11,764 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655840531742 => stream_ordering 155865)

Anything else that would be useful to know?

No response

roughnecks · 2022-07-23T11:37:34Z

It happened again a few minutes ago but it didn't last very much..

Jul 23 13:30:51 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:51,176 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851151 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,165 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851188 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,192 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852167 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,220 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852193 => stream_ordering 161950)

babolivier · 2022-07-25T15:29:20Z

I can see you have Synapse's experimental support for message retention policies enabled, can you see other log lines including [purge] in your logs?

Also, what level are you logging at? From your report it sounds like WARNING, which isn't really enough to figure out what Synapse is doing at a specific point in time (you usually want INFO).

I would also recommend setting up monitoring with Prometheus and Grafana, as that would allow better insight into which part of Synapse is causing this issue, and would help us assisting you in case INFO logs aren't enough to go from.

roughnecks · 2022-07-25T16:17:19Z

I just enabled INFO logging but I can't see any new "PURGE" messages, guess I'll have to wait for message retention policies to kick in.

Also, can't run Prometheus and Grafana because of very limited server specs.

I will update this post when I see new "purge" messages.

Thanks

roughnecks · 2022-07-27T20:42:54Z

@babolivier https://gist.githubusercontent.com/roughnecks/d8e78fd29382f4a7cb7031f310b9bdeb/raw/f9581b9046b0faaf072a75aaa192e15bb0051653/2022-07-27_22-40-44.txt

erikjohnston · 2022-07-29T10:33:28Z

Are you seeing the increased CPU and disk space issues during the period of time the logs are showing [purge] messages?

It looks like the way this is set up is that every N hours (or however long you've set the interval to) we look at every room with a retention policy and run a purge on them. That would probably explain the CPU usage if you have lots of such rooms.

I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?

roughnecks · 2022-07-29T10:40:29Z

Are you seeing the increased CPU and disk space issues during the period of time the logs are showing [purge] messages?

Yes.

I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?

Yeah, it keeps growing, but the most of it was done on the first purge with issues, subsequent ones were faster (in processing) and took less disk space. I have disabled messages retention now.

To me it looks like everything is about the same size but ./postgres, and I wouldn't know what other dirs to check:

root@pandora:/matrix# du -h --max-depth=1
24K     ./mautrix-telegram
20K     ./static-files
23M     ./nginx-proxy
8.0K    ./mailer
1.2M    ./ssl
62M     ./mautrix-whatsapp
8.0K    ./coturn
36K     ./client-element
2.1G    ./synapse
667M    ./postgres-backup
4.0K    ./rust-synapse-compress-state
8.0K    ./heisenbridge
292K    ./matrix-reminder-bot
2.3G    ./postgres
5.1G    .

babolivier · 2022-07-29T12:21:30Z

I think I've seen the database growing a bit in the past when running purge jobs but didn't take the time to investigate. I think it could be caused by the fact that we queue up events we might want to delete in a temporary table and don't vacuum after dropping it. In general purge jobs aren't expected to free up disk space on your machine unless you manually vacuum, but it's going to free up space Synapse will be able to use (as in, it will not give the space back to the system but it's going to mark the space as somewhere Synapse can write new data onto). See https://matrix-org.github.io/synapse/latest/message_retention_policies.html#note-on-reclaiming-disk-space for more information.

On the CPU side of things, as Erik implied it's kind of expected to see an increase when purge jobs run (which is the reason why we recommend they run at most once every several hours). I think there are a few quick improvements we can make to the purge code though, for example we check here how many events we need to delete, but we could also just skip all the way down to dropping the temporary table if that number is 0 (looking at the logs you shared this could alleviate a lot of the load given a fair amount of jobs are reporting 0 events to delete). And/or don't insert events we should not delete into that table.

roughnecks · 2022-07-29T12:28:58Z

Thanks for explaining, I did a VACUUM FULL with the tools provided by the ansible playbook.
Hope you can improve the process 👍

babolivier · 2022-07-29T12:37:10Z

For the paper trail, I've opened issues for the two improvement ideas I've mentioned in my previous comment: #13417 and #13419

richvdh · 2022-07-29T12:43:48Z

I've seen history purges cause a big explosion in state groups in the past: #3364 (comment).

babolivier added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 25, 2022

H-Shay added S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Aug 9, 2022

MadLittleMods added the A-Performance Performance, both client-facing and admin-facing label Apr 25, 2023

MadLittleMods changed the title ~~Sudden CPU spike to 100% and disk space decreasing fast~~ Sudden CPU spike to 100% and disk space decreasing fast - history purges Apr 25, 2023

MadLittleMods added the A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) label Apr 25, 2023

matrixbot mentioned this issue Dec 21, 2023

Sudden CPU spike to 100% and disk space decreasing fast - history purges element-hq/synapse#13361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden CPU spike to 100% and disk space decreasing fast - history purges #13361

Sudden CPU spike to 100% and disk space decreasing fast - history purges #13361

roughnecks commented Jul 22, 2022

roughnecks commented Jul 23, 2022

babolivier commented Jul 25, 2022

roughnecks commented Jul 25, 2022

roughnecks commented Jul 27, 2022

erikjohnston commented Jul 29, 2022

roughnecks commented Jul 29, 2022 •

edited

Loading

babolivier commented Jul 29, 2022 •

edited

Loading

roughnecks commented Jul 29, 2022

babolivier commented Jul 29, 2022

richvdh commented Jul 29, 2022

Sudden CPU spike to 100% and disk space decreasing fast - history purges #13361

Sudden CPU spike to 100% and disk space decreasing fast - history purges #13361

Comments

roughnecks commented Jul 22, 2022

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Platform

Relevant log output

Anything else that would be useful to know?

roughnecks commented Jul 23, 2022

babolivier commented Jul 25, 2022

roughnecks commented Jul 25, 2022

roughnecks commented Jul 27, 2022

erikjohnston commented Jul 29, 2022

roughnecks commented Jul 29, 2022 • edited Loading

babolivier commented Jul 29, 2022 • edited Loading

roughnecks commented Jul 29, 2022

babolivier commented Jul 29, 2022

richvdh commented Jul 29, 2022

roughnecks commented Jul 29, 2022 •

edited

Loading

babolivier commented Jul 29, 2022 •

edited

Loading