Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sudden CPU spike to 100% and disk space decreasing fast - history purges #13361

Open
roughnecks opened this issue Jul 22, 2022 · 10 comments
Open
Labels
A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) A-Performance Performance, both client-facing and admin-facing S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@roughnecks
Copy link

Description

I have "btop" always running on my second monitor..
Unpinned a message in a room (from Element Web) and all of the sudden (not sure if related) server's CPU went crazy to 100% for a minute or so, until when I restarted the ansible playbook.

In this minute my DB has grown by 1GB more or less but I lost more than 2GB of space and I don't know where exactly.
After that I ran the rust-compress utility and did a FULL VACUUM and DB is back to normal now.

Steps to reproduce

I don't have any step to reproduce, it happened out of the blue.

The only thing I was doing at that time was removing a pinned message from a room.

Homeserver

woodpeckersnest.space

Synapse Version

1.63.1

Installation Method

Other (please mention below)

Platform

I'm running the matrix-docker-ansible playbook on a Debian 11 VPS with 2GB RAM and 1 vCPU.

Relevant log output

Jul 21 21:42:11 matrix-synapse[1303919]: 2022-07-21 19:42:11,764 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655840531742 => stream_ordering 155865)

Anything else that would be useful to know?

No response

@roughnecks
Copy link
Author

It happened again a few minutes ago but it didn't last very much..

Jul 23 13:30:51 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:51,176 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851151 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,165 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851188 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,192 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852167 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,220 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852193 => stream_ordering 161950)

@babolivier
Copy link
Contributor

I can see you have Synapse's experimental support for message retention policies enabled, can you see other log lines including [purge] in your logs?

Also, what level are you logging at? From your report it sounds like WARNING, which isn't really enough to figure out what Synapse is doing at a specific point in time (you usually want INFO).

I would also recommend setting up monitoring with Prometheus and Grafana, as that would allow better insight into which part of Synapse is causing this issue, and would help us assisting you in case INFO logs aren't enough to go from.

@babolivier babolivier added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 25, 2022
@roughnecks
Copy link
Author

I just enabled INFO logging but I can't see any new "PURGE" messages, guess I'll have to wait for message retention policies to kick in.

Also, can't run Prometheus and Grafana because of very limited server specs.

I will update this post when I see new "purge" messages.

Thanks

@erikjohnston
Copy link
Member

Are you seeing the increased CPU and disk space issues during the period of time the logs are showing [purge] messages?

It looks like the way this is set up is that every N hours (or however long you've set the interval to) we look at every room with a retention policy and run a purge on them. That would probably explain the CPU usage if you have lots of such rooms.

I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?

@roughnecks
Copy link
Author

roughnecks commented Jul 29, 2022

Are you seeing the increased CPU and disk space issues during the period of time the logs are showing [purge] messages?

Yes.

I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?

Yeah, it keeps growing, but the most of it was done on the first purge with issues, subsequent ones were faster (in processing) and took less disk space. I have disabled messages retention now.

To me it looks like everything is about the same size but ./postgres, and I wouldn't know what other dirs to check:

root@pandora:/matrix# du -h --max-depth=1
24K     ./mautrix-telegram
20K     ./static-files
23M     ./nginx-proxy
8.0K    ./mailer
1.2M    ./ssl
62M     ./mautrix-whatsapp
8.0K    ./coturn
36K     ./client-element
2.1G    ./synapse
667M    ./postgres-backup
4.0K    ./rust-synapse-compress-state
8.0K    ./heisenbridge
292K    ./matrix-reminder-bot
2.3G    ./postgres
5.1G    .

@babolivier
Copy link
Contributor

babolivier commented Jul 29, 2022

I think I've seen the database growing a bit in the past when running purge jobs but didn't take the time to investigate. I think it could be caused by the fact that we queue up events we might want to delete in a temporary table and don't vacuum after dropping it. In general purge jobs aren't expected to free up disk space on your machine unless you manually vacuum, but it's going to free up space Synapse will be able to use (as in, it will not give the space back to the system but it's going to mark the space as somewhere Synapse can write new data onto). See https://matrix-org.github.io/synapse/latest/message_retention_policies.html#note-on-reclaiming-disk-space for more information.

On the CPU side of things, as Erik implied it's kind of expected to see an increase when purge jobs run (which is the reason why we recommend they run at most once every several hours). I think there are a few quick improvements we can make to the purge code though, for example we check here how many events we need to delete, but we could also just skip all the way down to dropping the temporary table if that number is 0 (looking at the logs you shared this could alleviate a lot of the load given a fair amount of jobs are reporting 0 events to delete). And/or don't insert events we should not delete into that table.

@roughnecks
Copy link
Author

Thanks for explaining, I did a VACUUM FULL with the tools provided by the ansible playbook.
Hope you can improve the process 👍

@babolivier
Copy link
Contributor

For the paper trail, I've opened issues for the two improvement ideas I've mentioned in my previous comment: #13417 and #13419

@richvdh
Copy link
Member

richvdh commented Jul 29, 2022

I've seen history purges cause a big explosion in state groups in the past: #3364 (comment).

@H-Shay H-Shay added S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Aug 9, 2022
@MadLittleMods MadLittleMods added the A-Performance Performance, both client-facing and admin-facing label Apr 25, 2023
@MadLittleMods MadLittleMods changed the title Sudden CPU spike to 100% and disk space decreasing fast Sudden CPU spike to 100% and disk space decreasing fast - history purges Apr 25, 2023
@MadLittleMods MadLittleMods added the A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) label Apr 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) A-Performance Performance, both client-facing and admin-facing S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

6 participants