Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak #45

Open
vrthra opened this issue Nov 17, 2022 · 52 comments
Open

Possible memory leak #45

vrthra opened this issue Nov 17, 2022 · 52 comments
Labels
bug Something isn't working

Comments

@vrthra
Copy link
Contributor

vrthra commented Nov 17, 2022

There is a possible memory leak; Here is my memory utilization;

memory

At this point, the machine becomes unresponsive, and has to be rebooted. It is an oracle cloud instance, and Ktistec is the only application running on it.

$ uname -a
Linux instance-web 5.15.0-1022-oracle #28-Ubuntu SMP Wed Oct 26 09:38:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy
$ free -m
               total        used        free      shared  buff/cache   available
Mem:             964         217         107           1         638         595
Swap:              0           0           0
@toddsundsted
Copy link
Owner

how many inbound messages are you typically receiving in a given day? i'm assuming you're operating at about my scale, but i want to check. my vps has 1gig of memory, and my instance fits comfortably in that. memory usage does increase, but it doesn't grow until a crash. at least for me.

@vrthra
Copy link
Contributor Author

vrthra commented Nov 17, 2022

Counting by the posts I see in the front page, I just got 9 posts yesterday before the server became unresponsive. I am not sure how many messages (pings etc.) I received. I have started collecting my logs; I will count the number of lines before this happens again and update.

@toddsundsted
Copy link
Owner

ok, i get 30-40 posts a minute on average. there's usually a pretty steep memory ramp when i first start the server, but then it generally plateaus, and it will run for days or weeks like that. i've had one or two crashes historically, but those might have been the server being killed off for memory issues unrelated to the server.

that's not to say there's not a memory leak. i'm keeping an eye on my instance in case i introduced a problem recently.

@vrthra
Copy link
Contributor Author

vrthra commented Nov 17, 2022

For me, there was a small steep climb in the first hour, followed by a dip, and then constant climb in 12 hours. See this image.

memleak

I will update what happens today; So far, it has plateaued.

memleak

@vrthra
Copy link
Contributor Author

vrthra commented Nov 19, 2022

Here is what I have currently; There is a small but noticeable climb in two days.

memleak

@toddsundsted
Copy link
Owner

i'm trying to find an easy way to chart the memory usage/growth on my server. i don't believe i've done anything that would prevent instances from being garbage collected, but i could be wrong, or there could be bugs in the core libraries.

@JayVii
Copy link
Contributor

JayVii commented Nov 21, 2022

I have not monitored my server in detail, yet. But ktistec (in docker) crawled from 600MB to 1600MB in the past 5 days. I have been a bit more active over the weekend, so I am not sure if that's a problem.

Edit: that's on a VPS with 4GB RAM

@toddsundsted
Copy link
Owner

memory usage should generally be flat. it doesn't, for example, intentionally try to keep an increasing subset of data in memory. so if you're seeing growth like that day after day, then i'm confident you're seeing a memory leak. i'm going to add additional logging to dump gc and memory stats periodically to see if that help shed light on what's going on.

@toddsundsted
Copy link
Owner

commit 4ce1d3e introduces some code to capture memory information.

@toddsundsted toddsundsted added the bug Something isn't working label Nov 23, 2022
@vrthra
Copy link
Contributor Author

vrthra commented Nov 23, 2022

Thanks; I have deployed this version. Let me know what to look for and report.

@vrthra
Copy link
Contributor Author

vrthra commented Nov 23, 2022

Is there any way to see the current commit/version from a deployed instance?

@toddsundsted
Copy link
Owner

there is not. i thought version would be more useful—i didn't count on all the early adopters building from main

@toddsundsted
Copy link
Owner

@vrthra i've found a petty good candidate bug for the memory leak. i believe sqlite is caching prepared statements for all queries issued. this wouldn't be a huge problem—queries should be parameterized—except in one query i embedded an id directly in the string instead of passing the value in as a parameter. this is in code that handles inbound activities, so it's going to result in a lot of caching of non-reusable statements. i don't know if this is the whole problem, but in local tests for me it's made a big difference in footprint over time.

@toddsundsted
Copy link
Owner

@vrthra i created a branch with some changes that seem to make thing significantly better for me:

https://github.com/toddsundsted/ktistec/tree/memory-problems

i need to make some improvements to the code before i put it in main, but it should improve things for you, as well.

@vrthra
Copy link
Contributor Author

vrthra commented Nov 25, 2022

After deploying this, I get

docker run -p 80:3000 -v `pwd`/db:/db -v `pwd`/uploads:/uploads  -it social:latest
1
2
GC::Stats(@heap_size=786432, @free_bytes=24576, @unmapped_bytes=0, @bytes_since_gc=749696, @total_bytes=749696)
Unhandled exception: more than one row (DB::Error)
  from /build/lib/db/src/db/query_methods.cr:81:9 in 'query_one:args'
  from /build/src/framework/model.cr:246:9 in 'find:iri:include_deleted:include_undone'
  from /build/src/framework/model.cr:256:9 in 'find?:iri:include_deleted:include_undone'
  from /build/src/models/activity_pub/activity/delete.cr:9:5 in 'object?'
  from /build/src/rules/content_rules.cr:43:3 in 'each_match'
  from /build/lib/school/src/school/domain/domain.cr:95:15 in 'each_match:trace'
  from /build/lib/school/src/school/domain/domain.cr:123:11 in 'run'
  from /build/src/rules/content_rules.cr:87:5 in 'run'
  from /build/src/utils/database.cr:41:15 in 'recreate_timeline_and_notifications'
  from /build/src/utils/database.cr:16:5 in 'recreate_timeline_and_notifications'
  from /build/src/database/migrations/000019-update-timeline-and-notifications.cr:5:1 in '->'
  from /build/src/framework/database.cr:60:41 in 'do_operation'
  from /build/src/framework/framework.cr:136:14 in '__crystal_main'
  from /usr/share/crystal/src/crystal/main.cr:115:5 in 'main_user_code'
  from /usr/share/crystal/src/crystal/main.cr:101:7 in 'main'
  from /usr/share/crystal/src/crystal/main.cr:127:3 in 'main'
  from src/env/__libc_start_main.c:95:2 in 'libc_start_main_stage2'
make: *** [Makefile:4: ktistec] Error 1

Any idea?

@toddsundsted
Copy link
Owner

it looks like your database has two activitypub object records with the same id/iri. that migration rebuilds your notification and timeline. the server is processing a delete activity but it's finding more than one object. strictly speaking you probably don't need to run that migration, but the duplicate object is disturbing. i'm at a loss at the moment on how it could have been created. something like the following query should return the duplicates. i'd be interested to know how many you have:
select iri from objects group by iri having count(id) > 1;

@vrthra
Copy link
Contributor Author

vrthra commented Nov 25, 2022

It doesn't seem link that is causing it.

~/ktistec$ sqlite3 db/ktistec.db 'select iri from objects'  | wc -l
5287
~/ktistec$ sqlite3 db/ktistec.db 'select iri from objects group by iri having count(id) > 1'
~/ktistec$ 

@toddsundsted
Copy link
Owner

ok. i don't have a better theory about the trace at the moment. thanks for testing!

@vrthra
Copy link
Contributor Author

vrthra commented Nov 25, 2022

How do I turn off the migration?

@toddsundsted
Copy link
Owner

you can either delete src/database/migrations/000019-update-timeline-and-notifications.cr before you build, or give me ~1 more day to push out the changes in this branch. those changes will include a try/catch in that migration to log the stack traces, but continue running.

@toddsundsted
Copy link
Owner

@vrthra all changes have been merged into main, including some protection against the migration failing—you'll still see the error messages but the server should continue to start.

i'd be interested in how many error messages you get, along with any other logging output. i don't get any errors running that migration, so i want to get a better understanding of what you're seeing.

@vrthra
Copy link
Contributor Author

vrthra commented Nov 27, 2022

These are the logs when starting. I can also supply my ktistec.db or give you access to my oracle machine if it helps you. I have no private information in that machine, and happy to help if you want to debug what is happening.

ktistec.log

@toddsundsted
Copy link
Owner

weird weird weird. i'm guessing you're following some server i'm not and it's sending something i don't expect. maybe it's another aspect of the guppe problems. the one common thing is that it's only happening when processing deletes, so it's probably not noticeable in your timeline or notifications, but if you see anything that looks amiss there, let me know. i've signed up to guppe now, so if it shows up in any way related to that i should see if myself...

@toddsundsted
Copy link
Owner

@vrthra have you seen any improvement in memory usage on your system?

@vrthra
Copy link
Contributor Author

vrthra commented Nov 28, 2022

Yes, it is much better now! Thank you for the updated version. This was deployed on 27th November; See the memory graph after that.
memory
I will keep an eye out on this for a few more days and I will let you know.

@vrthra
Copy link
Contributor Author

vrthra commented Nov 29, 2022

The original huge growth in memory is gone, but there is still a consistent small increase in memory -- 62% to 66% in the first day and 66% to 69% in the last one day.

memory

Screen Shot 2022-11-30 at 6 03 23 am

Screen Shot 2022-11-30 at 6 04 27 am

(three days after, notice the climb).
Screen Shot 2022-11-30 at 6 04 27 am

@felixkrohn
Copy link

Running the build including the latest fixes only since about 24h in total, I can confirm the big leak is gone:
image
I can't yet confirm an ever so slightly increase, will have to wait some more time to discern a pattern over the GC-related ups-and-downs:
image

@toddsundsted
Copy link
Owner

thanks! there are still a few more possibilities, one is that there is another instance somewhere of a prepared statement with an interpolated value. sqlite also allocates pages for its own work, and i haven't looked to see how they are managed.

generally what i want to understand now is does this plateau and/or does this eventually crash. i'll set some vm limits next time i start my instance and see what happens when it bumps into them.

@vrthra
Copy link
Contributor Author

vrthra commented Dec 5, 2022

My memory growth is fairly consistent. Any logs I can get you?

mem

@toddsundsted
Copy link
Owner

@vrthra the output from select * from points where chart = "heap-size"; would be useful. that captures the heap size as it appears to the crystal language garbage collector. if that's constant, then the only other source of allocations should be sqlite. this will help me diagnose that case. thanks!

@vrthra
Copy link
Contributor Author

vrthra commented Dec 7, 2022

@toddsundsted please find attached the output of the query

out.log

@toddsundsted
Copy link
Owner

@vrthra thanks! it looks like there's more hunting to be done!

@toddsundsted
Copy link
Owner

update on this: i've added a chart to the metrics page that shows heap size to make it easier to track this. i still see heap growth over time, even on my instance, so the problem is there.

@JayVii
Copy link
Contributor

JayVii commented Dec 23, 2022

Just so there is another deployment of ktistec in here. Memory usage seems to have changed since deploying 2.0.0-6 for me.

The first server restart is the update to 2.0.0-5, when collection of heap-size starts. Second restart is the update to 2.0.0-6.

Also, when upgrading to 2.0.0-6, I deleted duplicated rows from the DB, so maybe that's another factor(?).

Screenshot_20221223-085735~2

@toddsundsted
Copy link
Owner

thanks!

some good news on my end. after initial growth, the last 3-4 days on top of 2.0.0-6 have been flat for me, even when i look at the hourly data. i'm going to deploy again today, so i'll lose the trend—i'm going to keep this open a little longer to see if there are any data points refuting the improvements.

image

@vrthra
Copy link
Contributor Author

vrthra commented Dec 28, 2022

So, I have a Ktistec Version 2.0.0-6 version that is running for some time. My chart is below. I am not entirely sure why the free KB and heap-size increases hand in hand.

image

On the other hand, the machine memory utilization is as follows
image

image

That is, while there is a huge improvement, there is still a steady climb of memory usage

@toddsundsted
Copy link
Owner

I am not entirely sure why the free KB and heap-size increases hand in hand.

i see this, as well. i'm reporting the values return by the garbage collector, but it's possible i'm interpreting them incorrectly.

fwiw, over the last couple days i saw a dip in heap for the first time. unfortunately, i'm not monitoring the process size outside of the app, so i don't know where there was a corresponding reduction reported there.

image

@vrthra
Copy link
Contributor Author

vrthra commented Jan 1, 2023

@toddsundsted the larger memory leak is back again after upgrading to 07.

image

image

@vrthra
Copy link
Contributor Author

vrthra commented Jan 16, 2023

I think you have solved the memory leak issue as of now; Here is what I have from first week of January. I will keep monitoring for a few more days.

image

image

@toddsundsted
Copy link
Owner

thanks! yes, it's definitely been more consistent. i wish i knew better what fixed it.

@vrthra
Copy link
Contributor Author

vrthra commented Jan 29, 2023

My current stats as of now after a month of monitoring:
image

image

image

Summary: The main leaks are now gone, but there is a small leak ~ 5% per month that still seems to remain. This is running the previous version of Ktistec.

@toddsundsted
Copy link
Owner

@vrthra i'd be interested to know if you are running the latest version of ktistec, and if you are still seeing a leak. i've been able to run the latest for extended periods and memory definitely plateaus for me.

image

@vrthra
Copy link
Contributor Author

vrthra commented Jul 2, 2023

This is what I currently have. For some reason, I keep having to restart my instance as it becomes unresponsive. I am still monitoring it (last restarted 10 days back).

image

@vrthra
Copy link
Contributor Author

vrthra commented Jul 18, 2023

The memory is no longer growing
image
But my ktistec instance crashed again
image
I have a log of 365MB until the crash. Would you be interested in any part of it?

@toddsundsted
Copy link
Owner

@vrthra if there's any indication at all toward the end of the log file about what led up to the crash, i'd love to see that. a stack trace or other abnormal error message.

@vrthra
Copy link
Contributor Author

vrthra commented Aug 27, 2023

I have now checked it two times, and in both times, Ktistec seem to have just exited without an error or a trace. The process just exits. Any idea what I should be looking for in the log or what the cause could be? The log is 500 MB

@vrthra
Copy link
Contributor Author

vrthra commented Aug 27, 2023

Also, my graph now looks like this.

image image

@toddsundsted
Copy link
Owner

I have now checked it two times, and in both times, Ktistec seem to have just exited without an error or a trace.

in my experience this happens when the OS kills the process. there does appear to be a memory spike over the last couple days. did anything out of the ordinary happen then? what are the two dates when you checked and it had been killed? i'm guessing ~8/27 and ~7/18? or was it twice recently?

@vrthra
Copy link
Contributor Author

vrthra commented Aug 28, 2023

Nothing that I know of; I do see spikes in memory, disk read, and load as you mentioned in the last crash/exit. Yes, as you guess, the crash/exit was 2023-7-18 and 2023-8-27; The OCI dashboard is below.

image image

@vrthra
Copy link
Contributor Author

vrthra commented Oct 16, 2023

An update. I now see consistent memory exhaustion.

image

The corresponding metrics. I could get this only after the server restart.

image

@toddsundsted
Copy link
Owner

thanks! you're seeing what i'm seeing at this point. i'm pretty confident i've fixed the straight up programming errors in my code that contributed to past problems. i can't rule out the libraries i use and behavior in the garbage collector itself.
image
in my case, you'll notice a huge spike a short time ago that doesn't correspond to anything i can determine, traffic-wise, from the logs. my general sense is that the boehm garbage collector just isn't great for my particular use case.

@toddsundsted
Copy link
Owner

fwiw, i've also been running the exact same server for the last ~6-7 months. the restarts are due to OS updates that required a reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants