Possible memory leak #45

vrthra · 2022-11-17T20:57:26Z

There is a possible memory leak; Here is my memory utilization;

At this point, the machine becomes unresponsive, and has to be rebooted. It is an oracle cloud instance, and Ktistec is the only application running on it.

$ uname -a
Linux instance-web 5.15.0-1022-oracle #28-Ubuntu SMP Wed Oct 26 09:38:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy
$ free -m
               total        used        free      shared  buff/cache   available
Mem:             964         217         107           1         638         595
Swap:              0           0           0

The text was updated successfully, but these errors were encountered:

toddsundsted · 2022-11-17T21:21:58Z

how many inbound messages are you typically receiving in a given day? i'm assuming you're operating at about my scale, but i want to check. my vps has 1gig of memory, and my instance fits comfortably in that. memory usage does increase, but it doesn't grow until a crash. at least for me.

vrthra · 2022-11-17T22:21:32Z

Counting by the posts I see in the front page, I just got 9 posts yesterday before the server became unresponsive. I am not sure how many messages (pings etc.) I received. I have started collecting my logs; I will count the number of lines before this happens again and update.

toddsundsted · 2022-11-17T22:27:31Z

ok, i get 30-40 posts a minute on average. there's usually a pretty steep memory ramp when i first start the server, but then it generally plateaus, and it will run for days or weeks like that. i've had one or two crashes historically, but those might have been the server being killed off for memory issues unrelated to the server.

that's not to say there's not a memory leak. i'm keeping an eye on my instance in case i introduced a problem recently.

vrthra · 2022-11-17T22:37:03Z

For me, there was a small steep climb in the first hour, followed by a dip, and then constant climb in 12 hours. See this image.

I will update what happens today; So far, it has plateaued.

vrthra · 2022-11-19T21:05:53Z

Here is what I have currently; There is a small but noticeable climb in two days.

toddsundsted · 2022-11-20T04:25:18Z

i'm trying to find an easy way to chart the memory usage/growth on my server. i don't believe i've done anything that would prevent instances from being garbage collected, but i could be wrong, or there could be bugs in the core libraries.

JayVii · 2022-11-21T07:48:08Z

I have not monitored my server in detail, yet. But ktistec (in docker) crawled from 600MB to 1600MB in the past 5 days. I have been a bit more active over the weekend, so I am not sure if that's a problem.

Edit: that's on a VPS with 4GB RAM

toddsundsted · 2022-11-21T13:26:02Z

memory usage should generally be flat. it doesn't, for example, intentionally try to keep an increasing subset of data in memory. so if you're seeing growth like that day after day, then i'm confident you're seeing a memory leak. i'm going to add additional logging to dump gc and memory stats periodically to see if that help shed light on what's going on.

toddsundsted · 2022-11-23T03:10:11Z

commit 4ce1d3e introduces some code to capture memory information.

vrthra · 2022-11-23T20:48:12Z

Thanks; I have deployed this version. Let me know what to look for and report.

vrthra · 2022-11-23T20:51:38Z

Is there any way to see the current commit/version from a deployed instance?

toddsundsted · 2022-11-24T12:09:41Z

there is not. i thought version would be more useful—i didn't count on all the early adopters building from main

toddsundsted · 2022-11-24T12:15:25Z

@vrthra i've found a petty good candidate bug for the memory leak. i believe sqlite is caching prepared statements for all queries issued. this wouldn't be a huge problem—queries should be parameterized—except in one query i embedded an id directly in the string instead of passing the value in as a parameter. this is in code that handles inbound activities, so it's going to result in a lot of caching of non-reusable statements. i don't know if this is the whole problem, but in local tests for me it's made a big difference in footprint over time.

toddsundsted · 2022-11-24T17:07:54Z

@vrthra i created a branch with some changes that seem to make thing significantly better for me:

https://github.com/toddsundsted/ktistec/tree/memory-problems

i need to make some improvements to the code before i put it in main, but it should improve things for you, as well.

vrthra · 2022-11-25T00:58:31Z

After deploying this, I get

docker run -p 80:3000 -v `pwd`/db:/db -v `pwd`/uploads:/uploads  -it social:latest
1
2
GC::Stats(@heap_size=786432, @free_bytes=24576, @unmapped_bytes=0, @bytes_since_gc=749696, @total_bytes=749696)
Unhandled exception: more than one row (DB::Error)
  from /build/lib/db/src/db/query_methods.cr:81:9 in 'query_one:args'
  from /build/src/framework/model.cr:246:9 in 'find:iri:include_deleted:include_undone'
  from /build/src/framework/model.cr:256:9 in 'find?:iri:include_deleted:include_undone'
  from /build/src/models/activity_pub/activity/delete.cr:9:5 in 'object?'
  from /build/src/rules/content_rules.cr:43:3 in 'each_match'
  from /build/lib/school/src/school/domain/domain.cr:95:15 in 'each_match:trace'
  from /build/lib/school/src/school/domain/domain.cr:123:11 in 'run'
  from /build/src/rules/content_rules.cr:87:5 in 'run'
  from /build/src/utils/database.cr:41:15 in 'recreate_timeline_and_notifications'
  from /build/src/utils/database.cr:16:5 in 'recreate_timeline_and_notifications'
  from /build/src/database/migrations/000019-update-timeline-and-notifications.cr:5:1 in '->'
  from /build/src/framework/database.cr:60:41 in 'do_operation'
  from /build/src/framework/framework.cr:136:14 in '__crystal_main'
  from /usr/share/crystal/src/crystal/main.cr:115:5 in 'main_user_code'
  from /usr/share/crystal/src/crystal/main.cr:101:7 in 'main'
  from /usr/share/crystal/src/crystal/main.cr:127:3 in 'main'
  from src/env/__libc_start_main.c:95:2 in 'libc_start_main_stage2'
make: *** [Makefile:4: ktistec] Error 1

Any idea?

toddsundsted · 2022-11-25T02:23:29Z

it looks like your database has two activitypub object records with the same id/iri. that migration rebuilds your notification and timeline. the server is processing a delete activity but it's finding more than one object. strictly speaking you probably don't need to run that migration, but the duplicate object is disturbing. i'm at a loss at the moment on how it could have been created. something like the following query should return the duplicates. i'd be interested to know how many you have:
select iri from objects group by iri having count(id) > 1;

vrthra · 2022-11-25T05:16:49Z

It doesn't seem link that is causing it.

~/ktistec$ sqlite3 db/ktistec.db 'select iri from objects'  | wc -l
5287
~/ktistec$ sqlite3 db/ktistec.db 'select iri from objects group by iri having count(id) > 1'
~/ktistec$

toddsundsted · 2022-11-25T09:00:37Z

ok. i don't have a better theory about the trace at the moment. thanks for testing!

vrthra · 2022-11-25T10:35:14Z

How do I turn off the migration?

toddsundsted · 2022-11-25T12:40:32Z

you can either delete src/database/migrations/000019-update-timeline-and-notifications.cr before you build, or give me ~1 more day to push out the changes in this branch. those changes will include a try/catch in that migration to log the stack traces, but continue running.

toddsundsted · 2022-11-25T23:52:51Z

@vrthra all changes have been merged into main, including some protection against the migration failing—you'll still see the error messages but the server should continue to start.

i'd be interested in how many error messages you get, along with any other logging output. i don't get any errors running that migration, so i want to get a better understanding of what you're seeing.

vrthra · 2022-11-27T00:14:14Z

These are the logs when starting. I can also supply my ktistec.db or give you access to my oracle machine if it helps you. I have no private information in that machine, and happy to help if you want to debug what is happening.

ktistec.log

toddsundsted · 2022-11-27T00:58:02Z

weird weird weird. i'm guessing you're following some server i'm not and it's sending something i don't expect. maybe it's another aspect of the guppe problems. the one common thing is that it's only happening when processing deletes, so it's probably not noticeable in your timeline or notifications, but if you see anything that looks amiss there, let me know. i've signed up to guppe now, so if it shows up in any way related to that i should see if myself...

toddsundsted · 2022-11-28T14:51:28Z

@vrthra have you seen any improvement in memory usage on your system?

vrthra · 2022-11-28T19:38:28Z

Yes, it is much better now! Thank you for the updated version. This was deployed on 27th November; See the memory graph after that.

I will keep an eye out on this for a few more days and I will let you know.

vrthra · 2022-11-29T19:04:59Z

The original huge growth in memory is gone, but there is still a consistent small increase in memory -- 62% to 66% in the first day and 66% to 69% in the last one day.

(three days after, notice the climb).

felixkrohn · 2022-11-29T20:35:45Z

Running the build including the latest fixes only since about 24h in total, I can confirm the big leak is gone:

I can't yet confirm an ever so slightly increase, will have to wait some more time to discern a pattern over the GC-related ups-and-downs:

toddsundsted · 2022-11-30T02:22:15Z

thanks! there are still a few more possibilities, one is that there is another instance somewhere of a prepared statement with an interpolated value. sqlite also allocates pages for its own work, and i haven't looked to see how they are managed.

generally what i want to understand now is does this plateau and/or does this eventually crash. i'll set some vm limits next time i start my instance and see what happens when it bumps into them.

vrthra · 2022-12-05T19:26:57Z

My memory growth is fairly consistent. Any logs I can get you?

toddsundsted · 2022-12-07T12:09:34Z

@vrthra the output from select * from points where chart = "heap-size"; would be useful. that captures the heap size as it appears to the crystal language garbage collector. if that's constant, then the only other source of allocations should be sqlite. this will help me diagnose that case. thanks!

vrthra · 2022-12-07T21:06:33Z

@toddsundsted please find attached the output of the query

out.log

toddsundsted · 2022-12-11T19:05:29Z

@vrthra thanks! it looks like there's more hunting to be done!

toddsundsted · 2022-12-16T13:24:11Z

update on this: i've added a chart to the metrics page that shows heap size to make it easier to track this. i still see heap growth over time, even on my instance, so the problem is there.

JayVii · 2022-12-23T08:05:02Z

Just so there is another deployment of ktistec in here. Memory usage seems to have changed since deploying 2.0.0-6 for me.

The first server restart is the update to 2.0.0-5, when collection of heap-size starts. Second restart is the update to 2.0.0-6.

Also, when upgrading to 2.0.0-6, I deleted duplicated rows from the DB, so maybe that's another factor(?).

toddsundsted · 2022-12-24T14:15:46Z

thanks!

some good news on my end. after initial growth, the last 3-4 days on top of 2.0.0-6 have been flat for me, even when i look at the hourly data. i'm going to deploy again today, so i'll lose the trend—i'm going to keep this open a little longer to see if there are any data points refuting the improvements.

vrthra · 2022-12-28T03:53:31Z

So, I have a Ktistec Version 2.0.0-6 version that is running for some time. My chart is below. I am not entirely sure why the free KB and heap-size increases hand in hand.

On the other hand, the machine memory utilization is as follows

That is, while there is a huge improvement, there is still a steady climb of memory usage

toddsundsted · 2022-12-29T20:42:01Z

I am not entirely sure why the free KB and heap-size increases hand in hand.

i see this, as well. i'm reporting the values return by the garbage collector, but it's possible i'm interpreting them incorrectly.

fwiw, over the last couple days i saw a dip in heap for the first time. unfortunately, i'm not monitoring the process size outside of the app, so i don't know where there was a corresponding reduction reported there.

vrthra · 2023-01-01T22:48:41Z

@toddsundsted the larger memory leak is back again after upgrading to 07.

vrthra · 2023-01-16T19:26:14Z

I think you have solved the memory leak issue as of now; Here is what I have from first week of January. I will keep monitoring for a few more days.

toddsundsted · 2023-01-18T20:43:33Z

thanks! yes, it's definitely been more consistent. i wish i knew better what fixed it.

vrthra · 2023-01-29T21:01:06Z

My current stats as of now after a month of monitoring:

Summary: The main leaks are now gone, but there is a small leak ~ 5% per month that still seems to remain. This is running the previous version of Ktistec.

toddsundsted · 2023-07-02T13:30:37Z

@vrthra i'd be interested to know if you are running the latest version of ktistec, and if you are still seeing a leak. i've been able to run the latest for extended periods and memory definitely plateaus for me.

vrthra · 2023-07-02T23:08:29Z

This is what I currently have. For some reason, I keep having to restart my instance as it becomes unresponsive. I am still monitoring it (last restarted 10 days back).

vrthra · 2023-07-18T12:31:04Z

The memory is no longer growing

But my ktistec instance crashed again

I have a log of 365MB until the crash. Would you be interested in any part of it?

toddsundsted · 2023-07-18T17:27:41Z

@vrthra if there's any indication at all toward the end of the log file about what led up to the crash, i'd love to see that. a stack trace or other abnormal error message.

vrthra · 2023-08-27T00:13:31Z

I have now checked it two times, and in both times, Ktistec seem to have just exited without an error or a trace. The process just exits. Any idea what I should be looking for in the log or what the cause could be? The log is 500 MB

vrthra · 2023-08-27T00:17:41Z

Also, my graph now looks like this.

toddsundsted · 2023-08-27T02:45:29Z

I have now checked it two times, and in both times, Ktistec seem to have just exited without an error or a trace.

in my experience this happens when the OS kills the process. there does appear to be a memory spike over the last couple days. did anything out of the ordinary happen then? what are the two dates when you checked and it had been killed? i'm guessing ~8/27 and ~7/18? or was it twice recently?

vrthra · 2023-08-28T23:27:18Z

Nothing that I know of; I do see spikes in memory, disk read, and load as you mentioned in the last crash/exit. Yes, as you guess, the crash/exit was 2023-7-18 and 2023-8-27; The OCI dashboard is below.

vrthra · 2023-10-16T09:14:32Z

An update. I now see consistent memory exhaustion.

The corresponding metrics. I could get this only after the server restart.

toddsundsted · 2023-11-08T21:20:37Z

thanks! you're seeing what i'm seeing at this point. i'm pretty confident i've fixed the straight up programming errors in my code that contributed to past problems. i can't rule out the libraries i use and behavior in the garbage collector itself.

in my case, you'll notice a huge spike a short time ago that doesn't correspond to anything i can determine, traffic-wise, from the logs. my general sense is that the boehm garbage collector just isn't great for my particular use case.

toddsundsted · 2023-11-08T21:22:20Z

fwiw, i've also been running the exact same server for the last ~6-7 months. the restarts are due to OS updates that required a reboot.

toddsundsted added the bug Something isn't working label Nov 23, 2022

toddsundsted mentioned this issue Dec 5, 2022

Upgrading does not work #54

Open

toddsundsted mentioned this issue Dec 14, 2022

Error on startup #55

Closed

Possible memory leak #45

Possible memory leak #45

Comments

vrthra commented Nov 17, 2022

toddsundsted commented Nov 17, 2022

vrthra commented Nov 17, 2022

toddsundsted commented Nov 17, 2022

vrthra commented Nov 17, 2022

vrthra commented Nov 19, 2022

toddsundsted commented Nov 20, 2022

JayVii commented Nov 21, 2022 • edited Loading

toddsundsted commented Nov 21, 2022

toddsundsted commented Nov 23, 2022

vrthra commented Nov 23, 2022

vrthra commented Nov 23, 2022

toddsundsted commented Nov 24, 2022

toddsundsted commented Nov 24, 2022

toddsundsted commented Nov 24, 2022

vrthra commented Nov 25, 2022

toddsundsted commented Nov 25, 2022

vrthra commented Nov 25, 2022

toddsundsted commented Nov 25, 2022

vrthra commented Nov 25, 2022

toddsundsted commented Nov 25, 2022

toddsundsted commented Nov 25, 2022

vrthra commented Nov 27, 2022 • edited Loading

toddsundsted commented Nov 27, 2022

toddsundsted commented Nov 28, 2022

vrthra commented Nov 28, 2022

vrthra commented Nov 29, 2022 • edited Loading

felixkrohn commented Nov 29, 2022

toddsundsted commented Nov 30, 2022

vrthra commented Dec 5, 2022

toddsundsted commented Dec 7, 2022

vrthra commented Dec 7, 2022

toddsundsted commented Dec 11, 2022

toddsundsted commented Dec 16, 2022

JayVii commented Dec 23, 2022

toddsundsted commented Dec 24, 2022

vrthra commented Dec 28, 2022

toddsundsted commented Dec 29, 2022

vrthra commented Jan 1, 2023

vrthra commented Jan 16, 2023

toddsundsted commented Jan 18, 2023

vrthra commented Jan 29, 2023 • edited Loading

toddsundsted commented Jul 2, 2023

vrthra commented Jul 2, 2023

vrthra commented Jul 18, 2023

toddsundsted commented Jul 18, 2023

vrthra commented Aug 27, 2023 • edited Loading

vrthra commented Aug 27, 2023 • edited Loading

toddsundsted commented Aug 27, 2023

vrthra commented Aug 28, 2023 • edited Loading

vrthra commented Oct 16, 2023 • edited Loading

toddsundsted commented Nov 8, 2023

toddsundsted commented Nov 8, 2023

JayVii commented Nov 21, 2022 •

edited

Loading

vrthra commented Nov 27, 2022 •

edited

Loading

vrthra commented Nov 29, 2022 •

edited

Loading

vrthra commented Jan 29, 2023 •

edited

Loading

vrthra commented Aug 27, 2023 •

edited

Loading

vrthra commented Aug 27, 2023 •

edited

Loading

vrthra commented Aug 28, 2023 •

edited

Loading

vrthra commented Oct 16, 2023 •

edited

Loading