-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autopurging does not delete unread news items #110
Comments
similiar to this: owncloud-archive/news#1024 |
The number is added to the maximum number of entries the feed served at some point to prevent read items reappearing as unread |
Hi, thanks for the fast reply. :) So basically you are saying when a feed has the 100 most recent items (number of feed items is always 100, but feed items change over time) and the "Maximum read count per feed" is 10, then this two numbers add up? So in the nextcloud news app there should never be more than 110 items? |
Exactly :) You can view the number in the oc_news_feeds table (articles_per_update) The logic itself is covered by an integration test so I'm pretty sure that it works (though there's still a chance it doesn't xD) |
Ok cool. I also investigated the code a little bit. However my SQL and php is a bit rusty.😅 In my current owncloud instance I have this problem too. For example the China Daily Feed has an "articles_per_update" of 300. The maximum read count is configured to 60. However, when I check the database I can find over 3800 items with the feed id of China daily. The owncloud news app is a bit outdated though. I also pulled the newest nextcloud docker image and installed the news app (see above). If the problem still exists I will report back in a few days :). |
Hey, I also had a look at the code. What I don't understand is this line: The When I uncommend the later part: The autopurge process seems to work and the number of China daily items is constantly at 310. I'm not sure if this is intended or not. But with the |
I have several feeds that get updated often, so they show "999+" after some time. This lags the database quite heavily because the old feed items don't get purged. I logged this using the MYSQL slow query log as mentioned in the Nextcloud developer manual and here is the output of it:
The only app lagging the database is basically the News app. No other queries were logged. Sometimes this results in heavy load leading to the effect that the feeds are not updated anymore. Only resetting the News app and importing a backup *.opml file solves the problem. |
This might be related to #191, the queries are slow because of the way status is handled. So perhaps once that is fixed, this issue becomes a non-issue. I'm trying to wrap my head around what is intended with the purging code here, but that needs a bit more study. It does look like there's a bug in this bit of code indeed, in that it doesn't delete unread items which then accumulate quickly on high-traffic feeds. |
Diving into it a bit deeper, it appears to me unread items are not purged by design, the assumption being that we should never throw away items that the user hasn't seen yet. The number "Maximum read count per feed" mentioned above is described in the admin for the news app as "Defines the maximum amount of articles that can be read per feed which won't be deleted by the cleanup job" There is no mechanism that purges unread items I think. Given experiences reported in this issue, we probably should introduce a mechanism to purge unread items if they get above a certain (configurable?) threshold. I would propose a new configuration option 'Maximum items per feed'. If a feed has more than that number of items, excess items are purged. The default setting for 'Maximum items per feed' should be 'disabled' so that unread items are never deleted. Example: 'Maximum items per feed' is set to 1000. Feed has 1500 items. Then the oldest 500 (determined by id) items are purged, regardless of whether they are read or unread; with the exception of starred items of course. Just out of curiosity, how is anyone managing to keep up with feeds that have that many articles? |
@sonologic Well, I have a bunch of high volume forum feeds which I like to look through from time to time. The news app is a good place to keep them in one place, so I have to go only to one location. Also, being part of Nextcloud, I do not have to run another newsfeed application which is a big bonus. A limit of items would be very appreciated because it is what puts a high load on the database as I showed above. It gets out of control after some time and the only way to get rid of the problem is to reset the news app and reimport the *.opml file. Also, for people with less powerful machines, this is an even bigger problem I suppose. The news app shouldn't be a power hog just because a few feeds have high volumes of new itmes after a short time. It also does not have to be a default setting though, just deactivate it by default and let the user decide what to do/put it in the documentation. |
I changed this line to This works perfectly for me. With this fix the number of feed items is limited to "articles_per_update"+"maximum read count". |
@fabolhak while that may be perfect for you, for a lot of other people that is far from perfect. Deleting unread items is unexpected behaviour. It must be a conscious opt-in, and not be default behaviour. Also, probably with #191 fixed, the load on the database will no longer manifest as it does now. |
Any progress on this? I too would like to allow deleting of items that are not marked as 'read'. Today, i had trouble updating my nextcloud instance because my sqlite db had grown to >600mb, and my vps was low on space and the db couldnt be converted. I ended up doing a 'delete from oc_news_items' and a vacuum, and got the db to 33mb. Perhaps an option to keep X unread, or delete if > X days would do? i use the news app to skim through news feeds, and wouldnt ever want to go back more that a few weeks. I guess the root cause is that I use my own app on sailfish to sync from my nextcloud instance, and I never set the read status, perhaps I should set the read status of all items after ive downloaded them, but still, the db shouldnt grow to an unmanageable size. |
In my opinion the there is a misunderstanding in this function. Despite it is named
A simple change from
to
should do the trick. UpdateMore convenient of course would be a new function I think i will do a PR. |
You do not want to delete read items because you didn't read them yet. Works as intended |
It may work as intended, but could be considered a feature request. There is definitely a use case deleting old news, read or not, so as to not exceed disk space if the user wishes that. |
I know the app is pretty much unmaintained for quite a while, but this ever growing db is no fun. Atm. my db (lots of feeds and users though) consists of 28GB (feed items only) which is 95% of the entire nextlcoud database where I have 9709932 unread vs 702188 read items. Seeing how much db is used by unread items I want to put some order and purge as much items as possible to prevent the ennorous db growth without being forced to disable news app all together. |
Once I thought about normalizing feeds and feeds items
We could have same benefits:
Just want to put my idea in, because in this case it could help reducing the db size. Let me know what you think and if I should create a separate issue for that or if its just random nonsense. |
Right, it's a trade off. Why I decided against it back in the days was simplicity: you make your data model a lot harder to use and change for a theoretical gain (assumption: people have the same feeds), especially when you can have many different versions of a feed. Your assumption requires feed ids to be unique which doesn't hold up in practice. You can get a different version of a feed based on authentication parameters, full text feed settings, ip ranges, client headers etc. You also need to anticipate future changes. Even migrating this stuff could take days for certain installations (e.g. 1000+ users) |
No it doesn't. It makes absolutely no sense to keep read items, so the threshold for autopurging was always meant to be the treshold for unread items. Each comment in the code indicates that. Always there is the talk of Config.php:27
itemService.php:250
itemMapper.php:335
I made a PR #506 which removes the |
Autopurge works in the following way (even if comments are wonky, sorry but I implemented this stuff):
The "garbage" that you mean is basically stuff that you did not read yet. That's why you have a feed reader. If you (don't) want to read it, mark it as read and it will be cleaned up for you automatically. There isn't really a benefit in deleting unread items since data usage is minimal and the original use case for this feature is to not accumulate more and more data in your table that you are not interested in any more. |
Yes, but that is still a viable benefit .... i have had cases of my nextcloud db failing to upgrade as it has grown so large, that I have run out of space on my hosting when the db upgrade occurs, and the size was taken by unread articles. Its easy to say 'mark them as read' but that can mean endlessly scrolling through old stuff im never going to read. Im sure a single option of 'also delete unread items', defaulting to off would be good, but, in the mean time, i may try the version from the PR. |
@piggz did you see the mark all as read button? And you can trigger the cleanup (aka the cron jobs) manually using the same php command that you use to update stuff or just wait until it runs as configured. |
@BernhardPosselt I hadnt, although I dont use the web app, I use my own mobile app. I presume there is an api call I can make to mark all as read? |
Yep |
Yes, this is exactly how it works, but say: What is the use in keeping read items afterall? If you want to keep them you should star them.
Marking items as read which i do not want to read is a workaround and not a solution. Right now items I do not want to read persist in the database and create garbage. This a fact. Many people here complain about an ever-growing table because of this issue - including me. Since my last comment several months passed, which is the time my database grew so much that i came back to this issue.
This is ridiculous. You suggest to instruct all users of my nextcloud that they have to mark all items as read whether they read it or not in order to keep my database clean. You should admit that at some point items are no news anymore and should be cleaned up automatically in order to keep the database sane. My PR does exactly that and it is simplifying the cleanup process as well. Please consider it. |
RSS is not limited to news per se so your pull request would nuke stuff like podcasts and other things. I think I understand your use case now: you basically want an incomplete stream of fresh entries that is limited to something like: "show me what's relevant for today" That behavior is probably problematic on a global level: not every feed is strictly out of date so if at all this would have to be configured on per feed level IMHO. And then you probably want to keep stuff by date.
That is an entirely different issue: what about users importing tons of contacts and files? In general you need to talk to and trust your users to not abuse it. On bigger installations this is probably an issue but there you have a lot more resources available. I personally know of a setup for a finnish university which works just fine. What could be done though is to build in quota support as in: letting users only allocate a specific maximum amount of MBs in your db and otherwise refuse to update feeds. You can also expose that using the feed warning APIs in place. That way you don't have data loss for older items and a user should be able to fix it quickly. |
How is that? Can you explain, please.
What do you mean by not every feed is strictly out of date? As i said: set the items to keep to 1000 which will holds 1000 items per feed. I can not imagine a use-case where anybody would want to read the 1001st item.
Abuse and misuse are two different things. A user has to act manually to import files and contacts. This is purpose and the user can be hold responsible for that. A user importing 10 feeds, each generating 100 items per day (and there are such) without reading them cannot be held responsible for that, although it fills the nextcloud-database with garbage.
I think you overcomplicate things. Also you bring the responsibility back to the user which is a bad idea IMO. The user will not understand. I suggest to keep the items in a simple FIFO of a definable length (this is already there). If items disappear to soon, than increase the length, but making a difference between read and unread items regarding whether they should be kept or not leads to confusion and misunderstanding (which is already the case). Cheers |
Think of feeds containing images, videos, open issues on your customer management system (like github for instance where read means: done), error messages from logs, torrents, movie episodes and their availability, music etc. If you delete those, then you basically lose data. Therefore you can also not use a FIFO queue because you are deleting data that is not yet marked for deletion. As an analogy: thing of a bookmarks application that just deletes your old bookmarks because you went over the global limit.
You don't really know what a safe amount of unread articles is really. So people will increase it to something very big which in turn will put in much more garbage than the current implementation.
You are arguing for deleting user data rather than explaining to him, that he can only keep a limited amount and his data usage is full. Lots of software like Dropbox and the files app does that and it works just fine. |
I see. You are abusing the news-aggregator as a content-aggregator. Now it makes sense, that you do not run into the problems described here. I use it as replacement to https://feedly.com which is not saving obsolete items.
You are comparing direct data with indirect data. With a bookmarks application I decide whether to store a new bookmark as well as to remove an old bookmark. When i do nothing, no new bookmarks are created at all.
I think i said enough about the garbage the current implementation produces.
I cannot follow your argumentation about deleting user data. I think we both have different understandings of what information a news item means. Anyways, since i seem to be the only one who is having this issue with this app I cannot be bothered anymore. I have a patch to the code and it is more easy to use it after each update than trying to convince you to my perspective :) Cheers |
Yes. It really depends on what you are subscribing to and therefore deleting stuff should be pretty conservative and not surprising. Dealing with high frequency feeds in a manner like "show only stuff from the last 24h" could be appealing in certain cases I guess. But it should be configurable on per feed basis and use a time span (compared to your quick PR which is global and based on item size).
200 articles might sound a lot but takes up almost no space in your database. The comic feed that you mention is probably taking up 200kb of disk storage maximum. Sure there are people that subscribe to 100 high frequency feeds, don't read them and maybe pile up an extreme of 1 Gb of text inside their db over a few months. This issue however sounds more like a quota thing to me. |
Having a per-feed setting like "Automatically delete all entries older than NUMBER days" would be perfect for my use-case. |
At this moment I have 13million unread news items vs 600 000 read (about 40GB of data noone will ever read through). I wrote a small script that would mark everything older then X days as read and then cleanup, but people started reporting weird issues where items re-appeared etc. I think, specially in case of shared instances where people basically dont realize the litter they cause there should be something that purged automatically things that are older then some time. |
@BernhardPosselt Could you please give me a hint which integration test covers purging? (You've mentioned it earlier.) Method In another comment, you suggested it's a quota thing. Could you elaborate on that? In general, I'm analysing options we've got to reduce resource usage of a certain NextCloud instance:
In order for such a quota mechanism to be safe and useful, it would need to:
Any hints, comments? |
@pefimo Are you working on that? I'm interested in such a feature, too. After don't reading my feeds for a year, this took down my database. In my opinion a per-feed setting like "purge items after n days" would be great. I'm willing to help implement this feature. |
@NilsGriebner Yes I am, but slowly... At the moment I'm analysing the code, setting up development environment and so on. I'd be very glad to cooperate -- thanks! However, I don't visit Github often. Do you think we could move this cooperation somewhere else? You can reach me via email (pfm at untalkative.one) or jabber (pfm at disroot.org). |
@pefimo I'll contact you. |
@pefimo and I discussed this over the last days and agreed on this high level concept: We'd like to implement a quota system so that an administrator is able to create quotas and stick them to a specific user. A quota data structure will consist of:
The quotas will be persisted in an additional db table. Relation between user and quota could be realized with a foreign key in user table. This architecture will enable us to expire feed items which are too old and/or keep track of disk space. What do you think? Could this be a valid solution? PS: @pefimo thank you for the awesome collaboration. It's a great pleasure working with you 😺 ❤️ |
Thank you too, @NilsGriebner! I'm looking forward to the implementation phase. 🙂 |
@pefimo sounds good |
This truly is a blocker. I believe on my instance only two people are using the news app. Each time I update the news app, my instance stays in maintenance mode for several hours because it takes so long to update the news table, making Nextcloud unusable during that time for all users. I cannot understand how the attempts to fix this have been rejected with the argument that it breaks existing functionality. The current implementation makes Nextcloud completely unusable in some situations. How is some people losing some old news items more relevant than people not being able to access their files, calendar, passwords etc. for several hours? I agree that when I'm subscribed to a blog that makes one or two posts a month, I wouldn't like if unread (and even read) items are removed after one or two months (or ever). But when I'm subscribed to a news website that posts 100 articles a day, it just doesn't make sense to keep the last 100000 unread items in the database. With the infinite scrolling that the news app uses, it would probably take me a week to even scroll down that far, and my browser would probably crash on the way. Both as a user and as an administrator, the behaviour that I would expect would be that there is a limit somewhere between 1000 and 10000 news items (read or unread) per feed. |
Check the Changelog to find a solution to your problem. Else make a PR and implement it or pay someone to do it for you, we are not payed to work on this application we have other things in our lifes that we need to deal with and we are investing our free time to improve this application. We are not a company that you pay to provide you some service, read the license it states clearly what this is. Locked. |
See the new FAQ entry for this topic. |
IMPORTANT
Read and tick the following checkbox after you have created the issue or place an x inside the brackets ;)
Explain the Problem
Old feed data items are not getting purged automatically. Since I don't mark all items as read the database grows over time and becomes very big.
Steps to Reproduce
Clone Docker Image from official repo, install news app and install/configure cron.
Expected Behavior
The number of unread news items should be 10 and should not increase over time.
Actual Behavior
The number of unread news items is higher than 10 and increases when new items are being published.
System Information
Contents of nextcloud/data/nextcloud.log
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
The text was updated successfully, but these errors were encountered: