Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: download counts dropped between 5.8 to 5.9? #8840

Closed
eunices opened this issue Jul 13, 2022 · 12 comments · Fixed by #8972
Closed

Issue: download counts dropped between 5.8 to 5.9? #8840

eunices opened this issue Jul 13, 2022 · 12 comments · Fixed by #8972
Milestone

Comments

@eunices
Copy link
Contributor

eunices commented Jul 13, 2022

What steps does it take to reproduce the issue?

Seems related to dataset.xhtml (https://github.com/IQSS/dataverse/blob/develop/src/main/webapp/dataverse.xhtml), where guestbookResponseServiceBean.getCountOfAllGuestbookResponses was updated in 5.9 (#7804), to use an estimated value (

public Long getCountOfAllGuestbookResponses() {
)

  • What happens?
    After the upgrade, downloads dropped by about 2,000 (with ~150,000 downloads)

  • To whom does it occur (all users, curators, superusers)?
    All users

  • What did you expect to happen?
    Downloads should not drop after an upgrade.

Which version of Dataverse are you using?
5.10.1

Any related open or closed issues to this bug report?
https://groups.google.com/g/dataverse-community/c/z1EYZHswhhI
#7804, PR: #8143

Some organisations are tracking the precise download stats. We'd just like to highlight this.. It's great if there'll be a potential fix, but understandably it would not be preferred as it comes as the expense of performance. I suppose if we needed precise stats, we could do a select count(o.id) from GuestbookResponse o

@qqmyers
Copy link
Member

qqmyers commented Jul 13, 2022

Interesting. I see a similar (slightly larger %-wise) at QDR right now. FWIW: The estimate is coming from the query in src/main/resources/db/migration/V5.8.0.3__7804-optimizations.sql . There are suggestions on the internet that this is a reasonable fast estimate but as far as I can tell how accurate it is depends on how the table is used and the accuracy can vary/ the count can be out of date since some of the parameters in the query are only calculated periodically.

One thing I just tried (on non-production systems) is to run
vacuum (VERBOSE, ANALYZE) guestbookresponse;
which appears to make the estimate accurate (exact at present). Looking at the postgres config we have at QDR, I don't see autovacuum running, which appears to be recommended. I'll investigate a bit further but if there are others with more db knowledge who know why we should/shouldn't be autovacuuming, it would be good to know.

Assuming vacuuming is a good idea, that done manually or via autovacuum may help keep this estimate accurate. Alternately, if the time to get a count via the select count() query isn't prohibitive, I think sites could alter the query to use it, always or, for example, if the estimate is less than 1M, use the select count(). (Perhaps that could even be a modification contributed back to the community version? My guess is that autovacuum is the better choice overall unless there's reason not to do it.)

@eunices
Copy link
Contributor Author

eunices commented Jul 14, 2022

@qqmyers Thanks for verifying that you were able to replicate the issue on QDR.
I see, thanks for suggesting that 'vacuum' (manual/ automatic) database tuning will help resolve the issue.

@eunices eunices closed this as completed Jul 14, 2022
@qqmyers
Copy link
Member

qqmyers commented Jul 14, 2022

FWIW: After more discussion and investigation, it looks like autovacuum is on by default but the default settings probably don't trigger 'analyze' on the guestbookresponse table often enough to keep the estimate as accurate as we might like. I may look into lowering settings such as autovacuum_analyze_scale_factor (e.g. from 0.1 to 0.01) for that table (not sure yet if that is what is needed). If anyone knows more about postgres and what we could do here to improve the estimate, please add here and/or consider a PR to suggest postgres config settings to add in the guide.

@eunices eunices reopened this Jul 20, 2022
@eunices
Copy link
Contributor Author

eunices commented Jul 20, 2022

@qqmyers did you observe that the download statistic on homepage did not change over a week? We observed that the homepage download stats is does not update over 6 days (using a cached value) while the db query select count(o.id) from GuestbookResponse o had increased by ~1,000.

From the user perspective, it's probably ok that the stats aren't that accurate. But would not be ok if the stats are just not updated even on a weekly basis.

Is the solution to this to trigger analyse on guestbookresponse table on a regular basis manually? If so would you mind sharing the steps/commands involved, so we could adapt it?
It would be good to have a once-off setting like autovacuum_analyze_scale_factor (?) and built-in the application if possible.

@qqmyers
Copy link
Member

qqmyers commented Jul 20, 2022

@eunices - this is definitely something where I think someone has to take the time to get a good answer (I was going to reopen this issue - glad you did).

I haven't yet done a vacuum/analyze at QDR but plan to do so. From the postgres docs, where it says For efficiency reasons, reltuples and relpages are not updated on-the-fly, and so they usually contain somewhat out-of-date values. They are updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX. A VACUUM or ANALYZE operation that does not scan the entire table (which is commonly the case) will incrementally update the reltuples count on the basis of the part of the table it did scan, resulting in an approximate value. In any case, the planner will scale the values it finds in pg_class to match the current physical table size, thus obtaining a closer approximation. it sounds like manual/periodic vacuum/analyze makes sense.

I haven't yet dug deep enough to see if lowering the autovacuum_analyze_scale_factor would be a way/the best way to limit the discrepancy. My naïve guess would be that the estimate can get to be up to ~10% off when the scale factor is 0.1.
The scale factor can be reset per table so I think
alter table guestbookresponse SET (autovacuum_analyze_scale_factor = 0.01);
might be a way to keep the error to 1% and should have minimal (and scalable) impact. (And presumably one could lower it further if desired although the phrase will incrementally update the reltuples count on the basis of the part of the table it did scan makes me wonder how low one can set this without making the estimate worse.)

Again, I plan to try this at QDR (checking with our sys admin there to make sure there are no concerns first). If you/others try this and think it helps, we can make a PR to put that sql command in the docs and/or into flyway. (Anyone who runs this sql update on a test or production system - please report here how it goes. Being able to confirm that the difference between the estimate and real count gets limited to 1% would help confirm that this works.)

@eunices
Copy link
Contributor Author

eunices commented Jul 21, 2022

Thanks for sharing the steps.

Also, I have tried alter table guestbookresponse SET (autovacuum_analyze_scale_factor = 0.01); on our test system.

To verify:

select relname, reloptions 
from pg_class where relname like '%guestbookresponse%';

To remove this setting:

alter table guestbookresponse reset (autovacuum_analyze_scale_factor);   

On the development system:

Before changing autovacuum_analyze_scale_factor
2022-07-21_13-58-52

Downloaded >1% (of ~800 downloads).

After changing autovacuum_analyze_scale_factor
2022-07-21_15-23-49

The numbers seem to have increased.

Production systems are tightly governed on our end, and we won't be able to run the command there, unless it has proven to work on DEV, or recommended by Harvard and/or the developers.

Not sure what other side effects there'll be on the database.

Edit: as of 28 Jul 2022, downloads have been frozen at 941, even after manually downloading >1%.

@qqmyers
Copy link
Member

qqmyers commented Aug 15, 2022

FWIW: At QDR I set alter table guestbookresponse SET (autovacuum_analyze_scale_factor = 0.01); and, after a couple days, that triggered an autoanalyze, as seen with select relname,last_vacuum, last_autovacuum, last_analyze, vacuum_count, autovacuum_count, last_autoanalyze, schemaname from pg_stat_user_tables where relname='guestbookresponse';.
After the autoanalyze, the download counts shown is now quite close to the actual (currently off by 2 counts out of 90K+). Since there have been 100+ download counts since the autoanalyze, I think it means the count is continuing to track.

One thing I noticed is that the threshold for autovacuum is actually the scale factor (i.e. now 1%) plus 50 rows (another param that can be set).

Watching the updates for a bit, it appears that the estimate at QDR is changing in steps of reltuples/relpages which is ~67. It's possible that with fewer counts, a similar lower limit to the estimate's step size could exist, which would mean the estimate will still not change when 1% more downloads occur.

In summary - so far the autovacuum has reduced the estimate's error at QDR from several thousand/many percent to something that is so far <0.1% off. I'll try to continue monitoring to see the maximum before autoanalyze runs again. Assuming it stays smaller, I'd push for some next steps - either recommend this setting (especially since it is reversible), or have a few more sites try it on production/highly used sites (perhaps demo.dataverse.org?). I think this could be added as a flyway script in some future version as well to avoid admins having to make a manual change.

@pdurbin
Copy link
Member

pdurbin commented Sep 13, 2022

I'm talking to "analia-s" in https://chat.dataverse.org this morning and she reported seeing 36 downloads on the homepage but some datasets (such as https://dataverse.unr.edu.ar/dataset.xhtml?persistentId=doi:10.57715/UNR/PTDCEY have files with over 200 downloads). I told her she may be affected by this issue. Here are some screenshots:

Screen Shot 2022-09-13 at 8 20 00 AM
Screen Shot 2022-09-13 at 8 19 38 AM


Update, I just tried https://dataverse.unr.edu.ar/api/info/metrics/downloads and it shows 833 downloads. Why would this be different than the GUI?

@qqmyers
Copy link
Member

qqmyers commented Sep 13, 2022

I don't think the API uses the estimate, so this is probably a good indication that the difference is because of this issue. One could check in the db comparing:
SELECT ((reltuples / relpages) * (pg_relation_size('public.guestbookresponse') / current_setting('block_size')::int))::bigint FROM pg_class WHERE oid = 'public.guestbookresponse'::regclass;
against
SELECT COUNT(id) FROM guestbookresponse;
If they are 36 and 833 then vacuuming should help.

@analia-s
Copy link

@qqmyers thanks!
img1
img2
After to restart postgres and payara there are 739 downloads on the homepage.

@qqmyers
Copy link
Member

qqmyers commented Sep 13, 2022

@analia-s - was the 739 number from before a restart? If it was showing in the UI at the same time the query reported 739, something else is going on. If more downloads happened or the restarts between when you saw 36 and did the query to get 739, postgres could have updated its statistics (which is what the change in this issue/PR will make happen more frequently).

@analia-s
Copy link

@qqmyers I couldn't check if before restarting postgres the number of downloads increased. Today I restarted postgres again but I did not do the query and the number increased again.

@pdurbin pdurbin added this to the 5.13 milestone Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants