Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant upload and repair failures/alerts #1779

Open
CtrlAltDefeat94 opened this issue Jan 7, 2025 · 6 comments
Open

Constant upload and repair failures/alerts #1779

CtrlAltDefeat94 opened this issue Jan 7, 2025 · 6 comments
Labels
bug Something isn't working

Comments

@CtrlAltDefeat94
Copy link

Current Behavior

I'm currently uploading on two different Renterd nodes. Both of them keep getting stuck on uploads via the API every now and then, and both of them have reoccuring failures with uploads and repairs. Here's a current list of active alerts of one of them;
https://gist.github.com/CtrlAltDefeat94/7f3a180705390b8b6b97c7521dc21d79
(gist because of size limit)

Expected Behavior

.

Steps to Reproduce

I'm uploading data via /worker/objects/{path}?bucket=default

Version

V1.1.1

What operating system did the problem occur on (e.g. Ubuntu 22.04, macOS 12.0, Windows 11)?

Docker

Autopilot Config

{ "contracts": { "set": "autopilot", "amount": 50, "allowance": "4909550423995516048300000000", "period": 6048, "renewWindow": 2016, "download": 1400000000000, "upload": 1400000000000, "storage": 5000000000000, "prune": true }, "hosts": { "allowRedundantIPs": false, "maxDowntimeHours": 336, "minProtocolVersion": "1.6.0", "maxConsecutiveScanFailures": 10, "scoreOverrides": null } }

Bus Config

These all give a 404 and I cant figure all the new endpoints. I dont think they are too related to my issue. But I can provide the results of the corrent endpoints if need be.

Anything else?

The config on the other node with this issue is identical aside from the 30/60 redundancy with 90 hosts.

Unrelated note; The autopilot config endpoint was updated to http://localhost:9980/api/autopilot/config.

@CtrlAltDefeat94 CtrlAltDefeat94 added the bug Something isn't working label Jan 7, 2025
@ChrisSchinnerl
Copy link
Member

I have a few questions.

  1. Are you running the tagged v1.1.1 container or latest from master? Cause it's a bit odd that you would not be able to reach the bus endpoints for the settings.
  2. The node that got 90 contracts, is uploading with 30/60? Because that is very likely to fail. You should make sure to form contracts with about 50% more hosts than you need for all your shards. In this case 90. Because as soon as 1 host fails, your whole upload is going to fail if you don't have spare hosts. That's why the default is 30 shards total with 50 hosts.
  3. How often do you see the migration errors and do they always look the same? (tons of "internal error"s?)
  4. How long has the data been uploaded for?

cc @n8maninger it seems like migrations fail because all hosts return "internal error" when executing the read sector instruction. Any thoughts on what the renter might do wrong on all these hosts simultaneously to cause that error within the ReadSector instruction? From what I can tell it shouldn't be a missing sector but at the same time it feels unlikely that all hosts experience disk problems or similar.

@CtrlAltDefeat94
Copy link
Author

  1. I'm running the tagged container.
  2. I think we're running into semantics here. I have a 30 minimum shards and 60 total shards. Thus 2x redundancy.
  3. my script started uploading a file using http://localhost:8081/api/worker/objects/025a7b9709d975305309e95335877b0952cd439779814493ec7b14ba94cef878.bin?bucket=default
    the file is about 1GB and at some point the upload gets stuck. No bandwidth happens, and the API call doesn't stop/break.
    In the span of an hour more than 1K lines of:
    {"level":"error","ts":"2025-01-08T14:32:55Z","logger":"worker.worker.accounts","caller":"worker/accounts.go:353","msg":"failed to refill account for host{hostKey 25 0 ed25519:fba406818f558676e117323f45665f3affbb10bb025f74548fafe66df0dd5274} {error 26 0 failed to fund account: failed to fund account with 0 SC; insufficient funds to fund account: 0 SC <= 2 H\n}","stacktrace":"go.sia.tech/renterd/internal/worker.(*AccountMgr).refillAccounts.func1\n\t/renterd/internal/worker/accounts.go:353"}
    I'd have to check how often it happens, but if I had to guess (at least) 10 times a day.
  4. I've started uploading to both nodes on the 2nd. Around 1.5TB in files has been uploaded (not accounting redundancy).

I can provide the debug log files in a few hours

@ChrisSchinnerl
Copy link
Member

Ok so the issue template is updated and the curl examples should work now. Logs might still be helpful but the odd issue here is all the hosts returning "internal error". In theory it's pretty clear what that error means in the context of reading a sector (it's not the renter's fault) but the odd thing is that so many hosts return the same error at the same time. We updated hostd to give us some more information but it will take a bit of time for your hosts to update to reflect that change.

@CtrlAltDefeat94
Copy link
Author

I just checked how often uploads fail. Seems like there are a lot more failures than I anticipated. This is a little more than a week of constantly (trying to) upload;
https://file.io/4AzhcgIxChHE
Roughly 50K occurances of upload failures to hosts for various reasons.

@ChrisSchinnerl
Copy link
Member

That is very useful. Probably won't get to it today but I did grep my way through them to see what we can tackle next week.

Ok so 48742 lines of logs total

  • 7298 lines of block height gouging -> this happens when either renter or host aren't synced. Looks like your node was behind the hosts
  • 12679 failed payments -> likely due to some of your contracts being empty and having insufficient balance to pay
  • 5594 failed instructions -> this is an internal host error that we don't get any info about
  • 4628 commit errors -> also host errors, seems like they have disk/db issues
  • 17317 already finished errors -> these are not necessarily problems, just uploads which were finished by a redundant worker, we can probably reduce the log spam for those
  • 582 no price table found errors -> can happen when a host restarts and forgets about price tables that renters fetched, nothing to worry about, will be gone with RHP4
  • 319 revision number must be greater than X errors -> happens when a contract is revised while trying to use it
  • 136 stream was gracefully closed errors -> similar to "already finished" errors
  • 68 other internal errors we don't have any info about unfortunately but probably the same host having other issues
  • 11 context canceled errors which again is similar to "already finished" errors
  • A few more errors related to either gouging, insufficient balance or just a rare issue with dialing a host which can happen

cc @peterjan

@CtrlAltDefeat94
Copy link
Author

CtrlAltDefeat94 commented Jan 10, 2025

I would like to emphasize the error I mentioned before.

3. my script started uploading a file using http://localhost:8081/api/worker/objects/025a7b9709d975305309e95335877b0952cd439779814493ec7b14ba94cef878.bin?bucket=default
the file is about 1GB and at some point the upload gets stuck. No bandwidth happens, and the API call doesn't stop/break.
In the span of an hour more than 1K lines of:
{"level":"error","ts":"2025-01-08T14:32:55Z","logger":"worker.worker.accounts","caller":"worker/accounts.go:353","msg":"failed to refill account for host{hostKey 25 0 ed25519:fba406818f558676e117323f45665f3affbb10bb025f74548fafe66df0dd5274} {error 26 0 failed to fund account: failed to fund account with 0 SC; insufficient funds to fund account: 0 SC <= 2 H\n}","stacktrace":"go.sia.tech/renterd/internal/worker.(*AccountMgr).refillAccounts.func1\n\t/renterd/internal/worker/accounts.go:353"}
I'd have to check how often it happens, but if I had to guess (at least) 10 times a day.

An upload can get stuck on this state indefinitely without throwing an alert in the UI and the API call doesn't get interrupted. Even with smaller files, and waiting for over a day.
However, I'm not sure if the log line is related to the actual problem. I've seen the spam happen even when I'm not uploading myself, but maybe it also happens during repairs.

Here's the debug log from genenis up till now. It might be able to add some context that's missing now.
https://file.io/KCwyA1p9F2AO

EDIT: I could share the log file of the other node as well, though it looks like more of the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants