Constant upload and repair failures/alerts #1779

CtrlAltDefeat94 · 2025-01-07T20:26:16Z

Current Behavior

I'm currently uploading on two different Renterd nodes. Both of them keep getting stuck on uploads via the API every now and then, and both of them have reoccuring failures with uploads and repairs. Here's a current list of active alerts of one of them;
https://gist.github.com/CtrlAltDefeat94/7f3a180705390b8b6b97c7521dc21d79
(gist because of size limit)

Expected Behavior

.

Steps to Reproduce

I'm uploading data via /worker/objects/{path}?bucket=default

Version

V1.1.1

What operating system did the problem occur on (e.g. Ubuntu 22.04, macOS 12.0, Windows 11)?

Docker

Autopilot Config

{ "contracts": { "set": "autopilot", "amount": 50, "allowance": "4909550423995516048300000000", "period": 6048, "renewWindow": 2016, "download": 1400000000000, "upload": 1400000000000, "storage": 5000000000000, "prune": true }, "hosts": { "allowRedundantIPs": false, "maxDowntimeHours": 336, "minProtocolVersion": "1.6.0", "maxConsecutiveScanFailures": 10, "scoreOverrides": null } }

Bus Config

These all give a 404 and I cant figure all the new endpoints. I dont think they are too related to my issue. But I can provide the results of the corrent endpoints if need be.

Anything else?

The config on the other node with this issue is identical aside from the 30/60 redundancy with 90 hosts.

Unrelated note; The autopilot config endpoint was updated to http://localhost:9980/api/autopilot/config.

The text was updated successfully, but these errors were encountered:

ChrisSchinnerl · 2025-01-08T09:46:53Z

I have a few questions.

Are you running the tagged v1.1.1 container or latest from master? Cause it's a bit odd that you would not be able to reach the bus endpoints for the settings.
The node that got 90 contracts, is uploading with 30/60? Because that is very likely to fail. You should make sure to form contracts with about 50% more hosts than you need for all your shards. In this case 90. Because as soon as 1 host fails, your whole upload is going to fail if you don't have spare hosts. That's why the default is 30 shards total with 50 hosts.
How often do you see the migration errors and do they always look the same? (tons of "internal error"s?)
How long has the data been uploaded for?

cc @n8maninger it seems like migrations fail because all hosts return "internal error" when executing the read sector instruction. Any thoughts on what the renter might do wrong on all these hosts simultaneously to cause that error within the ReadSector instruction? From what I can tell it shouldn't be a missing sector but at the same time it feels unlikely that all hosts experience disk problems or similar.

CtrlAltDefeat94 · 2025-01-08T14:36:43Z

I'm running the tagged container.
I think we're running into semantics here. I have a 30 minimum shards and 60 total shards. Thus 2x redundancy.
my script started uploading a file using http://localhost:8081/api/worker/objects/025a7b9709d975305309e95335877b0952cd439779814493ec7b14ba94cef878.bin?bucket=default
the file is about 1GB and at some point the upload gets stuck. No bandwidth happens, and the API call doesn't stop/break.
In the span of an hour more than 1K lines of:
{"level":"error","ts":"2025-01-08T14:32:55Z","logger":"worker.worker.accounts","caller":"worker/accounts.go:353","msg":"failed to refill account for host{hostKey 25 0 ed25519:fba406818f558676e117323f45665f3affbb10bb025f74548fafe66df0dd5274} {error 26 0 failed to fund account: failed to fund account with 0 SC; insufficient funds to fund account: 0 SC <= 2 H\n}","stacktrace":"go.sia.tech/renterd/internal/worker.(*AccountMgr).refillAccounts.func1\n\t/renterd/internal/worker/accounts.go:353"}
I'd have to check how often it happens, but if I had to guess (at least) 10 times a day.
I've started uploading to both nodes on the 2nd. Around 1.5TB in files has been uploaded (not accounting redundancy).

I can provide the debug log files in a few hours

ChrisSchinnerl · 2025-01-10T10:01:57Z

Ok so the issue template is updated and the curl examples should work now. Logs might still be helpful but the odd issue here is all the hosts returning "internal error". In theory it's pretty clear what that error means in the context of reading a sector (it's not the renter's fault) but the odd thing is that so many hosts return the same error at the same time. We updated hostd to give us some more information but it will take a bit of time for your hosts to update to reflect that change.

CtrlAltDefeat94 · 2025-01-10T15:07:52Z

I just checked how often uploads fail. Seems like there are a lot more failures than I anticipated. This is a little more than a week of constantly (trying to) upload;
https://file.io/4AzhcgIxChHE
Roughly 50K occurances of upload failures to hosts for various reasons.

ChrisSchinnerl · 2025-01-10T16:36:54Z

That is very useful. Probably won't get to it today but I did grep my way through them to see what we can tackle next week.

Ok so 48742 lines of logs total

7298 lines of block height gouging -> this happens when either renter or host aren't synced. Looks like your node was behind the hosts
12679 failed payments -> likely due to some of your contracts being empty and having insufficient balance to pay
5594 failed instructions -> this is an internal host error that we don't get any info about
4628 commit errors -> also host errors, seems like they have disk/db issues
17317 already finished errors -> these are not necessarily problems, just uploads which were finished by a redundant worker, we can probably reduce the log spam for those
582 no price table found errors -> can happen when a host restarts and forgets about price tables that renters fetched, nothing to worry about, will be gone with RHP4
319 revision number must be greater than X errors -> happens when a contract is revised while trying to use it
136 stream was gracefully closed errors -> similar to "already finished" errors
68 other internal errors we don't have any info about unfortunately but probably the same host having other issues
11 context canceled errors which again is similar to "already finished" errors
A few more errors related to either gouging, insufficient balance or just a rare issue with dialing a host which can happen

cc @peterjan

CtrlAltDefeat94 · 2025-01-10T17:27:33Z

I would like to emphasize the error I mentioned before.

3. my script started uploading a file using http://localhost:8081/api/worker/objects/025a7b9709d975305309e95335877b0952cd439779814493ec7b14ba94cef878.bin?bucket=default
the file is about 1GB and at some point the upload gets stuck. No bandwidth happens, and the API call doesn't stop/break.
In the span of an hour more than 1K lines of:
{"level":"error","ts":"2025-01-08T14:32:55Z","logger":"worker.worker.accounts","caller":"worker/accounts.go:353","msg":"failed to refill account for host{hostKey 25 0 ed25519:fba406818f558676e117323f45665f3affbb10bb025f74548fafe66df0dd5274} {error 26 0 failed to fund account: failed to fund account with 0 SC; insufficient funds to fund account: 0 SC <= 2 H\n}","stacktrace":"go.sia.tech/renterd/internal/worker.(*AccountMgr).refillAccounts.func1\n\t/renterd/internal/worker/accounts.go:353"}
I'd have to check how often it happens, but if I had to guess (at least) 10 times a day.

An upload can get stuck on this state indefinitely without throwing an alert in the UI and the API call doesn't get interrupted. Even with smaller files, and waiting for over a day.
However, I'm not sure if the log line is related to the actual problem. I've seen the spam happen even when I'm not uploading myself, but maybe it also happens during repairs.

Here's the debug log from genenis up till now. It might be able to add some context that's missing now.
https://file.io/KCwyA1p9F2AO

EDIT: I could share the log file of the other node as well, though it looks like more of the same

CtrlAltDefeat94 added the bug Something isn't working label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant upload and repair failures/alerts #1779

Constant upload and repair failures/alerts #1779

CtrlAltDefeat94 commented Jan 7, 2025

ChrisSchinnerl commented Jan 8, 2025

CtrlAltDefeat94 commented Jan 8, 2025

ChrisSchinnerl commented Jan 10, 2025

CtrlAltDefeat94 commented Jan 10, 2025

ChrisSchinnerl commented Jan 10, 2025

CtrlAltDefeat94 commented Jan 10, 2025 •

edited

Loading

Constant upload and repair failures/alerts #1779

Constant upload and repair failures/alerts #1779

Comments

CtrlAltDefeat94 commented Jan 7, 2025

Current Behavior

Expected Behavior

Steps to Reproduce

Version

What operating system did the problem occur on (e.g. Ubuntu 22.04, macOS 12.0, Windows 11)?

Autopilot Config

Bus Config

Anything else?

ChrisSchinnerl commented Jan 8, 2025

CtrlAltDefeat94 commented Jan 8, 2025

ChrisSchinnerl commented Jan 10, 2025

CtrlAltDefeat94 commented Jan 10, 2025

ChrisSchinnerl commented Jan 10, 2025

CtrlAltDefeat94 commented Jan 10, 2025 • edited Loading

CtrlAltDefeat94 commented Jan 10, 2025 •

edited

Loading