Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synse v3 - Devices drop out of scan, then reappear. #362

Open
MatthewHink opened this issue Feb 18, 2020 · 13 comments
Open

Synse v3 - Devices drop out of scan, then reappear. #362

MatthewHink opened this issue Feb 18, 2020 · 13 comments
Assignees

Comments

@MatthewHink
Copy link
Contributor

Guessing what is happening is that when devices are being refreshed, the device collection is being cleared and rebuilt. This causes the scan to miss devices.

If that is what is happening, propose to rebuild the device collection in a new area, then flip the pointer to the rebuilt area once the rebuild is complete.

Sample scan with missing vem plc and ups devices:
synse-scan-missing-vem-plc-and-ups.txt

@edaniszewski
Copy link
Contributor

It looks like the deployment that the scan was off of is using the internal modbus plugin which is running into a connection reset bug, which is causing some failure with that plugin. I'll try to get that resolved tomorrow and I suspect that when it is, the scans will be fixed from that.

If it doesn't fix the scan issue, I'll look deeper into it here, but the caching is just in-memory (basically a dict) that has TTLs on the values, but a device refresh task (which rebuilds the cache) runs by default more frequently than the TTL. When replacing, the cache isn't cleared and then rebuilt, the cached keys are replaced as we get data for them, so shouldn't really be empty during a rebuild.

@MatthewHink
Copy link
Contributor Author

Sounds like my idea was wrong. The ups is SNMP and not modbus so the linked issue seems different (?)

@edaniszewski
Copy link
Contributor

Ah, woops. I thought ups was egauge which is also running with the modbus. If its SNMP I believe you are right and it is a different issue. I'll dig through some logs and see what I can come up with.

@MatthewHink
Copy link
Contributor Author

I think this is a problem on device reads as well. Attaching log snippet / client side error / curl is woking now.

This looks like it's happening when rebuilding / updating the device cache and looking up device ID in cache.

We're serving up a 500 that is being logged at info not error with not much more information than that (no stack).

It looks like it failed somewhere after here:

logger.debug(_('looking up device ID in cache'), id=device_id)

synse-500-2020-02-19.txt

@MatthewHink
Copy link
Contributor Author

This one is not hard to hit. It is very problematic in my lock testing.

If a repro is useful on an instrumented build with more tracing, we should get that setup and I can hit the issue.

@MatthewHink
Copy link
Contributor Author

Between this and #363 I cannot normally make it through my lock write test. Considering these both priority 1.

test-stacks.txt

@MatthewHink
Copy link
Contributor Author

These are happening on roughly a 3 minute interval.

@MatthewHink
Copy link
Contributor Author

I believe what needs to be stated here is that if an http connection fails due to a networking issue, that's fine (cable fell out, router blip), but I believe that the core issue is that synse is not serving the url from time to time, which is a problem.

@edaniszewski
Copy link
Contributor

Thanks for the logs/notes on this. I've been sifting through them this morning and am working on adding additional logging. It may take a bit since there is a bunch of stuff going on today, but hopefully by EOD/tomorrow, I'll have something with better log captures.

On first glance through the logs, it does seem like your initial diagnoses is correct in that the cache rebuild temporarily causes device lookup to miss.

@edaniszewski
Copy link
Contributor

Was able to reproduce locally. When I got the 500, I got the following error message:

{
  "http_code":500,
  "description":"an unexpected error occurred",
  "timestamp":"2020-02-20T14:56:59Z",
  "context":"Task <Task pending coro=<Sanic.handle_request() running at \/usr\/local\/lib\/python3.6\/site-packages\/sanic\/app.py:976>> got Future <Future pending> attached to a different loop"
}

so it seems like there is something attached to a different async loop somewhere... not quite what I was expecting to see, but good info to move forward with.

@edaniszewski
Copy link
Contributor

continuing the conversation from: #364 (comment)

I didn't yet have enough time to go through and look closely at the logs and try and get a reproducible setup, but in skimming the logs around device cache rebuild, it looks like at different times, there are a different number of devices being reported, and for one instance it looks like the rebuild only polled devices from a single plugin. I'll look more into this later/tomorrow. I've started adding additional logging into synse around these areas as well.

@edaniszewski
Copy link
Contributor

I think I see what is causing the remaining issues here now. I'll spend some time today working through a solution that should close this out.

@edaniszewski
Copy link
Contributor

Nevermind to the above comment. I added some additional work to improve various things, but I don't think it will actually effect anything related to this issue. Looking through the logs again, its not clear why devices dropped out temporarily. More logging was added in the various commits, so if it happens again, it will hopefully be more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants