-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoRaWAN] Transmission stops after about a day #27
Comments
Thanx. I'll give it a try.
…On Sat, Jul 28, 2018 at 10:21 AM, SloMusti ***@***.***> wrote:
I have been testing the robustness of the Murata module by using
B-L072Z-LRWAN1 with this core, version 0.0.7 and have encountered an issue
where about a day later the transmissions to gateway stop. This has been
confirmed on all the boards with multiple gateways in two different cities
to exclude other factors.
The code running on the device is attached below, simple transmission
every 10s. I have yet to capture the serial log until the crash, but it
does not appear to be an issue with the main code loop that keep executing.
Any suggestions or ideas towards debugging this are welcome as well as if
anyone else can please test this independently.
loradiscoveryttnworking.txt
<https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/files/2238116/loradiscoveryttnworking.txt>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#27>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AG4QfEkndZbBIkPw5JYkxRzfFAd2Ua_gks5uLI-DgaJpZM4VlGSF>
.
|
Furthermore, this issue has been reported by @s54mtb as well using unrelated firmware to this repository, so there may as well be something STM/Murata related: https://github.com/s54mtb/LoRaDunchy/tree/master/sw |
Tracing possible causes now with serial logging and power analyzer. Once thing is apparent now, the data rate changes due to ADR, will try to correlate if that is an issue.
|
Observed the hang now with serial attached, now the transmissions stopped when ADR was supposed to change to DR5
|
Is there a way for you to redirect the output to a UART instead of USB ? I'd like to isolate whether it's a USB issues perhaps. Looks like you see this after 65 downlinks. Does this always happen at that point ? |
I can do that, however it does not appear always at this point, I have also disabled ADR and the problem remains, so ti may not be directly correlated. |
The logging has been via serial and the fault persists, so definitely not related to the issue. We have now tested on 4 devices, all behaving exactly the same. @GrumpyOldPizza can you please let me know if you replicate the issue. Note we are using 868MHz EU band. |
I have not been able to reproduce the issue.
Is it possible that it is gateway related ?
…On Thu, Aug 2, 2018 at 5:41 AM, SloMusti ***@***.***> wrote:
The logging has been via serial and the fault persists, so definitely not
related to the issue.
We have now tested on 4 devices, all behaving exactly the same.
@GrumpyOldPizza <https://github.com/GrumpyOldPizza> can you please let me
know if you replicate the issue. Note we are using 868MHz EU band.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG4QfHVfurCdT4H-Z9KgJPa_CLjtuowwks5uMuV-gaJpZM4VlGSF>
.
|
@GrumpyOldPizza this was tested on 5+ gateways in different cities, running on Raspberry PI + RAK831 or IC880a or Laird indoor. The common factor to them is that this is using TheThingsNetwork servers. Are you using those or Loriot or other? |
I am using Multitech gateways.
…On Sat, Aug 4, 2018, 9:11 PM SloMusti ***@***.***> wrote:
@GrumpyOldPizza <https://github.com/GrumpyOldPizza> this was tested on 5+
gateways in different cities, running on Raspberry PI + RAK831 or IC880a or
Laird indoor. The common factor to them is that this is using
TheThingsNetwork servers. Are you using those or Loriot or other?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG4QfEUGtS5_lSbHUrxhVE0erbxaY6Foks5uNfHwgaJpZM4VlGSF>
.
|
@GrumpyOldPizza ok, but with what backend? |
Hi! I had similar issues with murata modules and ST LoraWan stack. I was running 5 different sensors using muRata Type ABZ module and LoRaWAN stack from STMicro. The application hangs after random time from few hours to several days (not more than 3 days). A module hanging after 50 packets sent dies, but then again send data more than 1k packets. The hardware used for testing: Complete sensor used for the testing: The latest software was commited here: My changes compared to the demo application:
When module hangs, LoraSend() is being executed, but no signal gets through (TTN receives no data). MCU is alive, timers are ok, sensor readings are ok. I also tested sending without any sensor interaction (just sending constant numbers instead of actual sensor readout) and it had no influence on occurance of the issue. Gateways and backend is same as @SloMusti reported above. |
Let me recheck this on my local gateways. My last tests were about a week
long with testing recovery from power outages. But I did not see anything
like this. However this was US915.
…On Sun, Aug 5, 2018, 11:42 PM Marko Pavlin ***@***.***> wrote:
Hi! I had similar issues with murata modules and ST LoraWan stack.
I was running 5 different sensors using muRata Type ABZ module and LoRaWAN
stack from STMicro.
The application hangs after random time from few hours to several days
(not more than 3 days). A module hanging after 50 packets sent dies, but
then again send data more than 1k packets.
The hardware used for testing:
http://e.pavlin.si/2018/05/07/lora-module-in-dil-form/
Complete sensor used for the testing:
http://e.pavlin.si/2018/07/03/particle-sensor-with-lora/
The latest software was commited here:
https://github.com/s54mtb/LoRaDunchy/tree/master/sw/Projects/PM-Sensor
My changes compared to the demo application:
-
power down is not being used, since PM sensor consumes quite some
power and everything is powered constantly.
-
duty cycle is 30' seconds (APP_TX_DUTYCYCLE 30000)
-
VCOM is not being used
-
I2C and UART communication for sensors has been added (no dynamic
memory/ heap is being used)
-
a counter has been added, which re-join after half an hour. Without
that none of the modules was working longer than few hours. Rejoining
didn't resolved the issue, it just prolonged the time to stop sending data.
When module hangs, LoraSend() is being executed, but no signal gets
through (TTN receives no data). MCU is alive, timers are ok, sensor
readings are ok.
I also tested sending without any sensor interaction (just sending
constant numbers instead of actual sensor readout) and it had no influence
on occurance of the issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG4QfEf5SifdbAXlADT24X9iosC-FwQZks5uN2bEgaJpZM4VlGSF>
.
|
After long field testing period I got some results:
It seems the major issues were in the periphery and not in stack and mostly related to proper configuration of the MCU/NVIC. That was not documented properly in the first versions of the STM stack. Latest updated documentation provided by STM is much more detailed and it helped solving issues with NVIC. |
So this is really not related to ArduinoCore-stm32l0. Again, I have not seen those problems here at all. |
@GrumpyOldPizza I was able to observe such a problem with ArduinoCore-stm32l0, the device stopping transmissions after a while. Can you please point me to what version of the STM Lora stack this core is running and where it would be best to evaluate interrupt priorities, should this be really the cause of hangups after a while. |
The stack is derived from LoRaMac-node 4.4.1. I doubt that it's the interrupt priorities. RTC based timeouts and DIO IRQ handling, which drive the stack are escalated to PENDSV callback. So are common peripheral callbacks, like "Serial.onReceive()" (which you are unlikely to use). There is of couse always the chance of another bug somewhere. But strikes me as curious is that you see this issue pretty much as the only one. |
@GrumpyOldPizza Just checking, did you test most of the nodes in the US or EU bands, should there be anything related to that, which I doubt. |
Obviously, yes.
…On Thu, Sep 13, 2018 at 12:45 PM SloMusti ***@***.***> wrote:
@GrumpyOldPizza <https://github.com/GrumpyOldPizza> Just checking, did
you test most of the nodes in the US or EU bands, should there be anything
related to that, which I doubt.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG4QfBkwputtMKOeIIdTIJLG2wvS6nhwks5uaqfkgaJpZM4VlGSF>
.
|
I have performed the experiment in the following configuration: B-L072Z-LRWAN1 board 1: LoraWAN-TTN-OTAA example code Both crashed after about 4000 messages almost simultaneously. Repeating the experiment now to validate.. |
Did you use "setDutyCycle(false)" or the default ? |
Ok, used the LoRaWAN_OTA.ino example with "setDutyCycle(false)". After almost 24 hours and 8500 transmissions, it's still alive on B-L072Z-LRWAN1. This is on EU868. |
Tried a 2nd board with ADR off, hence always DR_0. That one also survived a day without a crash. The first board is now on day 2 1/2 with a message every 10 seconds. Also no crash or anything. Unless there is a good reason to keep this open, I am gonna close the issue. |
I am repeating the same test as you have defined, will need to wait a day or so to see if a crash occurs and then report back. |
Actually, I have just now observed a crash on both devices with LoRaWAN_OTA.ino example with "setDutyCycle(false)". One had 584 messages, other 364. Next thing to try is ADR off and see if that affects. GW config: |
I am not sure what to do. It works fine here with 2 B-L072Z-LRWAN1 boards, as well as all others. I have no other mentioning from anybody else about sudden crashes after a short period of time. Obviously I am using a different gateway (and am on Linux). What are the last 50 messages printed out via serial console ? Otherwise I'd suggest you contact me via grumpyoldpizza@gmail.com so that you can arrange to send me your hardware (RAK831 gateway an one of the failing B-L072Z-LRWAN1 boards). |
Ok, got a repro after 3 days. I am not positive it's the same issue as you got, but it's possible. Essentially a corrupted frame on RX1 will keep LoRaWAN.busy() set to true (triggered for me by ADR). I tend to believe that a multicast frame not address to this node may cause this as well. In general it may be possible that the gateway sends some invalid packet (or LoRaWAN 1.1 extension to a LoRaWAN 1.0.2 node), which might trip up the LoRaWAN class as well. That will take a few days to sort out. |
Well spotted, thanks you for the effort. I believe it would be also good to figure out a watchdog, such that if any such problems appear when device has been deployed somewhere inconvenient, that would not be the case. Did you happen to look into this yet with this core? |
A watchdog will not help there. It's a internal bug where the code waitw for a McpsIndication that either never arrives, or arrives with an error that was not documented originally (multicast). Should be half way simple to fix. But I need to crosscheck all code paths in LoRaMac-node to see whether other errors can pop up (that are not handled properly). My bigger problem is how to test this. Where I am located physically there are no other gateways close by, only some faint US915 ones ... So checking out those boundary conditions is tricky. |
So far I have observer regular crashes at my location, so I am happy to run tests when necessary. Alternatively I can provision a RPi and you can upload remotely and test. Would that work? |
Since the issue has to do with other LoRaWAN traffic ... doesn't make sense to send me anything. I had assumed a Gateway issue, or a simple hardware issue with B-L072Z-LRWAN1 before. I'll test locally on US915 and see whether the fix I have survives a good chunk of packets (switched to 5 second intervals). The github will be updated in a few hours after the first shakedown. |
I have updated the repository with the proper fix. Will test over night (and the next few days) whether it does not introduce another issue. So no updated json file yet. Mind either installing via github, or simply copy the updated LoRaWAN.cpp into the proper place ? |
@GrumpyOldPizza I ahve been testing your code for 2 days now and it still works on two devices. |
@s54mtb reports another problem, not using this core but STM stack directly, with frame counters, where the loramac hangs upon reaching the maximal frame counter value 0xffff. This has been repeated with Workaround at the moment is:
|
I have here 4 boards (1x B-L072Z-LRWAN1 and 3x Grasshopper) doing various
different things, pinging on the same Gateway. No failure so far.
…On Thu, Sep 20, 2018 at 9:45 AM SloMusti ***@***.***> wrote:
@GrumpyOldPizza <https://github.com/GrumpyOldPizza> I ahve been testing
your code for 2 days now ad it still works on two devices.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG4QfLHeH_KU4ma4--nt1WkY8MtRgYTOks5uc7f9gaJpZM4VlGSF>
.
|
I think 1.0 and 1.0.1 allows for 16 bit counters. 1.0.2 is 32 bit clean per
standard.
However packet wise only the lower 16 bits get transmitted.
The code in LoRaMac-node 4.4.1 is ok. The one in 4.4.2 is busted:
// Add difference, consider roll-over
fCntDiff = ( int32_t )macMsg->FHDR.FCnt - ( int32_t )( previousDown
& 0x0000FFFF );
Cannot do int32_t to get to a int16_t rollover using wraparound.
…On Thu, Sep 20, 2018 at 9:48 AM SloMusti ***@***.***> wrote:
@s54mtb <https://github.com/s54mtb> reports another problem, not using
this core but STM stack directly, with frame counters, where the loramac
hangs upon reaching the maximal frame counter value 0xffff. This has been
repeated with
LoRaMacSetFCntUp() and including "LoRaMacFCntHandler.h" Would be good to
test if the same thing happens with this core.
Workaround at the moment is:
uplinkcounter = GetUplinkCounter();
if (uplinkcounter >= 0x0000ffff) {
NVIC_SystemReset(); // Reset everything
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG4QfOJLZHOi1K1rBl1aEcOgNbI-1y64ks5uc7ivgaJpZM4VlGSF>
.
|
No failures on my side either, currently at 80000+ frames on two devices. |
Ok, closing out. Here it's been alive for a week or so, every 5 seconds ... |
I have been testing the robustness of the Murata module by using B-L072Z-LRWAN1 with this core, version 0.0.7 and have encountered an issue where about a day later the transmissions to gateway stop. This has been confirmed on all the boards with multiple gateways in two different cities to exclude other factors.
The code running on the device is attached below, simple transmission every 10s. I have yet to capture the serial log until the crash, but it does not appear to be an issue with the main code loop that keep executing.
Any suggestions or ideas towards debugging this are welcome as well as if anyone else can please test this independently.
loradiscoveryttnworking.txt
The text was updated successfully, but these errors were encountered: