[LoRaWAN] Transmission stops after about a day #27

SloMusti · 2018-07-28T16:21:23Z

I have been testing the robustness of the Murata module by using B-L072Z-LRWAN1 with this core, version 0.0.7 and have encountered an issue where about a day later the transmissions to gateway stop. This has been confirmed on all the boards with multiple gateways in two different cities to exclude other factors.

The code running on the device is attached below, simple transmission every 10s. I have yet to capture the serial log until the crash, but it does not appear to be an issue with the main code loop that keep executing.

Any suggestions or ideas towards debugging this are welcome as well as if anyone else can please test this independently.
loradiscoveryttnworking.txt

GrumpyOldPizza · 2018-07-28T17:18:51Z

Thanx. I'll give it a try.

…

On Sat, Jul 28, 2018 at 10:21 AM, SloMusti ***@***.***> wrote: I have been testing the robustness of the Murata module by using B-L072Z-LRWAN1 with this core, version 0.0.7 and have encountered an issue where about a day later the transmissions to gateway stop. This has been confirmed on all the boards with multiple gateways in two different cities to exclude other factors. The code running on the device is attached below, simple transmission every 10s. I have yet to capture the serial log until the crash, but it does not appear to be an issue with the main code loop that keep executing. Any suggestions or ideas towards debugging this are welcome as well as if anyone else can please test this independently. loradiscoveryttnworking.txt <https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/files/2238116/loradiscoveryttnworking.txt> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG4QfEkndZbBIkPw5JYkxRzfFAd2Ua_gks5uLI-DgaJpZM4VlGSF> .

SloMusti · 2018-07-28T17:28:29Z

Furthermore, this issue has been reported by @s54mtb as well using unrelated firmware to this repository, so there may as well be something STM/Murata related: https://github.com/s54mtb/LoRaDunchy/tree/master/sw

SloMusti · 2018-07-28T18:16:16Z

Tracing possible causes now with serial logging and power analyzer. Once thing is apparent now, the data rate changes due to ADR, will try to correlate if that is an issue.

TRANSMIT( TimeOnAir: 74311, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 63, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 75467, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 64, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76623, NextTxTime: 0, MaxPayloadSize: 242, DR: 5, TxPower: 12.0dbm, UpLinkCounter: 65, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76675, NextTxTime: 0, MaxPayloadSize: 242, DR: 5, TxPower: 12.0dbm, UpLinkCounter: 66, DownLinkCounter: 0 )

messages are received by two gateways:
{
  "time": "2018-07-28T18:13:20.698336253Z",
  "frequency": 867.9,
  "modulation": "LORA",
  "data_rate": "SF7BW125",
  "coding_rate": "4/5",
  "gateways": [
    {
      "gtw_id": "XXXX",
      "gtw_trusted": true,
      "timestamp": 929034364,
      "time": "2018-07-28T18:13:20Z",
      "channel": 7,
      "rssi": -109,
      "snr": 6.25,
      "latitude": 46.554905,
      "longitude": 15.635378
    },
    {
      "gtw_id": "YYYY",
      "gtw_trusted": true,
      "timestamp": 4006964724,
      "time": "2018-07-28T18:13:20Z",
      "channel": 7,
      "rssi": -73,
      "snr": 9.75
    }
  ]
}

SloMusti · 2018-07-29T08:15:30Z

Observed the hang now with serial attached, now the transmissions stopped when ADR was supposed to change to DR5

TRANSMIT( TimeOnAir: 70843, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 60, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 71999, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 61, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 73155, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 62, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 74311, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 63, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 75467, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 64, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76623, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 65, DownLinkCounter: 0 )

GrumpyOldPizza · 2018-07-29T11:46:57Z

Is there a way for you to redirect the output to a UART instead of USB ? I'd like to isolate whether it's a USB issues perhaps. Looks like you see this after 65 downlinks. Does this always happen at that point ?

SloMusti · 2018-07-29T12:35:31Z

I can do that, however it does not appear always at this point, I have also disabled ADR and the problem remains, so ti may not be directly correlated.

SloMusti · 2018-08-02T11:41:50Z

The logging has been via serial and the fault persists, so definitely not related to the issue.

We have now tested on 4 devices, all behaving exactly the same. @GrumpyOldPizza can you please let me know if you replicate the issue. Note we are using 868MHz EU band.

GrumpyOldPizza · 2018-08-02T12:20:02Z

I have not been able to reproduce the issue. Is it possible that it is gateway related ?

…

On Thu, Aug 2, 2018 at 5:41 AM, SloMusti ***@***.***> wrote: The logging has been via serial and the fault persists, so definitely not related to the issue. We have now tested on 4 devices, all behaving exactly the same. @GrumpyOldPizza <https://github.com/GrumpyOldPizza> can you please let me know if you replicate the issue. Note we are using 868MHz EU band. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG4QfHVfurCdT4H-Z9KgJPa_CLjtuowwks5uMuV-gaJpZM4VlGSF> .

SloMusti · 2018-08-04T19:11:44Z

@GrumpyOldPizza this was tested on 5+ gateways in different cities, running on Raspberry PI + RAK831 or IC880a or Laird indoor. The common factor to them is that this is using TheThingsNetwork servers. Are you using those or Loriot or other?

GrumpyOldPizza · 2018-08-04T19:52:02Z

I am using Multitech gateways.

…

On Sat, Aug 4, 2018, 9:11 PM SloMusti ***@***.***> wrote: @GrumpyOldPizza <https://github.com/GrumpyOldPizza> this was tested on 5+ gateways in different cities, running on Raspberry PI + RAK831 or IC880a or Laird indoor. The common factor to them is that this is using TheThingsNetwork servers. Are you using those or Loriot or other? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG4QfEUGtS5_lSbHUrxhVE0erbxaY6Foks5uNfHwgaJpZM4VlGSF> .

SloMusti · 2018-08-05T08:11:00Z

@GrumpyOldPizza ok, but with what backend?

s54mtb · 2018-08-05T21:42:27Z

Hi! I had similar issues with murata modules and ST LoraWan stack.

I was running 5 different sensors using muRata Type ABZ module and LoRaWAN stack from STMicro.

The application hangs after random time from few hours to several days (not more than 3 days). A module hanging after 50 packets sent dies, but then again send data more than 1k packets.

The hardware used for testing:
http://e.pavlin.si/2018/05/07/lora-module-in-dil-form/

Complete sensor used for the testing:
http://e.pavlin.si/2018/07/03/particle-sensor-with-lora/

The latest software was commited here:
https://github.com/s54mtb/LoRaDunchy/tree/master/sw/Projects/PM-Sensor

My changes compared to the demo application:

power down is not being used, since PM sensor consumes quite some power and everything is powered constantly.
duty cycle is 30' seconds (APP_TX_DUTYCYCLE 30000)
VCOM is not being used
I2C and UART communication for sensors has been added (no dynamic memory/ heap is being used)
a counter has been added, which re-join after half an hour. Without that none of the modules was working longer than few hours. Rejoining didn't resolved the issue, it just prolonged the time to stop sending data.

When module hangs, LoraSend() is being executed, but no signal gets through (TTN receives no data). MCU is alive, timers are ok, sensor readings are ok.

I also tested sending without any sensor interaction (just sending constant numbers instead of actual sensor readout) and it had no influence on occurance of the issue.

Gateways and backend is same as @SloMusti reported above.

GrumpyOldPizza · 2018-08-06T07:15:48Z

Let me recheck this on my local gateways. My last tests were about a week long with testing recovery from power outages. But I did not see anything like this. However this was US915.

…

On Sun, Aug 5, 2018, 11:42 PM Marko Pavlin ***@***.***> wrote: Hi! I had similar issues with murata modules and ST LoraWan stack. I was running 5 different sensors using muRata Type ABZ module and LoRaWAN stack from STMicro. The application hangs after random time from few hours to several days (not more than 3 days). A module hanging after 50 packets sent dies, but then again send data more than 1k packets. The hardware used for testing: http://e.pavlin.si/2018/05/07/lora-module-in-dil-form/ Complete sensor used for the testing: http://e.pavlin.si/2018/07/03/particle-sensor-with-lora/ The latest software was commited here: https://github.com/s54mtb/LoRaDunchy/tree/master/sw/Projects/PM-Sensor My changes compared to the demo application: - power down is not being used, since PM sensor consumes quite some power and everything is powered constantly. - duty cycle is 30' seconds (APP_TX_DUTYCYCLE 30000) - VCOM is not being used - I2C and UART communication for sensors has been added (no dynamic memory/ heap is being used) - a counter has been added, which re-join after half an hour. Without that none of the modules was working longer than few hours. Rejoining didn't resolved the issue, it just prolonged the time to stop sending data. When module hangs, LoraSend() is being executed, but no signal gets through (TTN receives no data). MCU is alive, timers are ok, sensor readings are ok. I also tested sending without any sensor interaction (just sending constant numbers instead of actual sensor readout) and it had no influence on occurance of the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG4QfEf5SifdbAXlADT24X9iosC-FwQZks5uN2bEgaJpZM4VlGSF> .

s54mtb · 2018-09-13T06:03:39Z

After long field testing period I got some results:

LoraSend() function included in STM examples is called from RTC ISR -> trouble with pending IRQ if UART IO using interrupts is called from this function. Solved this by using UART between lora sends (while waiting for the next RTC alarm timeout)
Added I2C pullup resistors for I2C: internal pullups are not OK.
Upgraded ST lorawan stack to version 1.2.0: running provided examples on demo board with STM sensor "shiled" worked for 10+ days without an issue.
Removed all UART code for VCOM/diagnostic output and use UART for the HPM (particle) sensor only. HPM sensor seems to freeze from time to time
Added external transistor for switching power for the HPM sensor. Main purpose of this is to reset the HPM sensor when error is detected during readout, because the HPM sensor has no command for "reset". This improved the reliability of the operation.

It seems the major issues were in the periphery and not in stack and mostly related to proper configuration of the MCU/NVIC. That was not documented properly in the first versions of the STM stack. Latest updated documentation provided by STM is much more detailed and it helped solving issues with NVIC.

GrumpyOldPizza · 2018-09-13T12:08:57Z

So this is really not related to ArduinoCore-stm32l0. Again, I have not seen those problems here at all.

SloMusti · 2018-09-13T16:45:34Z

@GrumpyOldPizza I was able to observe such a problem with ArduinoCore-stm32l0, the device stopping transmissions after a while. Can you please point me to what version of the STM Lora stack this core is running and where it would be best to evaluate interrupt priorities, should this be really the cause of hangups after a while.

GrumpyOldPizza · 2018-09-13T17:12:44Z

The stack is derived from LoRaMac-node 4.4.1. I doubt that it's the interrupt priorities. RTC based timeouts and DIO IRQ handling, which drive the stack are escalated to PENDSV callback. So are common peripheral callbacks, like "Serial.onReceive()" (which you are unlikely to use).

There is of couse always the chance of another bug somewhere. But strikes me as curious is that you see this issue pretty much as the only one.

SloMusti · 2018-09-13T18:45:56Z

@GrumpyOldPizza Just checking, did you test most of the nodes in the US or EU bands, should there be anything related to that, which I doubt.

GrumpyOldPizza · 2018-09-13T20:20:38Z

Obviously, yes.

…

On Thu, Sep 13, 2018 at 12:45 PM SloMusti ***@***.***> wrote: @GrumpyOldPizza <https://github.com/GrumpyOldPizza> Just checking, did you test most of the nodes in the US or EU bands, should there be anything related to that, which I doubt. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG4QfBkwputtMKOeIIdTIJLG2wvS6nhwks5uaqfkgaJpZM4VlGSF> .

SloMusti · 2018-09-14T10:23:36Z

I have performed the experiment in the following configuration:

B-L072Z-LRWAN1 board 1: LoraWAN-TTN-OTAA example code
B-L072Z-LRWAN1 board 1: LoraWAN-TTN-OTAA example code with NO Serial

Both crashed after about 4000 messages almost simultaneously. Repeating the experiment now to validate..

GrumpyOldPizza · 2018-09-14T11:41:53Z

Did you use "setDutyCycle(false)" or the default ?

GrumpyOldPizza · 2018-09-15T11:54:17Z

Ok, used the LoRaWAN_OTA.ino example with "setDutyCycle(false)". After almost 24 hours and 8500 transmissions, it's still alive on B-L072Z-LRWAN1. This is on EU868.

GrumpyOldPizza · 2018-09-16T15:06:44Z

Tried a 2nd board with ADR off, hence always DR_0. That one also survived a day without a crash. The first board is now on day 2 1/2 with a message every 10 seconds. Also no crash or anything.

Unless there is a good reason to keep this open, I am gonna close the issue.

SloMusti · 2018-09-16T15:20:58Z

I am repeating the same test as you have defined, will need to wait a day or so to see if a crash occurs and then report back.

SloMusti · 2018-09-16T15:23:56Z

Actually, I have just now observed a crash on both devices with LoRaWAN_OTA.ino example with "setDutyCycle(false)". One had 584 messages, other 364. Next thing to try is ADR off and see if that affects.

GW config:
RAK831 on RPi
Lorix One

GrumpyOldPizza · 2018-09-16T16:36:11Z

I am not sure what to do. It works fine here with 2 B-L072Z-LRWAN1 boards, as well as all others. I have no other mentioning from anybody else about sudden crashes after a short period of time.

Obviously I am using a different gateway (and am on Linux).

What are the last 50 messages printed out via serial console ?

Otherwise I'd suggest you contact me via grumpyoldpizza@gmail.com so that you can arrange to send me your hardware (RAK831 gateway an one of the failing B-L072Z-LRWAN1 boards).

GrumpyOldPizza · 2018-09-17T14:35:51Z

Ok, got a repro after 3 days. I am not positive it's the same issue as you got, but it's possible. Essentially a corrupted frame on RX1 will keep LoRaWAN.busy() set to true (triggered for me by ADR). I tend to believe that a multicast frame not address to this node may cause this as well.

In general it may be possible that the gateway sends some invalid packet (or LoRaWAN 1.1 extension to a LoRaWAN 1.0.2 node), which might trip up the LoRaWAN class as well.

That will take a few days to sort out.

SloMusti · 2018-09-17T14:52:20Z

Well spotted, thanks you for the effort.

I believe it would be also good to figure out a watchdog, such that if any such problems appear when device has been deployed somewhere inconvenient, that would not be the case. Did you happen to look into this yet with this core?

GrumpyOldPizza · 2018-09-17T15:13:22Z

A watchdog will not help there. It's a internal bug where the code waitw for a McpsIndication that either never arrives, or arrives with an error that was not documented originally (multicast).

Should be half way simple to fix. But I need to crosscheck all code paths in LoRaMac-node to see whether other errors can pop up (that are not handled properly). My bigger problem is how to test this. Where I am located physically there are no other gateways close by, only some faint US915 ones ... So checking out those boundary conditions is tricky.

SloMusti · 2018-09-17T17:59:40Z

So far I have observer regular crashes at my location, so I am happy to run tests when necessary. Alternatively I can provision a RPi and you can upload remotely and test. Would that work?

GrumpyOldPizza · 2018-09-17T18:24:45Z

Since the issue has to do with other LoRaWAN traffic ... doesn't make sense to send me anything. I had assumed a Gateway issue, or a simple hardware issue with B-L072Z-LRWAN1 before.

I'll test locally on US915 and see whether the fix I have survives a good chunk of packets (switched to 5 second intervals).

The github will be updated in a few hours after the first shakedown.

GrumpyOldPizza · 2018-09-18T03:49:07Z

I have updated the repository with the proper fix. Will test over night (and the next few days) whether it does not introduce another issue. So no updated json file yet.

Mind either installing via github, or simply copy the updated LoRaWAN.cpp into the proper place ?

SloMusti · 2018-09-20T15:42:13Z

@GrumpyOldPizza I ahve been testing your code for 2 days now and it still works on two devices.

SloMusti · 2018-09-20T15:45:04Z

@s54mtb reports another problem, not using this core but STM stack directly, with frame counters, where the loramac hangs upon reaching the maximal frame counter value 0xffff. This has been repeated with
LoRaMacSetFCntUp() and including "LoRaMacFCntHandler.h" Would be good to test if the same thing happens with this core.

Workaround at the moment is:

    uplinkcounter = GetUplinkCounter();
    if (uplinkcounter >= 0x0000ffff) {
        NVIC_SystemReset();   // Reset everything
    }

GrumpyOldPizza · 2018-09-20T15:48:23Z

I have here 4 boards (1x B-L072Z-LRWAN1 and 3x Grasshopper) doing various different things, pinging on the same Gateway. No failure so far.

…

On Thu, Sep 20, 2018 at 9:45 AM SloMusti ***@***.***> wrote: @GrumpyOldPizza <https://github.com/GrumpyOldPizza> I ahve been testing your code for 2 days now ad it still works on two devices. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG4QfLHeH_KU4ma4--nt1WkY8MtRgYTOks5uc7f9gaJpZM4VlGSF> .

GrumpyOldPizza · 2018-09-20T16:02:55Z

I think 1.0 and 1.0.1 allows for 16 bit counters. 1.0.2 is 32 bit clean per standard. However packet wise only the lower 16 bits get transmitted. The code in LoRaMac-node 4.4.1 is ok. The one in 4.4.2 is busted: // Add difference, consider roll-over fCntDiff = ( int32_t )macMsg->FHDR.FCnt - ( int32_t )( previousDown & 0x0000FFFF ); Cannot do int32_t to get to a int16_t rollover using wraparound.

…

On Thu, Sep 20, 2018 at 9:48 AM SloMusti ***@***.***> wrote: @s54mtb <https://github.com/s54mtb> reports another problem, not using this core but STM stack directly, with frame counters, where the loramac hangs upon reaching the maximal frame counter value 0xffff. This has been repeated with LoRaMacSetFCntUp() and including "LoRaMacFCntHandler.h" Would be good to test if the same thing happens with this core. Workaround at the moment is: uplinkcounter = GetUplinkCounter(); if (uplinkcounter >= 0x0000ffff) { NVIC_SystemReset(); // Reset everything } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG4QfOJLZHOi1K1rBl1aEcOgNbI-1y64ks5uc7ivgaJpZM4VlGSF> .

SloMusti · 2018-09-24T09:17:14Z

No failures on my side either, currently at 80000+ frames on two devices.

GrumpyOldPizza · 2018-09-24T12:32:26Z

Ok, closing out. Here it's been alive for a week or so, every 5 seconds ...

GrumpyOldPizza closed this as completed Sep 24, 2018

romansoft mentioned this issue Sep 27, 2018

Feature request: add setters for uplink and downlink counters #41

Closed

kevin192291 mentioned this issue Jul 2, 2021

Recieve transmissions for 1 day, nothing for 3 days, receive transmissions again. #192

Open

[LoRaWAN] Transmission stops after about a day #27

[LoRaWAN] Transmission stops after about a day #27

Comments

SloMusti commented Jul 28, 2018

GrumpyOldPizza commented Jul 28, 2018 via email

SloMusti commented Jul 28, 2018

SloMusti commented Jul 28, 2018

SloMusti commented Jul 29, 2018

GrumpyOldPizza commented Jul 29, 2018

SloMusti commented Jul 29, 2018

SloMusti commented Aug 2, 2018

GrumpyOldPizza commented Aug 2, 2018 via email

SloMusti commented Aug 4, 2018

GrumpyOldPizza commented Aug 4, 2018 via email

SloMusti commented Aug 5, 2018

s54mtb commented Aug 5, 2018 • edited Loading

GrumpyOldPizza commented Aug 6, 2018 via email

s54mtb commented Sep 13, 2018 • edited Loading

GrumpyOldPizza commented Sep 13, 2018

SloMusti commented Sep 13, 2018

GrumpyOldPizza commented Sep 13, 2018

SloMusti commented Sep 13, 2018

GrumpyOldPizza commented Sep 13, 2018 via email

SloMusti commented Sep 14, 2018 • edited Loading

GrumpyOldPizza commented Sep 14, 2018

GrumpyOldPizza commented Sep 15, 2018

GrumpyOldPizza commented Sep 16, 2018

SloMusti commented Sep 16, 2018

SloMusti commented Sep 16, 2018 • edited Loading

GrumpyOldPizza commented Sep 16, 2018

GrumpyOldPizza commented Sep 17, 2018

SloMusti commented Sep 17, 2018

GrumpyOldPizza commented Sep 17, 2018

SloMusti commented Sep 17, 2018

GrumpyOldPizza commented Sep 17, 2018

GrumpyOldPizza commented Sep 18, 2018

SloMusti commented Sep 20, 2018 • edited Loading

SloMusti commented Sep 20, 2018

GrumpyOldPizza commented Sep 20, 2018 via email

GrumpyOldPizza commented Sep 20, 2018 via email

SloMusti commented Sep 24, 2018

GrumpyOldPizza commented Sep 24, 2018

s54mtb commented Aug 5, 2018 •

edited

Loading

s54mtb commented Sep 13, 2018 •

edited

Loading

SloMusti commented Sep 14, 2018 •

edited

Loading

SloMusti commented Sep 16, 2018 •

edited

Loading

SloMusti commented Sep 20, 2018 •

edited

Loading