Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoRaWAN] Transmission stops after about a day #27

Closed
SloMusti opened this issue Jul 28, 2018 · 38 comments
Closed

[LoRaWAN] Transmission stops after about a day #27

SloMusti opened this issue Jul 28, 2018 · 38 comments

Comments

@SloMusti
Copy link

I have been testing the robustness of the Murata module by using B-L072Z-LRWAN1 with this core, version 0.0.7 and have encountered an issue where about a day later the transmissions to gateway stop. This has been confirmed on all the boards with multiple gateways in two different cities to exclude other factors.

The code running on the device is attached below, simple transmission every 10s. I have yet to capture the serial log until the crash, but it does not appear to be an issue with the main code loop that keep executing.

Any suggestions or ideas towards debugging this are welcome as well as if anyone else can please test this independently.
loradiscoveryttnworking.txt

@GrumpyOldPizza
Copy link
Owner

GrumpyOldPizza commented Jul 28, 2018 via email

@SloMusti
Copy link
Author

Furthermore, this issue has been reported by @s54mtb as well using unrelated firmware to this repository, so there may as well be something STM/Murata related: https://github.com/s54mtb/LoRaDunchy/tree/master/sw

@SloMusti
Copy link
Author

Tracing possible causes now with serial logging and power analyzer. Once thing is apparent now, the data rate changes due to ADR, will try to correlate if that is an issue.

TRANSMIT( TimeOnAir: 74311, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 63, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 75467, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 64, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76623, NextTxTime: 0, MaxPayloadSize: 242, DR: 5, TxPower: 12.0dbm, UpLinkCounter: 65, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76675, NextTxTime: 0, MaxPayloadSize: 242, DR: 5, TxPower: 12.0dbm, UpLinkCounter: 66, DownLinkCounter: 0 )
messages are received by two gateways:
{
  "time": "2018-07-28T18:13:20.698336253Z",
  "frequency": 867.9,
  "modulation": "LORA",
  "data_rate": "SF7BW125",
  "coding_rate": "4/5",
  "gateways": [
    {
      "gtw_id": "XXXX",
      "gtw_trusted": true,
      "timestamp": 929034364,
      "time": "2018-07-28T18:13:20Z",
      "channel": 7,
      "rssi": -109,
      "snr": 6.25,
      "latitude": 46.554905,
      "longitude": 15.635378
    },
    {
      "gtw_id": "YYYY",
      "gtw_trusted": true,
      "timestamp": 4006964724,
      "time": "2018-07-28T18:13:20Z",
      "channel": 7,
      "rssi": -73,
      "snr": 9.75
    }
  ]
}

@SloMusti
Copy link
Author

Observed the hang now with serial attached, now the transmissions stopped when ADR was supposed to change to DR5

TRANSMIT( TimeOnAir: 70843, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 60, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 71999, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 61, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 73155, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 62, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 74311, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 63, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 75467, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 64, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76623, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 65, DownLinkCounter: 0 )

@GrumpyOldPizza
Copy link
Owner

Is there a way for you to redirect the output to a UART instead of USB ? I'd like to isolate whether it's a USB issues perhaps. Looks like you see this after 65 downlinks. Does this always happen at that point ?

@SloMusti
Copy link
Author

I can do that, however it does not appear always at this point, I have also disabled ADR and the problem remains, so ti may not be directly correlated.

@SloMusti
Copy link
Author

SloMusti commented Aug 2, 2018

The logging has been via serial and the fault persists, so definitely not related to the issue.

We have now tested on 4 devices, all behaving exactly the same. @GrumpyOldPizza can you please let me know if you replicate the issue. Note we are using 868MHz EU band.

@GrumpyOldPizza
Copy link
Owner

GrumpyOldPizza commented Aug 2, 2018 via email

@SloMusti
Copy link
Author

SloMusti commented Aug 4, 2018

@GrumpyOldPizza this was tested on 5+ gateways in different cities, running on Raspberry PI + RAK831 or IC880a or Laird indoor. The common factor to them is that this is using TheThingsNetwork servers. Are you using those or Loriot or other?

@GrumpyOldPizza
Copy link
Owner

GrumpyOldPizza commented Aug 4, 2018 via email

@SloMusti
Copy link
Author

SloMusti commented Aug 5, 2018

@GrumpyOldPizza ok, but with what backend?

@s54mtb
Copy link

s54mtb commented Aug 5, 2018

Hi! I had similar issues with murata modules and ST LoraWan stack.

I was running 5 different sensors using muRata Type ABZ module and LoRaWAN stack from STMicro.

The application hangs after random time from few hours to several days (not more than 3 days). A module hanging after 50 packets sent dies, but then again send data more than 1k packets.

The hardware used for testing:
http://e.pavlin.si/2018/05/07/lora-module-in-dil-form/

Complete sensor used for the testing:
http://e.pavlin.si/2018/07/03/particle-sensor-with-lora/

The latest software was commited here:
https://github.com/s54mtb/LoRaDunchy/tree/master/sw/Projects/PM-Sensor

My changes compared to the demo application:

  • power down is not being used, since PM sensor consumes quite some power and everything is powered constantly.

  • duty cycle is 30' seconds (APP_TX_DUTYCYCLE 30000)

  • VCOM is not being used

  • I2C and UART communication for sensors has been added (no dynamic memory/ heap is being used)

  • a counter has been added, which re-join after half an hour. Without that none of the modules was working longer than few hours. Rejoining didn't resolved the issue, it just prolonged the time to stop sending data.

When module hangs, LoraSend() is being executed, but no signal gets through (TTN receives no data). MCU is alive, timers are ok, sensor readings are ok.

I also tested sending without any sensor interaction (just sending constant numbers instead of actual sensor readout) and it had no influence on occurance of the issue.

Gateways and backend is same as @SloMusti reported above.

@GrumpyOldPizza
Copy link
Owner

GrumpyOldPizza commented Aug 6, 2018 via email

@s54mtb
Copy link

s54mtb commented Sep 13, 2018

After long field testing period I got some results:

  • LoraSend() function included in STM examples is called from RTC ISR -> trouble with pending IRQ if UART IO using interrupts is called from this function. Solved this by using UART between lora sends (while waiting for the next RTC alarm timeout)

  • Added I2C pullup resistors for I2C: internal pullups are not OK.

  • Upgraded ST lorawan stack to version 1.2.0: running provided examples on demo board with STM sensor "shiled" worked for 10+ days without an issue.

  • Removed all UART code for VCOM/diagnostic output and use UART for the HPM (particle) sensor only. HPM sensor seems to freeze from time to time

  • Added external transistor for switching power for the HPM sensor. Main purpose of this is to reset the HPM sensor when error is detected during readout, because the HPM sensor has no command for "reset". This improved the reliability of the operation.

It seems the major issues were in the periphery and not in stack and mostly related to proper configuration of the MCU/NVIC. That was not documented properly in the first versions of the STM stack. Latest updated documentation provided by STM is much more detailed and it helped solving issues with NVIC.

@GrumpyOldPizza
Copy link
Owner

So this is really not related to ArduinoCore-stm32l0. Again, I have not seen those problems here at all.

@SloMusti
Copy link
Author

@GrumpyOldPizza I was able to observe such a problem with ArduinoCore-stm32l0, the device stopping transmissions after a while. Can you please point me to what version of the STM Lora stack this core is running and where it would be best to evaluate interrupt priorities, should this be really the cause of hangups after a while.

@GrumpyOldPizza
Copy link
Owner

The stack is derived from LoRaMac-node 4.4.1. I doubt that it's the interrupt priorities. RTC based timeouts and DIO IRQ handling, which drive the stack are escalated to PENDSV callback. So are common peripheral callbacks, like "Serial.onReceive()" (which you are unlikely to use).

There is of couse always the chance of another bug somewhere. But strikes me as curious is that you see this issue pretty much as the only one.

@SloMusti
Copy link
Author

@GrumpyOldPizza Just checking, did you test most of the nodes in the US or EU bands, should there be anything related to that, which I doubt.

@GrumpyOldPizza
Copy link
Owner

GrumpyOldPizza commented Sep 13, 2018 via email

@SloMusti
Copy link
Author

SloMusti commented Sep 14, 2018

I have performed the experiment in the following configuration:

B-L072Z-LRWAN1 board 1: LoraWAN-TTN-OTAA example code
B-L072Z-LRWAN1 board 1: LoraWAN-TTN-OTAA example code with NO Serial

Both crashed after about 4000 messages almost simultaneously. Repeating the experiment now to validate..

@GrumpyOldPizza
Copy link
Owner

Did you use "setDutyCycle(false)" or the default ?

@GrumpyOldPizza
Copy link
Owner

Ok, used the LoRaWAN_OTA.ino example with "setDutyCycle(false)". After almost 24 hours and 8500 transmissions, it's still alive on B-L072Z-LRWAN1. This is on EU868.

@GrumpyOldPizza
Copy link
Owner

Tried a 2nd board with ADR off, hence always DR_0. That one also survived a day without a crash. The first board is now on day 2 1/2 with a message every 10 seconds. Also no crash or anything.

Unless there is a good reason to keep this open, I am gonna close the issue.

@SloMusti
Copy link
Author

I am repeating the same test as you have defined, will need to wait a day or so to see if a crash occurs and then report back.

@SloMusti
Copy link
Author

SloMusti commented Sep 16, 2018

Actually, I have just now observed a crash on both devices with LoRaWAN_OTA.ino example with "setDutyCycle(false)". One had 584 messages, other 364. Next thing to try is ADR off and see if that affects.

GW config:
RAK831 on RPi
Lorix One

@GrumpyOldPizza
Copy link
Owner

I am not sure what to do. It works fine here with 2 B-L072Z-LRWAN1 boards, as well as all others. I have no other mentioning from anybody else about sudden crashes after a short period of time.

Obviously I am using a different gateway (and am on Linux).

What are the last 50 messages printed out via serial console ?

Otherwise I'd suggest you contact me via grumpyoldpizza@gmail.com so that you can arrange to send me your hardware (RAK831 gateway an one of the failing B-L072Z-LRWAN1 boards).

@GrumpyOldPizza
Copy link
Owner

Ok, got a repro after 3 days. I am not positive it's the same issue as you got, but it's possible. Essentially a corrupted frame on RX1 will keep LoRaWAN.busy() set to true (triggered for me by ADR). I tend to believe that a multicast frame not address to this node may cause this as well.

In general it may be possible that the gateway sends some invalid packet (or LoRaWAN 1.1 extension to a LoRaWAN 1.0.2 node), which might trip up the LoRaWAN class as well.

That will take a few days to sort out.

@SloMusti
Copy link
Author

Well spotted, thanks you for the effort.

I believe it would be also good to figure out a watchdog, such that if any such problems appear when device has been deployed somewhere inconvenient, that would not be the case. Did you happen to look into this yet with this core?

@GrumpyOldPizza
Copy link
Owner

A watchdog will not help there. It's a internal bug where the code waitw for a McpsIndication that either never arrives, or arrives with an error that was not documented originally (multicast).

Should be half way simple to fix. But I need to crosscheck all code paths in LoRaMac-node to see whether other errors can pop up (that are not handled properly). My bigger problem is how to test this. Where I am located physically there are no other gateways close by, only some faint US915 ones ... So checking out those boundary conditions is tricky.

@SloMusti
Copy link
Author

So far I have observer regular crashes at my location, so I am happy to run tests when necessary. Alternatively I can provision a RPi and you can upload remotely and test. Would that work?

@GrumpyOldPizza
Copy link
Owner

Since the issue has to do with other LoRaWAN traffic ... doesn't make sense to send me anything. I had assumed a Gateway issue, or a simple hardware issue with B-L072Z-LRWAN1 before.

I'll test locally on US915 and see whether the fix I have survives a good chunk of packets (switched to 5 second intervals).

The github will be updated in a few hours after the first shakedown.

@GrumpyOldPizza
Copy link
Owner

I have updated the repository with the proper fix. Will test over night (and the next few days) whether it does not introduce another issue. So no updated json file yet.

Mind either installing via github, or simply copy the updated LoRaWAN.cpp into the proper place ?

@SloMusti
Copy link
Author

SloMusti commented Sep 20, 2018

@GrumpyOldPizza I ahve been testing your code for 2 days now and it still works on two devices.

@SloMusti
Copy link
Author

@s54mtb reports another problem, not using this core but STM stack directly, with frame counters, where the loramac hangs upon reaching the maximal frame counter value 0xffff. This has been repeated with
LoRaMacSetFCntUp() and including "LoRaMacFCntHandler.h" Would be good to test if the same thing happens with this core.

Workaround at the moment is:

    uplinkcounter = GetUplinkCounter();
    if (uplinkcounter >= 0x0000ffff) {
        NVIC_SystemReset();   // Reset everything
    }

@GrumpyOldPizza
Copy link
Owner

GrumpyOldPizza commented Sep 20, 2018 via email

@GrumpyOldPizza
Copy link
Owner

GrumpyOldPizza commented Sep 20, 2018 via email

@SloMusti
Copy link
Author

No failures on my side either, currently at 80000+ frames on two devices.

@GrumpyOldPizza
Copy link
Owner

Ok, closing out. Here it's been alive for a week or so, every 5 seconds ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants