Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Corruption Issue with latest release #854

Closed
TMRh20 opened this issue Jul 18, 2022 · 40 comments
Closed

Data Corruption Issue with latest release #854

TMRh20 opened this issue Jul 18, 2022 · 40 comments
Assignees
Labels

Comments

@TMRh20
Copy link
Member

TMRh20 commented Jul 18, 2022

Something went wrong in the last release of RF24. I started seeing data corruption issues on my RF24Ethernet nodes, and ended up downgrading all the libraries until I got to RF24. Downgrading to previous release is what fixed the issues.

Messed up:
MessedUPRF24Ethernet
After Downgrading
FixedRF24Ethernet

So I assume this issue will affect all higher layer libraries, kind of a crappy issue. I'm not sure where to start looking, and I don't have too much time today to look for the problem, so we'll see how long it takes to figure out I guess.
Wondering if the clang formatting messed up some logic or something?

I would suggest downgrading to 1.4.2 and waiting until we have a fix for any users that encounter the same or similar issues.

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 18, 2022

TMRh20 added a commit that referenced this issue Jul 18, 2022
@TMRh20 TMRh20 closed this as completed Jul 18, 2022
@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

Did adding those blank lines fix it?

I think that change was part of #846 (see description). To be safe, we should probably use curly brackets for all loops and conditions. The referenced commit doesn't conform to clang-format settings (Linux CI failed).

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 18, 2022

Yup, otherwise its like teh following without the line breaks:

while (condition){ while (condition) {} }

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

ok, It looks like you committed a trailing whitespace on the blank lines, which is causing the clang-format to fail.

There is another place where I think a blank line would be needed

RF24/RF24.cpp

Lines 344 to 346 in d7ba9c2

while (data_len--) *ptx++ = *current++;
while (blank_len--) *ptx++ = 0;

which affects Linux; you're commit affects Arduino only.

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

fixed trailing whitespace and added a blank line in 3f786bf

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 18, 2022 via email

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

damn. I was kinda glad to was that easy. I should've known better.

@2bndy5 2bndy5 reopened this Jul 18, 2022
@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

I think the Arduino lib manager hasn't published v1.4.4 yet, so we might still have 24 hours.

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 18, 2022 via email

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

I can go through and make sure all single statement loops and conditions use curly brackets (just as a first pass). I'm not at all familiar with setting up RF24Ethernet, so I may have trouble reproducing this on my end...

I'll be focusing on this diff (specifically the RF24.cpp file):
v1.4.2...v1.4.3#diff-d0133d6229b2d23257dc3cd4fb7822b96ae4b8f54d1095ca23abf223582f7968

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

ok I submitted a patch on the finding-data-corruption branch which adds curly brackets to all single line loops. I didn't find any single line condition statements.

master...finding-data-corruption

I also went throught the release diff v1.4.2...v1.4.3 for the RF24_config.h and RF24.h (after going through RF24.cpp), and all I could see is whitespace differences (mostly related to indentation of C syntax like #if ... #else ... #endif)

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 18, 2022 via email

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

I'm going to take a look at the diff for nRF24/RF24Mesh@v1.1.6...v1.1.8 because I think there was some single line loops/conditions in that lib also. You may have actually solved the corruption on this level, so I'm keeping an open mind about the other layers also.

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

Coincidentally, my local clone of RF24Mesh was somehow corrupted; git was saying "claims to have 125 objects while index indicates 188 objects" - whatever that means... I deleted then re-cloned RF24Mesh and git seems happy again - I doubt the git corruption made it into the remote because I wasn't able to commit anything. After re-cloning, I ran

PS path\to\GitHub\RF24Mesh> git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (2187/2187), done.

I'm guessing 100% means it is ok.

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

Nonetheless, I submitted a patch to nRF24/RF24Mesh@master...finding-data-corruption

It was mostly single line condition statements that didn't have curly brackets.

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

nRF24/RF24Network@v1.0.16...v1.0.17 just has whitespace changes; there's no single line loops or conditions that don't already use curly brackets.

I'm going to start testing... I'm assuming you experienced this on a MCU/Arduino, but I'll run some fragged msg tests on both Linux and Arduino.

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

Just noticed this in the build log for the RF242Mesh examples:

/home/pi/github/RF24Mesh/examples_RPi/ncurses/RF24Mesh_Ncurses_Master.cpp: In function ‘void drawTopology()’:
/home/pi/github/RF24Mesh/examples_RPi/ncurses/RF24Mesh_Ncurses_Master.cpp:241:25: warning: variable ‘y’ set but not used [-Wunused-but-set-variable]
  241 |                     int y = 0;
      |                         ^
/home/pi/github/RF24Mesh/examples_RPi/ncurses/RF24Mesh_Ncurses_Master.cpp:261:25: warning: variable ‘y’ set but not used [-Wunused-but-set-variable]
  261 |                     int y = 0;
      |                         ^
/home/pi/github/RF24Mesh/examples_RPi/ncurses/RF24Mesh_Ncurses_Master.cpp:281:25: warning: variable ‘y’ set but not used [-Wunused-but-set-variable]
  281 |                     int y = 0;
      |

I can address it on the RF24Mesh finding-data-corruption branch.

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

I suppose we can use clang-tidy to notify us of these types of warnings...

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 18, 2022

I can address it on the RF24Mesh finding-data-corruption branch.

Sure, may as well clean it up while we're in the process of fixing things

@2bndy5
Copy link
Member

2bndy5 commented Jul 18, 2022

Turns out y is needed as an arg to getyx(topoPad, y, x), but I can suppress the warning... I'm thinking like a bogus if statement where if (y >= 0) since it should never be negative, but I have to check curses coordinate system first.

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

Well, I haven't tested mesh yet, but the network layer seems to work well with fragged msgs. For the record I'm using pyrf24 on 1 end because it is still pinned to the libs prior to pigpio changes (the benefit of using submodules). On the other end I'm using my trusty QtPy M0 (ATSAMD21) with latest from RF24 finding-data-corruption branch and latest release of RF24Network.

ps - I really like the utility from the pyrf24/examples/general_network_test.py. Its really well suited for this. I wouldn't mind having a C++ variant of it in the nRF24/.github repo (because it isn't specific to only net or only mesh layers).

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

I'm confident that's all it was, just the line breaks needed between the loops with no brackets. I think we should try to get a release out this evening.

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

I'm onboard with that, at least for RF24. I still need to test the changes in RF24Mesh before we issue a new release for it.

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

I'll also change clang-format to not allow short loops and conditions. It will automatically change them to use curly brackets and multiple lines... We at www.github.com/cpp-linter org are developing a way to use a pre-commit action that will automatically commit changes made by clang-format (and clang-tidy if we ever started to use that).

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

Hmm, testing some more shows there are still issues. I could have sworn it was working this morning :(
Not sure, but 1.4.4 is still giving me errors 1.4.2 works fine. Testing with the stock RF24Ethernet example and a few different Arduinos (nano, pro mini, duemilanove)

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

Maybe I should increase the frequency of the fragged TX. I had it delayed 3 seconds to allow the payload to be printed to Serial.

When I ran it with 2 sec delay between TX, some fragged payloads weren't getting received as a single msg... not sure if I worded that right.

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

I'm kind of stumped. Going through the diff between 1.4.2 and 1.4.3 I don't see what would be causing it, unless there is a memory issue or similar higher up on the stack that changes are exposing?

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

The only notable changes I see to the mesh layer is renewAddress()

@2bndy5 2bndy5 closed this as completed Jul 19, 2022
@2bndy5 2bndy5 reopened this Jul 19, 2022
@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

Well I just found something interesting...

I tried downgrading from 1.4.4 to 1.4.2 manually, file by file. Replacing RF24.cpp and RF24.h made no change, but replacing the RF24_config.h did.
If I remove the newly added lines of

#ifndef sprintf_P
    #define sprintf_P sprintf
#endif // sprintf_P

Then it works fine... Does this make any sense?
Also, did you make that change, and why?

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

It might. I almost forgot about that addition. Now I have to ask, what board are you using?

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

AVRs, nano, pro mini and duemilanove

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

Geez, I haven't wired up one of those in a while now. I'm looking at the AVR core to see if it should be ifdef'd out. The addition for that function was really only meant for cores that don't have printf support, but it was intended to be used for sending printdets info over MQTT.

tagging @dstroy0

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

But why is it causing data corruption? That doesn't really seem to make sense to me unless its exposing an issue somewhere else.
I tested defining sprintf_P in the arduino sketch instead of the library, and it continues to work fine.

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

Did the sketch actually have code that calls the function? sprintf() is part of the deprecated-avr-comp/avr/pgmspace.h I'm not sure if deprecated has any real meaning because the AVR core pulls in avr/pgmspace .h. I'm trying the figure out the build flags to make sure the pgmspace.h is even part of an include path.

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

Ahh, the sketch does call sprintf_P

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

The config code in question was specifically added for Linux because it doesn't use a pgmspace.h to define the sprintf_P alias. We could probably do better to make sure it is only defined for Linux.

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

ok, I moved the define to just below where utility/includes.h is pulled in. Turns out it was also used for the PicoSDK...

@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

I also moved the include for stdio out of RF24.cpp and into utility/RP2/arch_config.h; all Linux drivers were already pulling in stdio (probably for printf). This should almost conclude revisions about #821 . I've been thinking that we might also want to #ifdef sprintf_P out the sprintf* functions in the lib just to be safe.

@TMRh20
Copy link
Member Author

TMRh20 commented Jul 19, 2022

I've been thinking that we might also want to #ifdef sprintf_P out the sprintf* functions in the lib just to be safe.

We got it working! Lets maybe try to keep the changes minimal.

@2bndy5 2bndy5 closed this as completed Jul 19, 2022
@2bndy5
Copy link
Member

2bndy5 commented Jul 19, 2022

I forgot to bump the version in the properties files before release. I was able to fix it and re-release., but wow what a pain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants