error log level on nominal use case ? #1634

cyril2maq · 2024-08-01T14:09:52Z

In production we did encounter this log :

leshan/leshan-tl-cf-server-coap/src/main/java/org/eclipse/leshan/transport/californium/server/endpoint/CaliforniumServerEndpointsProvider.java

Line 159 in b05c759

    
           LOG.error("Unexpected error: Unable to find observation with token {} for registration {}",

It seems that this can be a nominal use case when lwm2m client uses wrong token or in case of race condition on removing Observation.

ERROR log level is usually a trigger for operators as a service malfunction, which is not necessary the case here, and should not be used on a potential nominal use case.

I think it should be decreased to INFO level.

What do you think ?

sbernard31 · 2024-08-01T15:09:30Z

ERROR log level is usually a trigger for operators as a service malfunction, which is not necessary the case here, and should not be used on a potential nominal use case.

Agree with that.

It seems that this can be a nominal use case when lwm2m client uses wrong token or in case of race condition on removing Observation.

I don't know if this is really a "nominal use case"

This is a case where we get a notification which was accepted by coap layer but when we search the corresponding observation at LWM2M level we didn't find it.

So theoretically, this should be more about race condition than a wrong token used but this should not be so easy to trigger for a device or maybe there is a bug on how we handle this ?

Is it something you face often ? did you identifier clearly what trigger this ?

I think it should be decreased to INFO level.

I understand your point.
But is INFO not too low for something we should not happened.
I mean when it happens there is something wrong at client or server side. (I hope it's at client side)
Maybe a WARN ?

cyril2maq · 2024-08-02T09:38:00Z

This is a case where we get a notification which was accepted by coap layer but when we search the corresponding observation at LWM2M level we didn't find it.

You're right, I missed that the Coap layer did validate the token beforehand.

Is it something you face often ?

Not often, a few times last month.

did you identifier clearly what trigger this ?

On our prod this log happened in 2 different race condition use cases :

the notification is sent few seconds AFTER the registration lifetime elapsed and the ExpirationListener.registrationExpired is raised around the same time, few milliseconds afterward.
the client send the NOTIFY (that will trigger the log afterward) right before a new REGISTRATION. Therefore the time that the notify is handled, the previous registration on which it was based has been removed.

Maybe a WARN ?

In our log policy, we do not raise WARN on expected race conditions, but it can clearly be discussed. Anyway it seems better than ERROR. In that case we would just change the log level trigger for this class.

sbernard31 · 2024-08-02T10:11:08Z

On our prod this log happened in 2 different race condition use cases

I think in both case, there is something not so good as client side. Don't you think ?

First, client doesn't to the registration update in time. (it should do that several minutes before the lifetime expired)
Second, this is less clear but maybe client should not send a notification just before a new registration ? or at least send a CON Message for this notification wait for the ACK then send the new registration ?

In our log policy, we do not raise WARN on expected race conditions, but it can clearly be discussed. Anyway it seems better than ERROR. In that case we would just change the log level trigger for this class.

ERROR is clearly too high.
My "fear" is that maybe a INFO will be too low and no nobody care about it and so will not fix the device behavior.
But maybe I'm wrong and if the message is clear enough, it could maybe do the job.

Currently we haven't really clear guideline for log level usage, see https://github.com/eclipse-leshan/leshan/wiki/Code-&-design-guidelines#code-rules

Maybe the project need something better than that, If you (Orange) could contribute/help to create together a clear Log Level policy for Leshan ?
I don't know if it exists reference practice about it ? (https://stackoverflow.com/questions/5817738/how-to-use-log-levels-in-java)
Ideally this policy should define the log level to use about "client bad behavior"

cyril2maq · 2024-08-02T12:32:56Z

I think in both case, there is something not so good as client side. Don't you think ?

Yes. But race conditions are often tricky to target as implementation error. The client can send its notify with its lifetime still OK, but the the network and SSL handshake can take some time so as the lifetime is now over.

The stackoverfkow thread you mentioned highlight the ambiguity :

WARN: Any message that might warn us of potential problems, e.g. when a user tried to log in with wrong credentials - which might indicate an attack if that happens often or in short periods of time

this definition clearly goes with the WARN choice in our case

WARN: the process might be continued, but take extra caution [...] The application can tolerate warning messages, but they should always be justified and examined.

this definition goes more with the INFO choice in our case. It is consistent with this blog post for instance : https://betterstack.com/community/guides/logging/log-levels-explained/

Clearly, in our project we set with this second definition : WARN logs and above will trigger some human analysis. And I don't want to check what happened every time this race condition occurs 😅.

But if you choose this definition, it means that most libraries should have very few WARN logs; it is usually up to the application level to know there is something unexpected that a human should oversee...

sbernard31 · 2024-08-02T13:02:36Z

Yes. But race conditions are often tricky to target as implementation error. The client can send its notify with its lifetime still OK, but the the network and SSL handshake can take some time so as the lifetime is now over.

I agree but a client should take this into account so this should happens very very rarely.
See :

leshan/leshan-lwm2m-client/src/main/java/org/eclipse/leshan/client/EndpointsManager.java

Lines 46 to 55 in b05c759

    
               /** 
        
                * Return the longest possible communication period based on given lifetime for the given server. 
        
                * <p> 
        
                * EndpointsManager should be aware about the kind of transport layer used to contact this server. So it should be 
        
                * able to calculate the communication period to avoid that registration expires taking into account specific 
        
                * transport layer latency (like DTLS Handshake, CoAP retransmission ...) 
        
                * 
        
                * @return the longest possible communication period based on given lifetime for the given server. 
        
                */ 
        
               long getMaxCommunicationPeriodFor(LwM2mServer server, long lifetimeInSeconds);

sbernard31 · 2024-08-02T13:09:34Z

Once I finish what I'm working on I will move this log to WARN.

Then if wanted, we can try to create a log level usage policy together and then adapt it again (maybe to INFO if needed)

cyril2maq · 2024-08-02T13:11:00Z

Thanks, LGTM.

sbernard31 · 2024-08-09T16:00:07Z

Once I finish what I'm working on I will move this log to WARN.

Done in 3d03a8e

Then if wanted, we can try to create a log level usage policy together and then adapt it again (maybe to INFO if needed)

I created a dedicated issue about that : #1641

I close this issue but feel free to reopen or comment if needed.

sbernard31 mentioned this issue Aug 9, 2024

Define a clear Log Level Policy for Leshan. #1641

Open

sbernard31 added a commit that referenced this issue Aug 9, 2024

GH-1634: change log level from ERROR to WARN in ServerEndpointsProvider

3d03a8e

sbernard31 closed this as completed Aug 9, 2024

slaft mentioned this issue Sep 10, 2024

NullPointer when a Notification is received while the Registration is being cleaned #1649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error log level on nominal use case ? #1634

error log level on nominal use case ? #1634

cyril2maq commented Aug 1, 2024

sbernard31 commented Aug 1, 2024

cyril2maq commented Aug 2, 2024

sbernard31 commented Aug 2, 2024

cyril2maq commented Aug 2, 2024

sbernard31 commented Aug 2, 2024

sbernard31 commented Aug 2, 2024

cyril2maq commented Aug 2, 2024

sbernard31 commented Aug 9, 2024

error log level on nominal use case ? #1634

error log level on nominal use case ? #1634

Comments

cyril2maq commented Aug 1, 2024

sbernard31 commented Aug 1, 2024

cyril2maq commented Aug 2, 2024

sbernard31 commented Aug 2, 2024

cyril2maq commented Aug 2, 2024

sbernard31 commented Aug 2, 2024

sbernard31 commented Aug 2, 2024

cyril2maq commented Aug 2, 2024

sbernard31 commented Aug 9, 2024