Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize vars early to avoid living references #3409

Merged
merged 3 commits into from
Aug 8, 2024

Conversation

sl0thentr0py
Copy link
Member

@sl0thentr0py sl0thentr0py commented Aug 7, 2024

Since we added the recursive option for the EventScrubber in #2755, vars nested inside could be replaced with AnnotatedValue and potentially break userland code.

We really shouldn't be holding references to frame.f_locals throughout our SDK, this has all sorts of breakage potential.

This is a bit hacky, but we'll simply call serialize early on vars. I tried patching deepcopy in #3392 to work with our requirements, but I was simply reimplementing most of this serialize function.

Turns out @sentrivana did the same in #2117

Copy link

codecov bot commented Aug 7, 2024

Codecov Report

Attention: Patch coverage is 68.75000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 79.78%. Comparing base (5529c70) to head (ae7621c).

✅ All tests successful. No failed tests found.

Files Patch % Lines
sentry_sdk/scope.py 33.33% 2 Missing and 2 partials ⚠️
sentry_sdk/serializer.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3409      +/-   ##
==========================================
- Coverage   79.78%   79.78%   -0.01%     
==========================================
  Files         133      133              
  Lines       14418    14418              
  Branches     3036     3036              
==========================================
- Hits        11504    11503       -1     
+ Misses       2083     2082       -1     
- Partials      831      833       +2     
Files Coverage Δ
sentry_sdk/client.py 79.80% <100.00%> (-0.05%) ⬇️
sentry_sdk/integrations/pure_eval.py 66.23% <100.00%> (+0.44%) ⬆️
sentry_sdk/utils.py 83.16% <100.00%> (+0.02%) ⬆️
sentry_sdk/serializer.py 90.16% <75.00%> (+0.69%) ⬆️
sentry_sdk/scope.py 84.94% <33.33%> (-0.46%) ⬇️

... and 1 file with indirect coverage changes

@philipstarkey
Copy link

Hi @sl0thentr0py

(I'm the person who originally reported the bug to Sentry)

I'm worried that this approach is going to severely restrict our ability to scrub PII from the data we send to Sentry. By serialising the frame locals prior to the scrubber, we lose context about the Python type of the variables. This means we can no longer identify sensitive data by object type and scrub it out. For example, with this proposed change, I think a Django model instance would be converted to a string that contained PII before it got to the scrubber (for example, if the model __str__ method returned a string containing PII, which is common for Django applications). This would be very hard to identify in the scrubber - we'd have to fall back on variable/field names which is hard to do in a way that catches 100% of all relevant local variable names that might contain PII, for now and for any future changes in our codebase.

To give some context, we're currently using the Event Scrubber to convert variables with potential PII into something that we can use to identify the content later, without sending that content to Sentry (as often the content of the variable that contains PII is critical to identifying and reproducing an issue).

Am I correct here or is there another way that we can achieve PII scrubbing based on data type?

Copy link
Contributor

@sentrivana sentrivana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, see comment

return None

return False
return is_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on the specific lines because not part of the diff, but:
L123: return type is now always bool
Please also update the docstring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed method and documented serialize instead.

@sl0thentr0py
Copy link
Member Author

hmm @philipstarkey I see, that's a valid concern.

So the reason I looked at this again was because of a report where we patch the original f_locals and actually potentially break user code. We should generally not be holding the original variable references for longer than necessary.

Further, the scrubber was meant to be a simpler before_send with a simple denylist so that people can remove some vars by name easily from the event payload.

I'm going to go ahead with merging this PR, but at the same time I will expose another before_serialize_vars (naming TBD) that you can hook into to do precisely what you want in a separate PR.

@philipstarkey
Copy link

philipstarkey commented Aug 8, 2024

@sl0thentr0py Thanks! I appreciate the consideration. I've only had a very cursory look at the serialization code but it occurred to me that being able to provide a custom implementation for safe_repr (or rather, being able to customise the function that safe_repr calls so it's not just repr()) might also address my concern? It has a very clear interface (get object, return a string representation of the object) so it feels like something that could easily pluggable by end-users such as myself. I could be completely wrong though!

@sl0thentr0py
Copy link
Member Author

@philipstarkey we actually already have that but we haven't documented it openly since it's a footgun. You can set __sentry_repr__ on your class and we will pick it up.

sentry_repr = getattr(type(obj), "__sentry_repr__", None)

Could you try that and let me know if it works for you?

@sl0thentr0py sl0thentr0py merged commit 7d46709 into master Aug 8, 2024
123 of 124 checks passed
@sl0thentr0py sl0thentr0py deleted the neel/serialize-vars-early branch August 8, 2024 11:45
@philipstarkey
Copy link

@sl0thentr0py I did happen to see that actually! It's where I got the idea from 🙂 Unfortunately __sentry_repr__ doesn't quite do what we want because:

  • we can't easily set this on objects we don't control (numpy arrays, pandas data frames, and Django QuerySets come to mind)
  • most importantly we'd like something that would fail closed on unknown object types

Now that this is merged I'd also be happy to continue the conversation in a new issue if you'd like.

@sl0thentr0py
Copy link
Member Author

aha okay, so we have 2 options, both of which I'm fine with, so I'll let you choose which one you prefer:

  • A before_serialize_vars hook
  • A custom repr function that will be tried in safe_repr first
    I assume you prefer the latter?

@philipstarkey
Copy link

@sl0thentr0py Thanks, yes, I think we prefer option 2 - the option for a custom repr function to be tried in safe_repr (my current understanding of how everything works is that this will both solve our needs and reduce the likelihood of this bug to creeping back in).

Thanks for giving us the choice 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants