Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Improve API/event validation in synapse #8445

Open
richvdh opened this issue Oct 1, 2020 · 8 comments
Open

Improve API/event validation in synapse #8445

richvdh opened this issue Oct 1, 2020 · 8 comments
Labels
A-Validation 500 (mostly) errors due to lack of event/parameter validation T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@richvdh
Copy link
Member

richvdh commented Oct 1, 2020

Background

We've recently encountered a number of bugs in which malformed (or at least, unexpected) data has caused Synapse or clients to misbehave in some way. These bugs stem from the fact that, faced with a given datastructure, you cannot rely on it having the expected format. For example:

Such bugs are disruptive, and in extreme cases could progress beyond "denial of service" to "security threat", and it might be possible to avoid them by validating data at the point of entry to Synapse. We've recently discussed this in some depth within the core Synapse team; this issue serves to record some of our thoughts on the topic, including promising areas for further development.

Introduction

There are actually multiple reasons why it would be useful to improve validation of data within Synapse. These include:

  • As above, avoiding potential bugs
  • Simplifying code by removing isinstance checks everywhere
  • Forcing clients to follow the specification rather than using "whatever works in Synapse", thus improving compatibility with other server implementations which do enforce the spec (or conversely: not forcing other implementations to grandfather in Synapse's misfeatures for compatibility with clients)
  • Making synapse return 400 errors rather than "Internal Server Errors" if clients pass invalid parameters would:
    • Give better feedback to client developers: the 500 error gives the developer little information about their error.)
    • Allow server administrators to distinguish between "my server is having problems" and "somebody passed invalid parameters to my server"
    • Allow load-balancers to distinguish between "this process is having problems" and "somebody passed invalid parameters". Notably, Cloudflare will stop sending requests to an origin which returns too many 500 errors.

There are multiple related, but different, things we mean when we talk about validation. At a high level, these can be broken down into:

  • validating API request parameters (both CS and SS API); and
  • validating events (and other such data types, like EDUs etc).

Let's consider these in turn.

API validation

Given that we already have JSON schema specifications for our APIs, this is theoretically relatively straightforward (see, for example c39941c, which is a proof-of-concept applying this to a single endpoint), though it's certain to uncover a large number of places where clients are relying on non-spec-compliant behaviour. We also have to be wary, especially on the SS API, to ensure backwards compatibility.

Doing this would certainly reduce the number of false 500 errors, which as above brings a number of advantages. It might also reduce the occurrences of bugs due to bad events (e.g. non-string display names) since updated Synapses will correctly reject them on the CS API; however, it will absolutely not fix those bugs, since such malformed data could still be received from buggy or malicious servers over federation.

Event validation

Validating events, particularly those received over federation, is quite hard as:

  1. you need to decide what to do with events that fail validation,
  2. retaining backwards compatibility means it’s hard to reject/drop/decline to handle events that fail strict validation.

For example, an event with malformed m.relates_to data can’t just be dropped as, according to the authentication rules of all room versions to date, it is a valid event, even though its payload (that Synapse still interprets) is invalid, since annotations were added after the current room versions were specced.

The main problem we have currently is that events contain a mix of properties which are validated on receipt (auth_events, prev_events, etc) alongside a bunch of untrusted data that we cannot assume nor assert the types of. Then, when we come to access event data, we need to remember to add checks for any untrusted data. While room versions allow us to add stricter schemas, relying on that approach will always be a case of playing catchup as we’ll want to use new features in older room versions where possible. (See also MSC2801 on this subject.)

Ideally, therefore, we’d try and add some tooling to make it easier to statically assert (or at least tell easily in the code) whether the fields you’re accessing on events (and other data types) have been validated or not. Hopefully it is possibly to do something with mypy here, however it will require some experimentation here.

Conclusions

In summary there are a few paths going forward:

  1. validate schemas of APIs;
  2. validate a basic schema for events (and other data types), potentially using room versions to allow us to enforce stricter schemas while retaining backwards compatibilty; and
  3. experiment with mypy and other tooling (as well as code changes) to try and highlight the difference between validated and unvalidated data in events, and to enforce correct validating when using untrusted data.

Both 1 and 2 would be good things to do, and may reduce the number of occurrences of bugs, but won’t actually fix the class of bugs we’re seeing due to unexpected formats of various keys in events. However, while paths 1 and 2 are something that we know how to do, path 3 has a lot more unknowns attached to it, but ultimately is the only option that will fully prevent the class of bugs we’re seeing.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Oct 2, 2020

Small note about 3: mypy could be a good fit to enforce internal consistency of data access after validation checks have passed, by annotating return types with TypedDict or @attr.s classes, with which other areas of the program can then work guaranteed with validated and enforced data structures. Also, after validation, passed data structure (but not content) should be immutable, only transformed to database-ready types or dicts (for example) for serialisation. (#8376 could be relevant to this point)

Additionally; Should we use 400 when schema validation fails, and 422 (Unprocessable Entity) when event validation fails? Should that be reflected back into the spec?

(422 according to the mozilla documentation: "The HyperText Transfer Protocol (HTTP) 422 Unprocessable Entity response status code indicates that the server understands the content type of the request entity, and the syntax of the request entity is correct, but it was unable to process the contained instructions.")

@clokep clokep added the A-Validation 500 (mostly) errors due to lack of event/parameter validation label Oct 2, 2020
@bmarty
Copy link

bmarty commented Feb 16, 2021

I will not create a dedicated issue, but for instance Synapse CS API will accept if a client send a state event with un-specified value. For instance using "foo" for the "join_rule" field of https://matrix.org/docs/spec/client_server/r0.6.1#m-room-join-rules

This break compatibility on some clients (ex: on Element Android this Event is ignored and not displayed in the timeline), and I'm not sure what happen server side regarding the room properties and what should be the fallback we display to the user in this case.

At least on the CS API I would expect to get an error 400 when trying to send such malformed event.

@ShadowJonathan
Copy link
Contributor

I'd also like to mention pydantic here as an interesting option, as it provides speedy (C-extension-backed) validation of objects.

@erikjohnston erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 26, 2021
@clokep
Copy link
Member

clokep commented Aug 12, 2021

(see, for example c39941c, which is a proof-of-concept applying this to a single endpoint)

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower). I had put together a benchmark script at https://gist.github.com/clokep/17064b16a9b471d09feb3e61851886af

Results from running this on my 2016 MBP with Python 3.9.6 (times in nanoseconds):

Test Manual JSON Schema
Empty 305 14200
Error 309 39800
Correct 294 22600
Raw results
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (3.38 us) is 24% of the mean (14.2 us)
* the maximum (31.5 us) is 121% greater than the mean (14.2 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

json_schema_empty: Mean +- std dev: 14.2 us +- 3.4 us
.....................
manual_empty: Mean +- std dev: 305 ns +- 21 ns
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (5.16 us) is 13% of the mean (39.8 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

json_schema_wrong_type: Mean +- std dev: 39.8 us +- 5.2 us
.....................
WARNING: the benchmark result may be unstable
* the maximum (475 ns) is 54% greater than the mean (309 ns)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

manual_wrong_type: Mean +- std dev: 309 ns +- 27 ns
.....................
json_schema_good: Mean +- std dev: 22.6 us +- 1.4 us
.....................
manual_good: Mean +- std dev: 294 ns +- 11 ns

@DMRobertson
Copy link
Contributor

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower).

@clokep I think your benchmark just does the validation and nothing else? I'd be interested to see how those numbers compare when they're part of request handling as a whole.

@clokep
Copy link
Member

clokep commented Aug 12, 2021

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower).

@clokep I think your benchmark just does the validation and nothing else? I'd be interested to see how those numbers compare when they're part of request handling as a whole.

Yes, that's what I was attempting to benchmark. 😄 (It also is completely fake -- the schemas of most of our endpoints are much more complicated.) As I had said in chat, I didn't spend too much time on this and was attempting to get a gut of whether it would make sense to drop a lot more time into it or not.

DMRobertson pushed a commit that referenced this issue Aug 20, 2021
* Validate device_keys for C-S /keys/query requests

Closes #10354

A small, not particularly critical fix. I'm interested in seeing if we
can find a more systematic approach though. #8445 is the place for any discussion.
@Kay-kay2019
Copy link

Working it

@jaywink
Copy link
Member

jaywink commented Apr 14, 2022

One case I ran into - Synapse accepts an object into the allow of join rules, when it should be an array of objects.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Validation 500 (mostly) errors due to lack of event/parameter validation T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

8 participants