Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please clarify the exact sanitization process for messages. #1286

Closed
mikeshardmind opened this issue Dec 29, 2019 · 22 comments
Closed

Please clarify the exact sanitization process for messages. #1286

mikeshardmind opened this issue Dec 29, 2019 · 22 comments

Comments

@mikeshardmind
Copy link
Contributor

mikeshardmind commented Dec 29, 2019

The only current documented guidance on this is the following:

Discord may strip certain characters from message content, like invalid unicode characters or characters which cause unexpected message formatting. If you are passing user-generated strings into message content, consider sanitizing the data to prevent unexpected behavior. For example, to prevent unexpected mentions you could consider stripping @ characters or placing a zero-width space (U-200B) after them.

However, the behavior of collapsing characters has been observed with NFC-Normalized Unicode with all control characters pre-processed out. With this being the case, the above definition about it happening only with invalid Unicode characters, or with characters with unexpected formatting is clearly not correct.

This leaves no ability for a bot to take in web content, such as from RSS feeds without removing all @ characters, which changes the intent of a message. Changing this to insert zero-width spaces is also not universally acceptible, as there may be things in web content which users should be able to copy paste without unintended effects.

If you don't want to provide alternative ways such as message flags or just rejecting "bad" messages (see #1276), then please at least document the behavior in such a way that developers can properly account for it without breaking normal formatting.

@night
Copy link
Member

night commented Jan 14, 2020

We have no current plans to create or maintain documentation on the sanitization techniques we employ with message sending. The existing documentation is broad enough to cover the sanitization as a whole.

We are, however, monitoring feedback/reports related to mention abuse through bots internally and may make API changes down the road if it proves to be problematic. For right now, however, the existing sanitization techniques seem to be manageable and many bots have already patched this behavior.

@night night closed this as completed Jan 14, 2020
@mikeshardmind
Copy link
Contributor Author

Many bots, mine included, took drastic measures which break other things to avoid this issue because of the lack of communication not allowing anything better.

@Kowlin
Copy link
Contributor

Kowlin commented Jan 14, 2020

Would it be possible to cite the exact documentation on this subject? Since not everyone has a existing bot here that has solved the issues of sanitation and have nothing official to reference to, what knowledge is there for new developers getting into the Discord API to understand what discord does in-between from the message being send to Discord, and Discord rendering it on the end user's end?

@Jaesaces
Copy link

Jaesaces commented Jan 14, 2020

what knowledge is there for new developers getting into the Discord API to understand what discord does in-between from the message being send to Discord, and Discord rendering it on the end user's end?

mikeshardmind quotes it a few messages up. It's not particularly helpful.

@mikeshardmind
Copy link
Contributor Author

While I did quote that, I also mentioned how it is insufficient and also incorrect on top of it, and would generally agree with the question asked.

To be explicitly clear this section from the docs:

Discord may strip certain characters from message content, like invalid unicode characters or characters which cause unexpected message formatting.

Is outright incorrect, as I pointed out:

However, the behavior of collapsing characters has been observed with NFC-Normalized Unicode with all control characters pre-processed out.

There's no such thing as "surprising formatting" or invalid unicode when sending something already NFC normalized.

This is entirely misleading, and would be better replaced by "don't send messages containing user content, we're not gonna explain anything"

@eritbh
Copy link
Contributor

eritbh commented Jan 14, 2020

For right now, however, the existing sanitization techniques seem to be manageable...

It seems to me that bot devs have mostly been 'managing' this through manual testing and hearsay from other developers; isn't the point of having an open-source documentation repository that people can contribute things like this to a central source of information? At the very least, failing Discord providing adequate documentation for their own system, shouldn't people be able to list the known properties of the system in the docs?

Moreover, the system was 'manageable' for a long time before the recent uptick of mention abuse, but then it wasn't, because a previously unknown property of the system was discovered. People patched things and bots are mostly working again, but what happens when another anomaly is discovered? It's strange to me that Discord is so reticent to provide this information to developers interested in securing their bots against these incidents (and thereby reducing the likelihood of them occurring in the first place, saving everyone at Discord time and effort in response).

@DiscordLiz
Copy link

We are, however, monitoring feedback/reports related to mention abuse through bots internally and may make API changes down the road if it proves to be problematic. For right now, however, the existing sanitization techniques seem to be manageable and many bots have already patched this behavior.

This is a joke, right? Over the course of a few issues you've closed saying it's not an issue and Discord won't do more, you've had multiple library maintainers, a handful of large bot owners, and a couple of people who build self-hosted solutions all chime in that there needs to be more and how this is a problem. You've been presented with multiple options that developers could work with from just being clear about the documentation, to letting developers set the mentions everyone flag on messages sent, rather than that only being part of a payload from discord.

The larger bots like tatsumaki solved this by removing some functionality of their bots rather than attempt to guess at your sanitization.

@Skillz4Killz
Copy link
Contributor

Although I agree 100% that there should be a solution from Discord on this, I believe it is being blown way out of proportion.

I simply solved this issue in my bot by moving any form of content that I can not control aka User Input, Data from API etc inside of an embed. Embeds will automatically remove mentions.

@MinnDevelopment
Copy link
Contributor

MinnDevelopment commented Jan 15, 2020

Embeds will automatically remove mentions

Embeds can also be disabled on the user and permission levels. I don't see how that's a remotely satisfying solution.

@Gibigbig
Copy link

Gibigbig commented Jan 16, 2020

2020 and we still have problems with sanitation? :pepecoffee:

I agree with night on this one. If the safety of the app wasn't at risk, I would publish it, however, if doing so could mean people would find ways around it, I would keep in in-house.

@mikeshardmind
Copy link
Contributor Author

I agree with night on this one. If the safety of the app wasn't at risk, I would publish it, however, if doing so could mean people would find ways around it, I would keep in in-house.

You're heavily misunderstanding the issue. Because it hasn't been documented, bot devs are being exposed to sanitization issues. This is like if some SQLite interface silently changed your inputs changing the meaning, only in this case only causing message content to change to be heavy annoyances, rather than exploit databses.

@eritbh
Copy link
Contributor

eritbh commented Jan 16, 2020

The root cause of mention-spam exploits are bots not implementing input sanitization correctly. The root cause of that is that there is no documentation about how to do it. Unless the system is documented, it's impossible to be reasonably confident that any given sanitization method covers all its bases. This situation is analogous to web developers trying to prevent XSS attacks without having any information about how HTML escaping works.

@Gibigbig, your notion that "the safety of the app is at risk" is inaccurate. Bot developers are asking about the full behavior so they can implement the proper precautions. Arguably, the safety of the app is at risk because Discord hasn't documented their system well enough, because as long as the system is a black box, attackers could still be discovering flaws in bots' sanitization methods before bot developers can patch them.

@eritbh
Copy link
Contributor

eritbh commented Jan 23, 2020

We are, however, monitoring feedback/reports related to mention abuse through bots internally...

@night It's been around a week, is there any update on the team's feedback-gathering? Given everything that's been said in this issue and the previous ones, I really think this deserves a second look.

@petersvp
Copy link

@Geo1088 Definitely the XSS case. Discord should document theirs santization. It's not that sensitive, it shoule not be a trade secret, it should be something that we can consider how safe it is and how to properly implement in our bots.

@jhgg
Copy link
Contributor

jhgg commented Feb 29, 2020

We have no plans to document our message santization methods, however, we are working on exposing expose a new set of fields you can provide when you create a message or post a webhook that will specify what kinds of mentions will be allowed to be processed in a message. More details forthcoming as we reach code completion on that.

This will let you specify on a per-message level a white-list of which mentions should be processed (by users, roles, and everyone(/here)), and specifically what users/roles should be mentioned if found in the message content. We think this should solve all of the above feedback, while allowing us to keep our sanitization methods (which change more often than you'd think) undocumented, so as an API end-user you don't have to worry about it.

@mikeshardmind
Copy link
Contributor Author

@jhgg
Thank you for the follow-up. I agree that this should be sufficient for most use cases. The only ones which it may not involve certain Unicode control characters and web content using them. (most notably right to left overrides) Can we at least get it documented that text should be pre-processed to remove control characters so that other people do not run into this, or is this detail likely to change?

@jhgg
Copy link
Contributor

jhgg commented Feb 29, 2020

It is likely to change which is why we are not wanting to document it. Our "obscure character" filtering rules change as we discover more obscure characters. Documenting this means having people rely on a specific behavior, which we do not want to have people do.

@mikeshardmind
Copy link
Contributor Author

mikeshardmind commented Feb 29, 2020

Right, I understand you may add more characters, but I meant that it could be documented as a matter of something such as:

To avoid running into issues with our internal sanitization, you may want to preprocess text to remove any control characters. Doing this does not guarantee you will not have the text be further sanitized, but does reduce the likelihood for many cases not based on malicious input

This leaves the door open to changing behavior, while noting a class of character which has legitimate use, but that discord does not allow.

At least in the case of (intentional use of) directional control characters, you can use things like bidi algorithms to remove them while getting the intended text layout.

@jhgg
Copy link
Contributor

jhgg commented Feb 29, 2020

I don't follow. If you want to remove control characters from messages before sending them to us, that's your prerogative, and I think out of scope for whatever documentation we can provide.

@mikeshardmind
Copy link
Contributor Author

mikeshardmind commented Feb 29, 2020

While it's obvious to someone who has run into the issue how to minimize some of the impacts, I figured it might be worth a small additional note that there was a way to minimize issues with the filtering by preprocessing on a bot developer's end. That said, as adding an explicit way to chose which mentions get sent removes all of the abusable edge cases of the lack of documentation, people who run into the other cases should be able to fix it themselves when noticing it without the additional frustrations of users who got mass mentions as a result. It's not a big deal either way with that additional change in the future.

@eritbh
Copy link
Contributor

eritbh commented Feb 29, 2020

This will let you specify on a per-message level a white-list of which mentions should be processed (by users, roles, and everyone(/here)), and specifically what users/roles should be mentioned if found in the message content.

This sounds like a great solution, thanks so much for the update! I'm happy that the level of control will be so fine, it does seem like pretty much every use case will have the tools needed to keep unwanted mentions out of their messages with this feature. Eagerly awaiting further news.

@jhgg
Copy link
Contributor

jhgg commented Mar 3, 2020

And this marks the hopeful conclusion of this issue: #1396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests