Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Courier: Failing message stalls queue #1598

Closed
spg opened this issue Jul 27, 2021 · 12 comments
Closed

Courier: Failing message stalls queue #1598

spg opened this issue Jul 27, 2021 · 12 comments
Assignees
Milestone

Comments

@spg
Copy link
Contributor

spg commented Jul 27, 2021

Describe the bug

When an outgoing message fails to send, the message queue is stalled.

For example, if AWS SES replies with error 554 Message rejected: Email address is not verified, the oldest message in the queue will be retried forever. Newer messages in the queue are not processed.

A user could also stall the queue by submitting an invalid email address (e.g. me@notld).

Reproducing the bug

Steps to reproduce the behavior:

Using the Selfservice UI Node project, setup the courier config (I tried using AWS SES), and try recovering a password using an unverified email address email1@domain.com. This should cause the courier to fail:

time=2021-07-27T15:06:40Z level=error msg=Unable to send email using SMTP connection. func=github.com/ory/kratos/courier.(*Courier).DispatchMessage file=/home/ory/courier/courier.go:187 audience=application error=map[message:gomail: could not send email 1: 554 Message rejected: Email address is not verified. The following identities failed the check in region US-EAST-1: email1@domain.com trace:stack trace could not be recovered from error type *mail.SendError] message_from=noreply@myapp.com service_name=Ory Kratos service_version=v0.7.0-alpha.1 smtp_server=email-smtp.us-east-1.amazonaws.com:587 smtp_ssl_enabled=false

Try recovering another email address (email2@domain.com). You will see in the logs that the previous email address (email1@domain.com) keeps being repeated in the logs, without mention of the new address (email2@domain.com).
This indicates that the courier queue keeps retrying sending to email1@domain.com without regard for email2@domain.com.

Expected behavior

The courier queue should not stall when a particular message fails to send. Other messages in the queue should be processed, even if a given message fails.

Environment

  • Version: v0.7.1-alpha.1
  • Environment: Docker

Additional context

N/A

@aeneasr
Copy link
Member

aeneasr commented Jul 27, 2021

I assume the problem is here

kratos/courier/courier.go

Lines 180 to 189 in 60d848d

if err := m.Dialer.DialAndSend(ctx, gm); err != nil {
m.d.Logger().
WithError(err).
WithField("smtp_server", fmt.Sprintf("%s:%d", m.Dialer.Host, m.Dialer.Port)).
WithField("smtp_ssl_enabled", m.Dialer.SSL).
// WithField("email_to", msg.Recipient).
WithField("message_from", from).
Error("Unable to send email using SMTP connection.")
return errors.WithStack(err)
}

as we do not expect the SMTP server to reject the message with a validation error. One issue will probably be that every SMTP server will be returning a different error message here, making it difficult to differentiate between:

  • Validation failed
  • TCP connection failed
  • SMTP is down / unavailable

I took a quick glance at the SMTP error codes: https://en.wikipedia.org/wiki/List_of_SMTP_server_return_codes

It appears that 5xx errors should not be retried. So maybe adding a switch here that if it is a 5xx error, we set the status of the message to sent / failed, in which case it shouldn't be retried.

What do you think?

@spg
Copy link
Contributor Author

spg commented Jul 27, 2021

Yes your idea makes sense, this would definitely solve the retrying problem I'm experiencing.

@aeneasr
Copy link
Member

aeneasr commented Jul 27, 2021

Nice, would you be up to make a PR for this? :)

@spg
Copy link
Contributor Author

spg commented Jul 28, 2021

Sure, I don't have experience in Go, but I can try my hand at it!

@zepatrik
Copy link
Member

Actually duplicate of #402
So please also consider the discussion there

@aeneasr aeneasr added this to the v0.9.0-alpha.1 milestone Aug 18, 2021
@xeruf
Copy link

xeruf commented May 5, 2022

Please tackle this! Our infrastructure was severely crippled for a few days as a nonexistent receiver email was stuck in the queue and no further emails were sent.

@aeneasr
Copy link
Member

aeneasr commented May 5, 2022

@xeruf - this is an open source repository. The best thing you can do, if something bothers you, is to contribute the changes needed to get this working as you want.

@varac
Copy link

varac commented May 5, 2022

@aeneasr Sorry, but I disagree. I was witnessing this disruptive behaviour @xeruf described and only responding with "Please create a PR" is not really welcoming. The resonsability for fix severe bugs (which I consider this one is) lies in the maintainers and developers of the project, and it's stakeholders, not by the users who run into this. Sure, contributions are always welcome in open source projects but not as in "if you don't like this bug fix it" attitude. Feature requests are a different thing, for sure.
So also from me as a so far happy kratos admin who don't want to end up in such a situation: Please prioritze this.

@aeneasr
Copy link
Member

aeneasr commented Jul 21, 2022

Maintainers of popular open source projects like this one are burning out en masse while companies and users of those open source projects reap the benefits without either (a) putting money on the table or (b) contributing towards the health of the system. Just demanding "please fix this" is not an acceptable behaviour in open source communities. Please read up on this topic:

  1. https://www.jeffgeerling.com/blog/2022/burden-open-source-maintainer
  2. https://www.jeffgeerling.com/blog/2020/saying-no-burnout-open-source-maintainer
  3. https://www.infoworld.com/article/3563326/what-does-an-open-source-maintainer-do-after-burnout.html
  4. https://opensource.com/article/21/7/burnout-open-source
  5. https://github.blog/2022-06-01-welcome-to-maintainer-month/
  6. https://www.businessinsider.com/open-source-developers-burnout-low-pay-internet-2022-3
  7. https://daverupert.com/2021/12/sustaining-maintaining/

It would be great if the industry could become bit more sensitive to these topics.

In any case, since this is now a priority for our mail sender, we'll tackle this - so we all get the best of all worlds ;)

@xeruf
Copy link

xeruf commented Jul 22, 2022

I am totally with you on that, and I hope our startup will soon be able to support the open-source projects it builds on, but currently we are unfortunately very sparse when it comes to both developer time and finances.
Good to hear this is moving forward :)

@aeneasr
Copy link
Member

aeneasr commented Jul 22, 2022

Thank you for understanding, I appreciate the nice response :))

@varac
Copy link

varac commented Aug 16, 2022

@aeneasr Thanks for fixing ! Is there any planned release which would ship this fix ?

peturgeorgievv pushed a commit to senteca/kratos-fork that referenced this issue Jun 30, 2023
This PR replaces the `courier.message_ttl` configuration option with a `courier.message_retries` option to limit how often the sending of a message is retried before it is marked as `abandoned`. 

BREAKING CHANGES: This is a breaking change, as it removes the `courier.message_ttl` config key and replaces it with a counter `courier.message_retries`.

Closes ory#402
Closes ory#1598
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants