Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review MQTT implementation for scalability #198

Open
pral2a opened this issue Jul 8, 2021 · 6 comments
Open

Review MQTT implementation for scalability #198

pral2a opened this issue Jul 8, 2021 · 6 comments

Comments

@pral2a
Copy link
Member

pral2a commented Jul 8, 2021

Review MQTT gem

Review the current MQTT library in use

Shared Subscription

Consider implementing multiple subscriber mqtt_subscriber.rake tasks as it was originally planned to balance ingestion load across multiple rails tasks taking advantage of the Emqx Shared Subscriptions feature.

That could be achieved by adding support to pass a configuration variable to mqtt_subscriber.rake to instantiate multiple tasks via docker-compose.yml

👓 We need to learn more about rake thread / process management. Read here. Otherwise, we might also consider doing it at a docker level. Is it a crazy idea?

Broker SSL/TLS with Let's Encrypt

Review implementation of Let's Encrypt on the MQTT Broker server to confirm renewal and configuration is ok. Currently works well.

MQTT Message persistance

That option is critical to ensure in case rails fails to ingest messages temporary the broker persist the messages for later ingestion. That combined with the new flash based local storage on the SCK 2.1 (after SAM firmware release 0.9.8) ensures in case the broker becomes unavailable data will be persisted by the SCK 2.1 and in case the rails subscription tasks fails data will be persisted at the broker.

Message persistence doesn't require any changes on the broker and is defined by the pub / sub clients. Here an example using mosquitto as a client and our broker in production, EMQ X.

$ mosquitto_pub --host mqtt.smartcitizen.me --topic 'foo/bar' -p 80 -u foo -m 'foo' -q 1
$ mosquitto_sub --host mqtt.smartcitizen.me --topic 'foo/+' -p 80 -u foo -q 1 -q 1 -i bar --disable-clean-session

In principle the current MQTT library supports that feature and can be implemented as follows:

MQTT::Client.connect(host: host, clean_session: true) do |client|

However, it was implemented previously and lead to some instabilities in production after a mqtt_subscriber.rake crash.

Implementation needs to be reviewed on staging in particular to take in to account the mqtt_subscriber.rake peak load that can occur after a downtime when the broker buffered a lot of data.

@viktorsmari
Copy link
Collaborator

Also the current gem was last updated in 2016
https://github.com/njh/ruby-em-mqtt

@pral2a
Copy link
Member Author

pral2a commented Oct 25, 2022

Asesing the migration from Event Machine (and ruby-em-mqtt)

  1. Event Machine release schedule is very low and the ruby-em-mqtt last update dates back from 2016.

  2. There are more efficient and better maintained asynchronous I/O libraries to implement scalable network clients in Ruby. All of them are built on nio4r. Looks like the best option could be Async. Actually, it might soon become part of the Ruby 3.X stdlib, we are in Ruby 2.6.8. Here is an example migration from EventMachine to Async.

  3. However, the question currently arises in how we implement MQTT on top of Async. I couldn't find any properly maintained MQTT implementation using async. Here an example of what might be the closest one.

The following conclusion leads to three potential solutions to evaluate:

  • The MQTT standard allows the use WebSocket as a network transport and our production broker fully supports it. That means we should be able to perfectly use Async Websockets. However, we might need to implement MQTT parsing manually.
  • Implement the MQTT client in a different language using a Redis as a bus to deliver the messages as we do in other parts of the platform. That will means breaking the monolith more, impacting on test and maintainability.
  • Continue to use Event Machine and postpone any migration.

🔍 The research continues...

@oscgonfer oscgonfer mentioned this issue Mar 1, 2023
@oscgonfer oscgonfer added this to the 1223 milestone May 4, 2023
@oscgonfer
Copy link
Contributor

oscgonfer commented May 10, 2023

MQTT Persistence

Adding to the MQTT EMQX. We currently have migrated EMQX to 5.0.1 on staging and is running correctly with the following settings (ports and domain are redacted):

docker run -d --name emqx \
    --restart="unless-stopped" \
    --memory="3g" \
    --memory-swap="3g" \
    -p ****:**** -p ****:**** \
    -p ... \ 
    -e EMQX_MQTT__UPGRADE_QOS="true" \
    -e EMQX_MQTT__MQUEUE_STORE_QOS0="true" \
    -e EMQX_MQTT__SESSION_EXPIRY_INTERVAL="960h" \
    -e EMQX_MQTT__MAX_MQUEUE_LEN=10000000 \
    -e EMQX_NODE__COOKIE="*****"\
    -e EMQX_ALLOW_ANONYMOUS=false \
    -e EMQX_LISTENER__SSL__KEYFILE="/opt/emqx/etc/certs/privkey.pem" \
    -e EMQX_LISTENER__SSL__CERTFILE="/opt/emqx/etc/certs/fullchain.pem" \
    -e EMQX_LISTENER__WSS__KEYFILE="/opt/emqx/etc/certs/privkey.pem" \
    -e EMQX_LISTENER__WSS__CERTFILE="/opt/emqx/etc/certs/fullchain.pem" \
    -e EMQX_DASHBOARD__LISTENERS__HTTP__ENABLE=true \
    -e EMQX_DASHBOARD__LISTENERS__HTTP__BIND=* \
    -e EMQX_DASHBOARD__LISTENERS__HTTP__MAX_CONNECTIONS=5 \
    -e EMQX_DASHBOARD__LISTENERS__HTTPS__ENABLE=true \
    -e EMQX_DASHBOARD__LISTENERS__HTTPS__BIND=* \
    -e EMQX_DASHBOARD__LISTENERS__HTTPS__MAX_CONNECTIONS=5 \
    -e EMQX_DASHBOARD__LISTENERS__HTTPS__KEYFILE="/opt/emqx/etc/certs/privkey.pem" \
    -e EMQX_DASHBOARD__LISTENERS__HTTPS__CERTFILE="/opt/emqx/etc/certs/fullchain.pem" \
    -v /etc/letsencrypt/live/<domain>/fullchain.pem:/opt/emqx/etc/certs/fullchain.pem \
    -v /etc/letsencrypt/live/<domain>/privkey.pem:/opt/emqx/etc/certs/privkey.pem \
    -v /root/emqx/etc/acl.conf:/opt/emqx/etc/acl.conf \
    -v /root/emqx/log:/opt/emqx/log \
    emqx/emqx:5.0.11

The broker itself does need changes on the deployment, as seen above. The connection of the rails tasks with the mqtt-subscriber.rake file should be stablished with clean_session=False, for it to work. Environment variables are used for this, which were fixed in 3498dec

As far as SSL

SSL is working fine at least on the dashboard, although there were some (now-solved) issues with regards to user permissions of the cert files in the docker volume. In principle, it should all go well on WSS and SSL listeners.

Issue is solved by (check here and here):

  1. Adding a emqx user on the host machine (emqx only in the container otherwise):
useradd emqx
  1. Changing ownership of the certs and archive:
chmod 0755 /etc/letsencrypt/archive
chmod 0755 /etc/letsencrypt/live
chgrp emqx /etc/letsencrypt/live/<domain>/privkey.pem
chgrp emqx /etc/letsencrypt/archive/<domain>/privkey1.pem
chmod 0640 /etc/letsencrypt/live/<domain>/privkey.pem
chmod 0640 /etc/letsencrypt/archive/<domain>/privkey1.pem

chown emqx:emqx /etc/letsencrypt/live/<domain>/*.pem
  1. Adding the certificates line-by-line on the docker volumes so that it resolves the symlinks to ../../archive/ created by certbot.

TODO
Check if the certificate autorenewal doesn't mess up anything in 90 days...

@oscgonfer
Copy link
Contributor

oscgonfer commented Sep 15, 2023

Comments on the renewal:

  1. certbot needs to be run with dns-01 instead of https-01 due to our internal works with some IPTABLES. Check reference here and the configuration for the renewal below:
# renew_before_expiry = 30 days
version = 0.40.0
archive_dir = /etc/letsencrypt/archive/DOMAIN
cert = /etc/letsencrypt/live/DOMAIN/cert.pem
privkey = /etc/letsencrypt/live/DOMAIN/privkey.pem
chain = /etc/letsencrypt/live/DOMAIN/chain.pem
fullchain = /etc/letsencrypt/live/DOMAIN/fullchain.pem

# Options used in the renewal process
[renewalparams]
account = xxxxxxx
pref_challs = dns-01,
authenticator = manual
manual_auth_hook = /etc/letsencrypt/acme-dns-auth.py
server = https://acme-v02.api.letsencrypt.org/directory
manual_public_ip_logging_ok = True
  1. We have a post renewal hook in /etc/letsencrypt/renewal-hooks/post:
#!/bin/bash
DOMAIN=<DOMAIN>
USER='emqx'
user_exists(){ id "$1" &>/dev/null; } # silent, it just sets the exit code
if user_exists $USER; code=$?; then  # use the function, save the code
    echo "$USER exists. Skipping" 
else
    echo 'user not found' >&2  # error messages should go to stderr
    useradd emqx
fi

echo 'chmods...'
chmod 0755 /etc/letsencrypt/live
chmod 0755 /etc/letsencrypt/archive
chgrp $USER /etc/letsencrypt/live/$DOMAIN/privkey.pem
chgrp $USER /etc/letsencrypt/archive/$DOMAIN/privkey*.pem
chmod 0640 /etc/letsencrypt/live/$DOMAIN/privkey.pem
chmod 0640 /etc/letsencrypt/archive/$DOMAIN/privkey*.pem

echo 'chown to EMQX...'
chown emqx:emqx /etc/letsencrypt/live/$DOMAIN/*.pem
echo 'Done'

@oscgonfer
Copy link
Contributor

oscgonfer commented Jan 4, 2024

Documented in docs/mqtt.md in the #293. Comments above are not up to date.

@oscgonfer
Copy link
Contributor

One good place to take a look at and see how we are handling the mqtt messages is on the Slow subscription view of the EMQX broker: https://mqtt.smartcitizen.me:18084/#/slow-sub

This basically will queue up and buffer on mqtt the excess of unreceived messages. Notifications can be enabled via mqtt, so that we can trigger an email or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants