Uptrack

April 9, 2026

How we built unlimited self-hosted email on the BEAM

A monitoring tool that can't send alert emails is a dashboard with pretty graphs. Useful. Not enough.

Every email service provider has a free tier. Brevo gives you 300 emails a day. Resend gives you 3,000 a month. That sounds like a lot until your users' production servers start going down simultaneously and you're burning through your monthly quota in an afternoon.

So we self-hosted. What followed was a series of small problems, each with a surprisingly elegant solution — most of them already built into the Erlang runtime we were already running.

1

The 3,000 emails/month wall

Free ESP tiers exist to get you hooked, then charge you. 3,000 emails/month sounds generous until you do the math: 1,000 monitors, each checked every minute, with a 0.1% error rate means 1,440 alerts per day before you've even factored in resolution notifications, test alerts, or reminders for ongoing incidents.

Paid tiers start at $9/month for 20,000 emails. Not ruinous, but it adds up — and more importantly, it's a dependency on an external service that can go down, change pricing, or decide your sending patterns look like spam.

Solution: Stalwart on our own servers

Stalwart is a mail server written in Rust with a NixOS module (services.stalwart-mail). It handles DKIM signing, SPF verification, DMARC, DANE, and MTA-STS out of the box. It uses ~50MB of RAM. We already had two servers in Nuremberg — we deployed Stalwart on both. Unlimited outbound email, zero marginal cost.

The real work is DNS: SPF records listing your server IPs, DKIM TXT record with your public key, DMARC policy, and a PTR (reverse DNS) record pointing your IP back to your mail domain. Without PTR, Gmail will reject you. With it, you're a legitimate sender.

2

50-200ms per email just for connecting

The standard way to send email in Elixir is Swoosh's SMTP adapter, which uses gen_smtp under the hood. By default it opens a new TCP connection for every single email: TCP handshake, TLS negotiation, SMTP greeting, AUTH, then your message, then close. On localhost that's still 20-50ms of overhead per email, before you've transferred a single byte of content.

That's fine at one email per second. It's a problem when a datacenter outage triggers 500 alerts simultaneously.

Solution: gen_smtp's open/deliver/close API

gen_smtp has a lower-level API that most people don't use.:gen_smtp_client.open/1 opens a connection and returns an opaque socket.:gen_smtp_client.deliver/2 sends an email over that socket and returns it ready for reuse.:gen_smtp_client.close/1 ends the session. One connection, many emails. The handshake overhead pays for itself after the first message.

3

A fixed connection pool is a static bottleneck

The obvious next step is a connection pool. Keep N connections open, send emails through them, return connections when done. Libraries like NimblePool make this easy.

But a fixed pool has two failure modes. Too small: during a burst, workers queue up waiting for a checkout. Too large: you're holding open connections to your mail server all day for average load that needs two. And the pool manager itself is a single process — at high checkout rates, it becomes a serialization point.

Solution: A dynamic process fleet

BEAM processes are ~2KB and spawn in about 2 microseconds. Instead of a fixed pool, we have a fleet: a set of worker processes, each holding one persistent gen_smtp connection. The fleet starts with 2 warm workers. When all workers are busy and a new email arrives, a new worker spawns on demand. When a worker has been idle for 60 seconds, it terminates itself. No static ceiling. No wasted connections overnight. The fleet breathes with the load.

This is the thing about the BEAM: "spawn a process per thing" isn't a joke. Processes are the unit of concurrency, not threads. You'd never say "we have a fixed pool of 10 threads to handle all HTTP requests" — the same logic applies here.

4

What stops it from overwhelming Stalwart?

If the fleet spawns on demand with no limit, a sufficiently large burst could spawn thousands of connections and bring Stalwart to its knees. You'd be DDoSing your own mail server.

A hardcoded max_workers solves this — but then you're back to picking a number, which is just moving the static bottleneck from the pool size to the cap. How do you know when the cap is right?

Solution: Let Stalwart say no, then back off

Stalwart already has a connection limit (8,192 by default). When it's overwhelmed, it refuses connections. That refusal is the back-pressure signal — no hardcoded number needed. On top of this, we use fuse, an Erlang circuit breaker. After 5 connection failures in 10 seconds, the circuit opens and spawning stops. Workers drain the remaining queue through surviving connections. After 5 seconds, one probe attempt goes through. If it succeeds, the circuit closes and normal operation resumes. The server tells you when it's overwhelmed; you listen.

5

What if the mail server on one node dies?

We run two API nodes. Each has its own Stalwart instance. If nbg1's Stalwart crashes, every worker on nbg1 loses its connection. Our Oban job queue would eventually retry the failed jobs on nbg2 — but "eventually" is up to 15 seconds with default backoff. For a monitoring alert, that's not acceptable.

Solution: Fallback host + :pg cross-node routing

Two mechanisms work together. First, each worker tries to connect to localhost Stalwart on init — if that fails, it immediately connects to the other node's Stalwart via its Tailscale IP. Local failure, instant fallback, no delay. Second, workers register themselves in an OTP process group (:pg) when idle. Our dispatcher looks for idle workers across the entire cluster before spawning a new one. If nbg1 is under load and nbg2 has idle workers, the email routes there via Erlang message passing — microseconds, not milliseconds. No Horde, no external dependency. :pg is built into OTP 23+.

6

A digest batch blocking an incident alert

At 9 AM, the daily digest job fires. It queues 800 emails. Three minutes later, a production service goes down. The incident alert joins the back of the queue. Your users don't get notified for another four minutes while digest emails drain.

Solution: Separate Oban queues by priority

Three queues instead of one. email_critical (concurrency 50) handles incident alerts and resolutions — it runs in parallel with everything else and is never blocked. email_digest (concurrency 10) handles batched notifications. email_system (concurrency 5) handles verifications and welcome emails. Incident alerts never wait behind digest emails. This is a 10-line Oban config change.

What we ended up with

Each small problem had a small solution. None of them required a new infrastructure component. Most of them were already available in OTP or gen_smtp — we just had to use the right API.

Problem: ESP email limits

Stalwart on our own NixOS servers

Problem: New TCP connection per email

gen_smtp open/deliver/close session reuse

Problem: Fixed pool bottleneck

Dynamic process fleet (BEAM processes are free)

Problem: Uncapped fleet overwhelming Stalwart

fuse circuit breaker — let the server say no

Problem: Node-level Stalwart failure

Fallback host + :pg cross-node worker routing

Problem: Digest batch blocking incident alerts

Three Oban queues by priority

The result: unlimited transactional email with no monthly cost, instant failover when a Stalwart instance dies, dynamic throughput that scales with burst load, and incident alerts that are never delayed by lower-priority email.

The BEAM keeps showing up. Every time we hit a concurrency problem, the answer is "spawn a process." Every time we need distributed coordination, the answer is "use a built-in OTP primitive." The solutions aren't clever. They're just what the runtime was designed for.

Alerts that actually reach you.

50 free monitors — 10 at 30-second checks, 40 at 1-minute. No credit card required.

Start Monitoring Free