Engineering
How we built multi-region check consensus on the BEAM
Checks from 3 continents must agree before we wake you up. Zero external dependencies — just OTP primitives that Ericsson battle-tested for 30 years.
April 5, 2026 · 12 min read
The 3am problem
UptimeRobot checks your site from one location. A CDN edge goes down in Frankfurt. Your server in Virginia is fine. Your users in Tokyo see no issues. But the single check from Frankfurt fails, and your phone buzzes at 3am.
This is the false alert problem. Single-region monitoring can't distinguish between "the internet is broken between point A and point B" and "your server is actually down."
The fix sounds simple: check from multiple regions, only alert if most agree. But the implementation is surprisingly hard. You need to coordinate checks across continents, collect results in real-time, compute consensus, and avoid duplicate alerts — all without adding latency or single points of failure.
Here's how we did it with zero external dependencies using Erlang/OTP primitives.
The architecture: one process per monitor
Uptrack follows the Discord/WhatsApp pattern: one BEAM process per long-lived entity. Each monitor gets its own GenServer that self-schedules checks via Process.send_after.
For multi-region, the same monitor runs on every node. Three continents, three processes, one monitor:
┌─────────────────────────────────────────────────────────┐
│ Erlang Cluster (Tailscale mesh) │
│ │
│ EU (Germany) Asia (India) US (Virginia) │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │MonitorProcess│ │MonitorProcess│ │MonitorProcess│ │
│ │ id: "abc" │ │ id: "abc" │ │ id: "abc" │ │
│ │ Gun → target│ │ Gun → target│ │ Gun → target│ │
│ │ pg group │◄─►│ pg group │◄─►│ pg group │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ 50K monitors 50K monitors 50K monitors │
│ each self- each self- each self- │
│ scheduling scheduling scheduling │
└─────────────────────────────────────────────────────────┘Each process holds a persistent Gun HTTP connection to the target. TLS handshake happens once at startup — not on every check. A 30-second check cycle takes ~50ms of actual HTTP time.
How pg groups connect the continents
pg is an OTP module that's been in Erlang since OTP 23. It manages process groups across distributed nodes. Tested at 5,000 nodes and 150,000 processes by the OTP team.
When a MonitorProcess starts, it joins a pg group named after its monitor ID:
# On init, each MonitorProcess joins the group
:pg.join(:monitor_checks, monitor_id, self())
# After checking, broadcast result to all group members
:pg.get_members(:monitor_checks, monitor_id)
|> Enum.each(&send(&1, {:region_result, @region, result}))That's it. No message broker, no Redis pub/sub, no database polling. Erlang distribution delivers messages across continents in microseconds. When a node dies, pg automatically removes its processes from all groups.
The elegance: each MonitorProcess doesn't know or care how many regions exist. It broadcasts to the group and receives from the group. Add a 4th region, and it just works — the new process joins the pg group, and everyone sees it.
The consensus timeline: sub-second across 3 continents
Here's what happens during a single check cycle for a monitor where the site is down in Asia but up everywhere else:
T=0.0s EU checks target ──→ "up" ──→ pg broadcast to Asia, US
T=0.2s Asia checks target ──→ "down" ──→ pg broadcast to EU, US
T=0.5s US checks target ──→ "up" ──→ pg broadcast to EU, Asia
T=0.5s EU has: EU=up, Asia=down, US=up → 2/3 up → ✅ No alert
T=0.5s Asia has: EU=up, Asia=down, US=up → 2/3 up → (silent)
T=0.5s US has: EU=up, Asia=down, US=up → 2/3 up → (silent)
Total consensus time: ~0.5 seconds
CDN blip in Asia? Your phone stays silent.Now compare with a real outage:
T=0.0s EU checks ──→ "down" ──→ broadcasts
T=0.3s Asia checks ──→ "down" ──→ broadcasts
T=0.5s US checks ──→ "down" ──→ broadcasts
T=0.5s All 3 nodes: 3/3 down → consensus = DOWN
Home node (EU) increments consecutive_failures
After 3 consecutive → 🚨 ALERT
Total: 30s interval × 3 confirmations + 0.5s consensus
= ~91 seconds from outage to alert
= confirmed from 3 continentsA false alert would require: the majority of regions failing, multiple consecutive times. The probability of that happening by chance is essentially zero.
The home node: who gets to press the button?
All 3 nodes compute the same consensus. But only one should trigger the alert — otherwise you'd get 3 Slack messages for every incident.
We use deterministic hash-based assignment. The "home node" for each monitor is computed from its ID:
defp home_node?(monitor_id) do
nodes = [node() | Node.list()] |> Enum.sort()
hash = :erlang.phash2(monitor_id, length(nodes))
Enum.at(nodes, hash) == node()
end
defp maybe_trigger_alert(state) do
if home_node?(state.monitor_id) do
# Only this node sends alerts
do_alert(state)
else
# Other nodes track state but stay silent
state
end
endThe hash is deterministic — same monitor ID always maps to the same node. Monitors are distributed evenly: 900 monitors across 3 nodes = ~300 "home" per node. If a node dies, the hash redistributes to survivors.
Edge cases that would break a naive implementation
Slow region
Asia's check takes 8 seconds (congested submarine cable). EU and US finished in 200ms.
Solution: 10-second timeout. If a region doesn't respond, it's excluded from consensus — not counted as up OR down. 2/2 is still valid consensus.
Node crash
The India node loses power.
pg automatically removes dead processes from groups. EU and US continue with 2-node consensus. When India comes back, its processes rejoin the pg groups on startup. No manual intervention.
Network partition (netsplit)
The submarine cable between EU and Asia is cut. EU can still talk to US, but Asia is isolated.
EU ←→ US: connected (2 nodes visible)
✗
Asia: isolated (1 node visible)
EU consensus: 2/2 results → works normally
Asia consensus: 1/1 result → local only, skip consensus
Partition heals → pg reconnects → 3 nodes againEach side of the partition operates independently. No split-brain alerts because consensus requires majority of visible nodes.
Duplicate alerts
All 3 nodes compute "DOWN". Without guard rails, all 3 would fire alerts. The home_node? function ensures exactly one node acts on the consensus.
Why not just use the database?
The obvious approach: each region writes its check result to PostgreSQL. A periodic query aggregates results and computes consensus.
We tried this. Three problems:
1. Staleness
Each region checks at slightly different times (jitter). When EU reads Asia's result, it might get the previous check from 30 seconds ago — not the current one. Average staleness: ~15 seconds. That defeats the purpose of multi-region consensus.
2. Advisory lock issues
A singleton aggregator needs leader election. PostgreSQL advisory locks are the standard approach, but they don't work reliably with PgBouncer in transaction pooling mode — connections get swapped, locks get released unexpectedly.
3. The irony
We moved checks OUT of the database (GenServer-per-monitor) to remove the Oban bottleneck and get 100K+ concurrent checks. Putting consensus BACK in the database would reintroduce the exact bottleneck we eliminated.
pg messages solve all three: zero staleness (real-time delivery), no leader election needed (each process computes independently), and no database in the hot path.
Benchmark: 100K checks on a $10/mo server
We benchmarked on a single Netcup RS 1000 (4 AMD EPYC cores, 8GB RAM, $10/mo):
| Concurrent | Checks/sec | P50 | Failures |
|---|---|---|---|
| 10,000 | 2,602 | 1.6s | 0 |
| 50,000 | 2,782 | 8.2s | 0 |
| 100,000 | 2,602 | 16.9s | 0 |
| 110,000 | 2,444 | 19.1s | 0 |
110,000 concurrent HTTP checks, zero failures, on hardware that costs less than a Netflix subscription. The BEAM was designed for this — Ericsson used the same architecture to handle millions of concurrent phone calls.
With 4 nodes across 3 continents, multi-region consensus supports ~440K monitors. At 50 monitors per free user, that's 8,800 users before needing a 5th node.
The code: surprisingly little
The consensus logic in MonitorProcess is ~40 lines:
defmodule Uptrack.Monitoring.MonitorProcess do
use GenServer
# After local check completes, broadcast to pg group
def handle_info({:check_result, result}, state) do
# Broadcast our result to all regions
:pg.get_members(:monitor_checks, state.monitor_id)
|> Enum.each(&send(&1, {:region_result, @region, result}))
state = %{state |
region_results: Map.put(state.region_results, @region, result),
checking: false
}
maybe_evaluate_consensus(state)
end
# Receive result from another region
def handle_info({:region_result, region, result}, state) do
state = %{state |
region_results: Map.put(state.region_results, region, result)
}
maybe_evaluate_consensus(state)
end
defp maybe_evaluate_consensus(state) do
results = state.region_results
total = map_size(results)
expected = length(:pg.get_members(:monitor_checks, state.monitor_id))
if total >= min(expected, 2) do
# Enough results — compute consensus
down_count = Enum.count(results, fn {_, r} -> r.status == "down" end)
consensus = if down_count > total / 2, do: "down", else: "up"
state = %{state |
last_check: %{status: consensus},
region_results: %{} # reset for next cycle
}
|> evaluate_result()
|> record_result()
|> maybe_trigger_alert()
{:noreply, state}
else
# Still waiting for more regions
{:noreply, state}
end
end
endThat's the core of multi-region consensus. No Kafka, no Redis, no external coordinator. Just processes sending messages to each other — the thing the BEAM was literally built to do.
What we learned
The BEAM's distribution primitives are underused. Most Elixir apps treat nodes as independent units behind a load balancer. But pg, :global, and Erlang distribution enable architectures that would require Kafka + Redis + ZooKeeper in other ecosystems.
Don't put coordination in the database. We tried DB-based consensus first. The staleness problem alone killed it. pg messages are real-time with zero staleness.
The Discord/WhatsApp pattern scales to monitoring. One process per entity with self-scheduling and in-memory state — it works for chat guilds, phone calls, and uptime monitors.
$23/mo can compete with $54/mo. UptimeRobot charges $54/mo for 30-second checks from multiple regions. We do it on three $8-10 VPS nodes with better consensus logic.
Try it yourself
50 free monitors — 10 at 30-second checks, 40 at 1-minute. Multi-region consensus on every plan. No credit card required.
Start Monitoring Free