Engineering

How we built multi-region check consensus on the BEAM

Checks from 3 continents must agree before we wake you up. Zero external dependencies — just OTP primitives that Ericsson battle-tested for 30 years.

April 5, 2026 · 12 min read

The 3am problem

UptimeRobot checks your site from one location. A CDN edge goes down in Frankfurt. Your server in Virginia is fine. Your users in Tokyo see no issues. But the single check from Frankfurt fails, and your phone buzzes at 3am.

This is the false alert problem. Single-region monitoring can't distinguish between "the internet is broken between point A and point B" and "your server is actually down."

The fix sounds simple: check from multiple regions, only alert if most agree. But the implementation is surprisingly hard. You need to coordinate checks across continents, collect results in real-time, compute consensus, and avoid duplicate alerts — all without adding latency or single points of failure.

Here's how we did it with zero external dependencies using Erlang/OTP primitives.

The architecture: one process per monitor

Uptrack follows the Discord/WhatsApp pattern: one BEAM process per long-lived entity. Each monitor gets its own GenServer that self-schedules checks via Process.send_after.

For multi-region, the same monitor runs on every node. Three continents, three processes, one monitor:

┌─────────────────────────────────────────────────────────┐
│              Erlang Cluster (Tailscale mesh)             │
│                                                         │
│  EU (Germany)      Asia (India)      US (Virginia)      │
│  ┌────────────┐   ┌────────────┐   ┌────────────┐      │
│  │MonitorProcess│  │MonitorProcess│  │MonitorProcess│     │
│  │ id: "abc"  │   │ id: "abc"  │   │ id: "abc"  │      │
│  │ Gun → target│  │ Gun → target│  │ Gun → target│      │
│  │ pg group   │◄─►│ pg group   │◄─►│ pg group   │      │
│  └────────────┘   └────────────┘   └────────────┘      │
│                                                         │
│  50K monitors      50K monitors     50K monitors        │
│  each self-        each self-       each self-          │
│  scheduling        scheduling       scheduling          │
└─────────────────────────────────────────────────────────┘

Each process holds a persistent Gun HTTP connection to the target. TLS handshake happens once at startup — not on every check. A 30-second check cycle takes ~50ms of actual HTTP time.

How pg groups connect the continents

pg is an OTP module that's been in Erlang since OTP 23. It manages process groups across distributed nodes. Tested at 5,000 nodes and 150,000 processes by the OTP team.

When a MonitorProcess starts, it joins a pg group named after its monitor ID:

# On init, each MonitorProcess joins the group
:pg.join(:monitor_checks, monitor_id, self())

# After checking, broadcast result to all group members
:pg.get_members(:monitor_checks, monitor_id)
|> Enum.each(&send(&1, {:region_result, @region, result}))

That's it. No message broker, no Redis pub/sub, no database polling. Erlang distribution delivers messages across continents in microseconds. When a node dies, pg automatically removes its processes from all groups.

The elegance: each MonitorProcess doesn't know or care how many regions exist. It broadcasts to the group and receives from the group. Add a 4th region, and it just works — the new process joins the pg group, and everyone sees it.

The consensus timeline: sub-second across 3 continents

Here's what happens during a single check cycle for a monitor where the site is down in Asia but up everywhere else:

T=0.0s  EU checks target ──→ "up" ──→ pg broadcast to Asia, US
T=0.2s  Asia checks target ──→ "down" ──→ pg broadcast to EU, US
T=0.5s  US checks target ──→ "up" ──→ pg broadcast to EU, Asia

T=0.5s  EU has: EU=up, Asia=down, US=up → 2/3 up → ✅ No alert
T=0.5s  Asia has: EU=up, Asia=down, US=up → 2/3 up → (silent)
T=0.5s  US has: EU=up, Asia=down, US=up → 2/3 up → (silent)

                    Total consensus time: ~0.5 seconds
              CDN blip in Asia? Your phone stays silent.

Now compare with a real outage:

T=0.0s  EU checks ──→ "down" ──→ broadcasts
T=0.3s  Asia checks ──→ "down" ──→ broadcasts
T=0.5s  US checks ──→ "down" ──→ broadcasts

T=0.5s  All 3 nodes: 3/3 down → consensus = DOWN
        Home node (EU) increments consecutive_failures
        After 3 consecutive → 🚨 ALERT

    Total: 30s interval × 3 confirmations + 0.5s consensus
         = ~91 seconds from outage to alert
         = confirmed from 3 continents

A false alert would require: the majority of regions failing, multiple consecutive times. The probability of that happening by chance is essentially zero.

The home node: who gets to press the button?

All 3 nodes compute the same consensus. But only one should trigger the alert — otherwise you'd get 3 Slack messages for every incident.

We use deterministic hash-based assignment. The "home node" for each monitor is computed from its ID:

defp home_node?(monitor_id) do
  nodes = [node() | Node.list()] |> Enum.sort()
  hash = :erlang.phash2(monitor_id, length(nodes))
  Enum.at(nodes, hash) == node()
end

defp maybe_trigger_alert(state) do
  if home_node?(state.monitor_id) do
    # Only this node sends alerts
    do_alert(state)
  else
    # Other nodes track state but stay silent
    state
  end
end

The hash is deterministic — same monitor ID always maps to the same node. Monitors are distributed evenly: 900 monitors across 3 nodes = ~300 "home" per node. If a node dies, the hash redistributes to survivors.

Edge cases that would break a naive implementation

Slow region

Asia's check takes 8 seconds (congested submarine cable). EU and US finished in 200ms.

Solution: 10-second timeout. If a region doesn't respond, it's excluded from consensus — not counted as up OR down. 2/2 is still valid consensus.

Node crash

The India node loses power.

pg automatically removes dead processes from groups. EU and US continue with 2-node consensus. When India comes back, its processes rejoin the pg groups on startup. No manual intervention.

Network partition (netsplit)

The submarine cable between EU and Asia is cut. EU can still talk to US, but Asia is isolated.

EU ←→ US: connected (2 nodes visible)
   ✗
Asia: isolated (1 node visible)

EU consensus:  2/2 results → works normally
Asia consensus: 1/1 result → local only, skip consensus

Partition heals → pg reconnects → 3 nodes again

Each side of the partition operates independently. No split-brain alerts because consensus requires majority of visible nodes.

Duplicate alerts

All 3 nodes compute "DOWN". Without guard rails, all 3 would fire alerts. The home_node? function ensures exactly one node acts on the consensus.

Why not just use the database?

The obvious approach: each region writes its check result to PostgreSQL. A periodic query aggregates results and computes consensus.

We tried this. Three problems:

1. Staleness

Each region checks at slightly different times (jitter). When EU reads Asia's result, it might get the previous check from 30 seconds ago — not the current one. Average staleness: ~15 seconds. That defeats the purpose of multi-region consensus.

2. Advisory lock issues

A singleton aggregator needs leader election. PostgreSQL advisory locks are the standard approach, but they don't work reliably with PgBouncer in transaction pooling mode — connections get swapped, locks get released unexpectedly.

3. The irony

We moved checks OUT of the database (GenServer-per-monitor) to remove the Oban bottleneck and get 100K+ concurrent checks. Putting consensus BACK in the database would reintroduce the exact bottleneck we eliminated.

pg messages solve all three: zero staleness (real-time delivery), no leader election needed (each process computes independently), and no database in the hot path.

Benchmark: 100K checks on a $10/mo server

We benchmarked on a single Netcup RS 1000 (4 AMD EPYC cores, 8GB RAM, $10/mo):

Concurrent	Checks/sec	P50
10,000	2,602	1.6s
50,000	2,782	8.2s
100,000	2,602	16.9s
110,000	2,444	19.1s

110,000 concurrent HTTP checks, zero failures, on hardware that costs less than a Netflix subscription. The BEAM was designed for this — Ericsson used the same architecture to handle millions of concurrent phone calls.

With 4 nodes across 3 continents, multi-region consensus supports ~440K monitors. At 10 monitors per free user, that's 44,000 users before needing a 5th node.

The code: surprisingly little

The consensus logic in MonitorProcess is ~40 lines:

defmodule Uptrack.Monitoring.MonitorProcess do
  use GenServer

  # After local check completes, broadcast to pg group
  def handle_info({:check_result, result}, state) do
    # Broadcast our result to all regions
    :pg.get_members(:monitor_checks, state.monitor_id)
    |> Enum.each(&send(&1, {:region_result, @region, result}))

    state = %{state |
      region_results: Map.put(state.region_results, @region, result),
      checking: false
    }
    maybe_evaluate_consensus(state)
  end

  # Receive result from another region
  def handle_info({:region_result, region, result}, state) do
    state = %{state |
      region_results: Map.put(state.region_results, region, result)
    }
    maybe_evaluate_consensus(state)
  end

  defp maybe_evaluate_consensus(state) do
    results = state.region_results
    total = map_size(results)
    expected = length(:pg.get_members(:monitor_checks, state.monitor_id))

    if total >= min(expected, 2) do
      # Enough results — compute consensus
      down_count = Enum.count(results, fn {_, r} -> r.status == "down" end)
      consensus = if down_count > total / 2, do: "down", else: "up"

      state = %{state |
        last_check: %{status: consensus},
        region_results: %{}  # reset for next cycle
      }
      |> evaluate_result()
      |> record_result()
      |> maybe_trigger_alert()

      {:noreply, state}
    else
      # Still waiting for more regions
      {:noreply, state}
    end
  end
end

That's the core of multi-region consensus. No Kafka, no Redis, no external coordinator. Just processes sending messages to each other — the thing the BEAM was literally built to do.

What we learned

The BEAM's distribution primitives are underused. Most Elixir apps treat nodes as independent units behind a load balancer. But pg, :global, and Erlang distribution enable architectures that would require Kafka + Redis + ZooKeeper in other ecosystems.

Don't put coordination in the database. We tried DB-based consensus first. The staleness problem alone killed it. pg messages are real-time with zero staleness.

The Discord/WhatsApp pattern scales to monitoring. One process per entity with self-scheduling and in-memory state — it works for chat guilds, phone calls, and uptime monitors.

$23/mo can compete with $54/mo. UptimeRobot charges $54/mo for 30-second checks from multiple regions. We do it on three $8-10 VPS nodes with better consensus logic.

Try it yourself

10 free monitors at 30-second checks. Multi-region consensus on every plan. No credit card required.

Start Monitoring Free