April 1, 2026

Why we chose BEAM for uptime monitoring

When your monitoring tool goes down, who monitors the monitor?

This isn't a philosophical riddle. It's the most practical question you can ask when choosing a tech stack for monitoring infrastructure. Your monitoring tool is the last line of defense. If it crashes at 3 AM, nobody knows your production app is down until customers start tweeting about it.

We chose a runtime that was designed to never go down. Not "designed to be fast" or "designed to be easy to hire for." Designed to keep running when everything around it is failing. That runtime is the BEAM VM, and it's the foundation of Uptrack.

What is the BEAM VM?

BEAM is the virtual machine that runs Erlang and Elixir. It was created at Ericsson in the late 1980s for telecom switching systems -- the kind of software that routes your phone calls and needs 99.9999% uptime (that's about 32 seconds of downtime per year).

Telecom engineers had a constraint most web developers never face: you cannot restart the system. Ever. Millions of active phone calls depend on it. So they built a runtime where individual components can crash and restart without affecting the rest of the system. Where you can swap out code while the system is running. Where millions of lightweight concurrent processes communicate through message passing instead of sharing memory.

Today, the BEAM powers WhatsApp (2 billion users, ~50 engineers), Discord (200 million users), RabbitMQ, and most telecom infrastructure worldwide. And now it powers Uptrack.

Why BEAM is uniquely suited for monitoring

Every language can make HTTP requests and check status codes. The difference is what happens when things go wrong -- and in monitoring, things go wrong constantly. Targets time out. DNS lookups fail. Database connections drop. The question isn't whether failures happen, it's whether one failure cascades into many.

Fault tolerance through supervisor trees

In most runtimes, an unhandled exception crashes the process. In BEAM, each "process" is a tiny isolated unit (not an OS process -- more like a goroutine, but with its own heap and mailbox). When a BEAM process crashes, its supervisor restarts it automatically. The rest of the system doesn't notice. If a health check for one monitor hits a malformed response and the parsing code crashes, only that single check restarts. The other 10,000 concurrent checks continue uninterrupted. This isn't something we had to build. It's how the runtime works.

Hot code upgrades: deploy without downtime

BEAM can load new code modules while the system is running. This was built for telecom switches that couldn't be restarted, and it means we can ship fixes and features without dropping a single health check. No rolling restart window where checks are paused. No brief gap where an outage could go undetected. The old code handles in-flight requests while new requests hit the new code.

Lightweight processes for massive concurrency

A BEAM process uses about 2KB of memory. You can run hundreds of thousands of them on a single machine. Each monitor check runs in its own process with its own timeout, its own error handling, and its own lifecycle. This means 100,000 concurrent checks don't require 100,000 OS threads or a complex async/await state machine. They're just processes doing their thing, scheduled preemptively by the BEAM across all available CPU cores.

Built-in distribution for clustering

BEAM nodes can form clusters natively. Processes on different machines communicate with the same message-passing syntax as processes on the same machine. If one node goes down, the other nodes detect it within seconds and can take over its work. We didn't need to add Kubernetes, service mesh, or a separate consensus protocol. Distribution is a language-level primitive.

Message passing eliminates shared state bugs

BEAM processes don't share memory. They communicate by sending immutable messages. No mutexes, no locks, no race conditions, no "this works fine until you add a second thread." When you're running thousands of concurrent health checks that all need to write results, update incident status, and trigger alerts, the absence of shared-state bugs isn't a nice-to-have. It's the difference between a system that works under load and one that works "most of the time."

How we put it together

Theory is nice. Here's what Uptrack actually looks like in production.

Dual API nodes on Netcup Germany

Two independent Elixir/Phoenix nodes, each capable of handling the full workload. Cloudflare Tunnel runs on both nodes with the same tunnel token. If one node dies, Cloudflare automatically routes traffic to the surviving node within ~5 seconds. No load balancer to configure or maintain.

Patroni PostgreSQL with automatic failover

PostgreSQL runs in a Patroni cluster with HAProxy routing queries to the current primary. If the primary goes down, Patroni promotes the replica and HAProxy redirects automatically. Our application connects to a stable local address that never changes, regardless of which node is primary.

Oban for reliable job processing

Health checks, alert delivery, and incident management run as Oban jobs backed by PostgreSQL. If a check fails mid-execution (node crash, OOM kill), Oban automatically retries it on the surviving node. No Redis dependency. No separate job queue to monitor and maintain. The database is the queue, and PostgreSQL is extremely good at being reliable.

The entire stack is designed so that any single component can fail without the system going down. The BEAM handles application-level failures. Patroni handles database failover. Cloudflare handles network-level failover. Each layer is independent and self-healing.

What we considered instead

We didn't pick Elixir/BEAM because it was trendy. We evaluated serious alternatives and each fell short for this specific use case.

Go

Go is fast, compiles to a single binary, and has excellent concurrency with goroutines. But goroutines don't have supervisors. A panic in one goroutine can crash the entire process if you're not extremely careful with recovery. Go has no built-in distribution, no hot code upgrades, and error handling is explicit but tedious. For a monitoring service where you need thousands of independent, self-healing tasks, you'd end up reimplementing half of what BEAM gives you for free.

Node.js

Node.js is single-threaded. You can work around this with worker threads or cluster mode, but you're fighting the runtime's fundamental design. An unhandled promise rejection can crash the process. There's no built-in fault isolation between concurrent operations. For a monitoring tool that needs to run 100K concurrent checks with independent failure domains, Node.js requires bolting on infrastructure (PM2, Redis, external job queues) to approximate what BEAM provides natively.

Rust

Rust is an extraordinary language for systems programming. Memory safety without a garbage collector is genuinely impressive. But for a web service that makes HTTP requests and writes to a database, Rust's compile times and borrow checker are friction without proportional benefit. We don't need zero-cost abstractions or manual memory management. We need fault tolerance, concurrency, and rapid iteration. Rust is a great choice for building a database. It's overkill for building a web application on top of one.

The honest tradeoff

We're not going to pretend BEAM has no downsides. The Elixir talent pool is smaller than JavaScript or Python. You won't find Elixir developers on every street corner. The ecosystem, while mature, has fewer libraries than Node.js or Go. Raw computational throughput is lower than Go or Rust (though for I/O-bound work like HTTP monitoring, this is irrelevant).

But here's the thing: we're not building a generic web app. We're building monitoring infrastructure. The tool that needs to be running when everything else is broken. The tool that checks your servers at 3 AM, detects the outage, and wakes you up. For that specific job, the BEAM's tradeoffs are exactly right.

Fewer developers who know the language, but the ones who do understand distributed systems deeply. Fewer libraries, but the standard library covers concurrency, networking, and fault tolerance out of the box. Slower raw computation, but the ability to handle massive concurrent I/O with processes that self-heal when they fail. For monitoring infrastructure specifically, nothing else comes close.

Built on BEAM. Built to stay up.

10 free monitors at 30-second checks. No credit card required.

Start Monitoring Free