Use Cases

Monitor your AI inference endpoints — vLLM, Ollama, Replicate, and beyond

Inference is now the majority of AI cloud spend. If your serving layer goes down, your entire AI product is dead. Here is how to actually monitor it.

April 10, 2026

Inference is 55% of AI cloud spend — and it is your single point of failure

According to a16z's 2025 infrastructure report, inference now accounts for 55% of AI-related cloud expenditure, up from 40% the year prior. Training is a one-time cost. Inference is the always-on cost — and it is the layer your users actually touch.

When your vLLM server runs out of GPU memory and silently stops accepting requests, your chatbot returns empty responses. When Ollama's model unloads after an idle timeout, your RAG pipeline hangs. When Replicate's cold start exceeds your client's timeout, your image generation feature returns 502s.

Traditional uptime monitoring was built for web servers that return HTML. AI inference endpoints are different: they serve models, manage GPU memory, handle variable-length requests, and can be "up" at the process level while being completely non-functional at the model level.

You need monitoring that understands these failure modes.

vLLM: the /health endpoint lies to you

vLLM exposes a /health endpoint that returns 200 OK when the HTTP server is running. That is necessary but not sufficient. Here is what vLLM's health check actually tells you versus what you need to know:

What /health confirms

The uvicorn process is alive. The FastAPI app is accepting connections. That is it.

What it does not confirm

Whether the model is loaded into GPU memory. Whether CUDA is functioning. Whether the KV cache is full and requests are being rejected. Whether the tokenizer loaded correctly.

The real health signal is in /v1/models. This endpoint returns a JSON list of loaded models. If vLLM started but failed to load the model (wrong path, insufficient VRAM, corrupted weights), /health returns 200 but /v1/models returns an empty list.

# vLLM is "healthy" but no model is loaded
$ curl http://gpu-server:8000/health
→ 200 OK

$ curl http://gpu-server:8000/v1/models
→ {"object": "list", "data": []}
# Empty — your inference endpoint is useless

# vLLM is truly healthy
$ curl http://gpu-server:8000/v1/models
→ {"object": "list", "data": [{"id": "meta-llama/Llama-3.1-70B-Instruct", ...}]}
# Model name present — now you know it is actually serving

The monitoring strategy: hit /v1/models and use keyword matching to verify the response contains your model name. If "Llama-3.1-70B" disappears from the response body, the model unloaded — that is a real incident even though the server is "up."

You should also monitor the /metrics endpoint if you have Prometheus set up, but for immediate alerting on model availability, keyword matching on /v1/models is the fastest path to catching real failures.

Ollama in production: zero built-in monitoring

Ollama is the easiest way to run LLMs locally, and increasingly teams are deploying it on GPU servers for internal APIs. The problem: Ollama was designed as a developer tool, not a production serving platform. It has no health check endpoint, no metrics endpoint, and no built-in alerting.

Ollama's API root (http://host:11434) returns "Ollama is running" as plain text when the process is alive. That is the closest thing to a health check it offers. But like vLLM's /health, it only confirms the process — not the model state.

# Check if Ollama process is running
$ curl http://gpu-server:11434/
→ Ollama is running

# Check which models are currently loaded in memory
$ curl http://gpu-server:11434/api/ps
→ {"models": [{"name": "llama3.1:70b", "size": 39943471104, ...}]}

# If no models loaded (idle unload after 5 minutes by default)
$ curl http://gpu-server:11434/api/ps
→ {"models": []}

Ollama aggressively unloads models from GPU memory after idle periods (default: 5 minutes). This means your inference endpoint can go from "ready to serve in 50ms" to "needs 30 seconds to reload the model" without any visible error. The first user request after an idle period will either time out or take 30x longer than expected.

For production Ollama deployments, monitor two things: the root endpoint for process health (keyword match on "Ollama is running"), and /api/ps for loaded model state. If you rely on a specific model being hot in memory, keyword match on the model name in the /api/ps response.

Managed providers go down too: Replicate, Together.ai, Groq

"We use a managed provider, so we don't need monitoring." This is the AI equivalent of "we use AWS, so we don't need backups." Managed inference providers have outages, and they often fail in ways their own status pages do not reflect.

Replicate

Cold starts can spike from 5 seconds to 3+ minutes during high demand. Your 30-second timeout fires, but Replicate's status page shows green. Monitor by hitting the prediction creation endpoint and checking that the response includes a valid prediction ID within your timeout window.

Together.ai

Rate limits can silently drop your requests when shared GPU clusters are under load. The API returns 429s, but if your application does not surface that status code, you might not notice for hours. Monitor the chat completions endpoint and verify you get a 200 with actual content in the response body.

Groq

Groq's LPU inference is fast when available, but model availability can change without notice. A model you depend on might be temporarily removed or replaced. Monitor the models endpoint and keyword match on your specific model ID.

For all managed providers, the principle is the same: do not trust status pages. Set up your own HTTP endpoint monitors that hit the actual API endpoints your application uses. You will catch degradations and outages 10-30 minutes before the provider's status page updates.

Vector databases are part of the inference chain

If you run RAG (retrieval-augmented generation), your inference pipeline is only as reliable as its weakest link. The chain typically looks like: user query, embedding model, vector database query, LLM inference, response. A vector database outage breaks this chain just as effectively as an LLM outage.

Pinecone

Monitor your index's describe_index_stats endpoint. It returns the vector count and index fullness. If vector count drops to zero, your index was corrupted or deleted. If index fullness hits 100%, writes silently fail. Both are production incidents your application will not surface on its own.

Qdrant

Self-hosted Qdrant exposes a /healthz endpoint and a detailed /collections endpoint. Monitor both. A healthy Qdrant node can still have zero collections if the data directory was wiped on restart. Keyword match on your collection name in the collections response.

The embedding model (OpenAI, Cohere, or a local model via vLLM) is another failure point. If you use OpenAI embeddings, monitor the /v1/embeddings endpoint independently. A working LLM and a working vector database are useless if the embedding step between them is down.

Five failure modes unique to AI inference

AI serving infrastructure fails differently from traditional web services. These are the failure modes that standard uptime monitors miss:

1. GPU OOM without process crash

vLLM and TGI (Text Generation Inference) catch CUDA out-of-memory errors and continue running. The process stays alive, the HTTP server responds, but all inference requests return errors. A simple ping check sees 200 OK. You need to verify the response body contains actual model output.

2. Model weight corruption

Disk errors or incomplete downloads can corrupt model weights. The serving framework loads what it can and produces garbage output — or loads nothing and returns empty responses. The HTTP layer is fine. Only checking the response content catches this.

3. Silent model swaps

Managed providers occasionally swap the underlying model without changing the API endpoint. You requested Llama 3.1 70B, but you are getting a quantized 8B variant because the 70B instance hit capacity. Keyword matching on model metadata in the response catches this.

4. Tokenizer desync

The tokenizer and model weights must match. A version mismatch produces syntactically valid but semantically broken output. The endpoint returns 200, the response looks like JSON, but the content is incoherent. This is hard to catch with automated monitoring but response time anomalies (suddenly much faster or slower) can signal it.

5. KV cache exhaustion

vLLM pre-allocates GPU memory for the KV cache. Under heavy load, all cache blocks are consumed and new requests queue indefinitely. The server is "up" but latency goes from 200ms to 60+ seconds. Monitoring response time thresholds catches this before users notice.

How to monitor AI inference with Uptrack

Uptrack's HTTP endpoint monitoring covers the specific failure modes that AI inference introduces. Here is how to configure monitors for each layer of your inference stack:

Self-hosted vLLM or TGI

Create a monitor pointing at https://your-gpu-server/v1/models. Set the check interval to 30 seconds. Add a keyword match rule: the response body must contain your model name (e.g., "Llama-3.1-70B"). If the model unloads, gets corrupted, or fails to start, the keyword check fails and you get alerted within 90 seconds via Slack or Discord.

Ollama

Two monitors. First: http://your-server:11434/ with keyword match on "Ollama is running" to catch process crashes. Second: http://your-server:11434/api/ps with keyword match on your model name to catch idle unloads. If you set OLLAMA_KEEP_ALIVE=-1 to disable unloading, the second monitor confirms that setting is working.

Managed API providers

For Replicate, Together.ai, Groq, or any OpenAI-compatible API: monitor the models list endpoint. Most providers expose /v1/models or equivalent. Set keyword matching on the model ID you depend on. You will know if a model is removed or renamed before your users do.

Vector databases

For Qdrant: monitor /collections with keyword matching on your collection name. For Pinecone: monitor your index's describe stats endpoint. For Weaviate: monitor /v1/.well-known/ready for cluster readiness.

The full RAG pipeline check

For end-to-end RAG health, you ideally expose a dedicated /healthz endpoint in your application that queries the embedding model, vector database, and LLM in sequence. Point Uptrack at this endpoint. If any component fails, the health check fails, and you get one alert that tells you the pipeline is broken — then check the individual component monitors to isolate which one.

Why 30-second checks matter for inference

Most free monitoring tools check every 5 minutes. For a web server, that is usually fine — if your site is down for 5 minutes, you have a problem, and knowing 4 minutes earlier does not change your response.

AI inference is different. GPU instances cost $2-8/hour. A vLLM server that crashed and restarted might need 2-3 minutes to reload a 70B model. If you check every 5 minutes, the sequence looks like this:

5-minute interval:
T=0:00  Check passes (model loaded)
T=1:30  vLLM crashes (GPU OOM)
T=2:00  systemd restarts vLLM
T=2:30  Model reloading... (no inference possible)
T=4:30  Model loaded, serving resumes
T=5:00  Next check passes ← you never knew it was down

30-second interval:
T=0:00  Check passes
T=0:30  Check passes
T=1:00  Check passes
T=1:30  vLLM crashes
T=2:00  Check FAILS ← alert fires in 30 seconds
T=2:00  systemd restarts vLLM
T=2:30  Check FAILS (model still loading)
T=3:00  Check FAILS
T=3:30  Check passes ← recovery confirmed
         Total aware downtime: 2 minutes
         You saw it happen in real time

With 30-second checks, you catch transient failures that 5-minute intervals miss entirely. For GPU infrastructure that costs real money every minute, that visibility is worth it.

Start monitoring your inference stack

50 free monitors — 10 at 30-second checks, 40 at 1-minute. Keyword matching for model-aware health checks. Alerts via Slack, Discord, email, and webhooks. No credit card required.

Start Monitoring Free