Logs, Metrics, and Healthchecks for Containers

I was on call when a tiny container took down a very loud service. The app worked fine on my laptop. In the cluster it went quiet. No log files on disk, fresh pods spinning, users refreshing. We had CPU to spare yet no clue why requests stalled. That night taught me a lesson I keep close: with containers, logs, metrics, and health checks are not optional extras. They are the only steady ground when everything else is moving under your feet.

Logs for containers: write to stdout, structure everything

Traditional apps wrote logs to local files that lived longer than the process. Containers flip that story. Write logs to stdout and stderr. Let the runtime and your platform pick them up. Sidecars like Fluentd or DaemonSets like Filebeat ship them to a central place. Most teams I see are heading to ELK, Stackdriver, or CloudWatch. File writes inside a container risk vanishing with restarts, and rotation scripts become brittle when your pod name changes every few minutes. Keep logs stateless and stream based.

Structure your events. Plain text is friendly until you need to parse a stack trace at scale. Use JSON logs with a consistent schema across services. Include a request id, user agent, service name, version, region, and a precise timestamp. Make log levels stable and predictable. Debug messages can flood your pipe when autoscaling ramps up, so ship them only when you need them. A workable rule is info for the business flow, warn for unexpected but tolerated paths, and error only when someone should care soon. Keep multi line output tidy by nesting fields rather than emitting loose lines that parsers will split.

Think about search first. Pick a small set of canonical fields to power dashboards and alerts. Duration, route, status code, and outcome are my base four. When your aggregation lives on the server side, shipping everything is cheap but querying everything is not. Decide what you will search most and promote those fields. Give your team example queries and pin them. The time to learn how to query by trace id is not during a Friday outage. Also set retention policies: keep raw logs short and summaries longer. A week of raw and a month of reduced fields is better than losing everything on a busy day.

Finally, connect logs to deployments. Include the git sha and container tag in each line. When a pod restarts with a new tag you will see the boundary right in the stream. Tie log lines to a service map by adding a parent id if you have tracing. Even if you are not ready for tracing today, adding a correlation id in headers and logs is a gift to your future self. The moment one request crosses five services, your only friend is that id.

Metrics you can trust: white box beats guesswork

Logs tell stories. Metrics tell trends. In containers you need both. Export internal counters from your app rather than inferring health from the outside only. Prometheus is becoming the default for many teams, and its pull model fits clusters well. cAdvisor exposes container CPU and memory. Service exporters add HTTP request rates and latencies. Use histograms for latency with buckets that match your SLOs. Percentiles matter more than averages. The 99th percentile is where your impatient users live.

Keep label cardinality in check. It is tempting to label metrics with user id or request id. That path leads to an expensive time series storm. Label by service, route, status class, and region. Avoid anything that grows with traffic. If you need per user insight, push that to logs, not metrics. Build dashboards per service that always answer the same questions: are we up, how fast are we, and where are we spending resources. Dashboards should be boring on a good day and painfully obvious on a bad one.

Resource signals are not the whole picture. You can have low CPU and a very unhappy customer. Track business metrics next to system ones. Signups, publish events, payments attempted. Alert on a drop from a known baseline. Tie alerts to runbooks, not hunches. If a pod restarts more than a small threshold in five minutes, alert the team and link to the playbook. Kubernetes 1.9 makes resource requests and limits straightforward. Set them. Runaway containers are noisy neighbors, and nothing ruins a quiet evening like a pod evicted for being greedy.

Plan for storage and retention. Prometheus 2.0 brought a new storage engine that is easier on disk. Even then, decide how long you keep high resolution data. Many teams are moving older data to a remote store. Keep scrape intervals honest. Scraping every second sounds great until ten namespaces multiply that by hundreds of services. Start with a moderate interval and tighten only where it pays off, like user facing APIs and edges.

Health checks that guard users, not egos

There are two classes of checks in clusters and they are not the same. Liveness means the process is stuck or dead and needs a restart. Readiness means traffic should or should not hit the pod. A pod can be alive yet not ready. For example, it may still be warming a cache or running migrations. Use readiness to keep the load balancer away until it is safe. Send a very fast 200 when ready and a 503 when not. Keep the endpoint side effect free and cheap.

Docker has a HEALTHCHECK instruction and most schedulers can wire that into restart policies. Use that only for true stuck states. If your app depends on a database, do not fail liveness just because the database is momentarily out. That sets off an endless restart loop and adds load to the very thing that is hurting. Prefer a shallow dependency check in readiness and a deeper synthetic probe outside the cluster. Fail fast where it protects users and fail slow where it protects the system.

Health checks need timeouts and sane thresholds. A single glitch in a dependency should not immediately pull a pod from service. Add a small grace period and a small number of consecutive failures before flipping readiness. Most clouds attach health status to load balancers. GKE and the new AWS EKS preview make this pretty friendly, and Fargate is bringing containers without servers to manage. The idea stays the same across platforms. Keep checks simple, cheap, and tuned to the cost of a wrong answer.

Do not hide failures. Expose a /health endpoint for machines and a /status or /info page for humans. The first returns only what automation needs. The second can show build info, last deploy time, and commit id. That tiny bit of context shortens incidents. Pair that with logs and metrics so you can line up a deployment marker with a dip in readiness and a spike in error logs. Now you can tell causation from coincidence.

Put it together: one story across logs, metrics, and checks

Observability is a team sport. The strongest pattern I see is consistency. One request id rides the headers, shows up in logs, and tags metrics where it makes sense. That id threads the story from the ingress to the database. A log shipper sidecar takes stdout and forwards it with minimal rewriting. A common dashboard template shows traffic, latency, errors, saturation, and recent deploys. Alerts point to runbooks and carry enough context to start the fix right away.

Make development mirror production. Developers should run a container locally that exposes the same /metrics and /health endpoints and writes to stdout. If something only exists in the cluster, it will be forgotten until the worst moment. Bake these pieces into your app template or generator so new services inherit the wins. A tiny checklist in the pull request helps. Does it log in JSON, export metrics, and pass a real readiness gate. These are boring questions by design.

Plan for failure. Nodes will drain, pods will bounce, and rollouts will surprise you. If every container exports the same signals, the platform can make smart choices on your behalf. Autoscaling needs metrics it can trust. Rollouts need readiness that reflects truth. On the bad days you need logs that arrive even when a pod dies. That is the job of stdout and a shipper running nearby, not a bespoke file inside a container.

Keep the system simple to reason about. Pick one logging pipeline, one metrics stack, and a small set of conventions, then document them. Tools are moving fast. New proxies and tracing systems seem to land every week. The practices in this post will survive tool churn: write structured logs to stdout, export white box metrics with sane labels, and use liveness and readiness with intent. Your future cluster will thank you, and your future on call self will sleep better.

Summary

Containers change where state lives. Treat logs as a stream to stdout with a common schema. Treat metrics as first class signals with careful labels and clear dashboards. Treat health checks as contracts that protect users and the system. Tie all three to deployments and request ids. Keep them boring. When things get weird at two in the morning, boring wins.

General Software Software Engineering