Orchestrating Deployments without Tears

Why do deployments still ache when containers were supposed to make them painless? Why does a rollout that worked on your laptop buckle in the cluster? Why does a tiny change to one service trigger pager duty pings across the floor? If your team has been flirting with Docker, sizing up Kubernetes, and peeking at Helm charts while trying not to set production on fire, this is for you.

Let us talk about containers the way we actually ship them.

What did containers promise and what did they actually fix?

Containers gave us a clean box. You build an image, you tag it, you run it. That box removes the “works on my machine” punchline and replaces it with a repeatable artifact. The magic is not Docker itself. The magic is treating your app as immutable once built. From there, an orchestrator schedules those boxes, probes them, restarts them, and wires them to the network. Today the big four still show up in slide decks: Kubernetes with momentum and CNCF gravity, Docker Swarm for a gentle start, Nomad for a lean take, and Mesos plus Marathon for shops that grew there. In the cloud, GKE feels smooth, Azure has AKS, and this week Amazon pushed EKS into the public ring. That is a lot of letters. The question is not which logo you like. The question is which one lets your team deploy predictably.

Predictable beats shiny.

From laptop to cluster without drama

Pick boring defaults. For most teams, that means Kubernetes. Create a cluster, split work with namespaces, turn on RBAC, and keep a simple path from commit to running pod. Put the messy bits in scripts. I like a tiny Makefile that hides docker tags and kubectl calls. Your future self will thank you when it is late and you can still type make deploy without thinking. Also, stop building by hand on some random box. Use a pipeline to build images from a clean environment and tag with the commit SHA. Humans should not invent tags in chat. Let Git do that job.

Scripts are glue and your cluster loves glue.

# Dockerfile
FROM node:8-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
ENV NODE_ENV=production
EXPOSE 3000
CMD ["node", "server.js"]

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: registry.example.com/web:1.2.3-sha.abc1234
        ports:
        - containerPort: 3000
        readinessProbe:
          httpGet:
            path: /healthz
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /livez
            port: 3000
          initialDelaySeconds: 15
          periodSeconds: 10
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 3000
    protocol: TCP

Those probes are not optional. A liveness probe keeps your service from sitting dead inside a healthy looking container. A readiness probe says do not send traffic yet while the app warms up. Add resource requests so the scheduler knows where to place your pods. Without that, a busy neighbor can starve your app and you will chase ghosts for hours. Also set maxUnavailable zero if you are strict about no drop during a rollout, at the cost of longer rollouts. It is a trade you can explain to the business in plain words.

Slow rollouts beat fast outages.

Blue green and canary without tears

Blue green in Kubernetes is about labels. You keep two deployments, both ready, and you point the Service at one color. When the new color is healthy, you flip the selector. No magic, no gateways needed. For canary, run a second deployment with one or two pods and give it a label that a slice of traffic will hit. You can route that slice at the ingress level if your stack supports it, or keep it simple and send a small set of users by header or path. The key is to keep the flip reversible. If your change hurts, you do not want to wait for a full rollout to finish while customers wait with you.

Fast rollback is a feature.

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
    color: blue
  ports:
  - port: 80
    targetPort: 3000

# deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
      color: blue
  template:
    metadata:
      labels:
        app: web
        color: blue
    spec:
      containers:
      - name: web
        image: registry.example.com/web:1.2.3-sha.abc1234

# deployment-green.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
      color: green
  template:
    metadata:
      labels:
        app: web
        color: green
    spec:
      containers:
      - name: web
        image: registry.example.com/web:1.2.4-sha.def5678

# flip traffic to green when ready
kubectl patch service web -p '{"spec":{"selector":{"app":"web","color":"green"}}}'

Helm, with guardrails

Helm makes repeatable deploys easier, but be mindful of Tiller. In many shops, the in cluster Tiller service has more power than it should. If you run Helm today, either scope Tiller to a namespace with tight RBAC or skip Tiller and use helm template to render and then kubectl apply. The best part of Helm is not the server. It is the values files and the way they keep environments tidy. Keep one chart per service, version it, and store your own charts in a private spot. Chart Museum works, a Git repo also works. Pick one and stick with it.

Templates cut copy paste fatigue.

# values-prod.yaml
image:
  repository: registry.example.com/web
  tag: "1.2.4-sha.def5678"
replicaCount: 5
resources:
  requests:
    cpu: "200m"
    memory: "256Mi"

# render and apply without Tiller
helm template charts/web --values values-prod.yaml | kubectl apply -n prod -f -

Secrets and config that do not leak

Keep secrets out of images and out of Git. Kubernetes Secrets and ConfigMaps are your first stop. Mount them as env vars or files. For higher stakes, wire a real secret store. Vault is popular. In the cloud, use KMS and inject at runtime. On AWS, pod IAM with tools like kube2iam lets pods talk to AWS APIs without baking keys. That small change removes a whole class of leaks. Keep an eye on size limits and base64 quirks with Secrets. Also, rotate. A secret that never rotates is not a secret. Write it down, schedule it, and test the rotation in staging with traffic. Your on call person will sleep better.

Treat config as data, not code.

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: web-secrets
type: Opaque
data:
  DB_PASSWORD: c3VwZXItc2VjcmV0 # base64 of super-secret

# deployment snippet
env:
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      name: web-secrets
      key: DB_PASSWORD

CI and CD that ships on merge

Do not build on laptops. Build in a runner. Tag with the short SHA. Push to your registry. Deploy with the same tag. No floating tags like latest. That tag reads nice and pages hard. Keep master shipping to a staging namespace. Promote to production on a tag or a release branch. You can still keep human gates. The point is that the pipeline does the same thing every time. Your job is to write code and review PRs, not to copy YAML by hand on a jump box. One more thing. If you pull by tag, set imagePullPolicy to Always or pin the tag to a unique SHA. Pick one and do not mix them.

Tags are cheap, mistakes are not.

# .gitlab-ci.yml
stages: [build, deploy]

variables:
  IMAGE: registry.example.com/web
  TAG: $CI_COMMIT_SHORT_SHA

build:
  stage: build
  script:
    - docker build -t $IMAGE:$TAG .
    - docker push $IMAGE:$TAG
  only:
    - master
    - tags

deploy_staging:
  stage: deploy
  script:
    - helm upgrade --install web charts/web --namespace=staging --set image.repository=$IMAGE --set image.tag=$TAG
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - master

deploy_prod:
  stage: deploy
  when: manual
  script:
    - helm upgrade --install web charts/web --namespace=prod --set image.repository=$IMAGE --set image.tag=$TAG
  environment:
    name: production
    url: https://example.com
  only:
    - tags

Observability first, not last

You cannot fix what you cannot see. Get metrics, logs, and traces early. Prometheus with Grafana is a strong start and plays well with Kubernetes. Use node exporters, kube state metrics, and app level metrics. For logs, the EFK stack is common, or lean on your cloud logging if it is good enough. Add request duration, error rate, and apdex like signals to your app from day one. Tie alerts to these, not to CPU alone. An HPA is fine on CPU to start, but traffic cares about latency. When you roll out, watch a small set of golden signals. Your eyes will beat any contract test when the world gets weird.

Dark launches still need light.

War stories and lessons that hold up

I have seen teams ship a tiny change and knock out logins for half an hour because the rollout killed all pods at once. I have seen a sleepy Friday night fix turn into a long pizza dinner because the app never sent SIGTERM handlers and died mid request. I have seen a clean microservice split fall back to a big shared database because of one missing index. These are not rare. They happen when we forget that every deploy is a traffic event. Traffic does not care that your code is clean. Traffic cares that you answer fast and keep answering while you change the engine. So we add small habits that stack up to calm deploys.

Boring habits save launches.

No latest tag. Tag images with the commit SHA. Keep a friendly tag if you want, but deploy the SHA. You can always map friendly to SHA in a file.
Two step database changes. Add columns and code that reads both. Backfill. Flip reads and writes. Drop later. Your deploy stays green and your data stays happy.
Termination grace and preStop. Give pods time to finish in flight requests. In web servers, close keep alives, drain, then exit. In Kubernetes, set terminationGracePeriodSeconds and a preStop hook.
Backoff and timeouts. Do not let retries slam a slow downstream. Add circuit breakers if your stack supports them. A bit of jitter goes a long way.
Feature flags. Ship code dark and flip flags for a small slice. You can do that without a mesh. A simple flag service and a header can get you far.
Limit blast radius. New stack? Start with one service. New cluster? Start with internal apps. Learn without putting checkout at risk.
Backups you restore. A backup is not a backup until you have restored it in anger. Practice it. Put a date on the last test. Make it visible.

Choose boring over clever every time.

A tiny starter kit you can copy

Here is a tiny set of files that take you from code to a rolling update with one command. It is not a full platform. It is just enough to get you shipping and to teach new folks how we deploy without tears. Edit the names, wire your registry, and set your namespace. Then run the thing and watch the rollout in your terminal. You will see pods go ready, traffic stays up, and you can sleep a little better tonight.

Small wins stack up.

# Makefile
APP = web
REGISTRY = registry.example.com/$(APP)
SHA := $(shell git rev-parse --short HEAD)
NS ?= staging

.PHONY: build push deploy watch

build:
	docker build -t $(REGISTRY):$(SHA) .

push:
	docker push $(REGISTRY):$(SHA)

deploy:
	kubectl set image deploy/$(APP) $(APP)=$(REGISTRY):$(SHA) -n $(NS) --record

watch:
	kubectl rollout status deploy/$(APP) -n $(NS)

Tip: pair kubectl rollout status with a quick dashboard view of error rate and latency. If both stay flat, call it good and move on.

Ship calmly, sleep better.

Development Practices Software Engineering