We moved our apps to the cloud for elastic compute and forgot that the wire is not free.
Today the network is product territory and latency is a knob you can design with.
Compute keeps getting cheaper and closer to on demand, but the real story is between the servers, the load balancer, the cache, and the user’s pocket radio. In shared clouds like AWS, Azure, and the still young Google Compute Engine, you do not own the links and you do not tune the switches. A call between two instances in the same zone feels snappy, then the next hop crosses racks or zones and your request spends a quiet eternity in the queue. Meanwhile your user is on 3G with a round trip that can flirt with a quarter second before your app even leaves the driveway. We keep trying to hide this with more threads and bigger instances, but the better move is to treat time as part of the product. Think about latency as a feature, with choices and behaviors that make sense at different speeds. A chatty protocol is a nice idea until a few extra handshakes turn into a visible pause that users feel in their thumbs.
Start by measuring more than averages. The median looks fine, but the tail latency is where customer trust leaks out. Track p90 and p99, and then set a latency budget for each hop in a request path. Put that budget into service contracts so a single slow dependency cannot ransom the whole page. Build clients to prefer async by default, with timeouts, bounded retries, and backoff with jitter to avoid stampedes. Keep TCP connections warm with keep alive and batch small writes so Nagle does not turn tiny packets into tiny delays. Use queues like SQS or Rabbit to absorb bursts while the caller moves on. Place caches where they matter, at the edge with CloudFront and at the app tier, and route users with Route 53 latency based routing so distance is not a surprise line item on every click.
Once you stop fighting latency, you can ship features that use it. Give users a fast path answer first and refine with detail as the network cooperates. Serve from cache with a small stale window and refresh in the background, because a slightly old answer beats a spinner in most cases. Queue heavy writes and show progress, not silence, and say what the system is doing in clear language that sets expectations. Prefer push over polling where you can, and when you must poll, do it with a smart interval that backs off when the user is idle. Precompute the work you can during calm periods so the hot path is a single call and a quick render. Design your APIs to admit latency explicitly, with hints like Accept Stale or Priority so the caller can trade freshness for speed or pay extra latency for accuracy when it really matters. This is not a trick, it is product care.
Treat time as a first class input and the cloud stops feeling random and starts feeling intentional.