The inbox looked quiet which is never a good sign on a launch night. We had queued a tidy welcome batch and watched connections spin up to Gmail, Yahoo, and Hotmail. Then the logs started to sing. 421 temp fails. Connection timeouts. A sprinkle of greylisting from a few small providers. Marketing was watching the live dashboard. I was watching queue depth climb like a thermometer. The instinct was to punch resend over and over, but email punishes that kind of panic. What saved us was boring stuff that never gets a keynote: retry, backoff, and bounce handling.
When email works it looks like magic. When it does not it is just a queue with some rules. If you get the rules right you sleep at night. If you do not, you block an IP, spike complaints, and your Tuesday looks like a Friday night.
Know your SMTP no
SMTP speaks in three digit codes. The first digit is your compass. 2xx means success. 4xx means try again later. 5xx means stop trying. That last one matters. If you keep hammering a 550 user unknown you look like a bot and the big providers will remember your name for the wrong reasons.
Common stuff you will see:
- 421 or 451 busy or rate limited. Respect it. This is a classic case for backoff.
- 450 mailbox unavailable often a greylist nudge. Try again later and it will pass.
- 550 user unknown hard bounce. Drop it from your list right away.
- 552 quota exceeded soft bounce. Try later, but keep a limit on your patience.
Backoff is a brake not a suggestion
Think of exponential backoff with jitter as cruise control for your mailer. Start with a short wait on the first failure, then increase the delay for each retry, and shuffle the exact timing so you do not send a burst. A practical schedule that has treated us well:
- First retry around 1 minute
- Then 5 minutes
- Then 15 minutes
- Then 1 hour
- Then 6 hours
- Then every 12 hours until a final cutoff
Spread those with a bit of randomness so a thousand messages do not wake up at the same second. Set a time to live per message. Many teams pick two to five days. If it still has not gone through by then it is probably not going to and your sender reputation is more important than winning a single delivery.
Queues win games
Do not funnel everything through one giant queue. Break it down by recipient domain and by priority. That lets you slow down to a single domain that is rate limiting without freezing the rest of your traffic. Keep separate counters for connections, outstanding commands, and error streaks per domain. When a domain starts to defer you, drop your concurrency and lengthen your waits just for that lane.
Make each delivery idempotent. If your process crashes between DATA and QUIT, you should be able to retry without sending a duplicate. Store per message state and the last known server response so you do not guess what happened.
Handle bounces like a grown up
Bounces arrive as DSNs. Some are neat and some are a mess. You need a parser that classifies into hard and soft. Hard means remove that address from your list right away. Soft means keep the address but track counters. Three or five soft bounces in a row is a good cutoff to pause the address and ask the user to confirm later with a smaller message.
Use VERP so you can map a bounce back to the exact recipient and campaign without guessing. If you send newsletters, hook into feedback loops with the big providers so when someone hits spam you can stop mailing them. Complaints hurt more than bounces.
Authentication is table stakes
Sign with DKIM. Publish SPF. DMARC is getting traction and is worth planning for. These do not guarantee delivery but they remove a bunch of easy reasons to say no. Also set up reverse DNS so your PTR matches your HELO. Use TLS where the other side supports it. None of this is fancy. It is the boring foundation you want.
Warm up and ramp sanely
If you are sending from a new IP, warm up with small daily increases. Jumping from zero to a big blast looks like a bot. Start with a small trusted segment, prefer engaged recipients, and build a good pattern. Watch reply rates and opens as well as errors. The machines look at that too.
What managers should track
You do not need to read raw SMTP to manage this well. You do need a dashboard with a few simple lines that never lie:
- Delivered rate by domain
- Deferred rate by domain and by hour
- Hard and soft bounces with reasons
- Complaint rate from feedback loops
- Queue age and size trends
- Concurrency per domain and total
Set alerts on defer spikes and on queue age. A sudden jump can mean a block at one provider or a routing hiccup. When that happens, slow your send to that domain, extend backoff, and tell support what you see so they can answer customers with real info instead of guesswork.
Pair with the marketing team on list hygiene. Confirmed opt in is boring and it works. Trim inactives. Remove hard bounces right away. Keep complaint rates tiny. That earns forgiveness when you hit a rough patch with a provider.
The quiet craft of reliable email
There is a rhythm in good delivery. You respect 4xx. You stop on 5xx. You never flood. You track what you send. You say sorry by slowing down, not by saying sorry. Do that and the mailbox providers treat you like a neighbor not a stranger.
Your turn this week
Pick one or two of these and make them real:
- Write down your retry schedule with backoff and jitter. Is it written anywhere the team can see
- Split your send queue by domain. Add per domain concurrency caps.
- Parse DSNs and separate hard and soft bounces. Remove hard bounces the same day.
- Turn on DKIM and SPF. Check them with a test send to a personal mailbox and look at message headers.
- Enroll in major feedback loops and wire complaints to an auto suppress list.
- Add a queue age graph and an alert for a spike.
- Plan a calm IP warm up if you are moving providers or opening a new pool.
Reliable delivery is not glamorous. It is a handful of rules and a little patience. Get those right and your fans will see your message, your team will stop firefighting, and your weekends will feel like weekends again.