the bug: the shutdown signal that never arrived

2025 · written 2026

TL;DR

About a quarter of our daily calls were getting stuck. They never reached a final state, so the customer never got the result. This ran for days.
Every call should end somewhere: processed, not picked up, or do not disturb. These calls hung in the middle instead.
The trail led to Kubernetes. Autoscaling shrank the worker pool, and the pods took the full 30 second grace period to die, then got force-killed mid-call.
The real bug was one layer down. A shell script ran as PID 1 and never passed the shutdown signal to the worker. The worker had no idea it was being killed.
The fix was one word: exec, so the real process becomes PID 1 and gets the signal itself. The same template had copied this bug into three or four other products, so the one fix cleared almost all our reliability issues overnight.

Roughly a quarter of our calls were getting stuck, and for two days I could not see why. The logs showed nothing wrong.

What a call is supposed to do

This is the same Voice AI platform from the fair-queue post. It places outbound calls for customer campaigns, and every call moves through a lifecycle:

Initiated ──▶ Dialing ──▶ Ringing ──▶ In Progress ──▶ Connected ──▶ Processed

Processed is the happy ending. The call ran, the notes were generated, the dispositions were written. Not every call gets there, and that is fine. A call has terminal states, and Processed is only one of them. "Not picked up" is another: the campaign retries a lead a set number of times, over a cooldown interval, and if nobody answers, the lead is done. "Do not disturb" is another.

The rule is simple. Every lead has to end somewhere. It reaches a terminal state and it stops.

The symptom

Some calls never reached one.

They got stuck. Some sat in Initiated. Some sat in In Progress. Some ran to the end of the call but never flipped to Processed. They just hung, halfway through their own lifecycle, with nothing moving them forward.

This was about a quarter of daily calls, across every customer and every campaign. For the customer, a stuck call is a lost lead. The work was paid for and never delivered. It was a real business loss, and it was happening every day.

Two days of logs

I logged everything and stared at it. For a couple of days, nothing fit. The calls did not error. The workers did not crash in any way the logs would admit to. Leads just stopped moving, and I could not find the hand that stopped them.

So I stopped looking at the calls and looked at the workers.

The clue: pods that took too long to die

We ran the calling workers on Kubernetes and autoscaled them on queue depth. More leads in the queue, more worker pods. Fewer leads, fewer pods. The scaling itself worked.

What was off was how the pods shut down. A worker should drain and exit in about five seconds. Ten at most, if it had prefetched a message it still had to finish. These were taking the full thirty.

Thirty is the number that mattered. Kubernetes gives a pod a grace period to exit on its own, and the default is thirty seconds. After that, it stops asking and sends SIGKILL. The pod dies on the spot, whatever it was doing.

So every time we scaled down, some pods got force-killed. And a pod that gets force-killed while it is ringing a customer takes that call down with it. The call's state never advances. There were my stuck calls.

The root cause: PID 1 ate the signal

That left one question. Why was a five second shutdown taking thirty? Because the graceful part never started.

When Kubernetes wants a pod gone, it sends SIGTERM to PID 1, the first process in the container. PID 1 is supposed to catch that signal and begin shutting down.

Our service was booted by a shell script, entrypoint.sh, which launched the app with uvicorn. The shell was PID 1, and uvicorn ran as its child. Kubernetes sent SIGTERM to the shell, and the shell did nothing with it. A shell does not forward signals to its child processes unless you make it. So uvicorn, running underneath, never got the signal. It had no idea Kubernetes wanted it gone. It just kept working, ringing a customer, until SIGKILL ended it thirty seconds later.

k8s ──SIGTERM──▶ entrypoint.sh ──╳── worker (PID 1) (never told) ...30s grace period, worker keeps calling... k8s ──SIGKILL──▶ pod dies worker killed mid-call

The fix

The fix was one word. Make the real process PID 1, so the signal reaches it directly.

The shell ran uvicorn as a normal command, which keeps the shell alive as the parent. Put exec in front, and the shell replaces itself with uvicorn. No more shell in the middle. uvicorn becomes PID 1 and gets the SIGTERM itself. It stops taking new calls, finishes what it is holding, and exits, well inside the grace period.

# entrypoint.sh

# before: the shell stays PID 1, uvicorn is its child and never sees SIGTERM
uvicorn app.main:app --host 0.0.0.0 --port 8000

# after: exec replaces the shell with uvicorn, so uvicorn becomes PID 1
exec uvicorn app.main:app --host 0.0.0.0 --port 8000

Pods went back to shutting down in seconds. The force-kills stopped. The stuck calls stopped. We recovered the ones already stuck, and CSAT climbed back.

The same bug, four times over

Then it got bigger. We had a shared template for these worker services, and every team had copied it. The template ran the worker without exec. So the same bug was sitting in three or four other products, waiting for a scale-down to set it off.

It explained something nobody had tied back to this. Some of our calls finished, then got handed to another product's service to process. Many never came back processed. That service was being force-killed mid-work too, for the exact same reason.

I took the fix around and told every team the same thing. You are running the worker wrong. Put exec in front of it. Overnight, almost every reliability issue we had went away. One word.

In a container, PID 1 is a job, not an afterthought. Either make PID 1 forward signals, or use exec so your real process is PID 1 and gets them directly.

The symptom was at the top of the stack, stuck calls sitting in the database. The bug was at the bottom, a signal that never crossed one process boundary. There was no error to grep for. I found it by following the timing, not the logs.