convin: a fair queue for many tenants
TL;DR
- Convin's Voice AI platform places outbound calls for customer campaigns. I rebuilt the call dispatch behind it and took it from zero to one. It became the flagship product.
- The first design poured every campaign into one shared queue. One customer: fine. Two customers: the second waited five to ten minutes for its first call.
- The cause was head-of-line blocking. One FIFO queue served everyone, so the first campaign drained before the second got a turn.
- The fix was a small scheduler. Every five seconds it takes a batch from each active campaign and feeds the queue round-robin. Nobody waits behind anybody.
- Per-tenant queues were the cleaner fix. I documented them and skipped them. They needed heavy infra work, and we were moving to self-serve.
The platform worked great for one customer. The day we added the second, it stalled.
The platform
Convin sells a Voice AI platform. A customer uploads a campaign: a list of leads to call. The platform dials each lead, and a bot runs the conversation. Speech to text, a model picks the reply, text to speech back. The phone provider was swappable, so dialing was never the limit. Feeding the dialer was.
When I picked this up, the calling was a for loop. It walked the list and placed one call at a time. That works in a demo. In production it falls over: one process, no retries, no backpressure, no way to balance one customer against another.
So I rebuilt it as a producer and a consumer. A campaign's leads go into a RabbitMQ queue. A pool of workers pulls from the queue and places the calls. It scaled fine for one customer. It became the flagship product.
The second customer
The design had one shared queue. Every campaign drained into it, in order.
With one customer, that is invisible. With two, it is not. Customer A starts a campaign with tens of thousands of leads. They all land in the queue. A minute later, customer B starts a campaign. Their leads land behind A's. The workers are FIFO. They finish A before they reach B.
B clicked start and nothing happened. Five, ten minutes of silence before the first call went out. The capacity was there. A was spending all of it. This is head-of-line blocking: one big item at the front of a single queue holds up everything behind it.
A fair queue
The fix was to stop pouring whole campaigns into the queue. I put a small scheduler in front of it.
Every five seconds, the scheduler looks at every active campaign across every tenant. It takes a batch from each one, round-robin, and publishes that batch to the queue. The batch size is a knob.
# scheduler, every 5 seconds
while True:
campaigns = active_campaigns() # across all tenants
for campaign in campaigns: # one turn each, round-robin
leads = campaign.take(BATCH_SIZE) # tunable
for lead in leads:
calling_queue.publish(lead)
sleep(5)
Now the queue always holds an interleaved mix. No campaign can hog the workers, because no campaign adds more than one batch before everyone else gets a turn.
A new customer's first call now goes out in seconds, however large the campaigns ahead of it. This is what unlocked the next level of scale. Onboarding a customer stopped putting every other customer at risk.
One FIFO queue is fair by accident, and only when there is one source of work. Add tenants and fairness becomes a decision you have to make on purpose.
Why not a queue per tenant?
The clean fix is full isolation: one queue and a dedicated pool of workers per tenant. No tenant can touch another tenant's capacity at all. I wrote it up as phase two.
I skipped it on purpose. Per-tenant queues mean provisioning and scaling infrastructure for every customer. That is a lot of automation to build and run. We were moving the product to self-serve, where a customer signs up and starts calling without us touching anything. A scheduler in front of one shared queue bought most of the fairness for none of that cost.
Phase two is still the right next step, the day customers need hard isolation or per-tenant SLAs. Until then, the scheduler was the right amount of engineering.
The small fix that clears the bottleneck beats the complete fix that needs new infrastructure. Ship the scheduler. Document the queues.
What actually scaled it
This was one piece. The platform went from 30K to 1M calls a day over the year, a 33x jump. That came from a string of changes like this one, plus a lot of reliability work. One of those was a bug quietly failing a quarter of all calls. That is its own post.
What this change did was clear the first ceiling. Before the scheduler, the platform could not safely hold more than one busy customer. After it, adding customers no longer broke it. Every later optimization sat on top of that.