How AI Answers Business Calls, Step by Step
Your phone rings. You're up a ladder, with a patient, or in a meeting. Three seconds later, something answers in a warm, natural voice, greets the caller by your business name, books them in for Thursday at two, and emails you a summary before you've climbed down. No human touched the call. This article explains exactly how that works — the full pipeline, in plain language, including the parts vendors prefer not to talk about.
In short: when a call diverts to an AI phone assistant, speech recognition transcribes the caller in real time, a language model trained on your business information decides the reply, and lifelike text-to-speech speaks it back — all in under a second. The AI books appointments, answers questions, switches languages, and hands difficult calls to a human.
What happens in the first second of a call?
Nothing about your phone line changes. Your number stays your number. What changes is a setting called call divert (call forwarding), which every phone provider supports: you tell your line to send calls to the AI assistant's number — either for every call, only when you're busy, only when you don't pick up within a few rings, or only outside business hours.
When a diverted call arrives, the AI platform answers it instantly — there's no "please hold" and no ringing into the void. The assistant plays its greeting: your business name, a disclosure that the caller is speaking with a digital assistant (a legal requirement under the EU AI Act's transparency rules from August 2026), and an open question — *"How can I help?"*
From that moment, three systems run continuously and in parallel for the rest of the call. Understanding them is the key to understanding everything an AI receptionist can and cannot do.
How does the AI understand what the caller says?
The first system is speech-to-text (STT, also called automatic speech recognition). It converts the caller's audio into written words — but not the way an old dictation tool did, waiting for you to finish and then transcribing the whole thing. Modern STT is streaming: it produces a rolling, provisional transcript while the caller is still mid-sentence, revising earlier words as more context arrives. "I'd like to book" might first appear as "I'd like a book" and silently correct itself a few hundred milliseconds later.
Two harder problems sit alongside transcription:
- Endpointing — deciding the caller has actually finished speaking. Humans do this instinctively from intonation and rhythm; machines use voice-activity detection plus learned models of how sentences end. Get it wrong in one direction and the AI interrupts the caller; get it wrong in the other and there's an awkward dead pause.
- Robustness — phone audio is narrow-band and messy. Callers ring from vans, building sites and school pickups. Good STT models are trained on exactly this kind of telephone audio, with accents and background noise, rather than on clean studio recordings.
This is also where language detection happens. The recogniser identifies within the first phrase or two whether the caller is speaking English, German, French, Italian or Spanish, and the whole pipeline switches accordingly — same number, same call, no menu. For a fuller picture of this layer, see what is an AI voice agent.
How does the AI decide what to say?
The transcript flows into the second system: a large language model (LLM) — the same family of technology behind modern AI chat assistants, but constrained for the job. The model doesn't answer from its general knowledge of the internet. It answers from a knowledge base you control: your opening hours, services, prices, policies, directions, and the answers to the questions your callers actually ask. The technique is usually retrieval-based — the system looks up the relevant facts from your knowledge base and instructs the model to answer only from them.
The model also holds the conversation state: it remembers that the caller said "Thursday" two turns ago, that they've already given their name, and that the purpose of the call is a booking, not a complaint. And it operates under guardrails — explicit instructions about what it may promise, what it must never invent, and which situations require a human.
Crucially, the model can do more than talk. Through tool calls, it can act mid-sentence: check real availability in a connected calendar, write a confirmed appointment into it, or log a structured message. That's the difference between an assistant that *sounds* helpful and one that *is* helpful.
How does it speak back — and why does latency matter so much?
The third system is text-to-speech (TTS): the model's reply is converted into natural-sounding audio. Like the STT layer, it streams — the first syllables play while the rest of the sentence is still being generated, because waiting for a complete reply would add noticeable silence.
Latency is the make-or-break engineering problem in voice AI, and it's worth understanding why. Linguistic research across ten languages (Stivers et al., published in *PNAS*, 2009) found that the typical gap between turns in human conversation is around 200 milliseconds — faster than conscious thought, because listeners predict when the speaker will finish and prepare their reply in advance. Telephony standards point the same way: ITU-T Recommendation G.114 advises keeping one-way transmission delay below 150 milliseconds for satisfactory conversational quality, with 400 milliseconds as the upper planning limit.
An AI assistant can't hit 200 milliseconds — it has to transcribe, think and synthesise speech in sequence. But the total turn time (caller stops speaking → assistant starts speaking) needs to land well under a second to feel like conversation rather than a walkie-talkie exchange. Production systems get there by streaming every stage, running the three components close together on fast infrastructure, and starting to "think" before the caller has fully finished. Beyond roughly a second and a half of silence, callers assume the line has dropped, start repeating themselves, and trust evaporates.
Closely related is barge-in: callers interrupt, because humans do. A good system detects the interruption, stops talking immediately, and treats the new speech as the current turn — rather than ploughing on with its prepared sentence.
What can the AI actually do during the call?
With the pipeline in place, the practical capabilities of a modern virtual receptionist look like this:
- Book appointments — check live availability in Google Calendar, Microsoft 365 or a connected booking system, offer concrete slots, and write the confirmed appointment directly into the diary, with the caller's name and number attached.
- Answer FAQs — opening hours, prices, directions, parking, what to bring, whether you handle a particular job — anything in the knowledge base, phrased conversationally rather than read out like a script.
- Take structured messages — not a voicemail blob, but named fields: who called, their number, what they need, how urgent it is. Structured data can be routed, prioritised and acted on.
- Detect and switch languages — reply in the caller's language automatically, on one number. fonea handles English, German, French, Italian and Spanish this way.
- Escalate on your rules — transfer the call to your mobile for emergencies or VIP callers, or capture a prioritised callback request when you're unreachable. You define the triggers; the AI applies them consistently at 2 a.m. as well as 2 p.m.
What happens after the caller hangs up?
The end of the call is where the assistant quietly earns its keep. Within moments you receive a summary — by email, SMS or in a dashboard — saying who called, what they wanted, and what the assistant did: *"Mrs Davies, new patient, booked Thursday 14:00, asked about parking — told her about the rear car park."* Behind the summary sits the full transcript if you want the detail, and any structured outcomes (the calendar entry, the callback request) have already landed in the right system.
This matters more than it sounds. A receptionist's value was never just answering — it was making sure nothing fell through the cracks afterwards. Summaries and structured handoffs are how the AI version delivers that part of the job.
Where does it fail — and how do good systems handle failure?
Honesty time. Voice AI in 2026 is genuinely good, but it is not flawless, and any vendor claiming otherwise is selling. The real failure modes:
- Background noise and bad lines. A caller on speakerphone in a windy car park will sometimes be mis-transcribed. Good systems ask a natural clarifying question ("Sorry — was that Thursday or Tuesday?") instead of guessing, and confirm critical details like phone numbers back to the caller digit by digit.
- Interruptions and crosstalk. Two people talking near the phone, or a caller who interrupts mid-word, can confuse endpointing. Robust barge-in handling recovers most of these; the rest resolve with a polite "Go ahead — I'm listening."
- Out-of-scope questions. The caller asks something the knowledge base doesn't cover. The single most important design choice in this technology is what happens next. A bad system bluffs — and an invented price or a made-up promise is far worse than no answer. A good system says so plainly, takes a structured message, and flags it for a human — and the question gets added to the knowledge base so it's answered next time.
- Emotionally charged calls. A distressed or angry caller doesn't want a fluent machine; they want a person. Escalation rules should catch these early — on keywords, on tone, or simply on the caller asking for a human — and route them out of the AI immediately.
The pattern across all four: failure handling is the product. When you evaluate any assistant, don't test the happy path — mumble, interrupt, ask something weird, and watch what it does.
How is the call data kept secure and private?
A phone call is personal data — names, numbers, health details, the lot — so the security architecture is not a footnote:
- Encryption in transit and at rest, for audio, transcripts and summaries alike.
- GDPR compliance as a processor: a signed data processing agreement (DPA), a lawful basis, configurable retention, and deletion on request. The UK ICO's guidance sets out the same obligations for UK businesses under the UK GDPR.
- Data residency: ask where calls are processed and stored. Many voice AI products route audio through US infrastructure by default. fonea processes and hosts in the EU, which keeps the data-protection analysis simple for European businesses.
- Transparency: the assistant discloses that it's an AI at the start of the call — good practice today and, under Article 50 of the EU AI Act, a legal obligation from 2 August 2026.
What does setup actually involve?
Less than the technology above suggests. There's no hardware, no new number, and no developer:
1. Keep your number. You set up call divert from your existing line — a few minutes with any provider, and reversible at any time. 2. Teach it your business. In a dashboard, you enter opening hours, services, FAQs, booking rules and escalation triggers. Most businesses are done in an afternoon. 3. Choose when it answers. Everything, overflow only (when you're engaged or after four rings), or out-of-hours only — most owners start with overflow and expand once they trust it. 4. Test it yourself. Ring your own number, be an awkward customer, and tune the answers before callers ever hear it.
Hear it for yourself: fonea answers in five languages on one number, books straight into your calendar, and is hosted in the EU under GDPR. From £/€90 per month with 120 minutes included — and a 30-day money-back guarantee, so the trial is on us. Get started
Key Takeaways
- AI answers calls through a real-time pipeline: streaming speech-to-text → a language model grounded in *your* business knowledge → streaming text-to-speech.
- Latency is the hard part: humans switch turns in ~200 ms (Stivers et al., *PNAS* 2009); good AI assistants respond in well under a second, and beyond ~1.5 seconds conversations break down.
- Mid-call, the AI can book real appointments, answer FAQs, take structured messages, switch languages and escalate — via tool calls, not just talk.
- Failure handling is the product: clarify instead of guess, admit instead of bluff, escalate instead of trap. Test the awkward path before you buy.
- Security is table stakes: encryption, a DPA, GDPR compliance, EU data residency, and AI disclosure at the start of every call.
Frequently Asked Questions
Can callers tell it's an AI?
Yes — and they should: the assistant discloses it at the start of the call, which the EU AI Act makes mandatory from August 2026. After the disclosure, well-engineered voices and sub-second responses make the conversation feel entirely normal. In practice callers care far more that someone answered instantly and solved their problem than about who — or what — did the answering.
What happens if the AI can't answer a question?
It says so, honestly, and falls back on your rules: it takes a structured message with the caller's details and the question, flags it as needing a human, or transfers the call directly if you're available. Nothing is invented and no call is lost. Recurring gaps get added to the knowledge base, so the assistant answers them next time.
Does it work with my existing number?
Yes. You keep your number and set up call divert from your current provider — for all calls, only when you're busy or unreachable, or only outside opening hours. It takes a few minutes, requires no new hardware or contract changes on your line, and can be switched off just as quickly.
Sources
- Stivers et al. (2009) — *Universals and cultural variation in turn-taking in conversation*, PNAS (≈200 ms modal gap between conversational turns across 10 languages)
- ITU-T Recommendation G.114 (2003) — *One-way transmission time* (≤150 ms one-way delay for satisfactory conversational quality; 400 ms planning limit)
- EU AI Act (Regulation 2024/1689), Article 50 — transparency obligation for AI systems interacting with people (applies from 2 August 2026)
- European Commission — *EU General Data Protection Regulation (GDPR)*, Article 28 (processors and data processing agreements)
- UK Information Commissioner's Office (ICO) — *Guide to the UK GDPR*
Try fonea, no strings attached
AI phone assistant for business. Hear a live demo in your browser, book a call with our team, or get started — from £90/month, 30-day money-back guarantee, cancel monthly.
GDPR-compliant · EU & UK GDPR · Multilingual