Private On-Device AI Transcription

A tool that listens to your calls hears everything. Account numbers. A salary figure. The walk-away price in a negotiation. A doctor talking to a patient. Most products that do this quietly ship all of that audio to someone else’s server, and it feels free because you never see the wire.

I build the opposite. Private AI that runs the listening on your own laptop, so the audio of a call never leaves the room. This is the model I choose against the default, and the default is to send your data somewhere else: you speak, the audio goes to a server, the server sends back text.

The short version: I built a desktop copilot that listens to a live call and suggests what to say next. The speech model runs on the laptop, so the audio becomes text on the machine and never leaves it. Only when I press a key does a slice of that text go out to a hosted model. The trade is real, a small local model mishears and invents filler, so I built a filter to catch it, but in return the call stays yours, the listening costs nothing to run, and the words land in about four seconds. The voices stay. Only the words travel.

The copilot hears my voice through the microphone and the other person through the computer’s own speakers. Two streams, both sides of the conversation. Then it suggests a reply in my own words.

The first real decision was where the listening happens, and I put it on the laptop. The speech model runs on the machine. The audio becomes text on the machine. None of it leaves. There is no server that hears the call, no backend, no sync, no telemetry. When I want a suggestion, I press a key, and only then does a slice of the text go out to a hosted model. The words travel. The voices stay.

That sounds like a footnote. It is the spine of the product.

Why on-device AI is private by design

Think about what a call copilot actually hears. Account numbers. A salary figure. A walk-away price in a negotiation. A doctor and a patient. The moment you ship that audio to someone else’s computer, you have made a promise you cannot keep, because you no longer hold the thing you promised to protect.

Running the speech on the device removes the whole problem. I am not asking anyone to trust a privacy policy. The recording has nowhere to go. That is a different kind of claim: it is true by design, not by good intentions. It is the same rule I apply to sensitive documents, where the safest record is the one nobody keeps. A privacy guarantee you can check beats one you have to believe.

Design question	Cloud default	On-device (this build)
Where the audio goes	To someone else’s server	Nowhere. It stays on the laptop
What you have to trust	A privacy policy	Nothing. The recording has no exit
Cost to listen	A monthly server bill	None. No server is renting the work
Speed	A round trip to another continent	About four seconds, local
The privacy claim	One you have to believe	One you can check

This is the part of local AI most products skip. They put a privacy page on the site and keep sending your audio to a data center. Privacy you can check is rarer than it sounds. Even the padlock in your browser promises less than most people think. I would rather build the guarantee into how the thing works, so there is no page to read and no promise to break.

The real trade with local AI, and how I handle it

Now the honest part. On-device costs you something, and I would rather name it than hide it.

A speech model small enough to run on a laptop in real time, while you are also on a video call, is weaker than the big ones in the cloud. It mishears. And when it hears near-silence or background noise, it does not stay quiet, it fills the gap. These small models learned from huge piles of internet video, so when they are unsure they fall back on what they saw most. They type a thank-you for watching. They type a request to subscribe. They drop a lone word into a silent room. The model is not broken. It is guessing, and its guesses come from the videos it learned on.

So a real part of the work is a filter that sits between the ears and the page. It drops the known junk. It catches the stock phrases the model leaks when it has nothing real to transcribe. It spots the same word repeated ten times, which is what these models do when they spiral. It even checks whether a sound was speech at all before it pays to transcribe it, so a fan or a keyboard does not turn into a sentence.

None of that filter would exist if I had shipped the audio to a bigger model in a data center. I traded raw accuracy for privacy, then paid the trade back in code. That is the deal, and you should know the deal before you call something private. This is normal engineering: you pick the constraint that matters, then you do the work the constraint asks for.

What you get when the AI keeps your data on your machine

What you get is worth the bill. The call stays yours. The thing runs with no monthly fee for the listening, because nobody is renting me a server to do it. And it is fast, because the audio is not making a round trip to another continent and back. Words land on the screen about four seconds after they are spoken, and a copilot that tells you what to say is useless if it answers a minute late.

There is a second filter in the product, and it taught me the same lesson from the other side. Once the transcript goes out and a suggestion comes back, that suggestion has to sound like a person. So the reply layer bans the tells. No “great question.” No “I would be happy to.” No restating what the other person just said to prove you were listening. A statement, not a stalling question. The first filter strips the machine out of the ears. The second strips the machine out of the mouth. Same discipline, run twice.

That is the part people miss about building with AI right now. The model is the easy half. The judgment is deciding what never to send, and what never to say. I make that call on purpose, and I make it the same way every time.

Keep the voices on the machine. Send the words, not the wire.

Frequently asked questions

What is on-device (local) AI?

It is AI that runs on your own hardware, your laptop or phone, instead of on a remote server. In this copilot, the speech-to-text model runs locally, so the audio of a call is turned into text on the machine and is never uploaded.

Is on-device AI actually more private than cloud AI?

Yes, and in a way you can verify rather than take on faith. When the audio never leaves your device, there is no server copy to leak, subpoena, or misuse. The privacy comes from the architecture, not from a policy promising good behavior.

What is the downside of running AI models locally?

A model small enough to run in real time on a laptop is less accurate than a large cloud model. It mishears more, and when it has nothing clear to transcribe it tends to invent filler. You make up for that with extra engineering, mainly a filter that removes the junk before it reaches the screen.

Why do small speech models produce phrases like “thank you for watching”?

Because they were trained on enormous amounts of internet video. When the audio is silent or noisy and the model is unsure, it falls back on the phrases it saw most often in that training data, which is why captions like “thank you for watching” or “please subscribe” leak out of silence.

Is local AI fast enough for real-time use?

It can be faster, because there is no round trip to a distant server. In this build, transcribed words appear about four seconds after they are spoken, which is quick enough for a live call where a late suggestion is a useless one.

Site navigation

Private AI That Keeps Your Data on Your Machine