Show HN: Open-source, native audio turn detection model by kwindla

Share This Article

Sed ut perspiciatis unde.

This is an open source, community-driven, native audio turn detection model.

Turn detection is one of the most important functions of a conversational voice AI technology stack. Turn detection means deciding when a voice agent should respond to human speech.

Most voice agents today use voice activity detection (VAD) as the basis for turn detection. VAD segments audio into “speech” and “non-speech” segments. VAD can’t take into account the actual linguistic or acoustic content of the speech. Humans do turn detection based on grammar, tone and pace of speech, and various other complex audio and semantic cues. We want to build a model that matches human expectations more closely than the VAD-based approach can.

This is a truly open model (BSD 2-clause license). Anyone can use, fork, and contribute to this project. This model started its life as a work in progress component of the Pipecat ecosystem. Pipecat is an open source, vendor neutral framework for building voice and multimodal realtime AI agents.

Current state of the model

This is an initial proof-of-concept model. It handles a small number of common non-completion scenarios. It supports only English. The training data set is relatively small.

We have experimented with a number of different architectures and approaches to training data, and are releasing this version of the model now because we are confident that performance can be rapidly improved.

We invite you to try it, and to contribute to model development and experimentation.

Run the proof-of-concept model checkpoint

Set up the environment.

python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run a command-line utility that streams audio from the system microphone, detects segment start/stop using VAD, and sends each segment to the model for a phrase endpoint prediction.

# 
# It will take about 30 seconds to start up the first time.
#

# "Vocabulary" is limited. Try:
#
#   - "I can't seem to, um ..."
#   - "I can't seem to, um, find the return label."

python record_and_predict.py

The current version of this model is based on Meta AI’s Wav2Vec2-BERT backbone. More on model architecture below.

The high-level goal of this project is to build a state-of-the-art turn detection model that is:

Anyone can use,
Is easy to deploy in production,
Is easy to fine-tune for specific application needs.

Current limitations:

English only
Relatively slow inference
- ~150ms on GPU
- ~1500ms on CPU
Training data focused primarily on pause filler words at the end of a segment.

Medium-term goals:

Support for a wide range of languages
Inference time <50ms on GPU and <500ms on CPU
Much wider range

Post Author

zamalek

Posted March 6, 2025 at 11:35 pm

As an [diagnosed] HF autistic person, this is unironically something I would go for in an earpiece. How many parameters is the model?

0Likes Log in to Reply
Post Author

written-beyond

Posted March 7, 2025 at 12:16 am

Having reviewed a few turn based models your implementation is pretty inline with them. Excited to see how this matures!

0Likes Log in to Reply
Post Author

fdsd

Posted March 7, 2025 at 1:58 am

[dead]

0Likes Log in to Reply
Post Author

fdafdsa

Posted March 7, 2025 at 2:04 am

[dead]

0Likes Log in to Reply
Post Author

remram

Posted March 7, 2025 at 3:14 am

Ok what's turn detection?

0Likes Log in to Reply
Post Author

foundzen

Posted March 7, 2025 at 4:19 am

I got most of my answers from the README. Well written. I read most of it.
Can you share what kind of resources (and how much of them) were required to fine tune Wav2Vec2-BERT?

0Likes Log in to Reply
Post Author

kwindla

Posted March 7, 2025 at 5:12 am

A couple of interesting updates today:

– 100ms inference using CoreML: https://x.com/maxxrubin_/status/1897864136698347857

– An LSTM model (1/7th the size) trained on a subset of the data: https://github.com/pipecat-ai/smart-turn/issues/1

0Likes Log in to Reply
Post Author

cyberbiosecure

Posted March 7, 2025 at 5:14 am

forking…

0Likes Log in to Reply
Post Author

prophesi

Posted March 7, 2025 at 5:25 am

I'd love for Vedal to incorporate this in Neuro-sama's model. An osu bot turned AI Vtuber[0].

[0] https://www.youtube.com/shorts/eF6hnDFIKmA

0Likes Log in to Reply
Post Author

xp84

Posted March 7, 2025 at 5:48 am

I'm excited to see this particular technology developing more. From the absolute worst speech systems such as Siri, who will happily interrupt to respond with nonsense at the slightest half-pause, to even ChatGPT voice mode which at least tries, we haven't yet successfully got computers to do a good job of this – and I feel it may be the biggest obstacle in making 'agents' that are competent at completing simple but useful tasks. There are so many situations where humans "just know" when someone hasn't yet completed a thought, but "AI" still struggles, and those errors can just destroy the efficiency of a conversation or worse, lead to severe errors in function.

0Likes Log in to Reply
Post Author

lostmsu

Posted March 7, 2025 at 6:43 am

Does this support multiple speakers?

0Likes Log in to Reply
Post Author

pzo

Posted March 7, 2025 at 8:27 am

I will have a look at this. Played with pipecat before and it's great, switched to sherpa-onnx though since I need something that compile to native and can run on edge devices.

I'm not sure if turn detection can be really solved except dedicated push to talk button like in walkie-talkie. I often tried google translator app and the problem is in many times when you speaking longer sentence you will stop or slow down a little to gather thought before continuing talking (especially if you are not native speaker). For this reason I avoid converation mode in such cases like google translator and when using perplexity app I prefer the push to talk button mode instead of new one.

I think this could be solved but we would need not only low latency turn detection but also low latency speech interruption detection and also very fast low latency llm on device. And in case we have interruption good recovery that system know we continue last sentence instead of discarding previous audio and starting new etc.

Lots of things can be improved also regarding i/o latency, like using low latency audio api, very short audio buffer, dedicated audio category and mode (in iOS), using wired headsets instead of buildin speaker, turning off system processing like in iphone audio boosting or polar pattern. And streaming mode for all STT, transport (using using remote LLM), TTS. Not sure if we can have TTS in streaming mode. I think most of the time they split by sentence.

I think push to talk is a good solution if well designed: big button in place easily reached with your thumb, integration with iphone action button, using haptic for feedback, using apple watch as big push button, etc.

0Likes Log in to Reply

Show HN: Open-source, native audio turn detection model by kwindla

Show HN: Open-source, native audio turn detection model by kwindla

Share This Article

Newsletter

Current state of the model

Run the proof-of-concept model checkpoint

HackTech

12 Comments

zamalek

written-beyond

fdsd

fdafdsa

remram

foundzen

kwindla

cyberbiosecure

prophesi

xp84

lostmsu

pzo

Leave a comment Cancel reply

Editor's Choice

Show HN: Open-source, native audio turn detection model by kwindla

Show HN: Open-source, native audio turn detection model by kwindla

Share This Article

Newsletter

Current state of the model

Run the proof-of-concept model checkpoint

12 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter