Now Playing
Ben Schoeggl from achilleshr #1
Thank you.
Okay, so the first question that we have is,
so the first question is, when you're building with voice stuff, if you could wave a magic wand at anything and make it better, make it less difficult or change anything about it, where would you kind of be aiming the magic wand?
I would say just overall conversation.
What's the right word?
Like, fluidness and smoothness.
Like, when you're having a conversation with a human being, it's very obvious.
You know, our brains just kind of, like, automatically figure out when we should talk, when we should not talk.
The current, like, voice pipeline system is not great at that.
So we definitely have some like remaining issues with like right now, for example, we have a long, I think we have a second and a half of on purpose silence between you can't person stopping talking model starting talking so we don't get a bunch of like problematic interruptions there.
Um, which just makes the conversation feel a little bit sluggish.
That's a big one.
Um.
Honestly, that's probably like, that's our biggest
source of issues for sure.
Like, yeah, I've spent a lot of time just kind of like
messing with settings and trying to figure out stuff to do to just make that feel better.
If I could just wave a wand.
Basically, if you could wave a wand and just kind of like give it system prompt and tool calls and like what tools to call and stuff like that and the conversation just like went perfectly every time that would be kind of my magic wand thing yeah, and.
You mentioned that you played around with some stuff to try and fix it.
Can we hear a little bit more about like what we tried?
I messed with solero like all the solero bad settings a lot just like tuning them up and down I messed with we use live kit and live kit has a There's like a contextual end of utterance thing.
I'm not even entirely sure how it interacts with the Solara of that, but
I messed around with that.
I've messed around with using different speech to text providers and tuning the speech to text providers.
I've messed around with.
You know, timing on stuff in terms of how much time do we introduce between,
you know, person stopping talking, model starting talking.
I've messed around with like if the model doesn't detect or if the called Solero or whatever doesn't detect speech, how long do you wait before you say something like, hey, sorry, I can't hear you.
Yeah, just all that stuff.
Did anything make much of a difference?
Yeah.
Adding, so we went, initially I had like a really short delay between person stopping talking and model starting because I wanted the conversation to feel snappier, but that caused a bunch of bad interruption problems.
So I actually went on the other side of that and I went like above what the recommended default is.
That is a really reduced to be like error states in terms of the model interrupting people when it shouldn't, but.
It'S.
Just made the conversation feel more sluggish, but it's definitely better.
The other big one is now, I think it's after like three seconds, so not very much time if the model doesn't hear anything.
It will say, hey, sorry, I can't hear you.
Can you speak up?
Which
seems to help.
Yeah, that just kind of like prevents a lot of error states effectively.
And also, we were having issues with a lot of our candidates are in like high background noise environments, be it like they might be on a manufacturing floor somewhere or something.
And so that's kind of a, it'll prompt them to put the phone in a lower noise environment without being like, hey, I'm a stupid AI, I can't hear you.
Great.
Can you move your phone to somewhere?
Like, if you just say, I can't hear you, like, sorry, I can't hear you.
It's just a nicer way to kind of prompt them to make, or prompt the candidate to make the noise environment a little better.
Yeah, yeah.
That's really helpful.
Yeah, it seems like you tried a lot of stuff and.
Tried a lot.
It's a
good time.
Voice is a good time.
Yeah, it's gonna get there.
And so if we were to deliver
sort of a experience that people like it didn't have those like barging in, it didn't have like that, like it felt smooth.
How would it change your life?
Well, it would have saved me like three to four weeks of pain last month.
But in general, we have it in at least a decent state now where our clients aren't mad.
It's kind of a weird conundrum because in sales calls now, our calls are pretty damn good.
It's not quite as good in the real world with higher noise environments and stuff like that.
We did just get a
client complaint today about
basically like the pickup process being a little janky.
So just, I don't know.
Yeah, if there was a product, I would definitely pay for a product where all that is just kind of like handled and works well.
However, I'm highly skeptical of like a one stop shop.
Solution because
our call environment, maybe this is not good thinking on my part, but our call environment is very different from a call center call environment, which is very different from,
I don't know, maybe it's not that different actually, I don't know.
Yeah,
yeah, I mean, I would definitely,
you know,
I have a ticket on my to-dos at some point to just, I'm pretty confident that LiveKit is probably the best solution, but I just want, I never actually prototyped like VAPI or LayerCode obviously, or
the other big one, PipeCat.
So I kind of have it on my to-do list to just do a prototype call with all those and make sure that life gets the best.
But yeah, if there was a solution where I could just like wave wand and get rid of all the conversation flow problems, I would be happy.
Yeah, actually, could I just- yeah, I.
Mean, I know that's kind of what you guys are like, starting to do somewhat because you guys basically just exposed the all live node, right?
But I
would be concerned, like, The other problem with a one stop shop solution where it's just like you guys, someone else handles the entire conversation flows.
Like if there is a bug that I'm completely dependent on your release cycle to fix the bug.
Like if a client of mine, if a big client finds a bug and gets really some issue in the conversation flow, gets really mad about it, I am 100% dependent on your guys release cycle to keep that client, which I also don't really like.
Yeah,
but I don't know, it just depends.
Yeah, that.
Yeah.
Sorry, I'm kind of rambling now.
All right, this is super, super helpful.
How would you describe, like, why you decided to go with Livekit, by the way?
Well, so we started off with like Bland, which is basically like drag and drop UI thing for it, but it basically just sounded like a phone tree.
And then
as in the conversation flow sounded like a phone tree.
And then I was just looking around at different providers.
Pipecat just seemed like it was like.
Not.
Super mature compared to LiveKit.
And I mean, LiveKit's like a fairly big company.
So, I mean, there's always some comfort in that.
I had a colleague recommend LiveKit, and then I looked around at Vappy, and I forget there was some issue with Vappy.
But just generally, LiveKit seemed like the best solution.
And so, I built the call in LiveKit and it worked pretty well.
Then we actually, well, I initially built it with their Java Agent JavaScript framework and then we discovered that that one
sucks compared to the Python one.
So then I actually rebuilt it in Python.
LiveKit, Python.
LiveKit Python library, yeah.
There were just a bunch of bugs in the JavaScript one that don't exist in the Python one.
So.
Yeah, I guess that's been our development journey.
But I still, yeah, now I'm at a point where that was sort of like getting the product off the ground, didn't do quite as much research as I should have.
I think LiveKit's I guess his LiveKit's
probably the best solution.
But yeah, that's what I said.
I'm gonna take some time to just prototype Vappy and
I can't.
Yeah, I think they're doing a good job, LiveKit.
And yeah, as I said, this is not about trying to convince you or anything.
So we're
just trying to uncover how we could be useful.
Yeah, no, I'll definitely, I'm definitely interested in what you guys are doing.
I mean, it sounds like what you guys are doing is effect trying to do is effectively like what I'm asking for as a customer, which is handle everything conversation related.
I'm just, yeah, like I said, a little bit skeptical of,
like the one stop shop and also like, you guys aren't even launched yet, so it's like, I will definitely prototype it when you have maturity.
But
this is your whole business.
So you've
got to be risk averse on making sure that anyone you go with is reliable and has a good record and stuff.
So we understand that.
Yeah, on the just a question I had actually on the agents.
Because one, my background, I came from using a lot more, like, I've been dabbling a lot more with, like, the
general, like, AI agent Frameworks.
So more like they're kind of, I, I was using Maestro, just like JavaScript one, but I think there's, like, pydantic and
the, the big one, I forgot the Lang chain and stuff like that.
Lang chain.
Yeah.
Did you ever look at using any of those more general ones rather than the
LiveKit library for the LLM stuff?
I mean, you can use, I believe you can use LangChain stuff on top of LiveKit.
Really, we're using LiveKit to help manage telephony and all the conversation stuff.
I mean, we.
Control all our own
tool calls and system prompts and stuff with LiveKit.
I haven't really looked, so initially I thought we were going to have to do some complex agent setup or something, but we tried a couple of things and what worked the best was just stuff everything into the system prompt.
So far.
And so that's kind of what we've ran with.
Eventually we'll definitely get more
advanced on that, but what we have works pretty well for now.
Okay.
Are you using any frameworks besides the LiveKit one?
On the call side,
Not right now.
Not at the moment, yeah.
We're going to use, I'm planning on implementing Coval, which is like, I don't know, have you heard of them?
Yeah, yeah, yeah.
That's for evals, right?
Yeah, and they interact with some open source thing called LangFuse or something.
So basically you just have to instrument your call with LangFUSE and then you get a bunch of stuff out of the box.
So planning on using that in the near term.
But yeah.
Yeah.
Okay.
Super, super helpful.
Okay.
This one.
Okay.
Is there anything different about the world now than say like a year ago or longer than that that makes waving a magic wand and having smooth voice more valuable now than it would have been a year or two ago?
Well, for me personally, we understand how to sell this product better.
So I feel like the opportunity for this company is bigger than I thought it was a year ago.
And so just, I mean, making things better, it's like, you know,
that's like, you know, multiplying whatever.
It's like a bigger multiplier on one side is solving all the problems.
I don't think they're like, as far as I can tell, there aren't, like,
From a year ago, I don't think that there are
tectonic changes in any of the models
such that like voice content is significantly better.
Yeah, I don't think there's any like new breakthrough where it's like,
oh, you know, this conversation will go so much better with GPT-5 or something like that.
Yeah, not that I can think of.
I don't know.
What about same question on your side?
What do you think?
Well, I feel like just, I mean, the obvious one is just the models of like, just so much more.
The actual LLM models, I feel like maybe the voice one's not that much different, but the just LLM is more powerful.
And so I feel like they can handle a lot more stuff and it's easier to write things that can kind of do quite a lot.
I don't know.
I had a mad moment when I first- But I think, yeah, I mean.
I think ChatGPT is more like all the memory stuff they've done is very interesting, but that doesn't apply to calls.
I don't know.
Yeah, I mean, have there been, because like for a given call, you have, there's no memory, you have like the same, I mean, people have kind of discovered things that you can do with the prompts to make things a little bit better, but I don't feel like in the last year, in terms of like there has been, I haven't seen like a one new model that like really drastically changes things.
Which one are you using actually?
GPT-4O.
GPT-4O.
Okay, yeah, yeah, I think that's a good one.
Yeah, we've been using Gemini Flash 2.5.
Yeah, I've looked at, I looked, yeah, I need to look at Gemini Flash actually, because I've heard it's very good.
Yeah, but I don't know if it's like, I don't know, I guess for me, like, I was just surprised, like, because I remember I kind of took like a year or so break of like building with that and stuff and like I felt like it just became a lot easier where you could kind of just say what you want to happen and like give the tools and then it would actually like with the frameworks like master of the one I was using it would just like loop through and like if you gave it the tools and said what you wanted to do it would actually just like work really well.
I don't know but I don't know if that was these were mostly like my
hobby cases.
So I know that.
Yeah, you want it to be a little bit more, like,
deterministic.
Yeah.
Yeah.
So last couple of questions.
Ben, how do you learn new things about voice AI stuff?
Twitter.
Is there anyone?
Pretty good.
The pipe cat guy.
Quinn was pretty good.
Yeah.
I signed up for, like, that.
He did, like, a course a while ago, which I sent up for, and I only went to a couple of the things because I was so busy, but.
Yeah,
I don't have a great, I mean, honestly, to be honest, like, I'm just like so insanely busy with getting stuff done for the company that I don't have at the current moment have a ton of time to do like research for
new stuff.
Sometimes I'll like pop my head above water and be able to have some time to read or something, but it doesn't happen very often.
I know that I need to, for example, I think we can make an improvement by using Google speech to text for articles, which I'm going to investigate today.
But it's something where I'll just see.
That.
I saw a random post by some guy at Mercore.
Where they basically did a giant AB test of a bunch of-- they tested,
I think they tested
a ton of speech attacks.
They kept the LLM the same, but it was speech attacks and text-to-speech combinations.
They AB tested a bunch of those combinations.
And Google speech attacks and Cartesia text-to-speech seemed to be the best.
So we're already on Cartesia.
For textual speech, but we use deep ground for speech tech.
So I was going to look at that, see if.
See if there's any noticeable difference for.
Us.
But, yeah, it's just stuff like that, mostly.
Yeah.
Okay.
What about you guys?
Yeah, I mean, talking to people has been how we've just learned so much, to be honest.
Like, just asking people that are, like, banging their heads against it.
Has been.
Yeah, because it's always, I don't know,
I think in production it's just different.
For sure.
Yeah, one last thing, Ben, hopefully a fun one, is that we're trying to put together some special swag for the tab.
Yeah, one of the ideas
that I just had over the weekend, because I was trying to learn how they do the text to speech stuff and how they actually train it.
I don't know if you've looked into it, but like,
it seems like they use like Mel audiograms.
Anyway, one of the ideas that we had was to basically get your company name as like an audiogram.
That like in the same style they use in like if you like so basically like more or less like a screenshot from like a Jupiter Notebook audiogram of your company and then have that on hoodie or something.
I don't know if you thought it was like yeah, very cool.
Do you need it like recorded or something or?
Actually gonna figure that out if we get we could do it where we get your voice actually would that be would that be cooler if it was your voice?
It could be anybody's voice.
Okay.
And I guess it's cooler if it's your company than Leia code, right?
Yeah, probably.
Yeah.
Okay.
Are you guys gonna put it on sweatshirts or hats or...
What would you prefer?
Probably a hat.
I don't know.
I don't wear like a ton of sweatshirts, so.
Okay.
Probably a hat, but yeah.
I've got to say you're so far in the minority on hats.
Oh, really?
Everybody says sweatshirts.
Everyone says sweatshirts so far, so.
But you know, there might be a late.
I'll take a sweatshirt.
You might be start the start of a late sweat Serge on hats.
So if you're okay.
Yeah, Ben, that was like extremely, extremely helpful.
Thank you so much for your time.
I know you're very busy.
Of course.
So we'll, I'll send you an email just to like book some follow on ones and we'll get in touch about like swag and stuff like that, like sizes and if we want, if we need to get your audio voice.
Okay.
Yeah.
But yeah, great to meet you and, hopefully chat in about a month.
Cool.
Sounds good.
Can you also.
Will I get the recording for this as well?
Just so I can.
Yeah, yeah, yeah.
I can send you the recording.
Yeah, that'd be great.
Like the notes as well.
I'm using, like, granola.
Cool.
Cool.
Thank you so much.
Thanks, Ben.
See you later.
Have a good one.
Bye.