Now Playing
Gene Sigalov TAB 1
Contractor on Upwork.
Sorry, I forgot to do that.
Yeah, it's okay.
So I found a contractor on Upwork who appeared to be a very savvy guy.
He's like a Northern European guy, pretty expensive.
Like his Upwork rate is like 150 bucks an hour or something like this.
And I had talked to him and he had implemented a lot of voice AI agents and he specifically warned me against considering VAPI.
And he was believable, right?
The bona fides were there.
And I can introduce you to him.
And the only reason that I invoke him and that I suggest that maybe an introduction may be useful is because he's a guy who implements these technologies in the wild for many smaller clients or maybe not even so small, right?
But he kind of specializes in this focus area.
So it might be a worthwhile introduction for you.
100%.
Yeah, we'd really appreciate that because, yeah, I think we're just trying to laser in on what technical problems can we and do we want to solve with our platform?
Because I think Broadly, I'd say every vertical we're talking to is using voice for something.
The further they get into production, the more specific and nuanced the problems get, and some of them are shared across what people are doing, but some of them aren't at all.
That was one of the reasons for
inviting you to join the technical advisory board that we're running.
I think I explained in the email, but basically we're trying to have regular, pretty focused chats with folks across voice and basically share back, learn from about what the headaches are, but also share back with them, hey, we talked to someone else who's struggling with the same thing you were, would you like an intro or here's something they shared to solve that?
So the goal is that it's a reasonable hook, right?
And the reason that I'm sharing my time here today, right, I could be working on something that's going to bring value to my users instead of speaking with you.
Absolutely.
I have, you know, maybe a Slack or Discord.
If it's, you know, would probably help you get more folks on board, right?
If there's an opportunity to meet other people, other people building in the space, kind of like a little voice AI community thing.
I'm actually part of some WhatsApp groups that are focused on
AI, not WhatsApp, but they're WhatsApp groups that are focused on AI-based telephony and voice AI applications.
Terrific.
Yeah, no, I think, you know, we're interested in being anyone who's smart and trying to solve these problems, and I think, Yeah, your time's appreciated and anyone you feel that could be a good connection, we'd love to chat with.
But yeah, I think with that kind of founding context on this, I think there's typically a few specific questions we like to kind of use as a springboard.
Jack, do you want to take over and kind of run through a few of those?
Yeah, absolutely.
Yeah.
And by the way, we should say just upfront, we're not trying to sell here.
We're just trying to discover interesting things and things that we should be building.
I understand.
I get it.
You're builders.
You're building a thing.
You're trying to figure out what to build.
Exactly.
Yeah.
So with that, Gene, the first question that we really like to ask is
in the voice AI world, if there's one thing that you could wave a magic wand at, and make it better or fix it or change it.
What would you wave a magic wand at?
I mean, realism.
And, and that's a catch-all phrase for latency, right?
And also verbal tics, but elegant verbal tics where it's not the same verbal tick being played again and again.
I'm sure you've heard some agents where they try to do a verbal tick just to make it seem real.
But then after, like, a four minute conversation, you feel like it's just the same verbal tick on like it's a stuck record, right?
Kind of like laugh lines in Seinfeld, it's always the same laugh lines.
Only it's much more grading.
So, yeah.
Okay.
But the broad umbrella is realism, right?
And realism has a lot of facets, right?
Latency, cadence, tenor of the voice, right?
All the things.
Is there anything within realism that you would wave it at specifically?
Latency, I think the thing that you have control over latency.
So
are you specifically focused on telephony or, well, you're probably not specifically focused on telephony, but telephony probably comes across a lot and you don't really have a tremendous amount of impact unless you partner with a telephony provider, like a Telnyx, for example, like a CPaaS.
And Telnyx in particular, they're trying to co-locate some of the AI stack, as close to them as possible.
I don't know if you've ever talked to them, but that's a thing they're doing.
Yeah.
And that's quite appealing to you.
I mean, yeah, it's so, I mean, we have customers, right?
They are low end customers, right?
Meaning like they're small businesses, which is our focus area, but that's what they complain about.
They're like, there's a delay, right?
And yes, there is.
And then we say, just stick around.
It's going to get better, right?
There's a lot of people, everyone across the stack is working to remedy this problem.
And I don't have good tooling, so I have to, so how I figure out the latency, I get the WAV file, and then I use a software on my Mac, which does transcripts.
It uses an open whisper model, but it diatizes, what is it called?
Diarizes.
Diarizes.
Yeah, it diarizes down to the millisecond.
And then I just paste the transcript into the ChatGPT and I ask it to use a diaried transcript to give me the median to respond time, the average to respond time, the longest to respond time.
So I've had to hack it.
That's cool.
Yeah.
And yeah, but response times, that's important.
Yeah.
I think that's the number one complaint within realism, but there are other aspects to realism, which makes it feel better.
So some people add kind of like a busy workplace noise in the background.
That seems to be a popular one.
So it sounds like a Philippine BPO office or like Indian BPO office.
Do you know what I'm talking about?
Yeah, yeah, yeah.
Yeah, exactly.
But also it's like the same thing.
It's like the laugh track over and over again.
File.
Yeah.
Yeah.
We've had some funny ones where people have used it for, like, life coaching and, like, personal trainers, but it sounds like a call center and stuff.
That's Vibes and stuff, right?
Yeah.
Yeah, yeah.
That's super cool.
If you had better realism, see if you had significantly better realism, how would it change your life?
I mean.
I don't know.
I presume that some amount of users.
And this is a presumption because it's difficult to get people on the phone who decline to use your service.
They just go away and never come back.
But from people who stuck it out, we do hear this regularly.
So I have to infer that some amount of people don't convert to being paid users because the latency is too much for them or the realism is just not there.
But it's a pretty good inference.
It's pretty reasonable because we have live users who do stick around with us and they're complaining.
That's super interesting.
Is there anything
that's changed recently that makes realism even more valuable now than it was
a while ago, some unknown amount of time?
Year ago or like two years ago?
I can't think of a thing.
I thought that we'd have kind of like,
not to answer your question, but a thought just occurred to me.
I thought that with AI, there'd be kind of like an uncanny valley just like with robots, that in fact, people would maybe be upset if it sounds too real, meaning they don't want their AI sounding as good as a loved one, right?
To kind of keep the machines at a distance.
But no such compunctions.
They actually just wanted to sound as human as possible.
Yeah.
So better is better, basically.
Better is better.
Yeah.
That's essentially it.
Some of the things that we tell people to do, like their agents, these are small businesses, is to front load the expectations.
Like, hello, I am an AI receptionist.
I can handle these four things.
I can do scheduling.
I can answer scheduling questions, or you can leave a message.
So they know what they're going to get.
They're not like-- because I think there's a little bit of a hangover left from the prevalent and widespread use of natural language platforms, which kind of identify the intent and then send you down the right tree.
And those platforms platforms on average suck.
Right.
And people still use them.
Right.
Like Apple uses them.
So there's kind of a hangover effect where people don't realize this new generation of technology is really quite different and quite good.
And so I think it's just going to take a little bit, but we'll get there, obviously.
I don't know if you've seen that as well.
Totally.
Yeah.
I think the human expectations of what voice can do is one of the most interesting parts of this.
Because it's slowly changing as this technology rolls out.
Right.
And I think I think it's super interesting to me.
Yeah.
Do you have any stats on adoption?
I know it's sub 1% for kind of businesses in telephony.
I'm specifically interested in telephony.
But do you have any data to share?
On voice AI generally across industries.
Voice AI across industries.
And so I'm not sure how you categorize voice AI because my mental model is telephony.
You're probably married to a broader concept, which is in-app.
And so I'm broadly interested in telephony adoption, but any data you have is good to have.
No hard numbers, but I think from having spoken with quite a lot of people doing telephony-based voice agents, I would say it's extraordinarily early.
Still.
But I have been quite, I mean, surprised and not surprised by some like this.
There's a couple of businesses kind of doing what you guys are doing, but in BC, in Canada, like where we are, who we've got connected with.
And I've been quite surprised at, I was initially quite surprised, I should say, at how easy they were finding it to get customers on board to try the product.
But obviously for people who It's a very clear pitch for someone who's currently not answering the phones for 12 hours of the day.
It's easy to try it.
And I think broadly what you just described around, lots of people find an issue with the latency, but some people still are trying it.
That maps with what I've heard locally as well.
We're in a very small town in Victoria, but we have someone in our network who's kind of just selling a solution locally.
And he managed to get, I think, like
four hotels in this town to try it.
And there's not that many hotels in this town.
So like, you know, it's a poor hotel town.
There's probably less than 10 that reasonably he would be hitting up with like a cold approach there.
And that number like stood out to me because it's like this is very early technology.
You know, it's they're not necessarily tech-enabled business.
They're not early adopter businesses.
And yet they're willing to adopt because you're solving a very valuable problem.
And so I think broadly, my impression is still very early in society, but it is really interesting how quickly people are willing to try something that does actually solve a real problem.
But that's not that surprising if you zoom out from it.
But in the context of what people are willing to put up with to try and solve the problem, it's
definitely possible, as you know, and I think you guys have done like the best job with this scene of like to create a really responsive agent on the phone.
But the stack is still like the industry is in beta.
The tooling is still being figured out.
Like, yeah,
it's a good fishing hole.
Yeah, this is a fishing hole which will have the salmon come through.
So sorry to use the analogy.
It's good.
I think, yeah, it's like the stack is evolving quickly.
Right.
And so, I think, yeah, that's my overall impression.
Sorry, we don't have any biggest stats on it.
Yeah.
Jack, do you have any other questions?
I don't want to cut you short.
I know you're going through a script.
I want to make sure I've answered everything for you.
No, yeah, I think that was my script.
I tend to be those of the free big questions.
And then the other question is always just like if there's anything else that's like very the same question again.
But I think the only other thing that I could add is that if you add a lot of
verbal ticks to the conversation and you make them elegant.
My subjective feeling that even two seconds on average or median, let's say, response time is quite good.
And I know that a lot of the providers advertise something like 500 milliseconds.
I don't think that's realistic.
I don't really see that in the wild very much.
Yeah, I think two seconds is actually quite a decent experience.
If you could get it down probably to like a second and a half, no one would complain is the truth, right?
I think that that's kind of like my base case.
Tell me if I'm wrong if my-.
That tracks with what I would say was like my impression for sure.
I haven't seen any that resemble 500 milliseconds really.
It's but a dream, but it's nice marketing material as well.
It exists on a few demos, but
I think something that's interesting that we found, and this is less relevant for telephony, but
when you have a visual UI as well,
well, actually, it just doesn't work so much on a A phone conversation is mimicking speaking to a human, right?
So you're kind of stuck in that metaphor.
Whereas I think we found in some places, if you add a receiving sound or something as like a visual, sorry, an audible UI,
it actually feels to the user like the same amount of latency feels faster and gives a sense of like, oh, they heard me, right?
I'll shut up now.
And then you can kind of do turn taking because I think turn taking is obviously another big one, but that's something we noticed from our own experimentation and we've seen a few other places.
But it's obviously with telepathy, that doesn't really work because you're.
Yeah.
So I'm gonna add one more thing that may be useful.
Maybe you already know this.
And I also would like some feedback.
So when we launched with the GPT real time, didn't have semantic turn detection.
They had server turn detection and it sucked.
And we were just waiting for them to make it better.
And then they released semantic turn detection and we flipped that on and it was quite good.
And then we recently tested their server turn detection and it seemed to be just as good as the semantic turn detection and a hair faster, my subjective experience.
So we've turned off using semantic turn detection.
We're back to server turn detection.
And it just works.
I don't even know what their server turn detection does.
I don't know how it works.
I presume it just waits for silence, but that's too simple.
It must be doing something else as well.
I don't know how familiar you are with the real time product.
I imagine very familiar, right?
You have to really distinguish your product against their product.
Yeah.
Actually, we have been more focusing on the speech to text, text to speech at the moment, and we have not been focusing on real time, but it's like, yeah, we're getting it.
Phil, I think there's this feeling internally that it is just so good and it feels like
things are going, like, that way.
And even if not everyone's using it, it's gonna become.
I don't know.
It just feels so much more natural.
I don't know.
Yeah.
So do you mean.
You mean streaming versus just, like, yeah.
So.
So we have not.
So we've been kind of doing, like.
Creating the pipeline where it's like you get, we hook up with, like, text to speech, speech to text, and, like, you're in the middle and you handle the text and stuff rather than.
That's how we did it initially.
Yeah.
Yeah.
So.
But it just is.
So, yeah, it's like when you play with it, it's just so, so good.
And I think most of the people that we've spoken to have been doing it the kind of way that we've.
Been focused on, but I think more and more people are moving to it, sounds like what you did, because it's just getting better.
It's probably a little faster.
Yeah, it's gonna get better and better, right?
Yeah.
What else can I glean that's useful or I can share that's useful?
Yeah, one random tip we had the other day that was like, if you've got bad audio, it's like asking the user to speak up.
It's like very small thing, but like some one of our people we spoke to.
That's a nice tip.
Do you have a tip for feedback?
Once the feedback was so loud, the AI just ended up talking to itself.
It was just like- Really?
Oh no.
I mean, I guess you could do the same thing, right?
If there's an echo or something or.
Like- yeah, but how do you detect an echo, right?
I guess you have to put it in the prompt, right?
If you think there might be an echo, if you find yourself saying the same thing to yourself, you're probably speaking to an echo.
Yeah,
I wonder if there's like a kind of more like classical model you could run on the audio that detects echoes or something.
I don't know.
Yeah, but that was a really funny one.
It was just like a 30 second transcript of it itself.
And like the poor guy is like, yeah.
Oh my God.
Sometimes it's hellish, isn't it?
Like I got it to accidentally read out some code and it was just like grating.
It feels like the sort of thing like all the spy agencies could use to torture people.
It's like make them listen to code being read out by an AI.
Yeah.
Are you guys doing the speech detects that?
So are you competing with Deepgram on that portion?
Right?
Are you doing this?
We're using Deepgram.
Yeah.
So we're kind of more like just stitching all together right now.
But to be honest, we're trying to find the edge of the thing where we can really distinguish ourselves, which is why we're trying to do so much research and stuff and just figure out where the pain points that aren't being addressed are right now.
So
that's why we're talking to you and stuff and like, you know, how.
Many people on your team?
There's six of us.
How did you and Aiden get hooked up?
I mean, you're in the UK, he's in Canada.
How did that happen?
Yeah, I don't know if you can still hear Aiden's British accent behind the.
Oh,
I see.
I see.
What are you doing in Canada?
It's freezing.
It's even colder than.
Oh, yeah.
Thankfully, I'm in the most temperate bit.
So I did a previous company with our CEO who's based in the UK.
Our other co-founder is here in Victoria.
So we have a kind of odd dual geography thing going on.
But we actually met Jack because
Jack is a podcaster.
He does the best podcast on scaling DevTools.
It's called scaling DevTools.
It's great.
And so we were huge fans and we met him through that and somehow persuaded him to come join us on this journey.
So, yeah, cool.
Very nice.
If I could be of further help, I mean, let me know if there are other people who are similarly situated that you think would be useful for me to have a conversation with.
That'd be cool.
What else?
You know, Felix's and my background, right?
We did a company called Simple Texting.
We sold it.
We know a lot of people in kind of like the SMS telephony space.
Right.
We sold to cinch, which is like a twilio competitor, more Eurocentric, kind of roughly similarly sized.
Yeah.
So also, yeah, other, other people in kind of telephony space as well.
Like this.
This is the industry we've been in for quite a while.
So, yeah, I feel like you guys know your stuff.
I I think I actually.
I know we're at time.
I have one actual last question I really want to ask you, which is,
how are you guys thinking about the data capture piece?
Because that's actually something I thought Boardie does very well, is like take your name and text, take your details text, and then use the phone for like the voice for like a different part of data capture.
And I think that's something we're really interested in because that's coming up a lot with people like, you know, Names, personal information, transcriptions just kind of not as accurate as text.
And so systems that use both like Baudy really sticks out to me, but I've seen a few others that use SMS for data capture.
But I'm just wondering, you guys obviously thought about this from a lot of angles.
So be curious on your, if you have any thoughts.
Yeah.
So, you know, I think that Deepgram, so, you know, we do a fast transcription with open, with Whisper, right?
Sucks, right?
It's terrible, but we do show that at first.
And then once we get the longer transcript from Deepgram, we replace the transcript.
Deepgram doesn't support as many languages as Whisper.
Whisper supports like 98 or something.
Deepgram probably supports 30, 33, something like that.
And so Deepgram, I think, roughly 90% accuracy, roughly 90%, not 100%, but roughly 90.
And it's getting better.
Of what we're trying to do is we're moving, we're currently using Nexmo in our stack.
I think fidelity has something to do with it, right?
Like voice fidelity, right?
Depending on whether or not you're hopping like on a HD codec, right?
If you're getting HD voice, right, probably transcription is going to be better.
If you're getting like, you know, cruddy voice, like 8 PCM, 8
kHz, and it's not so nice.
So we're trying to improve audio quality, which will hopefully include transcript quality, which will include data collection quality.
But, you know, the real answer is that we're relying on people to check the transcript, to check the audio.
If something doesn't, doesn't, doesn't smell right.
And that seems to be okay for the time being.
Like, no, we've not had any complaints that, like, hey, the data capture was bad.
Just not a thing to be quite honest.
I I agree that data capture is important, probably even more so in kind of critical applications, maybe like medical scribe applications.
Right.
Like, the wrong medication can kill someone.
Right.
So, like, if the doctor prescribed.
Something that sounded benign and then the guy's getting morphine, right?
That's a problem.
So I've not seen the mission critical stuff and no one is complaining is the short of it.
And we're relying on people to just listen to the audio if it doesn't look like the right thing.
We do have a robust data collection feature.
That's one of the things that people use a lot, right?
So if you remember our platform, there's the transcript and there's the summary above the transcript.
And then there's like the fields that you've wanted to suck out, like the specific, like, you know, the address, the first name, are your current client, how much money do you make a year, whatever it is, kind of like the lead qualification, if you will, the data collection portion.
So we suck that out and we put it above the actual summary, right?
Like first name, last name, whatever it is, whatever data it is you're collecting.
And we just not, that's not a piece of feedback we've had.
Latency, we have people complain.
Everything comes back to latency.
Yeah, yeah, yeah.
Latency and really realism, right?
The reason I use realism is because yes, you can use some tricks, right?
That kind of like, you know, takes some of the edge off latency issues.
Yeah, that's a really, really good data point.
Thank you.
I think that's insightful.
Jack, did you have anything else before we wrap?
Sorry, I know we've gone slightly over time.
Gene, really appreciate it.
No, it's okay.
I'm enjoying chatting with you guys.
Yeah, yeah, no, I was just gonna say thank you, Gene.
It's
incredibly interesting and like, yeah, as Aiden said, you know, your stuff, like it's so cool.
Yeah, yeah, appreciate it.
Appreciate the call.
If you can think of anyone to introduce me to or if you're going to create like a little networking group, which will probably be useful for you because it's people who whose feedback you need, you know, please add me.
I'd be very appreciative.
If you're ever in Florida, ring me up.
Get a beer or something or some lunch.
It's also a nice place to go when it's cold in Canada.
I don't know what temporary area in Canada you're talking about because everything is going to freeze over pretty shortly.
Yeah, it's a lot warmer down where you are.
We might take you up on that.
Jack, did we, should we send an email to book the next one of these?
Yeah, Jean, would it be okay to do, any follow-ups on this to ask you, you know?
Yeah, no problem.
Yeah.
Okay, awesome.
I'll send you an email about that, if that's okay.
Sure.
I'll.
I'll send you the transcript if my notion was.
Oh, yeah.
That'll be amazing.
Thank you.
Thanks.
All right, take care, guys.
Bye.
Bye.
Bye.