Now Playing
Anthony Rego from Aide #1
Yeah, I do have a little bit of background, but just before I ask you some of the questions that we've got, I just wanted to get a little bit more just on what you're working on and stuff, just so that I can frame questions.
Sure.
So the company we started about two and a half years ago, we were actually just doing normal, like,
chat-based assistance and things like that.
We've pivoted probably a million times at this point and eventually landed on voice agents.
And we started as a very generic platform and we've been going into more niches.
We've been actually been getting into construction.
So on the job reports that usually have been made by someone going to their computer, opening it up, and filling out a form.
Now they can just call somebody, call an AI, and then have it done right on the phone.
And we do things like when you submit the report, it'll send you a text message with the link so you can open it up and confirm all the details and then send it off, stuff like that.
And then when we've also expanded now, not just voice, but WhatsApp and text message, and it's all one agent.
So you can change mediums and all the context stays the same.
Or you can even tell like a text, you can text the AI and be like, Hey, give me a call in like a half hour or something.
And then it'll call you then or call you at the moment.
That's really cool.
So right now we're still kind of experimenting on which niche to really get into.
Like we have like pilot programs with this construction business.
We have, we tried like plumbers trades and things like that, but there just really wasn't,
not really enough pull from them.
Like, and also like, I feel like plumbers are just very price conscious.
And as with all AI at this point, everything's a race to the bottom.
So we just kind of thought, oh, maybe we should just go for bigger contracts.
Interesting.
That's very interesting.
And I'm really excited about getting into the voice space because I just love it.
I kind of wanted to get into it from the beginning.
I have very deep interest in making my own voice assistance at home because I want it all to be local.
Like, have it on my.
My 4090 back there.
Dude.
Oh, yeah, yeah.
Like, and I.
I want to, like.
I wanted to make sure the Go service that we were building for our voice service was able to run on with any transcription, any kind of text speech, any kind of LLM so I can just use open source stuff locally and then have it be completely self-contained here.
And you really can.
Parakeet from NVIDIA is a really good speech text.
Well, it's pretty decent anyway, at least in terms of speed.
It's incredible.
And you have to definitely make some tweaks to it.
And also their newest version is multilingual with 25 European languages, which is really cool.
But for production, we use something we have it switched between Deepgram, 11 Labs is a speech to text.
And we tried a few others.
A Resemble had one assembly.
We used to try using theirs for a bit, but we found like latency was, it changes day to day, it seems, but
in terms of quality, I liked Deepfram the most.
But the latency was not great.
And I don't have access to like Crisp, which is like the,
you know, there's people that like, there was one company I remember they were, they had like some sort of special deal where they had, they could have it on like their own infrastructure, Crisp.
But
I'm not sure if that was like a special deal they had to make, but I couldn't figure out how to, you know, get to do that without having to like talk to the company itself.
But about, you know, to the point though, like I think the most important part of accuracy and speed is getting transcription down.
That's so important.
Honestly, the only important part of voice services, I feel, is
balancing accuracy and speed is the biggest driver here.
In my opinion, anyway.
Text-to-speech is not such a big deal.
Texas speech is not a huge deal.
I mean, it matters if you want the quality to be really, if it wants to sound natural.
But in terms of speed, I'm not worried about that.
Because especially the system that we have set up, I'm constantly generating responses as the transcription is coming in.
So it's like giving response candidates.
And then once we've confirmed that the user has stopped speaking, that's when we're like, okay, we have all these response candidates probably only take the last couple of ones, do a little comparison, then
you can kick it off to the text of speech.
But to speed that up even further is every response candidate, you can also just start spinning up all the buffering, all the text of speeches so that once you've figured out the right response candidate, you can start streaming it back almost immediately.
So the response time on on a text speech is not hugely important, not as important as
the transcription.
It's because you don't want people waiting for a response.
Totally.
There are a few things, I found that people are more okay with waiting if you have some sort of sound playing while it's thinking, like a little boop-boop-boop-boop-boop or something like that.
People tend to be more willing to wait around for that.
But if you don't, if it's silence for more than a second and a half, people just bail or
they start interrupting.
Like, hello, are you still there?
It's just a bad experience in general.
I'm not sure if you've seen this.
Yourselves, but yeah, yeah, yeah.
Some of this is like, well, anyway, I'm not gonna to...
Go into what we're thinking at the moment, but just for a sec, just to hold off so we can get your pure thoughts.
But actually, did I explain to you the technical advisory board?
I don't think-- oh, actually, I don't.
Think I-- sorry, I think I went right into it.
I was too excited.
Okay, cool.
Yeah, yeah.
So basically, we're just looking to get like a small group of people who we chat with once a month for half an hour for six months.
And really it's just a chance.
Hopefully it feels a little bit like therapy.
Just like unload everything on voice that you're feeling and lots of open-ended questions and stuff.
So we can talk about anything and then it will shape our roadmap of what we build and we'll share everything back with you and we're going to create some cool swag.
And also if you would like to, we can try and introduce you to other people.
In technical advisory board if you want to compare notes with other people that are building and stuff as well.
Yeah, that's awesome.
Amazing.
Clearly, I am kind of like the most technical person in my company and I feel like I just want to talk about this stuff with somebody that can relate as much as I've been dying to.
And I can't talk to my wife about it because she's sick of hearing about it.
Yeah, I understand.
And actually, I have more time.
I really want to dive into the whole, like, how you're running it locally and stuff.
I actually have been thinking a lot about that myself just as a personal curiosity as well.
I might try it.
So I'm just going to ask you a few questions.
Sure.
So the first question is, We're building with voice AI, in general, everything that you're doing around that.
If you could wave a magic wand or anything
to make it better, easier, whatever, you just wave the magic wand.
Where would you wave it at?
I mean, transcription.
If you could get that perfect and fast.
The rest would be easy.
Like this.
Yeah.
Be simple.
So that's just a.
Yeah, straight up, like.
Yeah.
And how would it change your life if you had that perfect and fast transgression?
I mean,
you could just do so much more in that time that you have, like.
And you do.
You wouldn't have to.
I wouldn't have to worry about, like.
I have this whole separate thread that just does a healing of the transcription because there's so many mistakes that can be made along the way.
Yeah, I mean, that consumes so much time and resources that could be dedicated towards generating better responses.
Or we don't have to do as many response candidates because
things like turn taking will be a lot easier.
It makes the whole process afterwards just so simpler if that we didn't have to worry about it.
And healing, you mean like, oh, so.
The way I do the transcriptions too is that's in its own thread and it's
doing it at different intervals and constantly.
It's basically so there's like different segments at different time intervals.
And then it's basically giving it to an LLM being like, Hey, here are all the time intervals.
Here's what the transcription said for each of these.
And here's the transcription of what we have so far.
What do you think is actually being said here?
And actually that does a really good job of putting it together.
And.
It can sometimes read between the lines basically what's being said.
That's what I found to be really good.
But I think this feels like that's kind of like a workaround for
the speed and inaccuracy of transcription models as they are.
Yep, that makes sense.
And is there anything that's different now that makes that more valuable versus like a year ago?
Well, I mean.
What do you mean exactly?
So I guess it's like if you had that transcription
that was perfect, Would that, is there anything that makes it like any kind of, you know, opportunities or things that are different now versus like a year ago that make it more valuable to have like the perfect and fast transcription?
Yeah, I mean, I think from just a year ago, like since we have faster LLMs that are higher quality, you know, like Gemini Flash is like fast and really good.
I mean, it does a decent job for its speed.
And that's not really something we had like a year ago.
Like even if you had fast transcription a year ago, it would have still been a pretty rough experience, I think.
So I think like the rest of the tech that's revolving around Voice is coming together to a point where, you know, things are going to, I think in a year from now it's going to, like, we'll be better in a better spot.
I still think turn taking has got a lot to go.
And I know there's some people that are into, like, models for turn taking.
But.
Those are still, I think, not quite there yet.
That's a whole, that's a whole separate conversation we can get into.
Well, yeah, perhaps it's a.
I kind of wanted to, like, ask the question again, if there's, like, a number two thing that you would wave the magic wand at.
Turn taking.
Because that is the biggest pain.
I guess that you can lump interruptions with that as well.
Because that is another hard point to get.
Right now we actually use 11
Labs as our primary voice agent and our own service as a backup because I don't feel our interruptions are quite there yet.
So their whole end to end thing uses the,
you mean use their kind of like end to end conversation AI?
Yeah, yeah, as a primary.
But since they go down all the time, we're flipping back to ours.
It is quite an unreliable service.
And actually, there's also a lot of things I don't like about ElevenLabs.
One, they're really bad at short responses, like a yes, no.
It sometimes just doesn't pick it up at all.
Whereas the way I had built our service, it's very good at picking up those, the way we do those like transcription segments.
So that's that, that's like, there's a bit of give and take there, but I would still rather use our own, but, you know, I'm like, I'm still a bit of a perfectionist on that.
So I want to make sure I have the turn taking and interruptions absolutely perfect for going fully live.
Yeah.
And how would having better turn taking change your life?
Oh, oh my God.
You know what really what it is is we would be able to cut down on the silence between a user finishing speaking and then we start speaking.
'Cause right now I'm a little bit, I'm a little too conservative on that, I think.
That's why we started putting in tones for a little musical tone
while we're waiting for it to really be sure that the user is finished speaking.
Yeah, that makes sense actually.
It's all kind of downstream of turn taking.
Yeah.
Which again is all downstream of...
Transcription.
Okay, let me just ask one more time.
Okay, so if there's the, if we've got biggest thing transcription, second thing is turn taking.
Is there any number three, are there any number three things that you would wave a magic wand at?
You know, I think,
and this is more of a personal thing for me.
I would say
some sort of knowledge base technology, like, like, like rag.
But, like, I've been, like, messing around with knowledge graphs and, like, if I could get that fast and have it do multi-term thinking very quickly in an efficient manner, then that would be you could.
You could just do so much more with that.
You could have much more flexible agents that could follow instructions better without having a gigantic prompt with everything.
And even with a gigantic prompt, that's not good for one latency and two keeping to task.
'Cause things get lost, you can't really do too much there.
Yeah, so I think that would be a big thing.
A lot of stuff we kind of.
Like.
That would mean
we do a lot of hacks to make sure to get around this.
We're doing like when the transcription gets too large, if it's been a long call, we do things like
compressing that down to be like, all right, let's do summarizations of this section of the and then we're only, we're keeping that prompt as short as possible.
I guess another way of saying that too is I would love to wave it at where an LLM could actually have a long prompt and it wouldn't affect its performance.
I think that's probably asking too much at this point.
The way I like to think about it is humans can only keep seven things in their head at one time anyway, so it makes sense that an LLM Probably is about some very similar, I think it could do a little bit more than that, but I technically, I don't like asking it to do too much in one prompt because it will, it'll not do something very well.
But if you ask it to do one or two things, it'll, it usually nails it.
Yeah, it's interesting how it kind of somewhat goes to human behavior.
Well, I mean, it's based off of, human language, which is an externalization of our, you know, the network dynamics of our brain.
Sure.
You can make the, the argument can be made.
Made.
It's so true.
That's so true.
Do you use tool calls and stuff like that?
Do you do much?
Yeah.
Although we, I don't use like any like the official like LLM tool call.
We had no just implementing it ourselves.
I prefer that way.
I don't want to get locked into, you know, I know like tool calling is fairly common between LLMs now, but I don't want to be stuck with any particular LLM.
I want to be able to just switch them in and out and not have to worry about porting code over for tool calling.
So we do everything ourselves.
It's all XML, yada, yada, yada.
Yeah, yeah.
And that works great for me.
Okay.
That's, yeah, I mean, I've been that a lot of them are pretty lightweight anyway, aren't they?
They are, they are.
But I think it's my ethos kind of thing.
No, no, no.
It's not like you're trying to like, you know, write your own F of MP library or something like that.
Yeah, exactly.
It's like, yeah,
yeah, that's, that's awesome.
Okay, those are the main questions that I had and it was extremely helpful.
Yeah.
What just like one very road question is like, first swag, like what kind of swag do you would like you actually like and use or where if we got some swag.
Oh.
I don't know.
I mean, a t-shirt that would go hard.
Okay, cool.
Cool.
Yeah, we're trying to think of, like, some.
We've kind of.
Our biggest delay in creating it so far is, like, we're trying to come up with, like, some clever, like, sort of, like, I don't know if you've seen, like, some of the cool.
Some of the companies have, like.
GPU poor or whatever, like these kind of, like, very fun, like, ones.
Yeah, I I have not noticed.
I I I know.
I I think being so isolated in New York, I've, like, been out of the swag game.
We'll bring you back in.
Yeah, that's.
That's really cool.
Yeah, that.
Anthony, that those were.
That's all we had.
I'm going to send you like an if it's okay, I had another invite that's got like five.
It's kind of you could just like recurring if there's like a slot that works.
And then obviously if you can't do any of them or you want to drop out, like totally understand because it's, you know, starting a company gets, you know, busy.
But yeah, we'll try to make it useful as well.
And we're going to share like what we find after the first, like, you know, as we go from, like, the conversations.
So the patterns and stuff like that and, like, anything else we learn.
Yeah.
Is.
Do you have any questions for me or.
No, no, no questions.
I mean, yeah, I i.
I mean, I would love to.
In further conversations, just to, like, compare notes as much as possible.
Yeah, it's fun just talking about it.
Yeah,
yeah.
It's like so clear how passionate you are about it.
It's really cool actually.
I think it's so fun.
I love media stuff in general.
Just more, I think working with media, it's just so, it feels much more like
unfigured out versus like text and stuff and I don't know.
Oh yeah.
I mean, like, I love,
I think this reminds me, like, I feel like I've had this kind of energy since I was in college.
And I used to, like, I loved making my own video game engines and getting really in the nitty gritty with GPUs and figuring out how to make things as fast and efficient as possible.
And I feel like I'm getting that same kind of rush with the voice agents.
Yeah.
That's why I've been getting super deep into the, the multi-threading of it.
Have you ever thought about, because I, something that I really wanted to do, just like, so I could learn how it works is like, if you thought about, like, building your own model, like, just even just like a toy one, just for, like, yeah, you have.
Yeah.
Actually, so, like, like a year before I, I started this company, I, I'd left, like, Deliveroo in, like, 2021, and then we started this.
2022.
But in that year, I spent just time making my own models and really just learning as much as I could.
Because I knew I wanted to get into AI.
I laid off Bluerouge just with that in mind because over the years, I just always wanted to.
I was like, okay, let me, I started with the video models actually.
My closest friend of mine, he's in visual effects here in New York.
So we, like, I've been, like, collaborating with him for, like, over the years, but, like, oh, like, I really think it'd be cool, like, to get into the VFX and see what AI can do with that.
And
I was just started, like, I made my own models around removing, like, cleaning up video.
Like, I was getting annoyed by sometimes there would be, like,
is it really weird thing to bring up, but.
I was watching the show Grand Designs and
I was into videography and photography and I noticed in some shots they had dust on the lens and you could see it in the shot.
And I was like, come on, you guys should be able to just filter this out or clean this up.
So I built a model that was called Dust Remover as just a way to learn how these models work.
And I
would just clean up any of all the episodes of Grand Designs that I had downloaded so that I could stop being annoyed by it when I wanted to.
It worked.
Yeah.
Oh, yeah, it worked.
How did you train?
How did you tell it?
The training data set, I actually had that all generated in Blender.
So what I would do is I would I took samples of it from the episodes of Grand Designs.
And then I had a lot of photos that I had noticed in my own photography.
Anytime there would be dust on the lens, so I had examples there.
But then I would do a roteless Python script in Blender that would automatically add these fake
dust into videos, existing videos.
And then I'd be like, okay, I have the dataset of it clean, and I have it with the dirtied ones I made.
And then that's how I built the data set.
That's absolutely genius.
And it would be like it randomly generated into like which frames and which part of the and the size of it.
Because Dust, I guess, is simple enough.
That it's very simple.
That would actually, yeah, that's so smart.
That's so smart.
Because I tried to, this was like a few years ago now.
I tried to, I I wanted to try to build an AI.
I used to play RuneScape this game back in the day.
And I was trying to get it to recognize so stupid cows and just click on a cow.
And
it's just the amount of, I think I got stuck on the data part where it's just how much labeled data I'd have to have.
And I didn't get very far over.
It, but it was like, that is the toughest part.
That's why I'm all for synthetic data for training.
Microsoft really doing a great job on that.
Like they were there.
I had done that before they came out with a very important paper on 5-4 that was related to how they did synthetic data on all of their training.
And I was like, yes, I'm not the only one that thought synthetic data rules.
I need to look into what Microsoft's doing there.
I didn't know they were doing that.
They have a really good
text to voice model they just came out with like a week or two ago.
Oh, really?
Yeah, yeah.
It's been, I would say I compare it, it's very close to like Chatterbox from Resemble.
But I think it's a lot better than that from what I've seen anyway.
Okay, awesome.
That's a weird name.
It's like a very generic sounding name.
Azure AI speech?
No.
Oh, no, no, no, not that.
Oh, bye voice.
Bye voice.
Bye voice, yes.
Oh, yeah.
That's very silly.
That doesn't matter, I guess.
Sorry, Anthony, I realize we're over.
Oh, yes.
Sorry.
Sorry.
Thank you so much for your time.
Really appreciate it.
And hopefully chat in about a month.
I'll send you a message.
Sure.
Yeah, I'd love to keep chatting.
Amazing.
Thank you.
See you later.
See it.