< Back

Now Playing

Giacomo Miceli TAB 2 (openai realtime)

Before I started elaborating, can you say briefly if this, like those pain points are exclusively for people using a combination of LLMs and text-to-speech, speech-to-text as a pipeline?

Or if you've recorded those top pains also from people that use a speech-to-speech directly?

That's a real question.

Yeah,

I think everyone that we've spoken to has done speech to text, text to speech.

Yeah.

Yeah.

Well, I'll tell you my opinion, but remember that we are trying to pull this off with real time, with OpenAI real time.

And if it's not open air real time, we would steer towards Gemini real time.

I forgot how it's called.

Or one of the competitors, basically models that take an input audio directly and output speech without intermediation, without having to go through a pipeline.

Why?

Why we're doing that?

Because there is important metadata that gets discarded, emotional metadata.

The way you're uttering certain phrases is important for what we do because we're dealing with emotional material.

I mean, it's not like we're doing psychotherapy or anything, but we're analyzing literature, we're talking about feelings, and people will ask questions about a certain novel and the way the question is asked carries metadata.

And this metadata is destroyed.

If you.

Pass through a large language model only the words in letters and you throw away the way that they have been pronounced.

So transcription for us is not as important because

most of the time the model understands what we're talking about.

Your second point, it's hard to reliably know how well conversations are doing and what difference your changes make.

That's

their problem.

They haven't set up their evolves correctly.

I think.

That is a startup idea by itself.

Like just helping

companies that are working with AI, setting their evals, evals, whatever you want to pronounce it,

is crucial.

And not everyone is capable of doing it.

Not everyone is focused enough on that.

So.

I understand that the pain point exists.

But, you know, I think solutions for that exist that are pretty clear.

If your goal is clear.

Onboarding, I mean, you mean onboarding with your specific platform or do you mean onboarding with working with models?

Yes.

I should elaborate that.

Onboarding new customers.

So a lot of people who have had like when

new joins, I think, especially if it's like B2B use cases.

So, yeah, where, like, they are like, you know,

there's often lots of, like, Integrations, which I guess doesn't really apply here, but maybe there's an analogous one of, like, getting users set up and knowing how to use the platform and stuff like that.

Yeah, that makes sense.

Well, I.

I cannot.

I cannot say specifically because I haven't been onboarded, so I cannot give you, you know, specific.

I guess I can.

More for your platform.

So, like, if you.

If there were challenges around, like, getting your users to, like, get on board.

Successfully, like, oh, man, we wish we were already there.

We haven't launched.

I think if we're lucky, we're gonna.

We're gonna launch in.

In.

In three to four weeks.

That's my estimate.

Nice.

And the onboarding, ours, it's

B2C.

So we're dealing with random people that have absolutely no idea of how AI works.

The only thing that they need to be able to do is click on the big fat button that says talk.

And so we are we're confident that they will be able to do that.

What is less clear is how flexibly will they use the tools that we.

Give.

We make available to them.

Your pain point number four is the reason why we went for what I keep on calling speech to speech, but it's not really a solidified name, but you know what I mean, like those models that just take input speech.

Yeah.

So that's the turn taking, you know, less than ideal behavior and the latency is what brought us to, to this solution.

And,

you know, like open AI two weeks ago released the official version of their real time API, so it's no longer beta.

It works better, it works.

Yeah, it seems indeed smarter and I haven't tested this yet, but it seems like it contains a

voice activity detection and turn taking mechanism that is semantically active, meaning it's not just dumbly listening a certain amount of seconds, it also checks on whether you're probably done with your sentence

given what you just said.

So

I think that should resolve a lot of those problems by offloading them to the model.

But again, I haven't tested that yet, so I don't know if that is really, you know, panacea.

I don't know if it really resolves all those problems.

Yeah.

But yeah,

We're going ahead with that one.

And if you recall during our last call, I was describing to you how we wanted to use PipeCAD flows, which is the plugin of the PipeCAD framework.

And the reason we wanted to do that was because there's like different conversation modes and we want

our backend to switch dynamically between those conversation flows.

That was a month and a half ago.

Then shortly after, we discovered that we couldn't because it is not compatible.

Because PipeCAD flows is still, as of today, not compatible with real-time models.

Why is it not compatible?

Basically what they do, what PipeCAD flows solve is the problem of

adopting the context and adopting the tools that the model gets an input dynamically as you progress through the conversation.

That is crucial to us because suppose that we have 50 tools, we can just dump 50 tools to the model and say, hey, figure out which one is best because it just won't work.

Models are still not that smart and arguably There's just too much noise in passing dozens, if not hundreds of tools.

This would have been our solution, PocketFlow.

The good news is that as of, I think, two weeks ago, this new real-time API that was officially published by OpenAI, should have in their own API the possibility to operate changes on the system prompt and other variables, including the tools, in real time.

As the conversation progresses without having to turn it off and on again, which obviously is

not ideal.

Long story short, this is my biggest pain point right now.

How do I

solve this without having to backtrack on the

speech to text, LLM, text to speech, bandwagon,

as we could have done a long time ago.

I prefer to remain with the current pipeline because We just love how snappy and thoughtful the responses are and how they take into account the context of the voice and the way things are pronounced.

We love the way the voice also adapts to particular vernaculars.

If you speak with a certain accent and you can ask the voice to speak with that accent or you switch language between the conversation, even the middle of the conversation and the model does the same.

All of that stuff.

Very niche.

I agree with you that it's probably not our core market, but it's so cool.

It's a fun part.

Magic.

Yeah.

It does feel like magic.

Yeah.

Okay, so this is the biggest problem for us.

Yeah.

Yeah.

And it feels like it's just a matter of time until OpenAI solves this.

Do you think so?

I'm not sure that they already solved it.

I think they might have solved it for sure.

PipeCut has not solved it.

And it's one thing that I check basically every morning with my morning coffee.

I'm sipping coffee and seeing if there are any updates on GitHub.

And that hasn't happened yet.

And so, short of building my solution, at a lower level, which is a big distraction.

It seems like we're gonna launch our product with a reduced set of tools than the one that we wanted to originally launch.

So specifically, we wanted to have skip ahead, go backwards, return to the beginning of the chapter, stop playback, resume playback, all this kind of stuff.

Those are all different commands.

Different tools in our, in our.

You.

Know, at our disposal.

And then there are, you know, like tools that are more semantic.

For instance, go back to the part when he kills the woman, you know, just like a random thing.

Like maybe you forgot exactly what happened.

And in a book, in a physical book, it would be obvious.

They just like go back a couple of pages.

You read the scene and then you get back because maybe you lost something.

Doing that with audiobooks is so much more awkward, but not anymore in theory.

In theory, this problem is solvable and perhaps already solved by us.

But

that selection of the right tool when you have dozens of tools is what is challenging us.

Right now because we have too many.

So we have to restrict the amount of features that we will launch with in hope that future generations, future versions of the frameworks we're using will support a more intelligent way of partitioning

the

context.

Could you have looked at having like a secondary

model that just has those, that isn't a conversation that's just like listening.

And then when you say, like, skip ahead, it just, like, can skip ahead.

And,

for instance,

you mean, like, something that, like a filter that, that, that is triggered before we pass it to the,

to the other model, to the general purpose model.

If you had the conversational, but then you're also tracking the audio, and then you were feeding the audio into another model that was just listening to see if it needs to do anything, and then it was skip ahead or something, then it could skip ahead and then somehow have in the prompt that if the user asks to skip ahead, there is another model that will handle that for you, and you can assume it happens or something.

I think it could be an interesting approach, but I would have to think a little bit more about it.

But I think it might just be kicking the can down the road because then the original model still doesn't know that that has happened.

Yeah.

So unless you have, well, I guess that the other model could have the privilege of, you know, turning off the entire conversation.

But it sounds potentially over engineered.

Yeah, short answer is we haven't.

Experimented.

That, although I know that it is possible to use the same audio and do other things with it because we're doing that.

We're saving the audio, reprocessing it, analyzing it, et cetera, et cetera.

So it would theoretically be very feasible, but maybe, yeah, no, yeah, we prefer to stick to the thing.

Yeah.

Yeah.

Yeah.

Is there anything else?

Like, obviously this is number one.

Is there a number two for you that's like really jumping up?

Because I know we haven't really, the ones that we had didn't really hit your problem, your pains.

Specifically about audio models.

Well, just like, sorry, if you say like just building, like we've got this list here for you.

The big pain is like, you know, your number one is like going to be like the tools with real-time voice, right?

I guess.

Yeah.

And then, yeah.

What would you say is your number two?

Number two, I think it's,

Well, it's dealing with,

for instance, low bandwidth solutions

or noisy environments in which the model erroneously thinks that it has detected a certain utterance.

And so it starts responding to you to an imaginary question, whereas you haven't even started.

If you're asking your question, maybe you just press the button and then after two seconds, perhaps while you're starting to talk, which is even more annoying, you.

Get an answer.

About an hallucinated question.

So.

There might be solutions that are heavy handed.

Like just turn on the model only

after a few seconds and that would probably already be better and then pass cached audio to it.

But yeah, it's a pinpoint and we're still living with it.

Yeah, that's definitely one that I think a lot of The models have not been trained on low quality audio.

Yeah, low quality and low bandwidth, it's just another one.

We imagined a lot of our customers will be using their mobile phones, they're on the go, maybe they're driving.

There's a lot of background noise and connection can be less than ideal.

So that's

again, this is probably more something that can be resolved upstream in the sense

that if the model has in the training data also worse scenario, maybe it can learn how to deal with those situations.

And I think that the smartest thing that a model could do in that situation is just say please can you repeat because I didn't understand everything you said.

So like having threshold of confidence is would be a nice one but it's not among the parameters you can set any time.

Yeah, yeah, true.

That would be, I guess that's like some of the trade-off on the control versus like the quality of the power of the real time ones.

Yeah.

It's very exciting.

I feel like it's definitely going that way.

It just makes it, I guess it's just so much harder to insert stuff into the feels like more of a, it's like more of a black box, I guess.

Oh man, join the club, the paint club of black boxes.

Everything has become a black box with AI and its current you know, phase, everything is just a big lump of parameters, billions and billions of parameters.

And that's weird.

It's just like no one really knows.

Yeah.

So are you guys looking into like open ways models and like how to integrate them or you're not interested in that stuff?

I personally, like, very interested in, like, I kind of wanted to, like, try and train my own, like, crappy model.

I started to just look at doing that and, like, I just for fun and understand how it works because it's, like, so weird.

And also it feels like people got quite far.

Like, quite, I don't know, like, about 11 Labs got, like, they launched the first version, like, over a weekend or something, which sounds insane.

So it feels like you can get quite far with not that much data.

And then running the models, I think we would like to at some point, but

then you get into either you run it on another provider and then it's like kind of same thing, or you then have to manage and be responsible for all of the uptime and stuff.

Which is like challenging for a small team.

Yeah.

And once again, focus, right?

That's the most important thing.

And like just understanding what's the most important thing for you.

Yeah.

I mean, I think that it could be interesting to do like a really low quality audio model, but like, I don't know if that would actually be better in most cases.

I don't know.

Yeah, well, you know, like something that already would be great for us would be just a way to clone a voice.

Tell me what we have in mind.

Like right now we're having an entire book being synthesized by a machine and instead we would like to have a human actor that does the reading because the reading is done once and listened by, you know, thousands, tens of thousands of times by different people.

And, and instead, if you have, you know, like a, I want to say a world class actor, like somebody who's really, really great at how you book reading experience is much heightened.

But then you would want that same actor to basically sell you their voice and, and then you would

use that very same voice to train or fine tune a model that would then give the answer that the final user is asking.

You answer them with the same voice, so there is no jarring change of voice.

Basically what we're doing right now, but with any arbitrary voice and making sure that we can deliver higher quality for the reading.

So that's why it's important for us.

That's why we're looking into.

Well, I'm.

Not saying doing it today, but I would reckon that in a couple of years it should be feasible also for smaller companies.

I'll be surprised if it isn't.

Yeah, I mean, have you seen, I guess you've seen like 11 labs?

Have the custom voices.

I've cloned mine and it was quite good.

But I guess you need to connect it in with the OpenAI real time.

And I don't know if you'd be able to, would you?

I think you would be able to.

But it's always continuing that obsession of ours with speech-to-speech models.

One of those type of models that just, you know, does it all and has a specific voice.

And so, you know, takes better care.

Of.

You know, turn detection, etc.

Etc.

And doesn't have as much delay in giving you an answer, but also you can code it with whatever layer of voice you want at the very end.

And that would be super cool.

That doesn't exist.

And again, I'd be surprised if we don't have the tools to do that in a couple of years down the road.

But,

yeah, I was just interested if you guys were looking that direction at all.

We haven't, but it's very interesting.

And I kind of want to look into that a little bit.

Like, I need.

I don't.

To be honest, like, I've, have, like, quite limited understanding of the.

Real time voice so far compared to the speech to text, text to speech just because it's what we've been doing a lot more of.

And my main experience has been just as a consumer with like open AI stuff, which I really like.

Yeah.

I'm also a fan of open AI school.

I mean, I think Anthropic was very close and, you know, Google also was very close and having excellent competitors.

But then, you know, GPT-5.

I mean, it got a lot of bad press, but honestly, I think it just blows everything out of the water in terms of smart.

I agree.

It's so good.

Yeah, it's just incredible.

Yeah.

Guys, gotta go and pick up the one kid.

Thank you.

Yeah, let's do this again sometime in the future if you guys want.

I would like to know what you guys are building and in.

You know, keeping in touch.

Amazing.

Thank you so much, gentlemen.

You're welcome.

Look next time.

Bye bye.

See you later.

Have a good day.