Some Straight Talk About The Role Of AI In Podcasting
And how, exactly, this technology was used in the Dear Media series Believable: The Coco Berthmann Story
It hasn’t taken long for me to become cynical about the AI hype. The endless articles, the AI versions of film trailers mimicking my favourite filmmaker, the doomsday reports.
I’ve been riding the waves of various technologies for a long time in my career. Take for example when I made a film for the iPad…10 years before the iPad was invented.
So when Believable: The Coco Berthmann Story came up in my listening queue, and then producers introduced an AI-cloned voice of Coco (who does not appear in the series in person otherwise), I sat up straight to listen. And take notes. There was no mistaking AI Coco for real Coco; the Pulitzer-prize-winning host Sara Ganim took pains to reiterate was an AI clone each time it appeared.
This was the first time I had come across the use of an AI clone voice in a podcast. At first, it was a bit shocking…they actually did this, I thought. But then my next thought was: of course they did this…it seems like a sensible production choice to make in this moment.
And then my next thought was that would need to dive into this topic asap.
I began some extensive Googling
To look for other examples of AI voice clones in the podcast world. I’m aware of these annoying fake-sounding voice clones on TikTok…there’s a terrible-sounding narrator who will “read” the words of my Medium articles.
But what about the use of AI as a “character,” or a person answering questions in an interview? I was not surprised to learn that there are podcasts that use stock voices and generative AI…but this is not where I spend my time listening. I mostly listen to longform, original, documentary, and serialized content…which is a much harder venue to pass off an AI-cloned voice or character.
Perhaps the only interesting one that I found, as a measure of intrigue, was the Joe Rogan AI Experiment. It’s a totally fake Youtube podcast, complete with a legal disclaimer in the intro notes, where the creator claims that every portion of the fictitious interview between Joe Rogan and Sam Altman, the CEO of OpenAI, was fittingly created by ChatGPT. It rose to notoriety when Joe Rogan himself tweeted an episode.
I hate admitting that Joe Rogan is bigger than podcasting itself. His bro-itization of the term “podcasting” has made so many others run from it. But his show is so far out in front of the leaderboards than the rest of us (by which I mean things like The Daily and Armchair Expert…not the sorts of shows I cover in this newsletter) that it’s a bit nuts.
Being that I’m not a Joe Rogan listener, I had to go looking for it…which led me to realize it’s not even on Apple any longer…which then led me to realize it’s perhaps the Spotify exclusive…and then I went there to listen, but then realized that I was watching it…yes Spotify has gone head-to-head with Youtube and now offers video podcasts (maybe I’m a purist, but I was aghast).
This is all a bit of a detour to confirm that listening to the real Joe Rogan beside the AI Clone Joe Rogan was basically indistinguishable, aside from the fact that I like the Joe Rogan AI Clone better; he seems to have more values.
Out of morbid interest, I listened to some of the two-plus hours of this show, which of course mirrors the incredibly long shows that Joe Rogan does himself. It was shocking to see in action how the AI clones could quite believably mimic speech patterns, vocal fry, and a certain amount of filler words inserted to make the voice sound real. Now I don’t believe for a second that it came off the shelf sounding this good—it’s a deepfake of a deepfake.
But nevertheless, in a moment when the WGA is on strike partially for this very reason, and then to hear how well it can be done, was shocking. From artistic to practical, to moral reasons…there’s a lot to dig into here.
And then I took a deep breath
Technology is actually slower than we think it is, even when it feels like it came out of nowhere (ChatGPT) and then purported to steal jobs and eliminate careers (the WGA strike). The transition from conceptual idea to actual seamless integration is a long road. And yes, we should be astute and attentive; but let’s also put it into context.
AI is a vast branch of computer science. Painting AI with one vast brush is not reasonable in this day and age. It’s just plainly unreasonable to say that you don’t want to support the use of AI in your work. Odds are you’ve interacted with (and benefitted from) various branches of this: Google Translate, Siri, Chatbots, Social Media algorithms…the list is extensive. Small bits of AI have started to be invisibly woven into our business and digital lives.
And then there’s the great irony about AI that specifically hits the podcasting industry: We have helped to make it all happen. I wrote about it here: We Should Have Seen ChatGPT Coming: We’ve been contributing to it for years.
How? You might ask? Transcription software
Maybe you use Otter.ai, or Google Cloud, or Descript, or Sonix…there are many to choose from. All those places where you have uploaded your mp3 files in order to get a text transcription in exchange. In the early days, you likely did this for free.
What you were contributing to, with this mutualistic interaction, was the massive growth in Machine Learning...which feeds directly into Natural Language Processing (NLP), which is the computer science term to define how computers have learned to detect our speech patterns. They can now decipher punctuation and even our regional accents (r-oot is the same word as r-out, for example).
The more hours, the more power. The more power, the better the transcription. You win; they win.
In just about six or seven years, we’ve gone from illiterate, dumb machines, to machines who can hear and detect nuance, inflection, regional accents, and sometimes even punctuation. In that way, when ChatGPT slammed onto the world stage and took everyone by surprise, I was less surprised.
Machine Learning has progressed to the point that AI is helpful, but is not replacing humans. It currently sits at about 95% accuracy, which is about the limit of perfection from a computer before human intuition and inference are required. Or, the next level of AI comes along (which I don’t discount, but also don’t believe we can mark that date on the calendar just yet).
If the beginning was speech-to-text, which ends in a text file; we are now at Level 2, where the equation is flipped.
Now we want text-to-speech, which ends in an audio file.
Or do we?
This is where the waters could get a bit murky… because as the Joe Rogan AI Experiment taught me, Machine Learning can be quite accurate (there are multiple hours of his voice to “feed” the machine with in order to “learn” his voice print…voice clones aren’t something that can accurately be plucked from mid-air).
As I listened to Believable: The Coco Berthmann series, the producers made it exceptionally clear that they were using a voice clone—because this is a story about a woman who does not appear in the series, their journalistic standards were very clear. You can easily imagine how the machine “learned” her voice because she was an “Influencer” with oodles of online content to work with.
And to wit, it’s not a narrator-sized voice clone. It’s small bits here and there. So here’s the first lesson I gleaned from the use of an AI voice clone in a narrative series:
Use them very sparingly
Exercise great caution when you do.
I reached out to one of the producers of the series, Karen Given, to ask her a few questions about the series - which it turns out is still only halfway released. The remaining 5 episodes will be released over the course of the coming six weeks.
The following is a transcript from various email exchanges. The content has been edited for brevity and context.
[Samantha Hodder]: Where did you first come across the use of AI in audio? If you can give a link great, or just describe where you first discovered it.
[Karen Given]: Hmm...well, actually, I'm not aware of anyone having used this kind of AI in audio, which is part of why I was eager to try it. Sara and I were both aware of the controversy surrounding the use of an AI clone in Roadrunner, a documentary about Anthony Bourdain. And that was hugely controversial, in part because the filmmakers didn't tell the audience that they were hearing AI. But we also felt that it was an intriguing storytelling tool.
So we wanted to see what it would sound like to use it, while being completely upfront about what we were doing.
I got curious about Roadrunner, so I did some more digging on this point. Anthony Bourdain was undoubtedly a huge public figure before (and still after) his untimely death.
What made this film controversial was three clips that the Director Morgan Neville allowed to be narrated by an AI clone of Bourdain’s voice. But that wasn’t the exact problem. The real issue is that the film did not identify this as an AI clone….say with a superscript or a subtitle. He just let it roll and flow.
In some ways, this was filmmaker hubris. They weren’t short on material; it just wasn’t perfect. Neville wanted the money shot, these perfect clips of AI Bourdain reading something which he had written, but not recorded. It was as though he’d found a buried treasure…something of Bourdain’s that he alone held. The Internet pounced on him.
There are a few ways to analyze this. On the one hand, documentary film has been pushing the creative boundaries of reality, and non-fiction, forever. Reenactments; animations, and visual effects to recreate various things that there was no footage of. Those have all become acceptable.
So from a strictly creative point of view, I do not see this as an egregious act. But the implication of what this does for the future is where it travels to a different camp.
Neville’s honesty is where it all fell apart. It might have been thought of as trite or lame to use an AI Bourdain cloned voice. It might have been thought of lazy, or maybe just a loosey-goosey production choice. But to pose as having some unique material of Bourdain’s from beyond the grave is where the internet landed, fueled by those close to Bourdain.
Back to our email interview now…
[Samantha Hodder]: When did it occur to you to use AI to create the voice of a character? What options did you consider before choosing to go with AI?
[Karen Given]: It was one of those things that comes to you in the middle of the night, when you should be sleeping. I had the idea, and I mentioned it to Sara [Ganim, host and reporter] the next morning. She was immediately on board.
Certainly, we could have hired a voice actor to read Coco's words, and I was already considering voice actors who might be good fit. I've worked with many of them in the past. But we felt like AI fit the story better.
Coco's public image was digitally manufactured. She photoshopped her photos, faked some of her singing and dancing videos, and pretended to be something that she wasn't.
It felt fitting that her voice would also be digitally curated.
SH: If you had hired an actor, this would have cost some production money? How much did it cost, approximately, to go this route?
[KG]: This was not a choice that was made to save money. In fact, we've probably spent more on using AI than we would have spent on hiring an actor to read Coco's words.
For us, it was a matter of choosing the method that best fit the story, and we felt like this was the best method.
[SH]: You mentioned that it was more expensive than it would have been to hire an actor…is that because the software is expensive, or because it took a lot of your time to get it right?
[KG]: But it's not really the time that it takes to generate the audio. Except for yesterday's hiccup, the audio generation itself has probably been relatively equal to hiring an actor in both time and money costs.
But—like I said—to our knowledge, no one had done this in audio before. So there was a lot to figure out. I spent hours testing out programs, figuring out which would work best.
And, because Coco is not participating in this project, Dear Media consulted with lawyers and industry experts to make sure we were on solid legal footing before we cloned her voice.
[SH]: Which software platform you used for this project? There are lots available…I was curious which one, or if it was Descript, which you might already use in your workflow anyway?
[KG]: I'm using ElevenLabs for the voice clones—which are just Coco, Sara and me.
All of the other AI voices are stock voices altered with speech-to-speech AI, using Respeecher. But that's the program I suddenly started having problems with yesterday. So I might end up switching over to something else.
And yeah, we are using Descript for our assemblies. Then we mix in Pro Tools. But I didn't even consider using it for the AI voices. It didn't come up in any of my searches, so I actually didn't know it had that feature!!
[ED Note: They have a feature called Voice Clone, and yes, it does this, although I haven’t tested the limits of this capacity]
[SH]: What are the downsides, or the things that you worried about, when it came to using AI in this project?
[KG]: We spent a lot of time considering the ethical aspects of using AI. If you listen to the podcast, you will hear that we are incredibly—and sometimes annoyingly—transparent about when we're using AI and when we're using Coco's real voice.
And that's because Dear Media's lawyers wanted us to make it crystal clear that Coco was not participating in this podcast. And Sara and I agreed with that stipulation. We did not want anyone to be confused.
Also, I should explain that there are two methods for AI voices. We use both in this podcast.
Speech-to-speech is what was used to create Anthony Bourdain's voice in Roadrunner. So, you hire an actor to mimic the person's cadence and voice patterns. And then you run that audio recording through AI to get the "characteristic" of the original voice. (AI can mimic pitch, tone...even vocal fry).
Speech-to-speech would have gotten us much closer to Coco's real voice. We could have hired an actor who could mimic her pauses and her very slight German accent. It would be virtually indistinguishable from Coco's real voice. But after the controversy over Roadrunner, the companies that offer speech-to-speech AI started requiring permission from the person whose voice was being cloned. So that wasn't really an option for us.
And also...our goal wasn't to be to create an AI voice that was indistinguishable from Coco's real voice. We want the listener to know that they were hearing AI.
We used a text-to-speech program to create Coco's voice. So, I literally type words into the program, and the program decides what inflection to use. This means that sometimes I have to "generate" the text a dozen times—or more—before I have something that conveys the emotion I'm looking for. That's a downside.
And the AI is shockingly bad at conveying happiness or excitement.
Luckily, this story is pretty dark, so that's not really been a limiting factor.
We also use AI to represent some other voices in the show. In these cases, it's very much like hiring an actor. The voice you hear is not a clone. It's a producer reading the script (or in one case, the audio of an interview with someone who wanted to stay anonymous.)
We run that recording through an AI filter, using one of the many stock voices the program has available. These voices are not meant to mimic anyone in particular. The only goal is for them to sound different from the original.
Normally, I'd just grab one of the producers on the project and ask them to read the script. But on this project, our producers are also characters in the story. So it would be really confusing to hear my voice in a debrief with Sara and then hear my voice again reading someone's social media posts. So I read all of the social media posts, and then we run that audio through an AI filter, using stock (generic) voices, just to make it less confusing.
(And yeah... I thought this would be less confusing. But people are sooo confused by it. So maybe I was wrong about that "less confusing" part).
[SH]: What are the advantages of using AI?
[KG]: The first big advantage—it fits the themes of this podcast.
This is really a story about a scam that takes place over social media. The scam worked, in part, because Coco was able to carefully curate her image. So having a little bit of digital artifice in her voice just seemed to be appropriate.
It's also a bit more nimble. We knew we'd be needing to make changes to this podcast up to the last minute, as more information came in. And it's much easier to use AI for a quick turnaround on script changes than it would be to get an actor into a studio.
[SH]: Have you run into, or encountered, anything that you didn’t expect to now that you’ve put this project out into the world?
[KG]: 100%. Our biggest complaint--by far -- is that we're being too transparent about our use of AI. Listeners keep telling us that they don't care that it's AI. They want us to shut up and stop telling them about it.
Yeah...we did not see that one coming.
[SH]: How did the “machine learning” of Coco’s voice happen? What, or how much audio, did you “feed” into the tool in order to learn her voice?
[KG]: I fed about five minutes of audio into the AI. It is possible to use more, but the program says that using more than five minutes does not improve the quality.
But there was definitely a learning curve for the AI. In the beginning, I had to generate the same section of text a dozen or more times before I got one that actually sounded like Coco.
It was incredibly tedious and time-consuming. But, as time has gone on, the AI has gotten better at generating Coco's voice. I still run into difficult passages, but it is generally much better.
[SH]: Can the AI tool that you used to create audio without a script? Or does it need to be told, via a written script, what to say…I guess this question goes to the limit of what AI can currently do, or how you used it?
[KG]: No.
The AI is not controlling the narrative.
It does not put words in Coco's mouth.
I type Coco's exact words into the program, and the AI generates audio of those exact words.
It does not ad-lib.
It does not fix Coco's grammar.
It does not think for itself.
Don't get me wrong. There are still ways in which this technology could be used unethically. But humans are behind the wheel, not machines.
Next month we look forward to featuring a Q+A with producer team Karen Given and Sarah Ganim, all about Coco. Put your show in your listening queue now!
AI in podcasting is so wild given how much of our vocal data we have out there. It makes me nervous about identity theft LOL