Then, the front door flies open. Your brother barges in, flashing you an innocent wave. You feel the panic wash away, and your fingertips start to numb. “Hello?” beams the imposter. Realizing you are being played, you quickly hang up the phone.
This chilling anecdote may sound like it was lifted from a sci-fi thriller, but it is a glimpse of an emerging reality. Welcome to the dawn of generative AI. The world is collectively becoming familiar with generative text, a trend exemplified by the mass adoption of ChatGPT. But what about the scam call you just received from the imposter? That is generative audio model, fine-tuned on the voice of your sibling. According to Google’s AudioLM paper, the best models need a mere three seconds of audio to produce consistent continuations of that speaker’s voice. Suddenly, a forgotten YouTube video from high school re-emerges as a potential attack vector.
Generative audio has recently taken the world by storm with the release of “Heart on My Sleeve,” an entirely AI generated song featuring Drake and The Weeknd. It was created by a ghostwriter who used singing voice conversion (SVC) AI. In digital music production, a MIDI keyboard is often used as a universal input for playing any instrument. Similarly, SVC enables one (even the most tone-deaf among us) to sing into the mic and have his voice synthesized into buttery R&B vocals. The song skyrocketed in popularity not just because of the novelty of the technology, but also because it featured star-powered entertainers instead of political figures typically used in demos. Furthermore, the lyrics are provocative and loaded with innuendo, alluding to each performer's current and past love interests. Most importantly, the featured artists notoriously don’t get along, and this is the first “collaboration” they’ve done in over a decade.
Generative audio, like many of our technologies, was conceptualized in decades past by ever-prescient sci-fi writers. For example, In The Moon Is a Harsh Mistress, the AI protagonist, Mike, is able to perfectly impersonate voices. Mike uses this ability to great effect, giving his band of ‘Loonie’ revolutionaries enough time to declare their independence from the oppressive Earthlings. While sci-fi authors must be commended for their expansive imaginations, they often fall victim to two predictive traps that don’t reflect the reality we are heading into: first, the rules of the world are universally understood; second, as is the case in Heinlein’s universe, knowledge and use of disruptive technology is limited to a select group.
To the first point, we already exist in a world where the so-called ‘rules’ of technology not only are imperfectly understood but are generationally defined. For instance, older generations, especially in the 60+ age group, are more susceptible to fake news and misinformation than younger people. In contrast, millennials and Gen Z spent their formative years surrounded by Internet technologies, thereby acquiring a sort of ‘digital literacy.’ Mirroring the future, generative AI can put on a friendly face, or a familiar voice, or show comforting image, and it is the 'digitally illiterate' who will be the most vulnerable to its influence.
Secondly, looking ahead, I foresee a future where generative AI becomes more democratized. Facebook has demonstrated as much with their Llama and Lima models. The future of generative AI is not a thirty billion dollar GPU farms. Instead, it lies in architectural improvements that make these intelligences cheaper and cheaper to train and run. While the past five years of AI advancement have been marked by massive scale, Llama has a comparatively tiny resource footprint, and can run inference acceptably on consumer hardware. Although Llama is a generative text model, I expect similar developments to occur in the audio space. This is not such a bold prediction when considering that the state-of-the-art in generated imagery is both open-source (source code is publicly available) and open-weight (pre-trained model weights are available). So, while Mike the AI’s abilities were kept secret by the Loonie revolutionaries, in reality, any enterprising user will have access to powerful generative models.
Currently, tools like GPTZero can identify text generated by a GPT model, while SVC models still have subtle artifacts that a trained ear can pick up. But the writing's on the wall. Soon, Silicon Valley will produce what I term perfect generators — models whose output is indistinguishable from reality. Currently, there is a game of cat-and-mouse for generated images and the detection of them, but it is a game that is becoming increasingly less competitive. For example, GPTZero cannot be confidently relied on due to its low precision, high recall.
Currently, tools like GPTZero can identify text generated by a GPT model, while SVC models still have subtle artifacts that a trained ear can pick up. But the writing's on the wall. Soon, Silicon Valley will produce what I term perfect generators — models whose output is indistinguishable from reality. Currently, there is a game of cat-and-mouse for generated images and the detection of them, but it is a game that is becoming increasingly less competitive. For example, GPTZero cannot be confidently relied on due to its low precision, high recall.
Let’s consider the combination of generative audio and generative images. First, images can be treated as individual frames that are stitched together. Afterwards, audio can be overlaid, in sum creating an indistinguishable deepfake video. The potential problems of this technology, I assume, are evident to most readers. In the past several years, video has been the impetus for social unrest. But what will happen when we enter a world where every aspect of a video — from the imagery to the voices, and even the content — can all be generated from a simple prompt? Pair that with a large segment of society that is digitally illiterate, and now we have a recipe for disaster on our hands.
A bad actor who wants to see the world burn could circulate entirely generated videos of police brutality. To add a flair of authenticity, he could first fine-tune the model on images and voices on officers in a specific department. And this also goes the other direction. Legitimate videos of police brutality could be denied by the offending department as deepfakes. Therefore, it is of crucial importance that we start developing countermeasures before perfect generators materialize. Any device that captures reality, whether it’s images or audio, needs to be verifiable.
It might become a necessity for video cameras to ship with tamper-proof hardware security modules (HSM). Think of an HSM like a bank safe deposit box. Instead of protecting your jewelry or your house deed, it serves as a barrier to accessing cryptographic signing keys. These keys will be used to append cryptographic signatures to individual frames, attesting that those frames did exist in the real world, and were not generated by an AI. We will call such signatures attestments of non-generativity. Video playback clients would then be able to verify signatures and indicate in the UI if the content was generated or not.
But, there’s a challenge. If these keys are ever stolen or copied off the camera, they could be used to make fake videos appear real. There is a saying in infosec that goes, “physical access is total access.” In this context, if I sell a camera equipped with a non-generative signing key, it should be assumed that a determined hacker will eventually exfiltrate it. The entertainment industry has learned this lesson time-and-time again as DRM is repeatedly cracked.
So, it is safe to assume keys will eventually be extracted. To address this issue, keys would need to be made revocable. For those familiar with the HTTPS web-of-trust, a system could be designed in a similar fashion. The camera company would generate a key for each device it sells. If the device detects foul-play, it will phone home, indicating that the key should be revoked; subsequent attestments of non-generativity will fail verification following revocation (for the security inclined, this is formally called traitor tracing). And while far from perfect, at the very least, the cost of launching an attack is raised to the price of a new recording apparatus.
For now, dear reader, I can only leave you with mitigations. If you receive a call from someone that sounds exactly like a loved one, you should be aware that fraudsters can easily spoof a number on the cell network. Caller ID cannot be trusted, and soon, neither can the voice on the other side of the line. Creating a protocol with family members, whereby each side says a secret word on the phone before discussing confidential information, is a possible option.
It is vitally important for engineers to develop methods to verify the authenticity of digital content in preparation for a world with perfect generators. If we are lucky, some unique aspects of reality can be encoded into capturing devices. Additionally, social media sites will need to build in verifiers so that content is made easily distinguishable as generated or real before going viral. Otherwise, the digital landscape will become a place where forgers abound, and honest users are overwhelmed by feelings of doubt and insecurity. Perfect generators are on the horizon, so countermeasures must be built preemptively rather than reactively.
Now, let's return to our opening anecdote. You receive the same call. But this time, right as the imposter begins to speak as your brother, your phone vibrates. You look down at the screen: WARNING: AI DETECTED. You laugh, then quickly hang up the phone. This is the future we need to build.
Now, let's return to our opening anecdote. You receive the same call. But this time, right as the imposter begins to speak as your brother, your phone vibrates. You look down at the screen: WARNING: AI DETECTED. You laugh, then quickly hang up the phone. This is the future we need to build.
No comments:
Post a Comment