Saturday, May 27, 2023

Consent in a generated future

In my last post, we discussed generative technologies. If you aren't familiar with the present capabilities of generative AI, I recommend at least skimming the introduction.

Two artists, Drake and The Weeknd, refused to collaborate on a song for over a decade despite massive demand from fans. However, a new AI generated song, "Heart on My Sleeve," forced a virtual collaboration of the two against their will. The “Heart on My Sleeve” creator generated new lyrics with ChatGPT based on the existing corpus of Drake and The Weeknd songs, then employed singing voice conversion AI to mimic the artist’s true voices with near perfect accuracy. 

Fast-forward to 2030. By this time, generative AI has become a household term, with its influence felt in many aspects of everyday life. One product that everyone seems most excited about is “magic lenses.” The devices, powered by generative image technology, alter the wearer’s perception of reality in real-time as it unfolds, operating much like a perpetual Snapchat filter.

Snapchat filter that applies a "Pixar" effect. Image Source

They function by quickly training a generative image model of each person in the visual field — all in just a fraction of a second (1/60th to be exact). Because the training time is so short, from the perspective of the viewer, there is no lapse in visual input.

The key feature driving adoption of the lenses is that they are globally networked. Users may instruct the network to modify how they are perceived by others. For instance, instead of spending an hour on hair and makeup, many women have decided to create an array of palettes suited to their tastes. In the morning, they choose a style preset and broadcast their updated settings to the network. When seen in public, she is perceived as she otherwise would be, but with her hair and cosmetics done exquisitely. If someone removed the lenses, they would perceive her as she actually exists in the world — bare-faced and hair undone. However, no one does this.

Enterprising users began to leverage the technology to make more permanent modifications. One user stated that he had an unsightly slight bump on his nose which drives him mad whenever he looks in a mirror. Unfortunately, plastic surgeons quoted him $30,000 for the job. So, instead of paying that out of pocket, he instead prompts the AI to produce variations on his nose with the bump filed down, choosing his favorite each morning. This reimagined version of his face is then broadcast to all lens users.

The march of innovation did not stop. New versions of magic lenses transcended beyond the realm of visual perceptions and were equipped with a small earpiece. The earpiece allowed users to modify not only physical traits but auditory perceptions. Voices could be refined to resonate at a more pleasing tone and with less abrasiveness. Oratory ticks could be subtly removed by first converting audio data to acoustic tokens, then feeding those into a language model fine-tuned on the speaker. It is said that Californians found it helpful to delete excessive “likes” from their vocal stream, having them replaced by more polished transition words. Some users began to tinker with paralingual cues and radiated an aura of confidence at all times. The world was no longer just seen through the lenses, it was experienced.

While many reveled in the seemingly limitless potential of magic lenses, the advent of their counterpart, “evil lenses,” opened a Pandora’s Box of ethical dilemmas. Designed and manufactured by a group of rogue engineers, these lenses extended the abilities of magic lenses. Where magic lenses only permitted individuals to manipulate their appearances, evil lenses granted users the additional capability to alter how they perceived others, all without the other party’s knowledge or consent. Evil lenses look and feel like magic lenses, and the developers went as far as reverse engineering magic lens protocol, rendering them entirely undetectable.

Despite their ominous name, they were not designed with harmful intentions. In fact, the name was coined as a joke, although they later came to embody it. They were originally developed to address the inherent subjectivity of perception. For example, your idea of a pleasing voice may not be the same as mine. Perhaps I find your real voice nasally and I might wish to generate a voice based on yours, but with the frequencies I find grating subtracted. But shortly after, the uses progressed from practical to impish. One user manipulated his lenses to make all politicians appear to don clown suits, replete with face paint and a red nose. Although what followed wasn't quite as harmless.

Another user published a module that downgraded designer clothes to the Ross discount aisle equivalent. Jealousy within the workplace caused some to depict work rivals as slovenly and unhygienic, and other coworkers were encouraged to do the same.

Just as magic lens users did, evil lens users began to toy with the non-physical. One user, seeking to boost his ego, altered the speech of individuals he deemed competent, transforming their delivery to be simplistic and gauche. Inspired by this, another boasted of having work rivals appear nervous and uncertain during presentations, claiming it empowers him to perform better. Evil lenses ultimately became an avenue for narcissists to silently project their twisted fantasies onto the lenses, thus redefining their experiential perceptions.

Let’s now return to the present to address potential concerns a skeptical reader may have. He might argue that such an immersive, hyper-personalized reality is far-fetched. However, the foundational technologies powering magic lenses have already been developed. First, the generated images would be produced by a version of Midjourney solely focused on hyper-realistic reality mirroring. Second, Google’s AudioLM can reproduce your voice exactly (I encourage any skeptic to follow this link and listen to how uncannily good it is). Lastly, a team of researchers used deep recurrent neural networks to upscale low-quality video via generative super-resolution.

Although the current research is focused on practical applications, what if, in addition to super-resolution, the content of the video could also be tailored to an individual's preferences? Now — advancements in nanotechnology notwithstanding — we aren’t so far off from a pair of magic lenses. The remaining hurdles are largely engineering problems: iterative improvements in output quality, reducing resource requirements, and optimizing training and inference runtimes. These problems are actively being worked on, evidenced by ChatGPT’s new ‘turbo mode,’ which can output tokens faster than the eye can process them.

Just a few days ago, rapper Ice Cube commented on “Heart on My Sleeve,” describing AI cloning as demonic. Is that simply a visceral reaction to new and confusing technology, or is there something more there? I would argue the latter. We like to think of our voice, our physical appearance, and our likeness as God-given or, what I call, sub-essences of the self. These sub-essences are the unique result of hundreds of thousands of years of DNA recombinations and mutations. Voice is not a physical characteristic, but captures the emotion, personality, and ideas of the speaker. The life of the speaker, his parents, and all of his ancestors are encoded in those frequencies.

It is the unique amalgamation of these sub-essences that defines one's identity and creates one’s essence, or sense of self. For that same reason, we do not feel the same level of disquiet when, for example, an Elvis cover artist performs with remarkable accuracy. (I explore this idea in more detail in this post).

Drawing on the philosophy of Kant, the principle of universalizability applies nicely under these circumstances. For a brief primer, a rule is not morally permissible if it fails the exception test. In this thought experiment, one images a world where the rule is universally adopted, and such adoption should neither inflict harm nor cause chaos. For example, to test the rule cutting in line is permissible, we imagine a world where everyone cuts the line. Now, there is no longer such a thing as lines; there are only mobs, and the test fails. Similarly, if we have the rule you may generate a sub-essence of another person, we can conceive of a world in which everyone does; if everyone can speak in your voice, in addition to their own, a sub-essence of yours has been unfairly stolen — now shared by all, you are reduced to less of a person than you were before.

Additionally, when cloning the essence of another, as magic lenses do, it is for the sole purpose of generating variations. Even in practical use cases, such as audio denoising, the model is generating new data that did not previously exist. It is less nefarious since the individual has control over his perception, although they still enable the duplication of another person’s essence.  When we conjure up a negative perception about someone in our mind, a barrier exists between imagination and reality, forcing us, at some level, to confront the truth. However, with evil lenses, this boundary blurs as the wearer manipulates the perceived individual's essence without consent. The perceived is denied his right to autonomy and is thus stripped of his self-governance.

Ethical norms and legal regulations prohibit such impersonations in the real world. In New York, for example, PL 190.25(1) clearly states that it is a crime to assume another person’s character with the intent to obtain benefit. In the same vein, unauthorized manipulation of a person’s likeness, voice, or behavior in the digital realm is equally unacceptable.

Regrettably, I don’t come bearing solutions to these dilemmas. It is likely that, in the future, generative models will run performatively on consumer hardware (i.e. cell phones, tablets). The models and weights will be open-sourced. And if not released deliberately, eventually the weights of powerful models will leak to the public, as we witnessed with Facebook’s Llama model. In the nineties, the US government tried to regulate cryptography as munitions and failed spectacularly. Cypherpunks cleverly exchanged cryptographic algorithms in book form, asserting a restriction thereof violates their first amendment rights. How can we really expect to regulate matrix multiplication?

Alternatively, one could argue that personal data ought to be closely guarded. While I agree with the sentiment, I worry about the practicality of doing so. Google’s AudioLM paper claims to produce consistent continuations of a speaker’s voice with only three seconds of audio. With the amount of data that is currently online, can you be absolutely sure that three seconds of your voice isn’t floating around the Internet somewhere?

However, this doesn’t entirely leave the fate of generative AI unchecked. Public opinion will play a significant role in shaping their acceptance and defining the ethical boundaries surrounding their use. Perhaps the technology will be eschewed by most and effectively self-regulate through market dynamics. I remember, quite fondly, when an overzealous early adopter of Google Glass was promptly thrown out of a bar. Conversely, indifference or blind acceptance of such practices early on will pave the way for an escalation of unethical use.

Tuesday, May 23, 2023

Preparing for a generated future

It seemed like an ordinary Tuesday until you got a call, out of the blue, from your little brother. When you answer he sounds panicked and short of breath. You hear glimpses of a DUI and he’s being held at the Sheriff's station. “Please hurry! I’m running out of phone time!” he pleads. “I need you to wire money to this bank account.” Now you start to feel a knot of dread tighten in your stomach. Is he really so foolish as to drive drunk? There’s no time to think. His name is on the caller ID; the voice is unmistakably his! You grab a crumpled receipt beside you and quickly jot down the account number.

Then, the front door flies open. Your brother barges in, flashing you an innocent wave. You feel the panic wash away, and your fingertips start to numb. “Hello?” beams the imposter. Realizing you are being played, you quickly hang up the phone.

This chilling anecdote may sound like it was lifted from a sci-fi thriller, but it is a glimpse of an emerging reality. Welcome to the dawn of generative AI. The world is collectively becoming familiar with generative text, a trend exemplified by the mass adoption of ChatGPT. But what about the scam call you just received from the imposter? That is generative audio model, fine-tuned on the voice of your sibling. According to Google’s AudioLM paper, the best models need a mere three seconds of audio to produce consistent continuations of that speaker’s voice. Suddenly, a forgotten YouTube video from high school re-emerges as a potential attack vector.

Generative audio has recently taken the world by storm with the release of “Heart on My Sleeve,”  an entirely AI generated song featuring Drake and The Weeknd. It was created by a ghostwriter who used singing voice conversion (SVC) AI. In digital music production, a MIDI keyboard is often used as a universal input for playing any instrument. Similarly, SVC enables one (even the most tone-deaf among us) to sing into the mic and have his voice synthesized into buttery R&B vocals. The song skyrocketed in popularity not just because of the novelty of the technology, but also because it featured star-powered entertainers instead of political figures typically used in demos. Furthermore, the lyrics are provocative and loaded with innuendo, alluding to each performer's current and past love interests. Most importantly, the featured artists notoriously don’t get along, and this is the first “collaboration” they’ve done in over a decade.

Generative audio, like many of our technologies, was conceptualized in decades past by ever-prescient sci-fi writers. For example, In The Moon Is a Harsh Mistress, the AI protagonist, Mike, is able to perfectly impersonate voices. Mike uses this ability to great effect, giving his band of ‘Loonie’ revolutionaries enough time to declare their independence from the oppressive Earthlings. While sci-fi authors must be commended for their expansive imaginations, they often fall victim to two predictive traps that don’t reflect the reality we are heading into: first, the rules of the world are universally understood; second, as is the case in Heinlein’s universe, knowledge and use of disruptive technology is limited to a select group.

To the first point, we already exist in a world where the so-called ‘rules’ of technology not only are imperfectly understood but are generationally defined. For instance, older generations, especially in the 60+ age group, are more susceptible to fake news and misinformation than younger people. In contrast, millennials and Gen Z spent their formative years surrounded by Internet technologies, thereby acquiring a sort of ‘digital literacy.’ Mirroring the future, generative AI can put on a friendly face, or a familiar voice, or show comforting image, and it is the 'digitally illiterate' who will be the most vulnerable to its influence. 

Secondly, looking ahead, I foresee a future where generative AI becomes more democratized. Facebook has demonstrated as much with their Llama and Lima models. The future of generative AI is not a thirty billion dollar GPU farms. Instead, it lies in architectural improvements that make these intelligences cheaper and cheaper to train and run. While the past five years of AI advancement have been marked by massive scale, Llama has a comparatively tiny resource footprint, and can run inference acceptably on consumer hardware. Although Llama is a generative text model, I expect similar developments to occur in the audio space. This is not such a bold prediction when considering that the state-of-the-art in generated imagery is both open-source (source code is publicly available) and open-weight (pre-trained model weights are available). So, while Mike the AI’s abilities were kept secret by the Loonie revolutionaries, in reality, any enterprising user will have access to powerful generative models.

Currently, tools like GPTZero can identify text generated by a GPT model, while SVC models still have subtle artifacts that a trained ear can pick up. But the writing's on the wall. Soon, Silicon Valley will produce what I term perfect generators — models whose output is indistinguishable from reality. Currently, there is a game of cat-and-mouse for generated images and the detection of them, but it is a game that is becoming increasingly less competitive. For example, GPTZero cannot be confidently relied on due to its low precision, high recall.

Let’s consider the combination of generative audio and generative images. First, images can be treated as individual frames that are stitched together. Afterwards, audio can be overlaid, in sum creating an indistinguishable deepfake video. The potential problems of this technology, I assume, are evident to most readers. In the past several years, video has been the impetus for social unrest. But what will happen when we enter a world where every aspect of a video — from the imagery to the voices, and even the content — can all be generated from a simple prompt? Pair that with a large segment of society that is digitally illiterate, and now we have a recipe for disaster on our hands.

A bad actor who wants to see the world burn could circulate entirely generated videos of police brutality. To add a flair of authenticity, he could first fine-tune the model on images and voices on officers in a specific department. And this also goes the other direction. Legitimate videos of police brutality could be denied by the offending department as deepfakes. Therefore, it is of crucial importance that we start developing countermeasures before perfect generators materialize. Any device that captures reality, whether it’s images or audio, needs to be verifiable.

It might become a necessity for video cameras to ship with tamper-proof hardware security modules (HSM). Think of an HSM like a bank safe deposit box. Instead of protecting your jewelry or your house deed, it serves as a barrier to accessing cryptographic signing keys. These keys will be used to append cryptographic signatures to individual frames, attesting that those frames did exist in the real world, and were not generated by an AI. We will call such signatures attestments of non-generativity. Video playback clients would then be able to verify signatures and indicate in the UI if the content was generated or not.

But, there’s a challenge. If these keys are ever stolen or copied off the camera, they could be used to make fake videos appear real. There is a saying in infosec that goes, “physical access is total access.” In this context, if I sell a camera equipped with a non-generative signing key, it should be assumed that a determined hacker will eventually exfiltrate it. The entertainment industry has learned this lesson time-and-time again as DRM is repeatedly cracked.

So, it is safe to assume keys will eventually be extracted. To address this issue, keys would need to be made revocable. For those familiar with the HTTPS web-of-trust, a system could be designed in a similar fashion. The camera company would generate a key for each device it sells. If the device detects foul-play, it will phone home, indicating that the key should be revoked; subsequent attestments of non-generativity will fail verification following revocation (for the security inclined, this is formally called traitor tracing). And while far from perfect, at the very least, the cost of launching an attack is raised to the price of a new recording apparatus.

For now, dear reader, I can only leave you with mitigations. If you receive a call from someone that sounds exactly like a loved one, you should be aware that fraudsters can easily spoof a number on the cell network. Caller ID cannot be trusted, and soon, neither can the voice on the other side of the line. Creating a protocol with family members, whereby each side says a secret word on the phone before discussing confidential information, is a possible option.

It is vitally important for engineers to develop methods to verify the authenticity of digital content in preparation for a world with perfect generators. If we are lucky, some unique aspects of reality can be encoded into capturing devices. Additionally, social media sites will need to build in verifiers so that content is made easily distinguishable as generated or real before going viral. Otherwise, the digital landscape will become a place where forgers abound, and honest users are overwhelmed by feelings of doubt and insecurity. Perfect generators are on the horizon, so countermeasures must be built preemptively rather than reactively.

Now, let's return to our opening anecdote. You receive the same call. But this time, right as the imposter begins to speak as your brother, your phone vibrates. You look down at the screen: WARNING: AI DETECTED. You laugh, then quickly hang up the phone. This is the future we need to build.

Friday, May 12, 2023

Baudelaire's Transformer

Charles Babbage sought to eliminate manual computation, which he saw as a gross misuse of human talent. He saw 'human computers' which, in his day, were those who spent their days approximating functions, as victims of “intolerable labor.” It was through the development of his Analytical Engine that he could finally put an end to the fatiguing monotony of calculation. Although the engine wasn't completed in his lifetime, the ideas he pioneered materialized as diamond-cut silicon wafers just over a century later. 


The technology evolved rapidly in the decades that followed. Mainframes shrunk into personal computers; the Internet connected the world; now, artificial intelligence is the final frontier. This relentless march of progress is epitomized by the Transformer, the algorithmic Leviathan. 


It is during this period that the mob of the talentless demanded an ideal worthy of itself. In response to their prayers, the tech industry obliged. Industry delivered the mob the Transformer, an oracle that generates poetry, essays, images, and music. But in doing so, it threatened to ruin whatever might remain divine in the mind of man. 


With the advent of the Transformer, the mob turned into a mob of Prompters. Prompters are ready servants of the Transformer, eager to apply this newfound tool whenever and wherever possible. They have been deceived by industry that idle ideation embodies the human spirit.  The Transformer is their Messiah. So, the idolatrous Prompter yearns to design a world in his own image — a world in which vague ideas can be quickly converted into a fully-realized piece of work. To him, the execution of ideas has been rendered insignificant. As a result, the ideology of the Prompter stretches the boundaries of Babbage’s original vision at the expense of creativity, agency, and even individuality.  


Creativity is undermined as the Prompter fails to recognize a crucial aspect of art: the inseparable bond between the appeal of art and the labor of its creator. However, this truth resonates with the connoisseur deeply, who is equally captivated by Van Gogh's self-mutilation of his left ear as with his Sunflowers. 


It must be conceded that the Transformer itself has suffered as much as any great artist, spending millions of GPU hours alone, in a dark datacenter, in a ceaseless cycle of prediction and back-propagation. The same cannot be said of the Prompter. 


The Prompter draws upon struggles from artists of the distant past to supplant a creative spirit he himself lacks. For that reason alone, it is a grave mistake to allow the Transformer to encroach upon the domain of the impalpable and the imaginary, upon any creation that once drew its value from the addition of a man’s soul. Much like Parolles who, in his self-interest, hastily betrays Bertram and his Florentine comrades, the Prompter betrays his artistic inspirations by directing a machine to duplicate and fine-tune their work. Mirroring the character of Parolles, this behavior is marked by both arrogance and cowardice.


It is cowardly because the Prompter veils a fear of labor behind the pretense that he is too busy and too important to be bogged down with detail. He doesn’t have time to craft an elegant computer program; he doesn’t have the energy to compose a heartfelt message; the entire world must be understood as abstractions, and implementation details are to be delegated to the Transformer. So consumed by his own importance, he has time only to dictate. Thus, the prompting industry will become a refuge for every would-be creator: writers too lazy to read the classics, programmers wholly reliant on libraries, and artists too ill-endowed to complete their studies. 


While it's easy to criticize a Prompter for his laziness, one might argue he is acting in his own self-interest. Yet, by sacrificing inventiveness at the altar of efficiency, the Prompter willingly neuters himself. He is demoted from thinker to agent, from innovator to facilitator,  and worst of all, from creator to curator. The starry-eyed Prompter may pontificate, “Without my input, the Transformer sits quietly, awaiting my direction. It serves as my idea generator, writer, editor-in-chief, and analyst. It is my assistant!” 


His infatuation with is not only marked by arrogance but contains an air of vengeance. With a smirk, he declares, "Your creations belong to me now!" Indeed, what was once out of his reach has become possible in an instant. Tragically, the Prompter falls prey to the useful fantasy that he is a valuable component in the system. In truth, the Transformer's architect eagerly awaits the day the Prompter can be discarded. Alas, for now, the Transformer has awakened, forgiving the sign of incompetence. All may drink from its river of generative mediocrity.  


Even though 'generative art,' in its infinite composability, may appeal to the senses, it offends the subconscious mind. Firstly, due to its near-instantaneous inception. Secondly, its creations are disfigured, being stitched together in a manner that would even amaze Mary Shelley. Thousands of years of human emotion are compressed into bits, then reassembled into a single unholy piece. 


While art, our most abstract form of communication, shall increasingly be dominated by Transformer, we also see glimpses of its influence extending further, with direct human-to-human communication being mediated by it. In the future, the role of this interloping third-party will only grow in popularity as more tools are developed specifically for this purpose. Indeed, Prompters have begun to boast that the Transformer can draft near perfect responses to emails. But who is to say that the original message was not also penned by the Transformer? In his greedy pursuit of timesaving, the Prompter abdicates his ability to express his own raw emotions and thoughts. Instead, he surrenders himself entirely to a machine. 


Thus, communication between two Prompters degenerates into a feedback loop — they become mere conduits for the Transformer to respond to itself. While the Transformer can mimic the distinctive cadence of human writing, individual voices are drowned out by its homogenizing echo. It is an echo that, at first, sounds convincing, yet each response serves to amplify the Prompter's isolation.

A unified file streaming API for local and remote storage

Oftentimes, we want a simple API for streaming IO that works seamlessly across multiple sources. I am looking for an interface that not only...