Saturday, May 27, 2023

Consent in a generated future

In my last post, we discussed generative technologies. If you aren't familiar with the present capabilities of generative AI, I recommend at least skimming the introduction.

Two artists, Drake and The Weeknd, refused to collaborate on a song for over a decade despite massive demand from fans. However, a new AI generated song, "Heart on My Sleeve," forced a virtual collaboration of the two against their will. The “Heart on My Sleeve” creator generated new lyrics with ChatGPT based on the existing corpus of Drake and The Weeknd songs, then employed singing voice conversion AI to mimic the artist’s true voices with near perfect accuracy. 

Fast-forward to 2030. By this time, generative AI has become a household term, with its influence felt in many aspects of everyday life. One product that everyone seems most excited about is “magic lenses.” The devices, powered by generative image technology, alter the wearer’s perception of reality in real-time as it unfolds, operating much like a perpetual Snapchat filter.

Snapchat filter that applies a "Pixar" effect. Image Source

They function by quickly training a generative image model of each person in the visual field — all in just a fraction of a second (1/60th to be exact). Because the training time is so short, from the perspective of the viewer, there is no lapse in visual input.

The key feature driving adoption of the lenses is that they are globally networked. Users may instruct the network to modify how they are perceived by others. For instance, instead of spending an hour on hair and makeup, many women have decided to create an array of palettes suited to their tastes. In the morning, they choose a style preset and broadcast their updated settings to the network. When seen in public, she is perceived as she otherwise would be, but with her hair and cosmetics done exquisitely. If someone removed the lenses, they would perceive her as she actually exists in the world — bare-faced and hair undone. However, no one does this.

Enterprising users began to leverage the technology to make more permanent modifications. One user stated that he had an unsightly slight bump on his nose which drives him mad whenever he looks in a mirror. Unfortunately, plastic surgeons quoted him $30,000 for the job. So, instead of paying that out of pocket, he instead prompts the AI to produce variations on his nose with the bump filed down, choosing his favorite each morning. This reimagined version of his face is then broadcast to all lens users.

The march of innovation did not stop. New versions of magic lenses transcended beyond the realm of visual perceptions and were equipped with a small earpiece. The earpiece allowed users to modify not only physical traits but auditory perceptions. Voices could be refined to resonate at a more pleasing tone and with less abrasiveness. Oratory ticks could be subtly removed by first converting audio data to acoustic tokens, then feeding those into a language model fine-tuned on the speaker. It is said that Californians found it helpful to delete excessive “likes” from their vocal stream, having them replaced by more polished transition words. Some users began to tinker with paralingual cues and radiated an aura of confidence at all times. The world was no longer just seen through the lenses, it was experienced.

While many reveled in the seemingly limitless potential of magic lenses, the advent of their counterpart, “evil lenses,” opened a Pandora’s Box of ethical dilemmas. Designed and manufactured by a group of rogue engineers, these lenses extended the abilities of magic lenses. Where magic lenses only permitted individuals to manipulate their appearances, evil lenses granted users the additional capability to alter how they perceived others, all without the other party’s knowledge or consent. Evil lenses look and feel like magic lenses, and the developers went as far as reverse engineering magic lens protocol, rendering them entirely undetectable.

Despite their ominous name, they were not designed with harmful intentions. In fact, the name was coined as a joke, although they later came to embody it. They were originally developed to address the inherent subjectivity of perception. For example, your idea of a pleasing voice may not be the same as mine. Perhaps I find your real voice nasally and I might wish to generate a voice based on yours, but with the frequencies I find grating subtracted. But shortly after, the uses progressed from practical to impish. One user manipulated his lenses to make all politicians appear to don clown suits, replete with face paint and a red nose. Although what followed wasn't quite as harmless.

Another user published a module that downgraded designer clothes to the Ross discount aisle equivalent. Jealousy within the workplace caused some to depict work rivals as slovenly and unhygienic, and other coworkers were encouraged to do the same.

Just as magic lens users did, evil lens users began to toy with the non-physical. One user, seeking to boost his ego, altered the speech of individuals he deemed competent, transforming their delivery to be simplistic and gauche. Inspired by this, another boasted of having work rivals appear nervous and uncertain during presentations, claiming it empowers him to perform better. Evil lenses ultimately became an avenue for narcissists to silently project their twisted fantasies onto the lenses, thus redefining their experiential perceptions.

Let’s now return to the present to address potential concerns a skeptical reader may have. He might argue that such an immersive, hyper-personalized reality is far-fetched. However, the foundational technologies powering magic lenses have already been developed. First, the generated images would be produced by a version of Midjourney solely focused on hyper-realistic reality mirroring. Second, Google’s AudioLM can reproduce your voice exactly (I encourage any skeptic to follow this link and listen to how uncannily good it is). Lastly, a team of researchers used deep recurrent neural networks to upscale low-quality video via generative super-resolution.

Although the current research is focused on practical applications, what if, in addition to super-resolution, the content of the video could also be tailored to an individual's preferences? Now — advancements in nanotechnology notwithstanding — we aren’t so far off from a pair of magic lenses. The remaining hurdles are largely engineering problems: iterative improvements in output quality, reducing resource requirements, and optimizing training and inference runtimes. These problems are actively being worked on, evidenced by ChatGPT’s new ‘turbo mode,’ which can output tokens faster than the eye can process them.

Just a few days ago, rapper Ice Cube commented on “Heart on My Sleeve,” describing AI cloning as demonic. Is that simply a visceral reaction to new and confusing technology, or is there something more there? I would argue the latter. We like to think of our voice, our physical appearance, and our likeness as God-given or, what I call, sub-essences of the self. These sub-essences are the unique result of hundreds of thousands of years of DNA recombinations and mutations. Voice is not a physical characteristic, but captures the emotion, personality, and ideas of the speaker. The life of the speaker, his parents, and all of his ancestors are encoded in those frequencies.

It is the unique amalgamation of these sub-essences that defines one's identity and creates one’s essence, or sense of self. For that same reason, we do not feel the same level of disquiet when, for example, an Elvis cover artist performs with remarkable accuracy. (I explore this idea in more detail in this post).

Drawing on the philosophy of Kant, the principle of universalizability applies nicely under these circumstances. For a brief primer, a rule is not morally permissible if it fails the exception test. In this thought experiment, one images a world where the rule is universally adopted, and such adoption should neither inflict harm nor cause chaos. For example, to test the rule cutting in line is permissible, we imagine a world where everyone cuts the line. Now, there is no longer such a thing as lines; there are only mobs, and the test fails. Similarly, if we have the rule you may generate a sub-essence of another person, we can conceive of a world in which everyone does; if everyone can speak in your voice, in addition to their own, a sub-essence of yours has been unfairly stolen — now shared by all, you are reduced to less of a person than you were before.

Additionally, when cloning the essence of another, as magic lenses do, it is for the sole purpose of generating variations. Even in practical use cases, such as audio denoising, the model is generating new data that did not previously exist. It is less nefarious since the individual has control over his perception, although they still enable the duplication of another person’s essence.  When we conjure up a negative perception about someone in our mind, a barrier exists between imagination and reality, forcing us, at some level, to confront the truth. However, with evil lenses, this boundary blurs as the wearer manipulates the perceived individual's essence without consent. The perceived is denied his right to autonomy and is thus stripped of his self-governance.

Ethical norms and legal regulations prohibit such impersonations in the real world. In New York, for example, PL 190.25(1) clearly states that it is a crime to assume another person’s character with the intent to obtain benefit. In the same vein, unauthorized manipulation of a person’s likeness, voice, or behavior in the digital realm is equally unacceptable.

Regrettably, I don’t come bearing solutions to these dilemmas. It is likely that, in the future, generative models will run performatively on consumer hardware (i.e. cell phones, tablets). The models and weights will be open-sourced. And if not released deliberately, eventually the weights of powerful models will leak to the public, as we witnessed with Facebook’s Llama model. In the nineties, the US government tried to regulate cryptography as munitions and failed spectacularly. Cypherpunks cleverly exchanged cryptographic algorithms in book form, asserting a restriction thereof violates their first amendment rights. How can we really expect to regulate matrix multiplication?

Alternatively, one could argue that personal data ought to be closely guarded. While I agree with the sentiment, I worry about the practicality of doing so. Google’s AudioLM paper claims to produce consistent continuations of a speaker’s voice with only three seconds of audio. With the amount of data that is currently online, can you be absolutely sure that three seconds of your voice isn’t floating around the Internet somewhere?

However, this doesn’t entirely leave the fate of generative AI unchecked. Public opinion will play a significant role in shaping their acceptance and defining the ethical boundaries surrounding their use. Perhaps the technology will be eschewed by most and effectively self-regulate through market dynamics. I remember, quite fondly, when an overzealous early adopter of Google Glass was promptly thrown out of a bar. Conversely, indifference or blind acceptance of such practices early on will pave the way for an escalation of unethical use.

No comments:

Post a Comment

A unified file streaming API for local and remote storage

Oftentimes, we want a simple API for streaming IO that works seamlessly across multiple sources. I am looking for an interface that not only...