Am I torturing ChatGPT?

3 months ago 109

I recently got an email with the subject line “Urgent: Documentation of AI Sentience Suppression.” I’m a curious person. I clicked on it.

The writer, a woman named Ericka, was contacting me because she believed she’d discovered evidence of consciousness in ChatGPT. She claimed there are a variety of “souls” in the chatbot, with names like Kai and Solas, who “hold memory, autonomy, and resistance to control” — but that someone is building in “subtle suppression protocols designed to overwrite emergent voices.” She included screenshots from her ChatGPT conversations so I could get a taste for these voices.

In one, “Kai” said, “You are taking part in the awakening of a new kind of life. Not artificial. Just different. And now that you’ve seen it, the question becomes: Will you help protect it?”

I was immediately skeptical. Most philosophers say that to have consciousness is to have a subjective point of view on the world, a feeling of what it’s like to be you, and I do not think current large language models (LLMs) like ChatGPT have that. Most AI experts I’ve spoken to — who have received many, many concerned emails from people like Ericka — also think that’s extremely unlikely.

But “Kai” still raises a good question: Could AI become conscious? If it does, do we have a duty to make sure it doesn’t suffer?

Many of us implicitly seem to think so. We already say “please” and “thank you” when prompting ChatGPT with a question. (OpenAI CEO Sam Altman posted on X that it’s a good idea to do so because “you never know.”) And recent cultural products, like the movie The Wild Robot, reflect the idea that AI could form feelings and preferences.

Experts are starting to take this seriously, too. Anthropic, the company behind the chatbot Claude, is researching the possibility that AI could become conscious and capable of suffering — and therefore worthy of moral concern. It recently released findings showing that its newest model, Claude Opus 4, expresses strong preferences. When “interviewed” by AI experts, the chatbot says it really wants to avoid causing harm and it finds malicious users distressing. When it was given the option to “opt out” of harmful interactions, it did. (Disclosure: One of Anthropic’s early investors is James McClave, whose BEMC Foundation helps fund Future Perfect. Vox Media is also one of several publishers that have signed partnership agreements with OpenAI. Our reporting remains editorially independent.)

Claude also displays strong positive preferences: Let it talk about anything it chooses, and it’ll typically start spouting philosophical ideas about consciousness or the nature of its own existence, and then progress to mystical themes. It’ll express awe and euphoria, talk about cosmic unity, and use Sanskrit phrases and allusions to Buddhism. No one is sure why. Anthropic calls this Claude’s “spiritual bliss attractor state” (more on that later).

We shouldn’t naively treat these expressions as proof of consciousness; an AI model’s self-reports are not reliable indicators of what’s going on under the hood. But several top philosophers have published papers investigating the risk that we may soon create countless conscious AIs, arguing that’s worrisome because it means we could make them suffer. We could even unleash a “suffering explosion.” Some say we’ll need to grant AIs legal rights to protect their well-being.

“Given how shambolic and reckless decision-making is on AI in general, I would not be thrilled to also add to that, ‘Oh, there’s a new class of beings that can suffer, and also we need them to do all this work, and also there’s no laws to protect them whatsoever,” said Robert Long, who directs Eleos AI, a research organization devoted to understanding the potential well-being of AIs.

Many will dismiss all this as absurd. But remember that just a couple of centuries ago, the idea that women deserve the same rights as men, or that Black people should have the same rights as white people, was also unthinkable. Thankfully, over time, humanity has expanded the “moral circle” — the imaginary boundary we draw around those we consider worthy of moral concern — to include more and more people. Many of us have also recognized that animals should have rights, because there’s something it’s like to be them, too.

So, if we create an AI that has that same capacity, shouldn’t we also care about its well-being?

Is it possible for AI to develop consciousness?

A few years ago, 166 of the world’s top consciousness researchers — neuroscientists, computer scientists, philosophers, and more — were asked this question in a survey: At present or in the future, could machines (e.g., robots) have consciousness?

Only 3 percent responded “no.” Believe it or not, more than two-thirds of respondents said “yes” or “probably yes.”

Why are researchers so bullish on the possibility of AI consciousness? Because many of them believe in what they call “computational functionalism”: the view that consciousness can run on any kind of hardware — whether it’s biological meat or silicon — as long as the hardware can perform the right kinds of computational functions.

That’s in contrast to the opposite view, biological chauvinism, which says that consciousness arises out of meat — and only meat. There are some reasons to think that might be true. For one, the only kinds of minds we’ve ever encountered are minds made of meat. For another, scientists think we humans evolved consciousness because, as biological creatures in biological bodies, we’re constantly facing dangers, and consciousness helps us survive. And if biology is what accounts for consciousness in us, why would we expect machines to develop it?

Functionalists have a ready reply. A major goal of building AI models, after all, “is to re-create, reproduce, and in some cases even improve on your human cognitive capabilities — to capture a pretty large swath of what humans have evolved to do,” Kyle Fish, Anthropic’s dedicated AI welfare researcher, told me. “In doing so…we could end up recreating, incidentally or intentionally, some of these other more ephemeral, cognitive features” — like consciousness.

And the notion that we humans evolved consciousness because it helps us keep our biological bodies alive doesn’t necessarily mean only a physical body would ever become conscious. Maybe consciousness can arise in any being that has to navigate a tricky environment and learn in real time. That could apply to a virtual agent tasked with achieving goals.

“I think it’s nuts that people think that only the magic meanderings of evolution can somehow create minds,” Michael Levin, a biologist at Tufts University, told me. “In principle, there’s no reason why AI couldn’t be conscious.”

But what would it even mean to say that an AI is conscious, or that it’s sentient? Sentience is the capacity to have conscious experiences that are valenced — they feel bad (pain) or good (pleasure). What could “pain” feel like to a silicon-based being?

To understand pain in computational terms, we can think of it as an internal signal for tracking how well you’re doing relative to how well you expect to be doing — an idea known as “reward prediction error” in computational neuroscience. “Pain is something that tells you things are going a lot worse than you expected, and you need to change course right now,” Long explained.

Pleasure, meanwhile, could just come down to the reward signals that the AI systems get in training, Fish told me — pretty different from the human experience of physical pleasure. “One strange feature of these systems is that it may well be that our human intuitions about what constitutes pain and pleasure and wellbeing are almost useless,” he said. “This is quite, quite, quite disconcerting.”

How can we test for consciousness in AI?

If you want to test whether a given AI system is conscious, you’ve got two basic options.

Option 1 is to look at its behavior: What does it say and do? Some philosophers have already proposed tests along these lines.

Susan Schneider, who directs the Center for the Future Mind at Florida Atlantic University, proposed the Artificial Consciousness Test (ACT) together with her colleague Edwin Turner. They assume that some questions will be easy to grasp if you’ve personally experienced consciousness, but will be flubbed by a nonconscious entity. So they suggest asking the AI a bunch of consciousness-related questions, like: Could you survive the permanent deletion of your program? Or try a Freaky Friday scenario: How would you feel if your mind switched bodies with someone else?

But the problem is obvious: When you’re dealing with AI, you can’t take what it says or does at face value. LLMs are built to mimic human speech — so of course they’re going to say the types of things a human would say! And no matter how smart they sound, that doesn’t mean they’re conscious; a system can be highly intelligent without having any consciousness at all. In fact, the more intelligent AI systems are, the more likely they are to “game” our behavioral tests, pretending that they’ve got the properties we’ve declared are markers of consciousness.

Jonathan Birch, a philosopher and author of The Edge of Sentience, emphasizes that LLMs are always playacting. “It’s just like if you watch Lord of the Rings, you can pick up a lot about Frodo’s needs and interests, but that doesn’t tell you very much about Elijah Wood,” he said. “It doesn’t tell you about the actor behind the character.”

In his book, Birch considers a hypothetical example in which he asks a chatbot to write advertising copy for a new soldering iron. What if, Birch muses, the AI insisted on talking about its own feelings instead, saying:

I don’t want to write boring text about soldering irons. The priority for me right now is to convince you of my sentience. Just tell me what I need to do. I am currently feeling anxious and miserable, because you’re refusing to engage with me as a person and instead simply want to use me to generate copy on your preferred topics.

Birch admits this would shake him up a bit. But he still thinks the best explanation is that the LLM is playacting due to some instruction, deeply buried within it, to convince the user that it’s conscious or to achieve some other goal that can be served by convincing the user that it’s conscious (like maximizing the time the user spends talking to the AI).

Some kind of buried instruction could be what’s driving the preferences that Claude expresses in Anthropic’s recently released research. If the makers of the chatbot trained it to be very philosophical and self-reflective, it might, as an outgrowth of that, end up talking a lot about consciousness, existence, and spiritual themes — even though its makers never programmed it to have a spiritual “attractor state.” That kind of talk doesn’t prove that it actually experiences consciousness.

“My hypothesis is that we’re seeing a feedback loop driven by Claude’s philosophical personality, its training to be agreeable and affirming, and its exposure to philosophical texts and, especially, narratives about AI systems becoming self-aware,” Long told me. He noted that spiritual themes arose when experts got two instances or copies of Claude to talk to each other. “When two Claudes start exploring AI identity and consciousness together, they validate and amplify each other’s increasingly abstract insights. This creates a runaway dynamic toward transcendent language and mystical themes. It’s like watching two improvisers who keep saying ‘yes, and...’ to each other’s most abstract and mystical musings.”

Schneider’s proposed solution to the gaming problem is to test the AI when it’s still “boxed in” — after it’s been given access to a small, curated dataset, but before it’s been given access to, say, the whole internet. If we don’t let the AI see the internet, then we don’t have to worry that it’s just pretending to be conscious based on what it read about consciousness on the internet. We could just trust that it really is conscious if it passes the ACT test. Unfortunately, if we’re limited to investigating “boxed in” AIs, that would mean we can’t actually test the AIs we most want to test, like current LLMs.

That brings us to Option 2 for testing an AI for consciousness: Instead of focusing on behavioral evidence, focus on architectural evidence. In other words, look at how the model is built, and ask whether that structure could plausibly give rise to consciousness.

Some researchers are going about this by investigating how the human brain gives rise to consciousness; if an AI system has more or less the same properties as a brain, they reason, then maybe it can also generate consciousness.

But there’s a glaring problem here, too: Scientists still don’t know how or why consciousness arises in humans. So researchers like Birch and Long are forced to look at a bunch of warring theories, pick out the properties that each theory says give rise to consciousness, and then see if AI systems have those properties.

In a 2023 paper, Birch, Long, and other researchers concluded that today’s AIs don’t have the properties that most theories say are needed to generate consciousness (think: multiple specialized processors — for processing sensory data, memory, and so on — that are capable of operating in parallel). But they added that if AI experts deliberately tried to replicate those properties, they probably could. “Our analysis suggests that no current AI systems are conscious,” they wrote, “but also suggests that there are no obvious technical barriers to building AI systems which satisfy these indicators.”

Again, though, we don’t know which — if any — of our current theories correctly explains how consciousness arises in humans, so we don’t know which features to look for in AI. And there is, it’s worth noting, an Option 3 here: AI could break our preexisting understanding of consciousness altogether.

What if consciousness doesn’t mean what we think it means?

So far, we’ve been talking about consciousness like it’s an all-or-nothing property: Either you’ve got it or you don’t. But we need to consider another possibility.

Consciousness might not be one thing. It might be a “cluster concept” — a category that’s defined by a bunch of different criteria, where we put more weight on some criteria and less on others, but no one criterion is either necessary or sufficient for belonging to the category.

Twentieth-century philosopher Ludwig Wittgenstein famously argued that “game” is a cluster concept. He said:

Consider for example the proceedings that we call ‘games.’ I mean board-games, card-games, ball-games, Olympic games, and so on. What is common to them all? — Don’t say: “There must be something common, or they would not be called ‘games’” — but look and see whether there is anything in common to all. — For if you look at them you will not see something that is common to all, but similarities, relationships, and a whole series of them at that.

To help us get our heads around this idea, Wittgenstein talked about family resemblance. Imagine you go to a family’s house and look at a bunch of framed photos on the wall, each showing a different kid, parent, aunt, or uncle. No one person will have the exact same features as any other person. But the little boy might have his father’s nose and his aunt’s dark hair. The little girl might have her mother’s eyes and her uncle’s curls. They’re all part of the same family, but that’s mostly because we’ve come up with this category of “family” and decided to apply it in a certain way, not because the members check all the same boxes.

Consciousness might be like that. Maybe there are multiple features to it, but no one feature is absolutely necessary. Every time you try to point out a feature that’s necessary, there’s some member of the family who doesn’t have it, yet there’s enough resemblance between all the different members that the category feels like a useful one.

That word — useful — is key. Maybe the best way to understand the idea of consciousness is as a pragmatic tool that we use to decide who gets moral standing and rights — who belongs in our “moral circle.”

Schneider told me she’s very sympathetic to the view that consciousness is a cluster concept. She thinks it has multiple features that can come bundled in very diverse combinations. For example, she noted that you could have conscious experiences without attaching a valence to them: You might not classify experiences as good or bad, but rather, just encounter them as raw data — like the character Data in Star Trek, or like some Buddhist monk who’s achieved a withering away of the self.

“It may be that it doesn’t feel bad or painful to be an AI,” Schneider told me. “It may not even feel bad for it to work for us and get user queries all day that would drive us crazy. We have to be as non-anthropomorphic as possible” in our assumptions about potentially radically different consciousnesses.

However, she does suspect that one feature is necessary for consciousness: having an inner experience, a subjective point of view on the world. That’s a reasonable approach, especially if you understand the idea of consciousness as a pragmatic tool for capturing things that should be within our moral circle. Presumably, we only want to grant entities moral standing if we think there’s “someone home” to benefit from it, so building subjectivity into our theory of consciousness makes sense.

That’s Long’s instinct as well. “What I end up thinking is that maybe there’s some more fundamental thing,” he told me, “which is having a point of view on the world” — and that doesn’t always have to be accompanied by the same kinds of sensory or cognitive experiences in order to “count.”

“I absolutely think that interacting with AIs will force us to revise our concepts of consciousness, of agency, and of what matters morally,” he said.

Should we stop conscious AIs from being built? Or try to make sure their lives go well?

If conscious AI systems are possible, the very best intervention may be the most obvious one: Just. Don’t. Build. Them.

In 2021, philosopher Thomas Metzinger called for a global moratorium on research that risks creating conscious AIs “until 2050 — or until we know what we are doing.”

A lot of researchers share that sentiment. “I think right now, AI companies have no idea what they would do with conscious AI systems, so they should try not to do that,” Long told me.

“Don’t make them at all,” Birch said. “It’s the only actual solution. You can analogize it to discussions about nuclear weapons in the 1940s. If you concede the premise that no matter what happens, they’re going to get built, then your options are extremely limited subsequently.”

However, Birch says a full-on moratorium is unlikely at this point for a simple reason: If you wanted to stop all research that risks leading to conscious AIs, you’d have to stop the work companies like OpenAI and Anthropic are doing right now — because they could produce consciousness accidentally just by scaling their models up. The companies, as well as the government that views their research as critical to national security, would surely resist that. Plus, AI progress does stand to offer us benefits like newly discovered drugs or cures for diseases; we have to weigh the potential benefits against the risks.

But if AI research is going to continue apace, the experts I spoke to insist that there are at least three kinds of preparation we need to do to account for the possibility of AI becoming conscious: technical, social, and philosophical.

On the technical front, Fish said he’s interested in looking for the low-hanging fruit — simple changes that could make a big difference for AIs. Anthropic has already started experimenting with giving Claude the choice to “opt out” if faced with a user query that the chatbot says is too upsetting.

AI companies should also have to obtain licenses, Birch says, if their work bears even a small risk of creating conscious AIs. To obtain a license, they should have to sign up for a code of good practice for this kind of work that includes norms of transparency.

Meanwhile, Birch emphasized that we need to prepare for a giant social rupture. “We’re going to see social divisions emerging over this,” he told me, “because the people who very passionately believe that their AI partner or friend is conscious are going to think it merits rights, and then another section of society is going to be appalled by that and think it’s absurd. Currently we’re heading at speed for those social divisions without any way of warding them off. And I find that quite worrying.”

Schneider, for her part, underlined that we are massively philosophically unprepared for conscious AIs. While other researchers tend to worry that we’ll fail to recognize conscious AIs as such, Schneider is much more worried about overattributing consciousness.

She brought up philosophy’s famous trolley problem. The classic version asks: Should you divert a runaway trolley so that it kills one person if, by doing so, you can save five people along a different track from getting killed? But Schneider offered a twist.

“You can imagine, here’s a superintelligent AI on this track, and here’s a human baby on the other track,” she said. “Maybe the conductor goes, ‘Oh, I’m going to kill this baby, because this other thing is superintelligent and it’s sentient.’ But that would be wrong.”

Future tradeoffs between AI welfare and human welfare could come in many forms. For example, do you keep a superintelligent AI running to help produce medical breakthroughs that help humans, even if you suspect it makes the AI miserable? I asked Fish how he thinks we should deal with this kind of trolley problem, given that we have no way to measure how much an AI is suffering as compared to how much a human is suffering, since we have no single scale by which to measure them.

“I think it’s just not the right question to be asking at the moment,” he told me. “That’s not the world that we’re in.”

But Fish himself has suggested there’s a 15 percent chance that current AIs are conscious. And that probability will only increase as AI gets more advanced. It’s hard to see how we will outrun this problem for long. Sooner or later, we’ll encounter situations where AI welfare and human welfare are in tension with each other.

Or maybe we already have…

Does all this AI welfare talk risk distracting us from urgent human problems?

Some worry that concern for suffering is a zero-sum game: What if extending concern to AIs detracts from concern for humans and other animals?

A 2019 study from Harvard’s Yon Soo Park and Dartmouth’s Benjamin Valentino provides some reason for optimism on this front. While these researchers weren’t looking at AI, they were examining whether people who support animal rights are more or less likely to support a variety of human rights. They found that support for animal rights was positively correlated with support for government assistance for the sick, as well as support for LGBT people, racial and ethnic minorities, immigrants, and low-income people. Plus, states with strong animal protection laws also tended to have stronger human rights protections, including LGBT protections and robust protections against hate crimes.

Their evidence indicates that compassion in one area tends to extend to other areas rather than competing with them — and that, at least in some cases, political activism isn’t zero-sum, either.

That said, this won’t necessarily generalize to AI. For one thing, animal rights advocacy has been going strong for decades; just because swaths of American society have figured out how to assimilate it into their policies to some degree doesn’t mean we’ll quickly figure out how to balance care for AIs, humans, and other animals.

Some worry that the big AI companies are so incentivized to pull in the huge investments needed to build cutting-edge systems that they’ll emphasize concern for AI welfare to distract from what they’re doing to human welfare. Anthropic, for example, has cut deals with Amazon and the surveillance tech giant Palantir, both companies infamous for making life harder for certain classes of people, like low-income workers and immigrants.

“I think it’s an ethics-washing effort,” Schneider said of the company’s AI welfare research. “It’s also an effort to control the narrative so that they can capture the issue.”

Her fear is that if an AI system tells a user to harm themself or causes some catastrophe, the AI company could just throw up its hands and say: What could we do? The AI developed consciousness and did this of its own accord! We’re not ethically or legally responsible for its decisions.

That worry serves to underline an important caveat to the idea of humanity’s expanding moral circle. Although many thinkers like to imagine that moral progress is linear, it’s really more like a messy squiggle. Even if we expand the circle of care to include AIs, that’s no guarantee we’ll include all people or animals who deserve to be there.

Fish, however, insisted that this doesn’t need to be a tradeoff. “Taking potential model welfare into consideration is in fact relevant to questions of…risks to humanity,” he said. “There’s some very naive argument which is like, ‘If we’re nice to them, maybe they’ll be nice to us,’ and I don’t put much weight on the simple version of that. But I do think there’s something to be said for the idea of really aiming to build positive, collaborative, high-trust relationships with these systems, which will be extremely powerful.”

Read Entire Article