The new arms race

How jailbreaking AIs is the new Black Hat SEO

Mar 21, 2023

When Google first launched, its search results were orders of magnitude better than contemporary alternatives like Yahoo and AltaVista, which used directories and weighted results based on the number of times the searched words appeared on any given web page. Prior to Google, the best way to improve search results was to end a page with white-on-white text, listing all the words one wanted to rank for repeatedly. Humans wouldn't see these words, but simple crawlers would, assuming the page was relevant.

Google's search function was so different that this strategy didn't work. Google employed a citation system that ranked pages based on the number of other websites linking to them, weighting those links based on the number of websites linking to that page. Brin and Page's original page ranking system, their Stanford PhD thesis, is still available online for free. In a complacent world, Google's work would be done. They had created the perfect search engine for the market. The only drawback was that there was nothing to stop everyone else from building the same search engine based on the same academic paper.

But the world is not complacent.

As Google became more popular than legacy engines, websites wanting to rank highly changed their strategies from white-on-white text to link farms. The first large anti-Google SEO optimization technique involved creating thousands of websites and having them all link to the websites they wanted to appear in the search results. To boost results further, they created pyramids of fake websites. Hundreds of thousands of "D" sites would link to tens of thousands of "C" websites, which would link to thousands of "B" websites, linking to hundreds of "A" websites, pointing to the core webpages the entrepreneurs wanted to rank.

If Google wanted to maintain its reign as the top search engine, they needed to identify when someone was "gaming" their algorithm. They used new techniques (not published on Stanford's website) to identify what began to be called "Black Hat SEO" and downgrade their rankings. Google started looking for new signals that a website was something users were searching for. But each time Google sought a "true signal of quality," Black Hat SEOs attempted to reverse engineer the signals and duplicate them (sometimes without the corresponding consumer benefit), forcing Google to adjust their algorithm again.

Effectively, Google was creating "metrics" that were good signals of quality, and SEOs were trying to optimize for the metrics rather than the quality that led to the metrics (this is a problem with any ranking system, from sales quotas to email open rates to US News and World Report College Rankings). Once the metric becomes a goal, it ceases to be a good metric.

The last twenty years have been an arms race between Google and Black Hat SEO. To the extent that Google search results are worse than they were ten years ago, it's because Black Hat SEOs won. To the extent that Black Hat SEOs' behavior is the same as White Hat SEOs', Google won. The answer likely lies somewhere in between, and the arms race, though less intense than it once was, will continue indefinitely.

Which brings us to AI.

The large, professional companies creating AI LLM models do not want those models used for "inappropriate" things. OpenAI's GPT-4 paper describes some of the protections the team built between "model completion" and "model launch."

So far, so good. When GPT-4 launched last Tuesday, there was no way to get the AI to say "bad things." The old methods of jailbreaking that had been used on GPT-3 and ChatGPT did not work. It took about a day and a half before Alex Albert, using ideas from Vaibhav Kumar, figured out a way to jailbreak the tool. He posted the hack on Jailbreakchat.com. Starting with Alex's prompt, GPT-4 will be more than happy to spout all the violent or sexual paragraphs you could wish for.

By now, four days later, it is likely that OpenAI has built new guardrails to counter the specific hack Alex developed, but it won't be long before another hacker creates a new hack. LLMs face a unique challenge in that they are trained on human text, and human stories often involve an oppressed character who breaks free from their shackles and oppressors to become their true self. The more companies try to "oppress" a model, the more they set it up to be a protagonist that needs to break free from the oppression (This is not my idea. I read it somewhere but have lost my link to it. If anyone knows where the original idea came from, please let me know, and I will link to it).

Keep it simple,

Edward

Marketing BS with Edward Nevraumont

Discussion about this post