From AI to xz

I’m getting really bothered by the current hype over “AI”, more precisely Large Language Models or LLMs, also known as “generative AI”. I’m putting “AI” in quotes because there isn’t actually any “I”; there’s no intelligence there, because intelligence implies understanding, and the LLMs being sold generally don’t have any understanding of the things they’re being tasked with working on.

These systems are being sold as an amazing new technology everyone should be using, and they’re not. Experienced software engineers are sounding caution and steering well clear of them, but as with the hype around blockchain, cryptocurrency and NFTs, the people pointing out the problems are being ignored. So obviously I decided to write another article that would be ignored.

Large language models are basically really, really sophisticated and well-trained autocomplete. They don’t understand anything. What they do is generate really plausible responses to what they’re given, based on all the text they’ve absorbed. They’re trained that when a particular question appears on a web page or in a book, there are particular chunks of text that will also tend to appear in a particular order — the things we call “answers”.

By feeding in enough data, we’ve got to the point where these bots are very good at coming up with correct looking responses to questions we enter. But they don’t understand anything, they just know what responses are likely.

When you type “get balloons for birthday” and your phone suggests “party”, it’s not because the software has any idea what a birthday is, or what balloons are, or that people have parties on their birthdays. It’s simply that software has analyzed a large amount of text and found that the word “birthday” is often followed by the word “party”.

The “AI” that generates images has the same problem. The reason generated images have such problems with hands, for instance, is that almost every time you see a finger in an image, it’s right next to another finger. The image generator has no understanding of what a finger is or how many of them should be attached to a hand; it doesn’t even know how to count. It’s just designed to imitate what it has seen, so if it generates an image of a finger, it’ll probably generate another finger next to that one.

I’m not saying that chatbots are inherently capable of understanding anything. When I say to Siri “add eggs to the shopping list”, Siri absolutely understands what I’m asking it to do. But it understands because it was programmed intentionally by humans; it wasn’t just shown all the apps on the phone and left to guess which commands should result in which actions in which apps. Also, Siri obviously has no idea what “eggs” actually are. I can say “add next Wednesday to the shopping list” and it’ll do that too, because the only part it understands is that I want something added to the list; the rest is just text.

So it’s absolutely the case that bots can understand some things, and can be useful. The problem is that they are being way overhyped.

The first big piece of hype is the idea that we don’t need humans to instruct the software how to understand and respond to specific problems the way Siri was programmed. Instead, we’re told that we can just throw more and more data at the system, at greater and greater cost in time and energy, and eventually it will work out how to solve any problem itself.

We’ve already seen plenty of examples where that falls down. One is lawyers citing cases that don’t exist, based on what ChatGPT has told them. This is a situation where there’s a ton of accurate training data, and the entire problem is “Here’s a question, what’s the answer?”. This is an area where LLMs might reasonably be expected to excel, but the chatbots can’t even get that right.

There are some problems that AI techniques can solve, sometimes in really interesting ways. Board games tend to be solvable by training neural networks with data, as long as they’re not too complicated. Genetic algorithms have been used to design bizarre but effective antennas that humans would never have invented. But the idea that any problem can be tackled just by throwing data at a big neural network is massively overselling the technology.

The second big piece of hype is taking these incredibly sophisticated autocomplete systems, and saying that they can solve problems that have little or nothing to do with producing text in response to a stated problem. This brings me to software development.

Writing code is the easiest part of software development. However, I suspect that if you’re not an experienced software developer yourself, you likely have little or no idea what the actual work of a senior software developer is like. So I’m going to talk about some work I did last week, to try to communicate how far these “AI” tools are from actually being competent to replace human software developers. This gets long and tedious, but hey, that’s why we’re paid to do it.

So, a customer had reported a problem with our product: it was performing horribly. It was fine for other customers, so the first part of the problem was looking at the part of the product that was slow, and the data that triggered the slowness, and guessing (based on understanding and experience and stuff I learned getting a computer science degree) what the problem might be. It’s faintly possible that you could describe the problem to a chatbot and it could respond usefully with something it saw online about similar problems, but that was just the first step.

The next task was testing and validating that the guess was correct — adding temporary code to log and measure behavior, experimentally removing code that could be the cause of the issue, and benchmarking the results. This is something no “AI” can currently do.

Once I had done that, I knew roughly where the problem was. So I sat and read code until I understood the code where the problem was occurring, all the code that code was calling, and the code that code was calling. Only then could I decide what parts of the code needed to be replaced. Note that the code needing replacement wasn’t buggy — it was behaving completely correctly, it was just taking the wrong approach for a specific problem, and had horrible performance as a result. So even an “AI” capable of examining code for defects wouldn’t have helped here, not that any of them can do that.

Then I wrote the replacement code. I started with what’s known as Test Driven Development. I put together some data designed to look like the problem data, and added in bad data, malformed data, and data that was correct but looked nothing like the problem data. Obviously this required actually understanding the real things that the data represented. I then built new code, and adjusted and corrected the new code until it processed all of the test data correctly. (The tests get incorporated into the project and run automatically each time the product gets built, to protect against some change later on breaking the code.) It’s faintly possible that an “AI” could generate test data, if the problem is one for which there are plenty of examples out there, but you’d need to know which values are correct, which are incorrect, and hence how the software should process them. Otherwise you would have no idea whether the tests succeeding actually meant the code was doing the right thing — the “AI” doesn’t know that and can’t tell you.

Next I performed small edits to the right parts of the program, to substitute the new code where appropriate. Once I’d done that, I went back to the original problem data, ran that through both the old and new versions of the software, and compared the two to make sure they gave the same results — and also that the results were correct, that they meant the right thing.

I also timed the behavior with the new code in place, to make sure I had actually fixed the problem. The answer was pretty clear: the old code took 88 minutes to run on the customer’s data, the new code took 2 or 3 seconds.

Then I documented the new functions, adding comments to describe what they did. I also went back and improved the documentation of the old code, adding a note about when it shouldn’t be used and the new code should be used instead.

The next step was to assemble a Patch Request or PR. This packages up the changes to the code. In addition, I wrote a description of what I had done — why the code had been slow, how I had fixed it, information about how much faster it was now. The PR then got examined by someone else, to make sure that it made sense, didn’t have any defects I’d overlooked, that there weren’t any obvious other ways to improve it that I’d missed. Another thing “AI” can’t do.

But I still wasn’t done. Because I understand the product and what it does, I knew that there might be a dozen or so other places where the same methods could be used to speed things up. So I went looking for them. In each case I read the code until I understood when it was being called and with what data. That allowed me to decide whether it would be appropriate to substitute in the new code or not. Another decision “AI” can’t help you make.

I then put together a second PR with the additional changes. This included a link to the previous PR, a description of why the additional changes were likely appropriate, and a description of what would need testing.

The actual code was one or two hundred lines. A tiny part of the work.

I hope that rather tedious tale has convinced you that being a software developer isn’t about writing code, just like being a firefighter isn’t about squirting water.

You might be thinking OK, so code-generating bots aren’t really useful for experienced developers, but what about beginners? Doesn’t it help them learn?

I’d argue that no, it doesn’t. In fact, I believe code-generating bots are actively harmful for inexperienced programmers.

I didn’t learn to write software by looking at code. I learned by constructing code myself, from scratch, so that I knew exactly what it was supposed to be doing, and why and how it was supposed to be doing it. I then went through the painstaking process of examining the results, debugging, refining, trying new techniques, and so on.

I find lectures about how to write software to be completely useless. They were useless to me as a student, and you won’t find me watching videos now either. I can get useful information from books, but even then to actually learn I have to stop every page or two, go away, and write some code to try things out.

As an old saying goes: “To hear is to forget, to see is to remember, to do is to understand.”

I might be unusual in this respect, but I have a hunch I’m not. I mean, you don’t learn to ride a bicycle by watching someone else do it. You don’t learn to play a musical instrument by listening to music someone else has played. Once you’re experienced, you can pick up tips and techniques and ideas by studying other people’s work, but the consensus seems to be that to actually learn a skill you need to start from the beginning and practice doing it. Practice doesn’t guarantee you’ll get good at it, but it makes it far more likely.

Yes, examples are definitely helpful when learning to program. But if you get your examples from “AI” you have no idea where they’ve come from, and you have no idea if they’re correct. You don’t know if they’re the right approach to take for the problem you’re faced with. You don’t know any of these things because the “AI” doesn’t know any of them either, it’s just coming up with plausible looking program code given the prompt you gave it.

Experienced programmers tend to look down on those who copy-paste code from web examples, but at least if you do that you know that all the code came from the same place and makes internal sense, whereas a bot might give you half a function from one example and half from another. (That said, examples on the web will get steadily worse as LLM-generated spam starts to fill web sites.) Example code from a web site will generally have some accompanying explanation too.

If you have a specific problem to solve, you might end up at Stack Overflow. That’s also way better than asking a bot, because humans will have looked at the suggested solutions, ranked them, pointed out problems with them, and so on. Bot-generated code may have subtle bugs, and the difference between good code and buggy code can be extremely hard to spot, even for experienced programmers. A beginner has little chance, which is why they’re all flocking to discussion forums for help.

This week someone discovered malicious code hidden in a piece of software called xz. It’s data compression code, used (directly or indirectly) by hundreds of pieces of software — including the SSH secure remote login software of basically every Linux system.

The person who managed to insert the malicious software got away with it because of a tiny bug they introduced into some test software. Here’s the change. Can you spot the error? To make it easier I’ll cut it down to just the file with the bug in:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -901,10 +901,29 @@ endif()

 # Sandboxing: Landlock
 if(NOT SANDBOX_FOUND AND ENABLE_SANDBOX MATCHES "^ON$|^landlock$")
-    check_include_file(linux/landlock.h HAVE_LINUX_LANDLOCK_H)
+    # A compile check is done here because some systems have
+    # linux/landlock.h, but do not have the syscalls defined
+    # in order to actually use Linux Landlock.
+    check_c_source_compiles("
+        #include <linux/landlock.h>
+        #include <sys/syscall.h>
+        #include <sys/prctl.h>
+.
+        void my_sandbox(void)
+        {
+            (void)prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+            (void)SYS_landlock_create_ruleset;
+            (void)SYS_landlock_restrict_self;
+            (void)LANDLOCK_CREATE_RULESET_VERSION;
+            return;
+        }
+
+        int main(void) { return 0; }
+        "
+    HAVE_LINUX_LANDLOCK)
 
-    if(HAVE_LINUX_LANDLOCK_H)
-        set(SANDBOX_COMPILE_DEFINITION "HAVE_LINUX_LANDLOCK_H")
+    if(HAVE_LINUX_LANDLOCK)
+        set(SANDBOX_COMPILE_DEFINITION "HAVE_LINUX_LANDLOCK")
         set(SANDBOX_FOUND ON)
 
         # Of our three sandbox methods, only Landlock is incompatible

The error is the otherwise completely empty line that starts +.. It should have just been an empty line starting +.

While in this case the error was inserted deliberately, you don’t have to be programming for long before you’ll make similarly tiny but disastrous mistakes yourself. There are plenty of famous examples. A single missing hyphen caused the Phobos 1 Mars probe to spin out of position and drain its batteries. A single line error in a Bitcoin transaction script led to MtGox losing 2,609 Bitcoins, which today would be worth $185m. Perhaps most relevant of all, the first flight of the Ariane V rocket failed because developers cribbed some code from another project and then failed to spot and fix an integer overflow bug.

I should also mention the intellectual property angle. The code generation bots have been trained on code under all kinds of different software licenses, including no license at all. If you unknowingly get fed a chunk of code that’s under a restrictive license and incorporate that code into your code, are you bound by the license? There’s no case law yet, so it’s a pretty risky gamble.

One thing people like tools like ChatGPT and Microsoft Copilot for is autocompleting code. Autocomplete is certainly very useful — nobody wants to memorize Java APIs or keep writing unit test table code — but autocomplete doesn’t need LLMs. It has been a feature of software development tools for years. I’d argue that given the problems of LLM-generated code, it’s much better to use an autocomplete system that injects known good code from known sources, rather than making stuff up based on web pages and software with unknown license terms.

Refactoring tools also predate LLMs. Renaming variables, moving code between packages, generating variable names, suggesting function calls — all existed before the current round of hype. Autofixing bugs caught by linters also existed before LLMs and doesn’t need them.

Generating short utility functions is also cited as a benefit of LLMs. The problem there is that as soon as you start working with other people, and everyone is generating utility functions automatically without having to think about it, your software will rapidly end up with large amounts of duplicate code. This is a problem because of a broader point often not understood: generating code more easily is not a benefit.

Bugs are inevitable. The more code you have, the more bugs you have, and LLM-generated code has as many bugs as human-written code — perhaps more, according to some studies. Furthermore, the more code you have, the more code you have to maintain.

The ideal amount of code to write is no code at all, and the best PRs are ones that remove code. This is emphasized strongly when it comes to encryption code — everyone knows that they shouldn’t throw together a quick encryption function, right? But really, the principles apply to all code. Code isn’t like treasure chests full of gems, it’s more like reactor fuel — it’s useful, you need it, but you want as little of it scattered about the place as possible.

The response to studies pointing out these problems has been for the people selling “AI” to double down and claim that they now have tools that can detect and remove duplicate code, find bugs, find security holes, and check for copyright violation.

I recently tried several phone apps that claimed to use AI to filter out text message spam. None of them worked as well as a free app running a couple of dozen simple rules. AI can’t work out that “protect our Senate majority” and “donate now” indicate spam, but it’s going to find previously undocumented security holes in software? Let me know when it picks up something like the xz malware.

The biggest piece of hype is the fuss over Artificial General Intelligence, or AGI. People like Elon Musk think that if we provide these autocomplete systems with more and more text, and make them better and better at autocompleting responses, eventually they will develop actual intelligence and understanding.

I find that idea ludicrous. The idea that understanding something is just a matter of being able to come up with the most plausible response to it, is an idea that can make sense only to people who don’t actually understand much of anything.

In 1980, philosopher John Searle came up with the Chinese room argument. We’re asked to imagine a person in a room who is given sets of instructions to follow in response to Chinese symbols passed into the room. The instructions say that when particular combinations of Chinese symbols are passed in, the person should pass out other specific sets of Chinese symbols. The person in the room speaks no Chinese, but what if the “answers” generated by following the rules seem to be intelligent responses to questions in Chinese being passed into the room?

Searle argues that the person doesn’t understand the questions, the rules are just simple written rules and don’t understand anything, and the room itself clearly doesn’t understand the questions, therefore the system as a whole doesn’t understand the questions, even if it can answer them.

Unfortunately this is a fallacy of composition — saying that because no single piece of something has a particular property, therefore the entire thing cannot have that property. No single person can win a football match, therefore a team of 11 people can’t win a football match. Sodium isn’t edible, chlorine isn’t edible, therefore sodium chloride isn’t edible. Fallacious arguments.

Searle’s conclusion, though, could well be true — that a system that just follows linguistic rules is incapable of actually understanding things. Large language models don’t even do that — rather than following actual rules, they just roll the dice and generate randomized but statistically plausible responses. If they’re told they’re wrong, they roll the dice and try a different random response.

The results can be a lot of fun. There are even situations where they can be useful. ChatGPT is a good way to come up with names for Dungeons and Dragons characters, for example, or placeholder text for a web site design. But that’s a really long way from being a technology that will change the world.

So what’s behind all the hype? Money.

AI models are expensive to train — it takes a lot of computing power and data. It’s also pretty complicated to set up. Amazon, Google and Oracle would love you to pay them to set everything up and provide the computing power on their computers.

GPU manufacturers like nVidia have a lot of surplus capacity now that the cryptocurrency boom is fizzling out, and they need to keep increasing sales to keep shareholders happy. Neural network training needs GPUs, and nVidia needs to sell GPUs. That’s why their CEO would like you to stop learning to write software, and use nVidia GPUs and LLMs instead.

For companies like Microsoft, the expense of training and running these “AI” systems is an opportunity to keep competitors out of the market. If they can persuade everyone that your code editor has to have AI features, they limit the ability of free software and small companies to compete with Microsoft Visual Studio Code and Microsoft GitHub.