Through work, I have paid license to windsurf
(recently renamed to "devin"), an application for LLM-based (aka,
"Agentic") development.
I hadn't been using it that much, but in an effort to more clearly
understand how this whole AI development thing works, I decided to give
it a closer look recently.
My conclusions:
In its current form, this whole LLM wave is problematic for multiple
reasons. But ignoring that, and looking at the technology only, I can
say that:
- it is a paradigm shift;
- it is, at the technological level, a positive evolution;
- and it is a threat to free software.
Problems
Lest someone (incorrectly) assume that I am arguing in favour of the
current state of affairs with regards to LLMs, let me state this first.
The way LLMs are built today is highly parasitic. Websites are
downloaded in whole, at unsustainable rates, regardless of the consent
of the people who made the original content. The result is predictable:
servers get overloaded, server administrators attempt to implement
various mitigations. Some of these mitigations work; some do, for a
while; some are entirely useless. In actual fact, the mitigations are an
arms race -- if too many people implement the same mitigation, then the
people who try to build yet another LLM so they can extract rent will
just try to work around the mitigation, eventually they will succeed,
and you'll just have to come up with another mitigation. It's a bit like
spam; you introduce regex-based spam filters, they introduce spelling
mistakes, you introduce bayesian filters, they add a large batch of
markov chain-generated semi-nonsense words made invisible by markup, you
add filters to block emails with such markup, they move the text into an
image. We have working mitigations today, but eventually we'll run out
of ideas.
LLMs glob up everything they can while ignoring the license of the
source material. The people who push those LLMs claim that pushing the
source material through the machine learning algorithms makes the output
of the algorithm distinct enough from the source material that the
license no longer applies; I'm not so sure that this is true. I guess
the New York Times v OpenAI
lawsuit
will teach us some of the answer to that question here, but even so
the ethical questions about "is it OK to bring down another server just
so we can download the internet for another for-pay LLM" are still
open. And regardless of what the law states, my opinion on "you're using
my copyleft code to generate code under a different license" is not
something you might like if you agree with the rent seekers' opinion on
the subject.
That all being said and true, the technology works. You can have a
"conversation" with an LLM that resembles a human one. If you pass it
some data, you can use plain english to ask it questions about that
data, which is a lot easier than to ask it about that in a formal way.
You can request it to generate some code, and it will generate something
that looks like what you need and that will be mostly correct for like
95% of the time.
Now, yes, 95% of the time is not 100% of the time, and no, you can't ask
it to "write me a piece of software that implements this 300-page
requirements document and get back to me when you're done", because it
will fail, and you won't know where it has failed, and you'll take it
into production and expect everything to be fine because it won't and
this one minor logic bug will cause half your servers to spin and
consume credits with your infrastructure provider with nothing to show
for it.
But that doesn't mean you can't use an LLM to build a large piece of
software. It just means you have to understand the LLMs limitations and
strenghts, and use them correctly.
Here's what an LLM is good at:
- Generating plausible text
- Interpreting text to figure out what a plausible meaning or summary of
that text is
- Giving vague indications as to what the probable context of a given
body of text is.
It turns out that that's enough to use the LLM to build a reliable piece
of software, provided you do it right.
Paradigm shift
An LLM can generate text by the truckful. The generated text could be
code. Given a good enough LLM, the generated text might even run and do
something useful.
You can try to blindly run the code, and if it doesn't run correctly,
you can paste the error message to the LLM, and it can tell you what
went wrong and how you could possibly fix it. This creates a feedback
loop: you ask it for an amount of code, you run the code, you receive an
error, you tell it that the code is problematic and give it the error
message, it makes changes to the code, now you have something that at
least no longer fails at startup.
If you ask it to add tests to make sure that your code acts as per your
specification, now you get an error if and when the code doesn't act
as per your specification. Or, well, at least not as per the part of the
specification that was correctly turned into a unit test by the LLM.
LLMs have a context window, so if the error message is pasted in the
same conversation as where the code was generated, it is able to reuse
the earlier prompts to refine how it should interpret the error message
that you received.
You can't really paste the source code of an entire application into the
prompt of your LLM, that would quickly overrun its context window. But
LLMs also allow you to provide some form of background information --
a document, say -- on which you ask it to reason. It will interpret that
document, but doing so uses less of the LLMs context window. So
providing the LLM with your application's source code as background
information can help it understand better how your code interacts. This
is especially helpful if you only provide the LLM the background
information relevant to the actual question.
So now if you are able to:
- Create background context with your application's source code
- Have the LLM generate a first draft of your requested change, plus the
tests to make sure it works
- Compile (if applicable) the generated code (and tests) and run said
tests
- Return any error messages to the LLM with a request to correct the
error
Then the combination of "getting it 95% right off the bat" and the above
feedback loop means you can generate syntactically correct code, that
probably does what you need, in minutes.
I say "probably" for a reason. There are going to be cases where you
specify a request without a number of details (because they are
implied), and the LLM will get most of those details right but just not
implement the one bit because it's an automaton and it doesn't think. Or
you will ask it to make sure that two bits of the application look
exactly the same, without specifying that they must act the same, now
and in the future, and it will just generate the same block of code
twice and then in a future change it will change one but not the other.
But if you review the changes, and you have experience as a
programmer, you will be able to spot most cases where the LLM got it
wrong. And so it's possible, if not necessarily easy at first, to
use an LLM to generate mostly correct code.
There are certain places where "mostly correct" code is not desireable.
But equally, there are also cases where, "mostly correct" is good
enough.
After all, most of the software you run today -- the bits of it that
weren't, yet, generated by an LLM -- is only "mostly correct", too,
because to err is human and we all make mistakes. If not, there wouldn't
be any CVEs and your software would never do anything wrong.
Now, doing the feedback loop described above is certainly something you
could do manually. You could open an account on one of the LLM websites,
upload the source code of your application, ask it to generate some new
feature, download the newly generated feature, run it, and then
copy/paste any error messages back into the LLM.
But that's a lot of manual work of the type that computers are pretty
good at. So that's what the "windsurf" tool helps you with: you run it
inside your IDE -- either a VSCode-based tool that you download from
their website which comes with their product preinstalled, or a separate
JetBrains plugin that you can install. You can then open your entire
relevant codebase in a workspace in your IDE. You then ask the LLM,
through the IDE, to generate a new feature in your codebase, and to also
generate the test while it's at it. It will use a mixture of LLM
interpretation and non-LLM functionality to scoop out the relevant bits
of your codebase to send to the LLM as background information, will send
it your prompt, will download the generated code and patch or create
files, will compile (if required) and run the newly generated code and
tests, and will refine the generated code if the tests produce any
errors. All mostly automatic; by default, running anything requires
explicit confirmation. You can turn that off completely (probably not a
good idea), or you can give it a whitelist of things that you don't want
to confirm (perhaps OK), and the tool also passes standing instructions
to the LLM to never generate any command that deletes a file (which,
like with any LLM, can be overridden, but it requires you to be very
stubborn and to use more credits than you'd probably like).
All this put together means you can build something without writing any
piece of code, provided you do it right.
A technically positive evolution
Don't go and say, "here's a 300-page document, read it and write
whatever the document says". It will get it wrong, it will write a
massive test suite that it will only run at the end, it will choke
itself up trying to interpret the massive amount of failures it
encounters, it will fill up its context window and it will start to
forget some of the requirements. That won't work.
But what you can do -- what I did, in fact -- is this.
First, create an empty workspace. Don't put any code in it.
Then, tell the LLM to generate a backend framework using technology X
and a frontend framework using technology Y that initially only says
"hello, world". Also add tests to it, and run the tests.
It will do that. You'll not get much, but it will work.
Then, ask it to add some UI elements. A login page, perhaps. A
navigation bar. Small things. Most of it doesn't have to be functional
-- but tests must be there for the bits that are, and have it run the
tests and evaluate the results.
Rinse, repeat, until you have a working application.
Importantly, in between the steps, you should also run the application
yourself and see if the change was implemented correctly. Sometimes it
won't be. Sometimes there will be a subtle bug -- I at one point had a
the application hang after a few minutes. Sometimes you tell it that
there's a subtle bug, and it will discover it more quickly than you
could, and it will fix it, and in implementing the fix it will uncover
another bug, and then you have to fix that one -- the fix it came up
with for the hang was to move something to an async process on the
server, which caused the application to start spinning while trying to
create hundreds of async jobs (this is when I realized that the hang was
a deadlock due to some part of the codebase doing something that
indirectly triggered itself). Sometimes it will try to fix the bug you
tell it about, and you'll see that it's going off on a tangent that has
nothing to do with what you're seeing. It's important to keep an eye on
what it's doing, so you can guide it back on track when that happens --
when I told it about the hang, it started investigating the part of the
code which sends out emails, thinking that it could hang while waiting
for sendmail to finish, but the hang was happening when the
application was idle, not when it was sending out emails, and only
when I told it about it happening when it was idle did it find the
deadlock.
So it's not a fully automatic process, and it needs to be guided by
someone who knows what they're doing. But if that is the case, you can
come up with something that works. I spent evenings and breaks for about
a week, and I managed to create a working application which, had I
written it by hand, would have taken me a few months of full-time work
to come up with. And I now have a side project, fully complete and
working, that I had been thinking about doing for more than a decade,
but never got around to actually doing, because of all the work that
would be involved and I just didn't see myself having the time for.
It's not perfect code. But it's mostly good enough, and it will perform
the job it needs to. And it looks far slicker than most of the side
projects I've done in the past, because in the past I would prioritize
between implementing new features or making something look slick, and I
would decide that the new feature was more important because it's only
for me and there's only me and nobody cares if it looks good or not and
I don't have three weeks to come up with something that looks better.
But here, I found myself sometimes spending 10 minutes writing a prompt
with instructions on making things look better. Because what's 10
minutes when you just spent an hour writing down and refining
specifications for functionality and tests?
There are a number of other things in which an LLM can help a
programmer.
For instance.
I received a bug report recently in a project I'm paid to
maintain that I couldn't make heads
or tails of. I opened the source code in my windsurf IDE, pasted the bug
report in the prompt, and then requested the tool to analyze the source
code and the associated logs and tell me how the described behavior
could be happening. It turned out that I had overlooked something, but
with the help of the tool, I found the bug in minutes.
I was trying to understand a particular part of a large
codebase that I didn't really grasp very well.
I loaded the codebase in the tool, and asked it to explain to me how a
particular action is performed by the code. I requested specific
functions and line numbers. I now have a far better understanding of how
the code works, and will be able to write that patch that I've been
wanting to write for years -- without using the LLM.
I have been struggling for, literally, years with understanding why
another tool that I
maintain was
misbehaving in a particular way but only in Firefox. I opened the
codebase in Firefox, explained the buggy behavior in plain English, and
asked it to explain how this could be happening. It picked up some
obscure corner case behavior of ffmpeg and mp4 containers that I was not
aware of and that perfectly explained why things were misbehaving in the
way that they were.
At the same time, there are limitations. Giving an LLM a codebase that
was originally generated by an LLM (either the same one or another one)
seems to work well. Giving it a codebase that was written by a human and
expecting it to correctly update it seems to be more error-prone. I did
one or two of those as a trial, and it is more problematic than
anything.
An LLM is also not intelligent, notwithstanding the popular term of
"Artificial Intelligence". On multiple occasions, I've asked it to write
a test case for some code that was not set up to do so; and rather than
suggesting a refactor is required, it would instead copy the code that
needed to be tested and then test the copy, rather than the original.
The tool has made multiple similar errors. I have sometimes people
describe agentic coding as "similar to interacting with junior
programmers", but that is not the case. A junior programmer will either
fill in the gaps in your specifications, or ask for clarification when
something seems off. The LLM will not do that; it will do what you ask,
exactly that and nothing more. If you missed a corner case in your
specification, then all bets are off.
I remember learning about programming language generations in college.
A first-generation language is "machine code", a second-generation
language is "assembler", a third-generation language is any high-level
language such as C, Perl, or Pascal. I've forgotten what set a
3rd-generation language apart from a 4th-generation language. But I
remember the definition they gave me for a 5th-generation language: "you
tell the computer what to do, and it will do it". At the time, I thought
it was ridiculous. Nobody could ever write something like that.
But it's here.
And it's a threat to free software.
A threat to free software?
Yes.
There is the obvious part where most of the well-known LLMs are non-free
software. I mean, there
are
some "open source" LLM models. The windsurf tool that I used doesn't allow
you to use them (directly), but they're there. There are also open
source applications that implement what the
windsurf editor does. So it's definitely possible to work like this
without resorting to non-free software and non-free services, even
though the non-free LLMs might be a bit ahead of the curve of the free
ones. But that's not what I mean.
And there is also the obvious thing which I mentioned earlier in this
post, which is that the people who try to build LLMs are doing it in
unethical, disgusting ways, causing downtimes and disregarding licenses
for whatever they can get their grubby hands on. Ideally we wouldn't be
in that situation, and ideally this wouldn't be a problem, but we are
where we are.
And there's the obvious thing where the OSI sold itself out and declared
that a machine learning program can be open source even when the very
things it was built from -- the training data -- is not available.
That's a major issue that the free software community needs to fight
against, but there's not really anything that that is a threat to free
software. You just build your own, free software, LLM, and you're done.
The actual threat is in funding and developer support.
Most large businesses do not care about free-as-in-freedom software.
They like the free-as-in-beer part, and they appreciate that the
free-as-in-freedom bits can make the software more customizable. They
are (mostly) happy to do sponsorships of the free-as-in-freedom projects
that they use if that means their free-as-in-beer usage of the software
gets improved.
But why would you care about all that when you can just generate the
code you need, rather than interacting with an open source community
that may or may not care about your business's interests?
Where to go from here
Although I think the moral and environmental issues with LLMs are real
and problematic, given the experiments I did I am not convinced that the
concept of interacting with a computer system in natural language and
to use it to generate code is necessarily deficient. There are pitfalls,
but they can be managed. It is possible to use such a system to create
throwaway, proof-of-concept type "good enough" code bases. It can be
used to interpret code bases and to understand bug reports.
I believe that the major issue with LLMs has to do with that saying
about hammers and nails:
If all you have is a hammer, then everything looks like a nail.
LLMs are an outgrowth of machine learning, pushed by large corporations.
These large corporations have a lot of money. If all you have is money,
then every problem can be fixed by throwing more money at it. The
initial language models were promising but not (yet) good enough, and it
seemed that one way in which they could be improved was to increase the
scale of the statistics: throw more hardware (and thus money) at it, and
rather than improving the efficiency of the models, just scale up.
Scaling up is something that megacorporations are very good at. It's
only a money problem, after all. Does that mean that "scaling up"
is the only way to improve the models, though? I'm not convinced.
Some hardware, such as most modern Apple and Samsung devices, ship with
accelerator hardware for machine learning algorithms. There are some
models that are small enough to be able to run on these devices. I don't
see why it should not be possible to create a small(er) language model
that can do some useful part of the above-described use cases; if not
locally, then at least on a server that one can run on-prem rather than
requiring that you pay rent to one of the LLM companies.
The Software Freedom Conservancy has
published an aspirational statement on machine learning-assisted
programming
that, I think, gets a lot right. It's not quite a definition, but it's
something to keep in mind.
Perhaps that's the way forward?
More questions than answers at this point, anyway.