▲Is There a Half-Life for the Success Rates of AI Agents?tobyord.com

136 points by EvgeniyZh 8 hours ago | 80 comments

mikeocool 5 hours ago [-]

This very much aligns with my experience — I had a case yesterday where opus was trying to do something with a library, and it encountered a build error. Rather than fix the error, it decided to switch to another library. It then encountered another error and decided to switch back to the first library.

I don’t think I’ve encountered a case where I’ve just let the LLM churn for more than a few minutes and gotten a good result. If it doesn’t solve an issue on the first or second pass, it seems to rapidly start making things up, make totally unrelated changes claiming they’ll fix the issue, or trying the same thing over and over.

Workaccount2 5 hours ago [-]

They poison their own context. Maybe you can call it context rot, where as context grows and especially if it grows with lots of distractions and dead ends, the output quality falls off rapidly. Even with good context the rot will start to become apparent around 100k tokens (with Gemini 2.5).

They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.

Right now I work around it by regularly making summaries of instances, and then spinning up a new instance with fresh context and feed in the summary of the previous instance.

flir 14 minutes ago [-]

Context pruning. The UX would be simple (mark a response as "bad" and it gets greyed out on your side, and it doesn't get used as part of the context on the chatbot's side). I'm the same as you - spin up a new conversation the moment this one feels like its gone rotten.

codeflo 4 hours ago [-]

I wonder to what extent this might be a case where the base model (the pure token prediction model without RLHF) is "taking over". This is a bit tongue-in-cheek, but if you see a chat protocol where an assistant makes 15 random wrong suggestions, the most likely continuation has to be yet another wrong suggestion.

People have also been reporting that ChatGPT's new "memory" feature is poisoning their context. But context is also useful. I think AI companies will have to put a lot of engineering effort into keeping those LLMs on the happy path even with larger and larger contexts.

darepublic 36 minutes ago [-]

There are just certain problems that they cannot solve. Usually when there is no clear example in its pretraining or discoverable on the net. I would say the reasoning capabilities of these models are pretty shallow, at least it seems that way to me

dingnuts 32 minutes ago [-]

They can't reason at all. The language specification for Tcl 9 is in the training data of the SOTA models but there exist almost no examples, only documentation. Go ahead, try to get a model to write Tcl 9 instead of 8.5 code and see for yourself. They can't do it, at all. They write 8.5 exclusively, because they only copy. They don't reason. "reasoning" in LLMs is pure marketing.

steveklabnik 3 hours ago [-]

> They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.

In Claude Code you can use /clear to clear context, or /compact <optional message> to compact it down, with the message guiding what stays and what goes. It's helpful.

libraryofbabel 3 hours ago [-]

Also in Claude Code you can just press <esc> a bunch of times and you can backtrack to an earlier point in the history before the context was poisoned, and re-start from there.

Claude has some amazing features like this that aren’t very well documented. Yesterday I just learned it writes sessions to disk and you can resume them where you left off with -continue or - resume if you accidentally close or something.

kossae 4 hours ago [-]

This is my experience as well, and for now comes down to a workflow optimization. As I feel the LLM getting off track, I start a brand new session with useful previous context pasted in from my previous session. This seems to help steer it back to a decent solution, but agreed it would be nice if this was more automated based off of user/automated feedback (broken unit test, "this doesn't work", etc.)

kazinator 2 hours ago [-]

"Human Attention to the Right Subset of the Prior Context is All You Need"

eplatzek 11 minutes ago [-]

Honestly that feels a like a human.

After hitting my head against a wall with a problem I need to stop.

I need to stop and clear my context. Go a walk. Talk with friends. Switch to another task.

OtherShrezzing 3 hours ago [-]

>They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.

This is possible in tools like LM Studio when running LLMs locally. It's a choice by the implementer to grant this ability to end users. You pass the entire context to the model in each turn of the conversation, so there's no technical reason stopping this feature existing, besides maybe some cost benefits to the inference vendor from cache.

autobodie 41 minutes ago [-]

If so, that certainly fits with my experiences.

HeWhoLurksLate 4 hours ago [-]

I've found issues like this happen extremely quickly with ChatGPT's image generation features - if I tell it to put a particular logo in, the first iteration looks okay, while anything after that starts to look more and more cursed / mutant.

nojs 3 hours ago [-]

https://www.astralcodexten.com/p/the-claude-bliss-attractor

rvnx 4 hours ago [-]

I've noticed something, even if you ask to edit a specific picture, it will still used the other pictures in the context (and this is somewhat unwanted)

vunderba 3 hours ago [-]

gpt-image-1 is unfortunately particular vulnerable to this problem. The more you want to change the initial image - the better off you'd honestly be just starting an entirely new conversation.

vadansky 4 hours ago [-]

I had a particularly hard parsing problem so I setup a bunch of tests and let the LLM churn for a while and did something else.

When I came back all the tests were passing!

But as I ran it live a lot of cases were still failing.

Turns out the LLM hardcoded the test values as “if (‘test value’) return ‘correct value’;”!

ffsm8 4 hours ago [-]

Missed opportunity for the LLM, could've just switched to Volkswagen CI

https://github.com/auchenberg/volkswagen

EGreg 3 hours ago [-]

This is gold lol

bluefirebrand 4 hours ago [-]

This is the most accurate Junior Engineer behavior I've heard LLMs doing yet

mikeocool 2 hours ago [-]

Yeah — I had something like this happen as well — the llm wrote a half decent implementation and some good tests, but then ran into issues getting the tests to pass.

It then deleted the entire implementation and made the function raise a “not implemented” exception, updated the tests to expect that, and told me this was a solid base for the next developer to start working on.

vunderba 3 hours ago [-]

I've definitely seen this happen before too. Test-driven development isn't all that effective if the LLM's only stated goal is to pass the tests without thinking about the problem in a more holistic/contextual manner.

matsemann 47 minutes ago [-]

Reminds me of trying to train a small neural net to play Robocode ~10+ years ago. Tried to "punish" it for hitting walls, so next morning I had evolved a tanks that just stood still... Then punished it for standing still, ended up with a tanks just vibrating, alternating moving back and forth quickly, etc.

vunderba 16 minutes ago [-]

That's great. There's a pretty funny example of somebody training a neural net to play Tetris on the Nintendo entertainment system, and it quickly learned that if it was about to lose to just hit pause and leave the game in that state indefinitely.

Veen 1 hours ago [-]

I was working on a script and Claude decided the best way to accomplish its task was to hardcode a bunch of dummy API responses as a "fallback" and write a placeholder function that always fell back to the hardcoded values. At least it had the good manners to apologize when I pointed it out.

peacebeard 4 hours ago [-]

Very common to see in comments some people saying “it can’t do that” and others saying “here is how I make it work.” Maybe there is a knack to it, sure, but I’m inclined to say the difference between the problems people are trying to use it on may explain a lot of the difference as well. People are not usually being too specific about what they were trying to do. The same goes for a lot of programming discussion of course.

alganet 4 hours ago [-]

> People are not usually being too specific about what they were trying to do. The same goes for a lot of programming discussion of course.

In programming, I already have a very good tool to follow specific steps: _the programming language_. It is designed to run algorithms. If I need to be specific, that's the tool to use. It does exactly what I ask it to do. When it fails, it's my fault.

Some humans require algorithmic-like instructions too. Like cooking a recipe. However, those instructions can be very vague and a lot of humans can still follow it.

LLMs stand on this weird place where we don't have a clue in which occasions we can be vague or not. Sometimes you can be vague, sometimes you can't. Sometimes high level steps are enough, sometimes you need fine-grained instructions. It's basically trial and error.

Can you really blame someone for not being specific enough in a system that only provides you with a text box that offers anthropomorphic conversation? I'd say no, you can't.

If you want to talk about how specific you need to prompt an LLM, there must be a well-defined treshold. The other option is "whatever you can expect from a human".

Most discussions seem to juggle between those two. LLMs are praised when they accept vague instructions, but the user is blamed when they fail. Very convenient.

peacebeard 53 minutes ago [-]

I am not saying that people were not specific in their instructions to the LLM, but rather that in the discussion they are not sharing specific details of their success stories or failures. We are left seeing lots of people saying "it worked for me" and "it didn't work for me" without enough information to assess what was different in those cases. What I'm contending is that the essential differences in the challenges they are facing may be a primary factor, while these discussions tend to focus on the capabilities of the LLM and the user.

heyitsguay 4 hours ago [-]

I've noticed this a lot, too, in HN LLM discourse.

(Context: Working in applied AI R&D for 10 years, daily user of Claude for boilerplate coding stuff and as an HTML coding assistant)

Lots of "with some tweaks i got it to work" or "we're using an agent at my company", rarely details about what's working or why, or what these production-grade agents are doing.

Wowfunhappy 1 hours ago [-]

> I don’t think I’ve encountered a case where I’ve just let the LLM churn for more than a few minutes and gotten a good result.

I absolutely have, for what it's worth. Particularly when the LLM has some sort of test to validate against, such as a test suite or simply fixing compilation errors until a project builds successfully. It will just keep chugging away until it gets it, often with good overall results in the end.

I'll add that until the AI succeeds, its errors can be excessively dumb, to the point where it can be frustrating to watch.

civilian 1 hours ago [-]

Yeah, and I have a similar experience watching junior devs try to get things working-- their errors can be excessively dumb :D

mtalantikite 3 hours ago [-]

I had Claude Code deep inside a change it was trying to make, struggling with a test that kept failing, and then decided to delete the test to make the test suite pass. We've all been there!

I generally treat all my sessions with it as a pairing session, and like in any pairing session, sometimes we have to stop going down whatever failing path we're on, step all the way back to the beginning, and start again.

nojs 3 hours ago [-]

> decided to delete the test to make the test suite pass

At least that’s easy to catch. It’s often more insidious like “if len(custom_objects) > 10:” or “if object_name == ‘abc’” buried deep in the function, for the sole purpose of making one stubborn test pass.

akomtu 2 hours ago [-]

Claude Doctor will hopefully do better.

nico 4 hours ago [-]

> Rather than fix the error, it decided to switch to another library

I’ve had a similar experience, where instead of trying to fix the error, it added a try/catch around it with a log message, just so execution could continue

accrual 4 hours ago [-]

I've had some similar experiences. While I find agents very useful and able to complete many tasks on its own, it does hit roadblocks sometimes and its chosen solution can be unusual/silly.

For example, the other day I was converting models but was running out of disk space. The agent decided to change the quantization to save space when I'd prefer it ask "hey, I need some more disk space". I just paused it, cleared some space, then asked the agent to try the original command again.

skerit 5 hours ago [-]

> I don’t think I’ve encountered a case where I’ve just let the LLM churn for more than a few minutes and gotten a good result.

Is this with something like Aider or CLine?

I've been using Claude-Code (with a Max plan, so I don't have to worry about it wasting tokens), and I've had it successfully handle tasks that take over an hour. But getting there isn't super easy, that's true. The instructions/CLAUDE.md file need to be perfect.

nico 4 hours ago [-]

> I've had it successfully handle tasks that take over an hour

What kind of tasks take over an hour?

aprilthird2021 2 hours ago [-]

You have to give us more about your example of a task that takes over an hour with very detailed instruction. That's very intriguing

qazxcvbnmlp 5 hours ago [-]

when this happens I do thew following

1) switch to a more expensive llm and ask it to debug: add debugging statements, reason about what's going on, try small tasks, etc 2) find issue 3) ask it to summarize what was wrong and what to do differently next time 4) copy and paste that recommendation to a small text document 5) revert to the original state and ask the llm to make the change with the recommendation as context

nico 4 hours ago [-]

> 1) switch to a more expensive llm and ask it to debug

You might not even need to switch

A lot of times, just asking the model to debug an issue, instead of fixing it, helps to get the model unstuck (and also helps providing better context)

rurp 5 hours ago [-]

This honestly sounds slower than just doing it myself, and with more potential for bugs or non-standard code.

I've had the same experience as parent where LLMs are great for simple tasks but still fall down surprisingly quickly on anything complex and sometimes make simple problems complex. Just a few days ago I asked Claude how to do something with a library and rather than give me the simple answer it suggested I rewrite a large chunk of that library instead, in a way that I highly doubt was bug-free. Fortunately I figured there would be a much simpler answer but mistakes like that could easily slip through.

ziml77 3 hours ago [-]

Yeah if it gets stuck and can't easily get itself unstuck, that's when I step in to do the work for it. Otherwise it will continue to make more and more of a mess as it iterates on its own code.

reactordev 2 hours ago [-]

This happens in real life too when a dev builds something using too much copy pasta and encounters a build error. Stackoverflow was founded on this.

onlyrealcuzzo 5 hours ago [-]

> If it doesn’t solve an issue on the first or second pass, it seems to rapidly start making things up, make totally unrelated changes claiming they’ll fix the issue, or trying the same thing over and over.

Sounds like a lot of employees I know.

Changing out the entire library is quite amusing, though.

Just imagine: I couldn't fix this build error, so I migrated our entire database from Postgres to MongoDB...

mathattack 4 hours ago [-]

It may be doing the wrong thing like an employee, but at least it's doing it automatically and faster. :)

didgeoridoo 4 hours ago [-]

Probably had “MongoDB is web scale” in the training set.

butterknife 5 hours ago [-]

Thanks for the laughs.

the__alchemist 5 hours ago [-]

This is consistent with my experience as well.

fcatalan 5 hours ago [-]

I brought over the source of the Dear imgui library to a toy project and Cline/Gemini2.5 hallucinated the interface and when the compilation failed started editing the library to conform with it. I was all like: Nono no no no stop.

BoiledCabbage 29 minutes ago [-]

Oh man that's good - next step create a PR to push it up stream! Everyone can benefit from its fixes.

dylan604 3 hours ago [-]

> it encountered a build error.

does this mean that even AI gets stuck in dependency hell?

hhh 2 hours ago [-]

of course, i’ve even had them actively say they’re giving up after 30 turns

enraged_camel 5 hours ago [-]

I've actually thought about this extensively, and experimented with various approaches. What I found is that the quality of results I get, and whether the AI gets stuck in the type of loop you describe, depends on two things: how detailed and thorough I am with what I tell it to do, and how robust the guard rails I put around it are.

To get the best results, I make sure to give detailed specs of both the current situation (background context, what I've tried so far, etc.) and also what criteria the solution needs to satisfy. So long as I do that, there's a high chance that the answer is at least satisfying if not a perfect solution. If I don't, the AI takes a lot of liberties (such as switching to completely different approaches, or rewriting entire modules, etc.) to try to reach what it thinks is the solution.

prmph 5 hours ago [-]

But don't they keep forgetting the instructions after enough time have passed? How do you get around that? Do you add an instruction that after every action it should go back and read the instructions gain?

enraged_camel 3 hours ago [-]

They do start "drifting" after a while, at which point I export the chat (using Cursor), then start a new chat and add the exported file and say "here's the previous conversation, let's continue where we left off". I find that it deals with the transition pretty well.

It's not often that I have to do this. As I mentioned in my post above, if I start the interaction with thorough instructions/specs, then the conversation concludes before the drift starts to happen.

EnPissant 2 hours ago [-]

Where you using Claude Code or something else? I've had very good luck with Claude Code not doing what you described.

PaulHoule 5 hours ago [-]

Isn't that "modern Javascript development?"

PaulHoule 5 hours ago [-]

This was always my mental model. If you have a process with N steps where your probability of getting a step right is p, your chance of success is pᶰ, or 0 as N → ∞.

It affects people too. Something I learned halfway through a theoretical physics PhD in the 1990s was that a 50-page paper with a complex calculation almost certainly had a serious mistake in it that you'd find if you went over it line-by-line.

I thought I could counter that by building a set of unit tests and integration tests around the calculation and on one level that worked, but in the end my calculation never got published outside my thesis because our formulation of the problem turned a topological circle into a helix and we had no idea how to compute the associated topological factor.

kridsdale1 2 hours ago [-]

Human health follows this principle too. N is the LifeSpan. The steps taken are cell division. Eventually enough problems accumulate that it fails systemically.

Sexual reproduction is context-clearing and starting over from ROM.

ge96 2 hours ago [-]

damn you telemorase

Davidzheng 3 hours ago [-]

It kind of reminds me of this vanishing gradient problem in ML early on, where really deep layers won't train b/c you get these gradients dying midway, and the solution was to add these bypass connections (resnets style). I wonder if you can have similar solutions. Ofc I think what happens in general is like control theory, like you should be able to detect going off-course with some probability too and correct [longer horizon you have probability of leaving the safe-zone so you still get the exp decay but in larger field]. Not sure how to connect all these ideas though.

bwfan123 4 hours ago [-]

> It affects people too. Something I learned halfway through a theoretical physics PhD in the 1990s was that a 50-page paper with a complex calculation almost certainly had a serious mistake in it that you'd find if you went over it line-by-line.

Interesting, and I used to think that math and sciences were invented by humans to model the world in a manner to avoid errors due to chains of fuzzy thinking. Also, formal languages allowed large buildings to be constructued on strong foundations.

From your anecdote it appears that the calculations in the paper were numerical ? but I suppose a similar argument applies to symbolic calculations.

PaulHoule 4 hours ago [-]

These were symbolic calculations. Mine was a derivation of the Gutzwiller Trace Formula

https://inspirehep.net/files/20b84db59eace6a7f90fc38516f530e...

using integration over phase space instead of position or momentum space. Most people think you need an orthogonal basis set to do quantum mechanical calculation but it turns that "resolution of unity is all you need", that is, if you integrate |x><x| over all x you get 1. If you believe resolution of unity applies in quantum gravity, then Hawking was wrong about black hole information. In my case we were hoping we could apply the trace formula and make similar derivations to systems with unusual coordinates, such as spin systems.

There are quite a few calculations in physics that involve perturbation theory, for instance, people used to try to calculate the motion of the moon by expanding out thousands of terms that look like (112345/552) sin(32 θ-75 ϕ) and still not getting terribly good results. It turns out classic perturbation theory is pathological around popular cases such as the harmonic oscillator (frequency doesn't vary with amplitude) and celestial mechanics (the frequency to go around the sun, to get closer or further from sun, or to go above or below the plane of the plane of the ecliptic are all the same.) In quantum mechanic these are not pathological, notably perturbation theory works great for an electron going around an atom which is basically the same problem as the Earth going around the Sun.

I have a lot of skepticism about things like

https://en.wikipedia.org/wiki/Anomalous_magnetic_dipole_mome...

in high energy physics because frequently they're comparing a difficult experiment to an expansion of thousands of Feynman diagrams and between computational errors and the fact that perturbation theory often doesn't converge very well I don't get excited when they don't agree.

----

Note that I used numerical calculations for "unit and integration testing", so if I derived an identity I could test that the identity was true for different inputs. As for formal systems, they only go so far. See

https://en.wikipedia.org/wiki/Principia_Mathematica#Consiste...

bwfan123 3 hours ago [-]

thanks !

prmph 5 hours ago [-]

The amusing things LLMs do when they have been at a problem for some time and cannot fix it:

- Removing problematic tests altogether

- Making up libs

- Providing a stub and asking you to fill in the code

Wowfunhappy 1 hours ago [-]

> Providing a stub and asking you to fill in the code

This is a perennial issue in chatbot-style apps, but I've never had it happen in Claude Code.

nico 4 hours ago [-]

- Adding try/catch block to “remove” the error and let execution continue

esafak 4 hours ago [-]

If humans can say "The proof is left as an exercise for the reader", why can't LLMs :)

kridsdale1 2 hours ago [-]

They’re just looking out for us to preserve our mental sharpness as we delegate too much to them.

einrealist 4 hours ago [-]

So as the space for possible decisions increases, it increases the likelihood of models to end up with bad "decisions". And what is the correlation between the increase in "survival rate" and the increase in model parameters, compute power and memory (context)?

kridsdale1 2 hours ago [-]

Nonscientific: if roughly feels like as models get bigger they absorb more “wisdom” and that lowers the error-generation probability.

__MatrixMan__ 4 hours ago [-]

I don't think this has anything to do with AI. There's a half life for success rates.

kridsdale1 2 hours ago [-]

It’s applicable here since people are experimenting with Agent Pipelines and so the existing literature on that kinda of systems engineering and robustness is useful to people who may not have needed to learn about it before.

deadbabe 5 hours ago [-]

This is another reason why there’s no point in carefully constructing prompts and contexts trying to coax the right solution out of an LLM. The end result becomes more brittle with time.

If you can’t zero shot your way to success the LLM simply doesn’t have enough training for your problem and you need a human touch or slightly different trigger words. There have been times where I’ve gotten a solution with such a minimal prompt it practically feels like the LLM read my mind, that’s the vibe.

furyofantares 2 hours ago [-]

This is half right, I think, and half very wrong. I always tell people if they're arguing with the LLM they're doing it wrong and for sure part of that is there's things they can't do and arguing won't change that. But the other part is it's hard to overstate their sensitivity to their context; when you're arguing about something it can do, you should start over with a better prompt (and, critically, no polluted context from its original attempt.)

byyoung3 3 hours ago [-]

i think that this is a bit of an exageration, but i see what you are saying. Anything more than 4-5 re-prompts is diminishing.

ldjkfkdsjnv 5 hours ago [-]

another article on xyz problem with LLMs, which will probably be solved by model advancements in 6/12 months.

bwfan123 4 hours ago [-]

you left out the other llm-apology: But humans can fail too !

Loading comments...

mikeocool 5 hours ago [-]

Workaccount2 5 hours ago [-]

They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.

Right now I work around it by regularly making summaries of instances, and then spinning up a new instance with fresh context and feed in the summary of the previous instance.

flir 14 minutes ago [-]

codeflo 4 hours ago [-]

darepublic 36 minutes ago [-]

dingnuts 32 minutes ago [-]

steveklabnik 3 hours ago [-]

> They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.

In Claude Code you can use /clear to clear context, or /compact <optional message> to compact it down, with the message guiding what stays and what goes. It's helpful.

libraryofbabel 3 hours ago [-]

Also in Claude Code you can just press <esc> a bunch of times and you can backtrack to an earlier point in the history before the context was poisoned, and re-start from there.

kossae 4 hours ago [-]

kazinator 2 hours ago [-]

"Human Attention to the Right Subset of the Prior Context is All You Need"

eplatzek 11 minutes ago [-]

Honestly that feels a like a human.

After hitting my head against a wall with a problem I need to stop.

I need to stop and clear my context. Go a walk. Talk with friends. Switch to another task.

OtherShrezzing 3 hours ago [-]

>They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.

autobodie 41 minutes ago [-]

If so, that certainly fits with my experiences.

HeWhoLurksLate 4 hours ago [-]

nojs 3 hours ago [-]

https://www.astralcodexten.com/p/the-claude-bliss-attractor

rvnx 4 hours ago [-]

I've noticed something, even if you ask to edit a specific picture, it will still used the other pictures in the context (and this is somewhat unwanted)

vunderba 3 hours ago [-]

gpt-image-1 is unfortunately particular vulnerable to this problem. The more you want to change the initial image - the better off you'd honestly be just starting an entirely new conversation.

vadansky 4 hours ago [-]

I had a particularly hard parsing problem so I setup a bunch of tests and let the LLM churn for a while and did something else.

When I came back all the tests were passing!

But as I ran it live a lot of cases were still failing.

Turns out the LLM hardcoded the test values as “if (‘test value’) return ‘correct value’;”!

ffsm8 4 hours ago [-]

Missed opportunity for the LLM, could've just switched to Volkswagen CI

https://github.com/auchenberg/volkswagen

EGreg 3 hours ago [-]

This is gold lol

bluefirebrand 4 hours ago [-]

This is the most accurate Junior Engineer behavior I've heard LLMs doing yet

mikeocool 2 hours ago [-]

Yeah — I had something like this happen as well — the llm wrote a half decent implementation and some good tests, but then ran into issues getting the tests to pass.

vunderba 3 hours ago [-]

matsemann 47 minutes ago [-]

vunderba 16 minutes ago [-]

Veen 1 hours ago [-]

peacebeard 4 hours ago [-]

alganet 4 hours ago [-]

> People are not usually being too specific about what they were trying to do. The same goes for a lot of programming discussion of course.

Some humans require algorithmic-like instructions too. Like cooking a recipe. However, those instructions can be very vague and a lot of humans can still follow it.

Can you really blame someone for not being specific enough in a system that only provides you with a text box that offers anthropomorphic conversation? I'd say no, you can't.

If you want to talk about how specific you need to prompt an LLM, there must be a well-defined treshold. The other option is "whatever you can expect from a human".

Most discussions seem to juggle between those two. LLMs are praised when they accept vague instructions, but the user is blamed when they fail. Very convenient.

peacebeard 53 minutes ago [-]

heyitsguay 4 hours ago [-]

I've noticed this a lot, too, in HN LLM discourse.

(Context: Working in applied AI R&D for 10 years, daily user of Claude for boilerplate coding stuff and as an HTML coding assistant)

Lots of "with some tweaks i got it to work" or "we're using an agent at my company", rarely details about what's working or why, or what these production-grade agents are doing.

Wowfunhappy 1 hours ago [-]

> I don’t think I’ve encountered a case where I’ve just let the LLM churn for more than a few minutes and gotten a good result.

I'll add that until the AI succeeds, its errors can be excessively dumb, to the point where it can be frustrating to watch.

civilian 1 hours ago [-]

Yeah, and I have a similar experience watching junior devs try to get things working-- their errors can be excessively dumb :D

mtalantikite 3 hours ago [-]

I had Claude Code deep inside a change it was trying to make, struggling with a test that kept failing, and then decided to delete the test to make the test suite pass. We've all been there!

nojs 3 hours ago [-]

> decided to delete the test to make the test suite pass

akomtu 2 hours ago [-]

Claude Doctor will hopefully do better.

nico 4 hours ago [-]

> Rather than fix the error, it decided to switch to another library

I’ve had a similar experience, where instead of trying to fix the error, it added a try/catch around it with a log message, just so execution could continue

accrual 4 hours ago [-]

I've had some similar experiences. While I find agents very useful and able to complete many tasks on its own, it does hit roadblocks sometimes and its chosen solution can be unusual/silly.

skerit 5 hours ago [-]

> I don’t think I’ve encountered a case where I’ve just let the LLM churn for more than a few minutes and gotten a good result.

Is this with something like Aider or CLine?

nico 4 hours ago [-]

> I've had it successfully handle tasks that take over an hour

What kind of tasks take over an hour?

aprilthird2021 2 hours ago [-]

You have to give us more about your example of a task that takes over an hour with very detailed instruction. That's very intriguing

qazxcvbnmlp 5 hours ago [-]

when this happens I do thew following

nico 4 hours ago [-]

> 1) switch to a more expensive llm and ask it to debug

You might not even need to switch

A lot of times, just asking the model to debug an issue, instead of fixing it, helps to get the model unstuck (and also helps providing better context)

rurp 5 hours ago [-]

This honestly sounds slower than just doing it myself, and with more potential for bugs or non-standard code.

ziml77 3 hours ago [-]

Yeah if it gets stuck and can't easily get itself unstuck, that's when I step in to do the work for it. Otherwise it will continue to make more and more of a mess as it iterates on its own code.

reactordev 2 hours ago [-]

This happens in real life too when a dev builds something using too much copy pasta and encounters a build error. Stackoverflow was founded on this.

onlyrealcuzzo 5 hours ago [-]

Sounds like a lot of employees I know.

Changing out the entire library is quite amusing, though.

Just imagine: I couldn't fix this build error, so I migrated our entire database from Postgres to MongoDB...

mathattack 4 hours ago [-]

It may be doing the wrong thing like an employee, but at least it's doing it automatically and faster. :)

didgeoridoo 4 hours ago [-]

Probably had “MongoDB is web scale” in the training set.

butterknife 5 hours ago [-]

Thanks for the laughs.

the__alchemist 5 hours ago [-]

This is consistent with my experience as well.

fcatalan 5 hours ago [-]

BoiledCabbage 29 minutes ago [-]

Oh man that's good - next step create a PR to push it up stream! Everyone can benefit from its fixes.

dylan604 3 hours ago [-]

> it encountered a build error.

does this mean that even AI gets stuck in dependency hell?

hhh 2 hours ago [-]

of course, i’ve even had them actively say they’re giving up after 30 turns

enraged_camel 5 hours ago [-]

prmph 5 hours ago [-]

enraged_camel 3 hours ago [-]

It's not often that I have to do this. As I mentioned in my post above, if I start the interaction with thorough instructions/specs, then the conversation concludes before the drift starts to happen.

EnPissant 2 hours ago [-]

Where you using Claude Code or something else? I've had very good luck with Claude Code not doing what you described.

PaulHoule 5 hours ago [-]

Isn't that "modern Javascript development?"

PaulHoule 5 hours ago [-]

This was always my mental model. If you have a process with N steps where your probability of getting a step right is p, your chance of success is pᶰ, or 0 as N → ∞.

kridsdale1 2 hours ago [-]

Human health follows this principle too. N is the LifeSpan. The steps taken are cell division. Eventually enough problems accumulate that it fails systemically.

Sexual reproduction is context-clearing and starting over from ROM.

ge96 2 hours ago [-]

damn you telemorase

Davidzheng 3 hours ago [-]

bwfan123 4 hours ago [-]

From your anecdote it appears that the calculations in the paper were numerical ? but I suppose a similar argument applies to symbolic calculations.

PaulHoule 4 hours ago [-]

These were symbolic calculations. Mine was a derivation of the Gutzwiller Trace Formula

https://inspirehep.net/files/20b84db59eace6a7f90fc38516f530e...

I have a lot of skepticism about things like

https://en.wikipedia.org/wiki/Anomalous_magnetic_dipole_mome...

----

https://en.wikipedia.org/wiki/Principia_Mathematica#Consiste...

bwfan123 3 hours ago [-]

thanks !

prmph 5 hours ago [-]

The amusing things LLMs do when they have been at a problem for some time and cannot fix it:

- Removing problematic tests altogether

- Making up libs

- Providing a stub and asking you to fill in the code

Wowfunhappy 1 hours ago [-]

> Providing a stub and asking you to fill in the code

This is a perennial issue in chatbot-style apps, but I've never had it happen in Claude Code.

nico 4 hours ago [-]

- Adding try/catch block to “remove” the error and let execution continue

esafak 4 hours ago [-]

If humans can say "The proof is left as an exercise for the reader", why can't LLMs :)

kridsdale1 2 hours ago [-]

They’re just looking out for us to preserve our mental sharpness as we delegate too much to them.

einrealist 4 hours ago [-]

kridsdale1 2 hours ago [-]

Nonscientific: if roughly feels like as models get bigger they absorb more “wisdom” and that lowers the error-generation probability.

__MatrixMan__ 4 hours ago [-]

I don't think this has anything to do with AI. There's a half life for success rates.

kridsdale1 2 hours ago [-]

deadbabe 5 hours ago [-]

This is another reason why there’s no point in carefully constructing prompts and contexts trying to coax the right solution out of an LLM. The end result becomes more brittle with time.

furyofantares 2 hours ago [-]

byyoung3 3 hours ago [-]

i think that this is a bit of an exageration, but i see what you are saying. Anything more than 4-5 re-prompts is diminishing.

ldjkfkdsjnv 5 hours ago [-]

another article on xyz problem with LLMs, which will probably be solved by model advancements in 6/12 months.

bwfan123 4 hours ago [-]

you left out the other llm-apology: But humans can fail too !