Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Personally (anecdata) I haven't experienced any practical progress in my day-to-day tasks for a long time, no matter how good they became at gaming the benchmarks.

They keep being impressive at what they're good at (aggregating sources to solve a very well known problem) and terrible at what they're bad at (actually thinking through novel problems or old problems with few sources).

E.g. all ChatGPT, Claude and Gemini were absolutely terrible at generating Liquidsoap[0] scripts. It's not even that complex, but there's very little information to ingest about the problem space, so you can actually tell they are not "thinking".

[0] https://www.liquidsoap.info/



Absolutely. All models ar terrible with Objective-C and Swift, compared to let's say JS/HTML/Python.

However, I've realized that Claude Code is extremely useful for generating somewhat simple landing pages for some of my projects. It spits out static html+js which is easy to host, with somewhat good looking design.

The code isn't the best and to some extent isn't maintainable by a human at all, but it gets the job done.


I’ve gotten 0 production usable python out of any LLM. Small script to do something trivial, sure. Anything I’m going to have to maintain or debug in the future, not even close. I think there is a _lot_ of terrible python code out there training LLMs, so being a more popular language is not helpful. This era is making transparent how low standards really are.


> I’ve gotten 0 production usable python out of any LLM

Fascinating, I wonder how you use it because once I decompose code to modules and function signatures, Claude[0] is pretty good at implementing Python functions. I'd say it one-shots 60% of the times, I have to tweak the prompt or adjust the proposed diffs 30%, and the remaining 10% is unusable code that I end up writing by hand. Other things Claude is even better at: writing tests, simple refactors within a module, authoring first-draft docstrings, adding context-appropriate type hints.

0. Local LLMs like Gemma3, Qwen-coder seem to be in the same ballpark in terms of capabilities, it's just that they are much slower on my hardware. Except for the 30b Qwen3 MoE that was released a day ago, that one is freakin' fast.


I agree - you have to treat them like juniors and provide the same context you would someone who is still learning. You can’t assume it’s correct but where it doesn’t matter it is a productivity improvement. The vast majority of the code I write doesn’t even go into production so it’s fantastic for my usage.


What happens to the vast majority of code you write


Different experience here. Production code in banking and finance for backend data analysis and reporting. Sure the code isn't perfect, but doesn't need to be. It's saving >50% effort and the analysis results and reporting are of at least as good a standard as human developed alternatives.


Try o4-mini-high. It’s getting there.


Maybe with the next got version, gpt-4.003741


Interesting, I'll have to try that. All the "static" page generators I've tried require React....


Building a basic static html landing page is ridiculously easy though. What js is even needed? If it's just an html file and maybe a stylesheet of course it's easy to host. You can apply 20 lines of css and have a decent looking page.

These aren't hard problems.


A big part of my job is building proofs of concept for some technologies and that usually means some webpage to visualize that the underlying tech is working as expected. It’s not hard, doesn’t have to look good at all, and will never be maintained. I throw it away a few weeks later.

It used take me an hr or two to get it all done up properly. Now it’s literal seconds. It’s a handy tool.


> These aren’t hard problems.

Honestly, that’s the best use-case for AI currently. Simple but laborious problems.


Laziness mostly - no need to think about design, icons and layout (responsiveness and all that stuff).

These are not hard problems obviously, but getting to 80%-90% is faster than doing it by hand and in my cases that was more than enough.

With that being said, AI failed for the rest 10%-20% with various small visual issues.


> These aren't hard problems.

So why do so many LLMs fail at them?


And humans also.


I like using Vercel v0 for frontend


Absolutely, as soon as they hit that mark where things get really specialized, they start failing a lot. They do generalizations on well documented areas pretty good. I only use it for getting a second opinion as it can search through a lot of documents quickly and find me alternative means.


They have broad knowledge, a lot of it, and they work fast. That should be a useful combination-

And indeed it is. Essentially every time I buy something these days, I use Deep Research (Gemini 2.5) to first make a shortlist of options. It’s great at that, and often it also points out issues I wouldn’t have thought about.

Leave the final decisions to a super slow / smart intelligence (a human), by all means, but for people who claim that LLMs are useless I can only conclude that they haven’t tried very hard.


Yes similar experience querying gpt about lesser known frameworks. Had o1 stone cold hallucinate some non existent methods I could find no trace of from googling. Would not budge on the matter either. Basically you have to provide the key insight yourself in these cases to get it unstuck, or just figure it out yourself. After its dug into a problem to some degree you get a feel for whether continued prompting on the subject is going to be helpful or just more churn


I'm curious what kind of prompting or context you are providing before asking for a liquid soap script - or if you've tried using Cursor and providing a bunch of context with documentation about liquid soap as part of it. My guess was these kinds of things get the models to perform much better. I have seen this work with internal APIs / best practices / patterns.


Yes, I used Cursor and tried providing both the whole Liquidsoap book or the URL to the online reference just in case the book was too large for context or it was triggering some sort of RAG.

Not successful.

It's not that it didn't do what I wanted: most of the time it didn't even run. Iterating on the error messages just arrived at progressively dumber not-solutions and running in circles.


Oh man, that's dissapointing.


What model?


I'm on Pro two-week trial so I tried a mix of mainstream premium models (including reasoning ones) + letting Cursor route me to the "best" model or whatever they call it.


this problem is always going to exist in these models, these models are hungry for good data

if there is focus on improving the model on something, the method do it is known, its just about priority




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: