It’s basically called “reinforced learning” and it’s a common technique for machine learning.
You provide a goal as a big reward (eg test passing), and smaller rewards for any particular behaviours you want to encourage, and then leave the machine to figure out the best way to achieve those rewards through trial and error.
After a few million attempts, you generally either have a decent result, or more data around additional weights you need to apply before reiterating on the training.
Defining the goal is the easy part: as I said in my OP, the goal is unit tests passing.
It’s the other weights that are harder. You might want execution speed to be one metric. But how do you add weights to prevent cheating (eg hardcoding the results)? Or use of anti-patterns like global variables? (For example. Though one could argue that scoped variables aren’t something an AI-first language would need)
This is where the human feedback part comes into play.
It’s definitely not an easy problem. But it’s still more pragmatic than having a human curate the corpus. Particularly considering the end goal (no pun intended) is having an AI-first programming language.
I should close off by saying that I’m very skeptical that there’s any real value in an AI-first PL. so all of this is just a thought experiment rather than something I’d advocate.
With such learning your model needs to be able to provide some kind of solution or at least approximate it right off the bat. Otherwise it will keep producing random sequences of tokens and will not learn anything ever because there will be nothing in its output to reward, so no guidance.
I don’t agree it needs to provide a solution off the bat. But I do agree there is some initial weights you need to define.
With a AI-first language, I suspect the primitives to be more similar to assembly or WASM rather than something human readable like Rust or Python. So the amount of pre-training preparation would’ve a little easier since syntax errors due to parser constraints.
I’m not suggesting this would be easy though haha. I think it’s a solvable problem but that doesn’t mean it’s easy.
You provide a goal as a big reward (eg test passing), and smaller rewards for any particular behaviours you want to encourage, and then leave the machine to figure out the best way to achieve those rewards through trial and error.
After a few million attempts, you generally either have a decent result, or more data around additional weights you need to apply before reiterating on the training.