if you haven't tried the research -> plan -> implementation approach here, you are missing out on how good LLMs are. it completely changed my perspective.
the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.
When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.
I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.
tbh I think the thing that's making this new approach so hard to adopt for many people is the word "vibecoding"
Like yes vibecoding in the lovable-esque "give me an app that does XYZ" manner is obviously ridiculous and wrong, and will result in slop. Building any serious app based on "vibes" is stupid.
But if you're doing this right, you are not "coding" in any traditional sense of the word, and you are *definitely* not relying on vibes
I'm sticking to the original definition of "vibe coding", which is AI-generated code that you don't review.
If you're properly reviewing the code, you're programming.
The challenge is finding a good term for code that's responsibly written with AI assistance. I've been calling it "AI-assisted programming" but that's WAY too long.
> but not explicitly in discrete steps and that was where i got into messes.
I've said this repeatedly, I mostly use it for boilerplate code, or when I'm having a brain fart of sorts, I still love to solve things for myself, but AI can take me from "I know I want x, y, z" to "oh look I got to x, y, z in under 30 minutes, which could have taken hours. For side projects this is fine.
I think if you do it piecemeal it should almost always be fine. When you try to tell it to do two much, you and the model both don't consider edge cases (Ask it for those too!) and are more prone for a rude awakening eventually.
Have you tried techniques that don’t require modifying the LLM and the sampling strategy for structure outputs? For example, schema aligned passing, where you build error tolerance into the parser instead of coercing to a grammar.
It looks really slick, for us the reason we haven't adopted yet is it brings more tooling and configuration that overlaps with our existing system for prompt templates, schema definitions, etc. In the component where we couldn't rely on OpenAI structured outputs we experimented with TOML-formatted output, that ended up being reliable enough to solve the problem across many models without any new dependencies. I do think we'll revisit at some point as Boundary also provides incremental parsing of streaming outputs and may allow some cost optimization that is not easy right now.
its a bit more nuanced than applicative lifting. parts of of SAP is that, but there's also supporting strings that don't have quotation marks, supporting recursive types, supporting unescaped quotes like: `"hi i wanted to say "hi""`, supporting markdown blocks inside of things that look like "json", etc.
but applicative lifting is a big part of it as well!
Client: Ollama (phi4) - 90164ms. StopReason: stop. Tokens(in/out): 365/396
---PROMPT---
user: Extract from this content:
Grave Digger:
Ingredients
- 1 1/2 ounces vanilla-infused brandy*
- 3/4 ounce coffee liqueur
- 1/2 ounce Grand Marnier
- 1 ounce espresso, freshly brewed
- Garnish: whipped cream
- Garnish: oreo cookies, crushed
Steps
1. Add all ingredients into a shaker with ice and shake until
well-chilled.
2. Strain into a coupe.
3. Top with whipped cream and crushed Oreo cookies (discarding cream in
center).
*Vanilla-infused brandy: Cut 2 fresh vanilla pods lengthwise and place
into a 750 mL bottle of brandy. Let sit for 3 to 5 days, shaking
occasionally. Strain out pods and store.
IngredientType
----
- Alcohol
- Sweetener
- Sour
- Aromatic
- Bittering_agent
- Food
- Dilution
LiquorType
----
- Gin
- Vodka
- Rum
- Whiskey
- Tequila
- Mezcal
- Brandy
- Cognac
- Liqueur
- Wine
- Absinthe
Answer in JSON using this schema:
{
Name: string,
Ingredients: [
{
Text: string,
Type: IngredientType,
Liquor_type: LiquorType or null,
Name_brand: string or null,
Unit_of_measure: string,
Measurement_or_unit_count: string,
}
],
Steps: [
{
Number: int,
Instruction: string,
}
],
Garnishes: string[],
}
---LLM REPLY---
```json
{
"Name": "Grave Digger",
"Ingredients": [
{
"Text": "vanilla-infused brandy*",
"Type": "Alcohol",
"Liquor_type": "Brandy",
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1 1/2"
},
{
"Text": "coffee liqueur",
"Type": "Liqueur",
"Liquor_type": "Liqueur",
"Name_brand": null,
"Unit_of_measure": "ounce",
"Measurement_or_unit_count": "3/4"
},
{
"Text": "Grand Marnier",
"Type": "Liqueur",
"Liquor_type": "Liqueur",
"Name_brand": "Grand Marnier",
"Unit_of_measure": "ounce",
"Measurement_or_unit_count": "1/2"
},
{
"Text": "espresso, freshly brewed",
"Type": "Bittering_agent",
"Liquor_type": null,
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1"
}
],
"Steps": [
{
"Number": 1,
"Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
},
{
"Number": 2,
"Instruction": "Strain into a coupe."
},
{
"Number": 3,
"Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
}
],
"Garnishes": [
"whipped cream",
"oreo cookies, crushed"
]
}
```
---Parsed Response (class Recipe)---
{
"Name": "Grave Digger",
"Ingredients": [
{
"Text": "vanilla-infused brandy*",
"Type": "Alcohol",
"Liquor_type": "Brandy",
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1 1/2"
},
{
"Text": "espresso, freshly brewed",
"Type": "Bittering_agent",
"Liquor_type": null,
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1"
}
],
"Steps": [
{
"Number": 1,
"Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
},
{
"Number": 2,
"Instruction": "Strain into a coupe."
},
{
"Number": 3,
"Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
}
],
"Garnishes": [
"whipped cream",
"oreo cookies, crushed"
]
}
Processed Recipe: {
Name: 'Grave Digger',
Ingredients: [
{
Text: 'vanilla-infused brandy*',
Type: 'Alcohol',
Liquor_type: 'Brandy',
Name_brand: null,
Unit_of_measure: 'ounces',
Measurement_or_unit_count: '1 1/2'
},
{
Text: 'espresso, freshly brewed',
Type: 'Bittering_agent',
Liquor_type: null,
Name_brand: null,
Unit_of_measure: 'ounces',
Measurement_or_unit_count: '1'
}
],
Steps: [
{
Number: 1,
Instruction: 'Add all ingredients into a shaker with ice and shake until well-chilled.'
},
{ Number: 2, Instruction: 'Strain into a coupe.' },
{
Number: 3,
Instruction: 'Top with whipped cream and crushed Oreo cookies (discarding cream in center).'
}
],
Garnishes: [ 'whipped cream', 'oreo cookies, crushed' ]
}
So, yeah, the main issue being that it dropped some ingredients that were present in the original LLM reply. Separately, the original LLM Reply misclassified the `Type` field in `coffee liqueur`, which should have been `Alcohol`.
but you expected a
{
Text: string,
Type: IngredientType,
Liquor_type: LiquorType or null,
Name_brand: string or null,
Unit_of_measure: string,
Measurement_or_unit_count: string,
}
there's no way to cast `Liqueur` -> `IngredientType`. but since the the data model is a `Ingredient[]` we attempted to give you as many ingredients as possible.
The model itself being wrong isn't something we can do much about. that depends on 2 things (the capabilities of the model, and the prompt you pass in).
If you wanted to capture all of the items with more rigor you could write it in this way:
class Recipe {
name string
ingredients Ingredient[]
num_ingredients int
...
// add a constraint on the type
@@assert(counts_match, {{ this.ingredients|length == this.num_ingredients }})
}
And then if you want to be very wild, put this in your prompt:
then in your code you can easily pass that to the "calculator" function and get the result, then hand the result back to the model. Making it feel like the model can "call" an external function.
i 100% percent agree. people get so caught up on trying to do everything 90% right with AI, but they forget there's a reason most websites offer at least 2 9's of uptime.
I’m not really sure what stance is here because you say you agree with the GP but then throw some figures that clearly disagree with the authors point (99% uptime is vastly greater than 90% accuracy).
We have some preliminary data with llama3.1 and we find that the smaller model gets to around 70% with BAML (+20% from base), but we'll update this dashboard with llama3.1 by end of week!
tb = TypeBuilder()
tb.Person.add_property("last_name", tb.string().list())
tb.Person.add_property("height", tb.float().optional()).description(
"Height in meters"
)
tb.Hobby.add_value("chess")
for name, val in tb.Hobby.list_values():
val.alias(name.lower())
tb.Person.add_property("hobbies", tb.Hobby.type().list()).description(
"Some suggested hobbies they might be good at"
)
# no_tb_res = await b.ExtractPeople("My name is Harrison. My hair is black and I'm 6 feet tall.")
tb_res = await b.ExtractPeople(
"My name is Harrison. My hair is black and I'm 6 feet tall. I'm pretty good around the hoop.",
{"tb": tb},
)
assert len(tb_res) > 0, "Expected non-empty result but got empty."
for r in tb_res:
print(r.model_dump())
Neat, thanks! I'm still pondering wether I should be using this since most of the retries I have to do are because of the LLM itself not understanding the schema asked for (eg output with missing fields / using a value not present in `Literal[]`) — certain models being especially bad with deeply nested schemas and output gibberish. Anything on your end that can help with that?
or if you're open to share your prompt / data model with, I can send over my best guess of a good prompt! We've found these models works even with over 50+ fields / nested and whatnot decently well!
I might share it with you later on your discord server.
> I can send over my best guess of a good prompt!
Now if you could automate the above process by "fitting" a first draft prompt to a wanted schema, ie where your library makes a few adjustments if some assertions do not pass by have having a chat of its own with the LLM, that would be super useful! Heck i might just implement it myself.
[Another BAML creator here]. I agree this is an interesting direction! We have a "chat" feature on our roadmap to do this right in the VSCode playground, where an AI agent will have context on your prompt, schema, (and baml test results etc) and help you iterate on the prompt automatically. We've done this before and have been surprised by how good the LLM feedback can be.
We just need a bit better testing flow within BAML since we do not support adding assertions just yet.
the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.
When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.
I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.