The world desperately needs a replacement for YAML.
TOML is fine for configuration, but not an adequate solution for representing arbitrary data.
JSON is a fine data exchange format, but is not particularly human-friendly, and is especially poor for editable content: Lacks comments, multi-line strings, is far too strict about unimportant syntax, etc.
Jsonnet (a derivative of Google's internal configuration language) is very good, but has failed to reach widespread adoption.
Cue is a newer Jsonnet-inspired language that ticks a lot of boxes for me (strict, schema support, human-readable, compact), but has not seen wide adoption.
Protobuf has a JSON-like text format that's friendlier, but I don't think it's widely adopted, and as I recall, it inherits a lot of Protobufisms.
Dhall is interesting, but a bit too complex to replace YAML.
Starlark is a neat language, but has the same problem as Dhall. It's essentially a stripped-down Python.
Amazon Ion [1] is neat, but I've not seen any adoption outside of AWS.
NestedText [2] looks promising, but it's just a Python library.
StrictYAML [3] is a nice attempt at cleaning up YAML. But we need a new language with wide adoption across many popular languages, and this is Python only.
Seems you're missing my personal favorite, extensible data notation - EDN (https://github.com/edn-format/edn). Probably I'm a bit biased coming from Clojure as it's widely used there but haven't really found a format that comes close to EDN when it comes to succinctness and features.
Some of the neat features: Custom literals / tagged elements that can have their support added for them on runtime/compile time (dates can be represented, parsed and turned into proper dates in your language). Also being able to namespace data inside of it makes things a bit easier to manage without having to result to nesting or other hacks. Very human friendly, plus machine friendly.
Biggest drawback so far seems to be performance of parsing, although I'm not sure if that's actually about the format itself, or about the small adoption of the format and therefore not many parsers focusing on speed has been written.
Your list is like a graveyard of my dreams and hopes. Anything that doesn't validate the format of the underlying data is pretty much dead to me...
The problem with most of these is they're useless to describe the data. Honestly, it is completely not useful to have the following to describe data:
email => string
name => string
dob => string
IMHO, it is akin to having a dictionary (like Oxford English) read like:
email - noun
name - noun
birthday - noun
It says next to nothing except, yes, they are nouns. All too often I waste time fighting nils and bullshit in fields or duplicating validation logic all over the place.
"Oh wow, this field... is a string..? That's great... smiles gently except... THERE SHOULD NOT BE EMOJI IN MY FUCKING UUID, SCHEMA-CHUD. GET THE FUCK OFF MY LAWN!"
My experience is that validation quickly becomes surprisingly complex, to the point of being infeasible to express in a message format.
Not only are the constraints very hard to express (remember that one 2000 char regexp that really validates email addresses?), they are also contextual: the correct validation in an Android client is not the same as on the server side. Eg you might want to check uniqueness or foreign key constraints that you cannot check on the client. Sometimes you want to store and transmit invalid messages (eg partially completed user input). And then you have evolving validation requirements: what do you do with the messages from three years ago that don't have field X yet?
Unfortunately I don't think you can express what you need in a declarative format. Even minimal features such as regexp validation or enums have pitfalls.
I think it's better to bite the bullet and implement the contextually required validation on each system boundary, for any message crossing boundaries.
If you want automatic built-in string validation, one option that seems particularly interesting is to use a variant of Lua patterns, which are weaker and easier to understand than regular expressions, but still provide a significant degree of "sanity" for something like an email. The original version works on bytes and not runes, but you could simply write a parser that works on runes instead, and the pattern-matching code is just 400 old and battle-tested lines of C89. You might want to add one extension: allow for escape sequences to be treated as a single character (hence included in repetition operators and adding the capability to match quoted strings); with this extension, I think you could implement full email address validation:
XML and XML Schema solved this more than 20 years ago. It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data.
Because it offered all these things parent responded, but that made it too complex.
You either provide schema and get commodities of describing it or you don't.
I had a chance of using SOAP at one point. It was a F5 device and I used a python library. What I really liked is that when it connected to it it downloaded its schema, and then used that to generate an object. At that point you just communicated with device like you did with any object in Python.
We abandoned it for inferior technologies like REST and JSON, because they were harder to use from JS, as parent mentioned.
Parent didn't say it was harder to use from JS. Parent said "It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data."
First of all, I was there 20 years ago. I had to deal with XML, XSLT, one kind of Java XML parsers that didn't fully do what I needed, another kind of Java XML parsers that didn't fully do what I needed. And oh boy was it a pain. I just wanted to get a few properties of a bunch of entities in a bigger XML document, that's all. Big fail.
Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.
Third, JS actually had the best dev UX for XML of all languages 20 years ago. Maybe you know JavaScript from Node.js, but 20 years ago it used to run excusively in web browsers, which even then were pretty good at parsing XML documents. The browser of course had a JS DOM traversal API known to every single JS developer, and very soon (Although TBH I can't remember if before or after JSON) it also had xpath querying functions, all built in.
XML was so bad, that its replacement came from the language where it was actually easiest to use. think about that for a second.
So the answer to the question "Why was XML replaced?" is not "Because webdevs lol".
I suspect it was because it has both content and attributes, which all but guarantees it's impossible to create a bunch of simple, common data structures from it (like JSON does).
> Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.
Firstly, it sounds like XML ran over your dog or something. Sorry to hear about that. It wasn’t particularly hard to use at all, and if you’re dealing with the possibility of emojis in your JSON UUIDs in 2021, one might even say it’s easier to use.
If you’re referring to JSON.parse() in “had a parser” above, then you have a temporal problem. Regarding eval(), it’s suggested right in the original RFC for JSON. Check it out. Web developers at the time were following that advice.
> The world desperately needs a replacement for YAML.
The world desperately needs support for YAML 1.2, which solves the problems the article addresses fairly completely (largely in the “default” Core schema[0], but more completely with the support for schemas in general), plus a bunch of others, and has for more than a decade. But YAML 1.2 libraries aren’t available for most languages.
[0] not actually an official default, but reflects a cleanup of the YAML 1.1 behavior without optional types, so its defaultish. Back when it looked like YAML 1.3 might happen in some reasonably-near future, it was actually indicated by team members that the JSON Schema for YAML (not to be confused with the JSON Schema spec) would be the explicit default YAML Schema in 1.3, which has a lot to recommend it.
Nope nope nope. YAML is awful and needs to die. The more you look at it the worse it gets. The basic functionality is elegant (at least until you consider stuff like The Norway Problem), but the advanced parts of YAML are batshit insane.
The article is simply, factually wrong; there is no “YAML 2.0 specification” [0], and everything they point to is YAML 1.1, and addressed in YAML 1.2 (the most recent YAML spec, from 2009.)
TOML quickly breaks down with lots of nested arrays of objects. For example:
a:
b:
- c: 1
- d:
- e: 2
- f:
g: 3
Turns into this, which is unreadable:
[[a.b]]
c = 1
[[a.b]]
[[a.b.d]]
e = 2
[[a.b.d]]
[a.b.d.f]
g = 3
TOML also has a few restrictions, such as not supporting mixed-type arrays like [1, "hello", true], or arrays at the root of the data. JSON can represent any TOML value (as far as I know), but TOML cannot represent any JSON value.
At my company we use YAML a lot for table-driven tests (e.g. [1]), and this not only means lots of nested arrays, but also having to represent pure data (i.e. the expected output of a test), which requires a format that supports encoding arbitrary "pure" data structures of arrays, numbers, strings, booleans, and objects.
Also many (most? all?) serializers don't let you control which fields are serialized inline vs not. So if you have a program that generates configuration, you're going to end up with the original unreadable form anyway.
Apropos of this, in Clojure-land the idiomatic serialization is, EDN [1], which is pretty ergonomic to work with IMO, since in most cases it is the same as a data-literal in Clojure.
My feeling is that :keywords reduce the need and temptation to conflate strings and boolean/enumerations that occurs when there's no clear way to convey or distinguish between a string of data and a unique named 'symbol'. I miss them when I'm in Pythonland.
> S-expressions inherits all trouble with data types from json (dates, times, booleans, integer size, number vs numeric string).
Hm, not sure that's true, S-expressions would only define the "shape" of how you're defining something, not the semantics of how you're defining something. EDN https://github.com/edn-format/edn for all purposes is S-expressions and have support for custom literals and more, to avoid "the trouble with data types from JSON"
Yes, EDN is S-expressions plus a bunch of semantic rules.
Parsing EDN is quite a bit more complex than just parsing S-expressions, just because you need to support a bunch of built in types, as well as arbitrary exensions through 'tags'.
I’ve used most of the technologies you listed. Cue is the best, and the only one with strong theoretical foundations. I’ve been using it for some time now and won’t go back to the others.
> The world desperately needs a replacement for YAML.
For situations like TFA you really want a configuration language that behaves exactly like you think it will, and since you don't have to interop with other organizations you don't really need a global standard.
Moreover, broadly used config languages can be somewhat counterproductive to that goal. Take JSON as an example; idiomatic JSON serdes in multiple programming languages has discrepancies in minint, maxfloat, datetime, timezone, round-tripping, max depth, and all kinds of other nuanced issues. Existing tooling is nice when it does what you expect, but for a no-frills, no-surprises configuration language I would almost always just prefer to use the programming language itself or otherwise write a parser if that doesn't suffice (e.g., in multilingual projects).
Mildly off-topic: The problem here, more or less, was that the configuration change didn't have the desired effect on an in-memory representation of that configuration. We can mitigate that at the language level, but as a sanity check it's also a good idea to just diff the in-memory objects and make sure the change looks kind of like what you'd expect.
You don't need wide adoption for internal projects in an organization, but you do want great toolchain support.
For example, the fact that NestedText is a Python library means a Python team could use it, but it's a poor fit for an organization whose other teams use Go and JavaScript/TypeScript.
We use YAML for much more than configuration, by the way. I feel like YAML hits a nice sweet spot where it's usable for almost everything.
I don't think YAML is going anywhere, largely because it was the first format to prioritize readability and conciseness, and has used that advantage to achieve critical mass.
It's far more productive to push for incremental changes to the YAML spec (or even a fork of it) to make it more sane and better defined. Things like a StrictYAML subset mode for parsers in other popular languages.
> It's far more productive to push for incremental changes to the YAML spec
The problems this article raises and strictyaml purports to address were addressed in YAML 1.2, already supported in python via ruamel.yaml; YAML 1.2 addresses much of this in the Core schema which is the closest successor to the default behavior of earlier spec versions, and does so more completely in the support for schemas more generally, which define both the supported “built-in" tags (roughly, types) and how they are matched from the low-level representation which consists only of strings, sequences, and maps (which, incidentally, are the only three tags of the “Failsafe” schema; there’s also a “JSON” Schema between Failsafe and Core, which has tags corresponding to the types supported by JSON.
JSON5 is better than JSON on my points, but it has downsides compared to YAML. For example, YAML is very good at multiline strings that don't require any sort of quoting, and knows to remove preceding indentation:
foo: |
"This is a string that goes across
multiple lines," he wrote.
In JSON5, you'd have to write:
{
foo: \"This is a string that goes across \
multiple lines,\" he wrote."
}
This sort of ergonomic approach is why YAML is so well-liked, I think. (Granted, YAML's use of obscure Perl-like sigils to indicate whitespace mode is annoying, but it does cover a lot of situations.)
YAML is also great at arrays, mimicking how you'd write a list in plaintext:
I will keep using YAML because I don't want to learn the pitfalls of your alternatives. With YAML everyone is complaining about the pitfalls, and therefore everyone is aware of them. A random replacement may not have this particular problem, but it may have other problems that remain unknown.
TOML is fine for configuration, but not an adequate solution for representing arbitrary data.
JSON is a fine data exchange format, but is not particularly human-friendly, and is especially poor for editable content: Lacks comments, multi-line strings, is far too strict about unimportant syntax, etc.
Jsonnet (a derivative of Google's internal configuration language) is very good, but has failed to reach widespread adoption.
Cue is a newer Jsonnet-inspired language that ticks a lot of boxes for me (strict, schema support, human-readable, compact), but has not seen wide adoption.
Protobuf has a JSON-like text format that's friendlier, but I don't think it's widely adopted, and as I recall, it inherits a lot of Protobufisms.
Dhall is interesting, but a bit too complex to replace YAML.
Starlark is a neat language, but has the same problem as Dhall. It's essentially a stripped-down Python.
Amazon Ion [1] is neat, but I've not seen any adoption outside of AWS.
NestedText [2] looks promising, but it's just a Python library.
StrictYAML [3] is a nice attempt at cleaning up YAML. But we need a new language with wide adoption across many popular languages, and this is Python only.
Any others?
[1] https://amzn.github.io/ion-docs/
[2] https://nestedtext.org/
[3] https://github.com/crdoconnor/strictyaml/