YAML seems like a really neat idea, but over time, I have I have come to regard ...

Arnavion · on April 3, 2021

They all have their downsides.

JSON:

- no comments, unless you fake them with fake properties, unless your configuration has a schema that doesn't allow extra fake properties

- no trailing commas; makes editing more annoying

- no raw strings

YAML:

- the automatic type coercion

- the many ways to encode strings ( https://yaml-multiline.info/ )

- the roulette wheel of whether this particular parser is anal about two-space indentation or accepts anything as long as it's used consistently

- the roulette wheel of whether this particular parser supports uncommon features like anchors

TOML:

- runtime footguns in automated serialization ( https://news.ycombinator.com/item?id=24853386 )

- hard to represent deeply-nested structures, unless you switch to inline tables which are like JSON but just different enough to be annoying

perlgeek · on April 3, 2021

For hand-writing I love jsonnet, which produces JSON, is much more convenient to write, and has some templating, functions etc. https://jsonnet.org/

You wouldn't serialize data structures to jsonnet though, you'd just generate JSON.

anticristi · on April 3, 2021

This makes me sad. It's 2021 and we still haven't figure out how to serialize configuration in a format that is easy-to-edit and predictable.

kstenerud · on April 3, 2021

This is the problem space I'm targeting with https://concise-encoding.org/

* Text AND binary so that humans can edit easily, and machines can transmit energy and bandwidth efficiently.

* Carefully designed spec to avoid ambiguities (and their security implications).

* Strong type support so you're not using all kinds of incompatible hacks to serialize your data.

* Versioned, because there's no such thing as the perfect format.

* Also, the website is 32k bytes ;-)

yyyk · on April 3, 2021

+ Has binary format.

+ Avoids ambiguities.

- The format seems to feel the need to support everything, including things I am not sure are actual usecases (what's the point of Markup element for example? What does Metadata save us compared to just including it in document, given that parsers must parse it anyway?). This must make implementation most complex and costly, and makes reading the text format more difficult.

- Not a fan of octal notation. At 3am not sure I can't confuse 0 and o given certain fonts. Does anyone even use it these days?

- Unquoted string were discussed in the thread, I'd like to point out that it's very easy to make an unquoted string not "text-safe" (according to the spec) without noticing it, at which point document is invalid.

Just add white-space (maybe a user pasted a string from somewhere without noticing whitespace at the end or forgot the rules), a dot, an exclamation or a question mark. Having surprises like that is IMHO worse than a consistent quoting method.

Basically all the things I don't like are about the format supporting a bit too much. YAML 1.1 should teach us more is sometimes less.

kstenerud · on April 3, 2021

Alright that's two votes against unquoted strings so far (plus my wife agrees so that's three against!)

I put in octal because it was trivial to implement after the others. The canonical format when it's stored or being sent is binary, and a decoder shouldn't be presenting integers in octal (that would just be weird). But a human might want octal when inputting data that will be converted to the binary format.

Markup is for presentation data, UI layouts, etc, but with full type support rather than all the hacky XML+whatever solutions that many UI toolkits are adopting. Also, having presentation data in binary form is nice to have.

yyyk · on April 3, 2021

Well, unquoted strings work when a format is built for that. If the default was "it's text unless we see the special sequences" it would be better for unquoted strings. But even then there are too many special characters in this format IMHO.

I saw there's a 'Media' type in the spec. It's seems the type is actually for serializing files. But there's no "name" (or we can call it "description") field. Of course we could accomplish this with a separate field - but than again the entire type's functionality could be accomplished with a u8x array and a string field. So if you're specifying this type at all, might as well add a name field to make it useful.

kstenerud · on April 4, 2021

The media object is for embedding media within a document (an image, a sound, an animation, some bytecode to execute in a sandbox, or whatever). It's not intended to be used as an archive format for storing files (which, as you said, could be trivially accomplished with a byte array for the data, a string for the file name, and some metadata like permissions etc). A file is just one way among many to store media (in this case as an entry in a hierarchical database - the filesystem - keyed by filename). CE is only interested in the media itself, not the database technology.

The media object is a way to embed media data directly into a document such that the receiving end will have some idea of how to deal with it (from its media type). It won't have or need a "file name" because it's not intended to be stored in a filesystem, but rather to be used directly by an application. Yes, it could be built up from the primitives, but then you lose the canonical "media" type, and everyone invents their own incompatible compound types (much like what happened with dates in JSON and XML).

kstenerud · on April 4, 2021

OK, after more discussion and thought:

- I'm removing the metadata type. You're right that it's not really gaining us anything.

- I'm changing strings so they always must be quoted. This actually simplifies a lot of things.

Thanks for the critique!

chousuke · on April 3, 2021

I'm skimming through the human readable spec, and it seems decent, but I noticed the spec allows unquoted strings. What's the reasoning for this? In my experience unquoted strings cause nothing but trouble, and are confusing to humans who may interpret them as keywords.

Any reason for not using RFC2119 keywords in the spec? Using them should make the spec easier to read.

kstenerud · on April 3, 2021

> I noticed the spec allows unquoted strings. What's the reasoning for this? In my experience unquoted strings cause nothing but trouble, and are confusing to humans who may interpret them as keywords.

Unquoted strings are much nicer for humans to work with. All special keywords and object encodings are prefixed with sigils (@, &, $, #, etc), so any bare text starting with a letter is either a string or an invalid document, and any bare text starting with a numeral is either a number or an invalid document.

> Any reason for not using RFC2119 keywords in the spec? Using them should make the spec easier to read.

I use a superset of those keywords to give more precision in meaning: https://github.com/kstenerud/concise-encoding/blob/master/ce...

chousuke · on April 3, 2021

If strings are always unambiquously detectable, why allow quoting them at all? Having two representations for the same data means you can't normalize a document unambiguously. I can understand having barewords seems cleaner for things like map keys, but I am not convinced that it's a worthwhile tradeoff.

An important feature of RFC2119 keywords is that they're always capitalized (ie. the keyword is "MUST", not "Must", or "must"). This makes requirements and recommendations stand out amid explanatory text, improving legibility. For example, RFC2119 itself uses MUST and must with different meanings.

kstenerud · on April 3, 2021

> If strings are always unambiquously detectable, why allow quoting them at all?

Because strings can contain whitespace and other structural characters that would confuse a parser.

> Having two representations for the same data means you can't normalize a document unambiguously.

The document will always be normalized unambiguously in binary format. The text format is a bit more lenient because humans are involved.

The idea is that the binary format is the source of truth, and is what is used in 90% of situations. The text format is only needed as a conduit for human input, or as a human readable representation of the binary data when you need to see what's going on.

> An important feature of RFC2119 keywords is that they're always capitalized (ie. the keyword is "MUST", not "Must", or "must").

Hmm good point. I'll add that.

kstenerud · on April 4, 2021

Update: I'm removing unquoted strings. Thanks for the critique!

anticristi · on April 3, 2021

Nice! I like some concepts that this format proposes, but the `@` and `|` modifier feels a bit too "loaded".

kstenerud · on April 3, 2021

It's a compromise; there are only so many letters, numbers, and symbols available in a single keystroke on all keyboards, and I don't want there to be any ambiguity with numbers and unquoted strings (e.g. interpreting the unquoted string value true as the boolean value true).

So everything else needs some kind of initiator and/or container syntax to logically separate it from the other objects when interpreted by a human or machine.

imhoguy · on April 3, 2021

We had such: XML. With proper editor support it is easy. I guess it needs rediscovery /s ;)

anticristi · on April 3, 2021

I used XML and didn't like it:

- A proper editor was never around.

- Closing tags were verbose.

- Attributes vs tags was confusing.

- It didn't map "naturally" to common data types, like lists, maps, integers, float, etc.

mattmanser · on April 3, 2021

Don't forgot about namespaces, another fiddly bit of XML that caused all sorts of problems and headaches.

sergeykish · on April 3, 2021

You've just used XML tech as it was designed to post this comment.

XML is serialization. I hardly believe you was concerned about serialization while posting comment or thought about attributes-tags distinction.

This page utilizes request to server for multi-user editing. But it is easy to build truly serverless (like a file) document with same interface:

    data:text/html,<html><ul>Host: <span class=host contenteditable>example.com

Change it, save it, done. Web handles input of lists, maps, integers, float and much more.

anticristi · on April 3, 2021

You are right. XML is great for encoding the DOM. However, I didn't find it practical for interfacing with humans, due to the concerns I raised.

sergeykish · on April 3, 2021

It is not practical to edit plain text in binary:

    636f 756e 7472 6965 733a 0a2d 2047 420a
    2d20 4945 0a2d 2046 520a 2d20 4445 0a2d

It is not practical to edit Excel documents in plain text:

    <?xml version="1.0"?>
    <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
      xmlns:o="urn:schemas-microsoft-com:office:office"
      xmlns:x="urn:schemas-microsoft-com:office:excel"
      xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
      xmlns:html="http://www.w3.org/TR/REC-html40">
      <Worksheet ss:Name="Sheet1">
        <Table>
          <Row>
            <Cell><Data ss:Type="String">ID</Data></Cell>

Tim Berners-Lee browser was browser-editor. Can't you see parallels?

trhway · on April 3, 2021

XML with a convenient UI tools to edit should have fit the bill. Yet, for whatever reason a convenient UI tool would never happen to be there when needed, and thus scared and tired of manual editing of XML the world have embraced YAML.

masklinn · on April 3, 2021

> XML with a convenient UI tools to edit should have fit the bill.

"You need this special tool to work" immediately and instantly rules out "easy to edit". Or makes the debate irrelevant: every format is easy to edit if you have "a convenient UI" to do it for you.

sergeykish · on April 3, 2021

The fault was in XML editing, pure data authoring is hard. We have convenient UI — web browser, think of it as literate programming, a way to merge man page and configuration file.

And plain text editor is a "widely deployed special tool to work". Actual data is

    countries:\n- GB\n- IE\n- FR\n- DE\n- NO

Or

    636f 756e 7472 6965 733a 0a2d 2047 420a
    2d20 4945 0a2d 2046 520a 2d20 4445 0a2d

anticristi · on April 3, 2021

Opening XMLs in ZIP containers is easy! Just spin up Word. :)

_pvxk · on April 3, 2021

https://dhall-lang.org/ ?

tgv · on April 3, 2021

> - the automatic type coercion

Only when you "unmarshal" to an untyped data structure and then make assumptions about the type. I've used yaml with a go application, and it can't interpret NO as a bool when the field is a string.

Arnavion · on April 3, 2021

Correct, like TFA.