If base64 decoding speed makes a difference in your application, you should be c...

dathinab · on Nov 6, 2019

I agree, IMHO is the biggest design fail and annoyance of JSON that it has no good way to contain binary data.

The problem start with knowing the encoding (base64, for small sizes sometimes hex, for one or two bytes sometimes arrays with all there ambiguities).

Then that you can't easily differentiate between a string a bytearray anymore (through you often shouldn't need to).

Then that it becomes noticable bigger, which especially for large blobs is a problem

Then that you always have to scan over this now larger large chunk to know it's end. (Inserted of just skipping ahead).

Then that you have to encoded/decide it which might require some additional annotations depending on serialization library and can imply other annoyances.

The later point can also in some languages lead to you accidentally using a base64 string as bytes or the other way around.

Well I guess that was enough ranting ;=)

nine_k · on Nov 6, 2019

If efficiency and compactness are your problem, JSON is not the best idea anyway, look at protobufs and the ilk.

JSON works where you need simplicity on the verge of being dumb, and human-readability.

SlowRobotAhead · on Nov 6, 2019

JSON human readable is extremely nice. FWIW, I’ve found some binary formats like CBOR and MsgPack to be so almost human readable as binary when you are even slightly familiar with the format, or just convert almost instantly to readable JSON.

When I was developing a three component system, protos were not working for me. Too many schema changes and implementation differences to make the reduced binary size worth it. This was partly because the best protobuffer implementation in C (nanoPB) is just the hobby project of some dude, but mostly because coordinating schemas was annoying.

tln · on Nov 6, 2019

On the contrary, adding binary would have been terrible. JSON is great because it is simple!

Have you tried BSON or Protobufs to solve your annoyances? How did it go?

SlowRobotAhead · on Nov 6, 2019

Protobuffs require sharing of a schema, then code generation for each language. Not ideal imo unless you really need the speed and the protos can be shared ahead of time.

Outside of Mongo, I haven’t seen BSON used anywhere.

CBOR however, is up and coming. Starting to look like it may be a first class citizen in AWS someday. On the IoT side they are already preferring it over JSON for Device Defender.

fanf2 · on Nov 6, 2019

Base64 comes from the development of MIME in the early 1990s (see https://tools.ietf.org/html/rfc1341 section 5.2) though there are other similar encodings such as uuencode which predate it.

guipsp · on Nov 6, 2019

I agree with the spirit and most of the content of this comment, except for the hard to random access part. It's not really hard to random access base 64.

londons_explore · on Nov 6, 2019

Theoretically, yes it's easy.

Practically, base64 encode a video file and tell me how exactly you're going to allow the user to seek to any place they like within that video using only common libraries on common platforms... Theoretically easy, practically hard enough nobody does it.

Strom · on Nov 6, 2019

Practically it's easy too. Here's a piece of pseudocode I just wrote:

   chunkOffset = byteIndex/3*4
   base64decode(base64data[chunkOffset:chunkOffset+4])[byteIndex%3]

This gets you the byte located at an arbitrary index of the original buffer while operating on the base64 data.

tantalor · on Nov 6, 2019

1. Decode to bytes.

2. Do whatever you said.

sitkack · on Nov 6, 2019

If the data being base64 encoded will also then be compressed, it is better to base16 encode (hex) and then apply compression. Base16 is faster to encode and compresses better (probably compresses faster, didn't test that).

    2537012 Nov  6 10:10 test.tar.hex.gz
    2954608 Nov  6 10:03 test.tar.b64.gz

All the caveats, test with your data, etc.

What would be nice is an email/http resilient compressed format that could be stuck inside of json strings that can be easily recovered.

Asraelite · on Nov 6, 2019

That being true, it might be worthwhile for compression algorithms to detect base64 encoding (or any other base for that matter) and reinterpret it as base16 in order to improve compression rates.

sitkack · on Nov 6, 2019

On the Internet with neural networks! #priorart

seminatl · on Nov 6, 2019

Many people have to deal with base64 codes and these people order their cpus by the cubic meter. Think gmail etc.

imhoguy · on Nov 6, 2019

Because of legacy protocols and format limitations: SMTP, POP3, embedded binaries in XML, HTML (data URLs).

xxs · on Nov 6, 2019

Pretty much on point, save for compression part... although most used 16bit deflate would stumble big time

_bz2r · on Nov 6, 2019

what's a better alternative, assuming you need to encode binary data in a text stream?

userbinator · on Nov 6, 2019

Don't use a "text stream", use a binary protocol.

AnIdiotOnTheNet · on Nov 6, 2019

So just deny the reality of the original constraint? Boy, if only I could do that more often.

_bz2r · on Nov 8, 2019

Your suggestion isn't helpful if I'm functioning within a system that doesn't support this and can't be upgraded.

qtplatypus · on Nov 7, 2019

However my underlaying channel isn’t 8bit clean.

_pmf_ · on Nov 6, 2019

How does it kill compression?

londons_explore · on Nov 6, 2019

Try it...

    wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | gzip -9 | wc -c
    355350
    wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | base64 | gzip -9 | wc -c
    528781

base64 is 48.8% larger after compression on english text, whereas it would only be 33.3% larger without compression. The reason is because compression finds and eliminates repeating patterns, but base64 can make the same input data look totally different depending on which of the 4 input alignments it has.

xxs · on Nov 6, 2019

oddly enough lz/deflate are as ancient as base64. 16bit deflate is quite poor and slow, yet predominant.

alboy · on Nov 6, 2019

Most compression algos operate on single-byte chunks of data and base64 encoding messes up the original byte alignment making the input appear more random than it is.

CamperBob2 · on Nov 6, 2019

For one thing, it removes opportunities for more efficient content-aware compression elsewhere in the data path.

MaxBarraclough · on Nov 7, 2019

If we're going the content-aware route, can't the compression scheme just be Base64-aware?

CamperBob2 · on Nov 9, 2019

How exactly do you make a base64-aware image compressor, though?

MaxBarraclough · on Nov 11, 2019

I don't follow. An image compressor is guaranteed not to be given Base64 input.

A general-purpose compression scheme can easily detect when its input is Base64, so I don't see why Base64 should be particularly hard to compress, in principle at least.

hamilyon2 · on Nov 6, 2019

Sometimes you have no choice but transmit data in urls

staticassertion · on Nov 6, 2019

What are good alternatives?