Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If base64 decoding speed makes a difference in your application, you should be considering why you are transmitting data in base64, a neat hack from the 1970's, which is non-human-readable, wastes 30% of the network bandwidth, kills compression, wastes RAM, is typically hard to random-access, and is generally a bad idea all round.


I agree, IMHO is the biggest design fail and annoyance of JSON that it has no good way to contain binary data.

The problem start with knowing the encoding (base64, for small sizes sometimes hex, for one or two bytes sometimes arrays with all there ambiguities).

Then that you can't easily differentiate between a string a bytearray anymore (through you often shouldn't need to).

Then that it becomes noticable bigger, which especially for large blobs is a problem

Then that you always have to scan over this now larger large chunk to know it's end. (Inserted of just skipping ahead).

Then that you have to encoded/decide it which might require some additional annotations depending on serialization library and can imply other annoyances.

The later point can also in some languages lead to you accidentally using a base64 string as bytes or the other way around.

Well I guess that was enough ranting ;=)


If efficiency and compactness are your problem, JSON is not the best idea anyway, look at protobufs and the ilk.

JSON works where you need simplicity on the verge of being dumb, and human-readability.


JSON human readable is extremely nice. FWIW, I’ve found some binary formats like CBOR and MsgPack to be so almost human readable as binary when you are even slightly familiar with the format, or just convert almost instantly to readable JSON.

When I was developing a three component system, protos were not working for me. Too many schema changes and implementation differences to make the reduced binary size worth it. This was partly because the best protobuffer implementation in C (nanoPB) is just the hobby project of some dude, but mostly because coordinating schemas was annoying.


On the contrary, adding binary would have been terrible. JSON is great because it is simple!

Have you tried BSON or Protobufs to solve your annoyances? How did it go?


Protobuffs require sharing of a schema, then code generation for each language. Not ideal imo unless you really need the speed and the protos can be shared ahead of time.

Outside of Mongo, I haven’t seen BSON used anywhere.

CBOR however, is up and coming. Starting to look like it may be a first class citizen in AWS someday. On the IoT side they are already preferring it over JSON for Device Defender.


Base64 comes from the development of MIME in the early 1990s (see https://tools.ietf.org/html/rfc1341 section 5.2) though there are other similar encodings such as uuencode which predate it.


I agree with the spirit and most of the content of this comment, except for the hard to random access part. It's not really hard to random access base 64.


Theoretically, yes it's easy.

Practically, base64 encode a video file and tell me how exactly you're going to allow the user to seek to any place they like within that video using only common libraries on common platforms... Theoretically easy, practically hard enough nobody does it.


Practically it's easy too. Here's a piece of pseudocode I just wrote:

   chunkOffset = byteIndex/3*4
   base64decode(base64data[chunkOffset:chunkOffset+4])[byteIndex%3]
This gets you the byte located at an arbitrary index of the original buffer while operating on the base64 data.


1. Decode to bytes.

2. Do whatever you said.


If the data being base64 encoded will also then be compressed, it is better to base16 encode (hex) and then apply compression. Base16 is faster to encode and compresses better (probably compresses faster, didn't test that).

    2537012 Nov  6 10:10 test.tar.hex.gz
    2954608 Nov  6 10:03 test.tar.b64.gz
All the caveats, test with your data, etc.

What would be nice is an email/http resilient compressed format that could be stuck inside of json strings that can be easily recovered.


That being true, it might be worthwhile for compression algorithms to detect base64 encoding (or any other base for that matter) and reinterpret it as base16 in order to improve compression rates.


On the Internet with neural networks! #priorart


Many people have to deal with base64 codes and these people order their cpus by the cubic meter. Think gmail etc.


Because of legacy protocols and format limitations: SMTP, POP3, embedded binaries in XML, HTML (data URLs).


Pretty much on point, save for compression part... although most used 16bit deflate would stumble big time


what's a better alternative, assuming you need to encode binary data in a text stream?


Don't use a "text stream", use a binary protocol.


So just deny the reality of the original constraint? Boy, if only I could do that more often.


Your suggestion isn't helpful if I'm functioning within a system that doesn't support this and can't be upgraded.


However my underlaying channel isn’t 8bit clean.


How does it kill compression?


Try it...

    wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | gzip -9 | wc -c
    355350
    wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | base64 | gzip -9 | wc -c
    528781
base64 is 48.8% larger after compression on english text, whereas it would only be 33.3% larger without compression. The reason is because compression finds and eliminates repeating patterns, but base64 can make the same input data look totally different depending on which of the 4 input alignments it has.


oddly enough lz/deflate are as ancient as base64. 16bit deflate is quite poor and slow, yet predominant.


Most compression algos operate on single-byte chunks of data and base64 encoding messes up the original byte alignment making the input appear more random than it is.


For one thing, it removes opportunities for more efficient content-aware compression elsewhere in the data path.


If we're going the content-aware route, can't the compression scheme just be Base64-aware?


How exactly do you make a base64-aware image compressor, though?


I don't follow. An image compressor is guaranteed not to be given Base64 input.

A general-purpose compression scheme can easily detect when its input is Base64, so I don't see why Base64 should be particularly hard to compress, in principle at least.


Sometimes you have no choice but transmit data in urls


What are good alternatives?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: