Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What makes tar files better than zip files? Why should anyone prefer .tar.xz to an even more commonly used format that has even broader support?


There’s a very good reason to prefer .tar.gz (or xz or whatever) to .zip: tar.gz files deliver better compression (ranging from “marginally better” to “significantly better” depending on what you’re compressing).

In a .zip, the files are each compressed individually using DEFLATE and then concatenated to create an archive, whereas in a .tar.gz the files are first concatenated into a .tar archive and then DEFLATEd all together.

Because of this, a .tar.gz often achieves much better compression on archives containing many small files, because the compression algorithm can eliminate redundancies across files. The downside is that you can’t decompress an individual file without decompressing every preceding file in the stream, because DEFLATE does not support random access. (And so tar’s lack of index is an advantage here; an index would not be useful if you can’t seek.)

This is why e.g. open source software downloads often use .tar.gz. A source code archive has hundreds or thousands of tiny text files with a ton of redundancy between files in variable & function names, keywords, common code snippets/patterns, etc., so tar.gz delivers significantly better compression than zip. And there’s little use for random access of individual files, since all the files need to be extracted in order to compile the program anyway.

The abbreviation “tape archive” may be anachronistic nowadays, but the performance cbaracteristics of a tape drive — namely, fast sequential access but absolutely awful random access — coincide with the performance characteristics of compression algorithms. So an archive format designed for packaging files up to be stored on a tape is perfect for packaging files up to be compressed.


The trade-off is that it takes a long time to extract a single file from a solid archive. The 7z format supports a "solid block size" parameter for this reason (for all supported compression algorithms AFAICT) which can be set to anything from "compress all files individually" to "size of whole archive"


I wasn't saying tar is better than zip (though if you want to read about issues with the zip file format, have a look at this post: https://games.greggman.com/game/zip-rant/ -- HN discussion: https://news.ycombinator.com/item?id=27925393). I'm just saying there's nothing about the tar file format itself that makes you loathe it, you just loathe that Windows doesn't support it. That's not a problem with tar.

I don't even know if it's true that tar is less common than zip. I know zip is incredibly common, but truly, so is tar. In the UNIX world, _everything_ happens through tar. And as someone who almost exclusively operates in the UNIX world, I interact with zip files very rarely, while I work with tarballs all the time.

Just don't blame the format for a deficiency in your operating system, that's all I'm saying.


To be fair, I almost never come across tar files. Most crossplatform software provides .tar.gz for Linux/macOS and .zip for Windows.

Should windows have native support for tar.gz files? Maybe! Maybe not. I dunno. So when I come across something using that format for windows what it really comes across is half-ass Windows support. Which isn’t the end of the world. But it’s rarely a good sign.


Ah, yeah, that's completely fair. If someone is making an archive _for Windows users_, that archive should absolutely be a zip. A tarball definitely sends the signal that Windows users aren't the primary audience. Sometimes that's okay, sometimes it's a sign of a really shoddy port.


Sort of. I wouldn't make a separate source archive for Windows users - anyone who can compile stuff will manage installing 7-Zip or another archiver that handles .tar and solid compression wastes less space and bandwith. For Windows-specific archives .zip is a no brainer though.


7-Zip for windows is always something I go for on a fresh install. Then I also have rar support. But screw rar.

Also on a fresh install I install WSL, so tar is always available that way too.


.tar.gz is a tar file, it’s just gzipped afterwards.


Technicall yes, but if you care about user experience you should treat it as a single compressed archive not as one archive in another like e.g. 7-Zip does.


For unixy systems zip doesn't have sufficient metadata. I suspect that is almost the entire reason zip isn't used more on such systems. It is still used quite a bit when that doesn't matter.


Why should you couple your archive format to a compression algorithm?


Well, there are reasons. If your archive format handles compression, it can be designed in such a way that you can seek and extract only parts of the archive. If the archive format doesn't handle compression, you're dependent on reading through the archive sequentially from start to finish.

That's not to say tar is wrong to not have native compression, it's just one reason why it's not crazy for archive formats to natively support compression.


I’m semi-sure that this is possible with .tar.gz files already. I’ve used vim to view a text file within a few different rather large archives without noticing the machine choke up on extracting several gigs before showing content. Certainly nothing was written to disk in those cases.


.tar.gz files can only be read sequentially, but there are optimizations in place on common tools that make this surprisingly fast as long as there's enough memory available to essentially mmap the decompressed form. The problem is bigger with archives in the tens of GB (actually pretty common for tarballs since it's popular as a lowest-common-denominator backup format) or resource-constrained systems where the swapping becomes untenable.


There are extensions for gzip that can make it coarsely seekable, I wouldn't be surprised if some archive tools used that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: