The internet archive is a wonderful thing. I recovered much of my web site when my server burned in a fire, it was cheaper than the $2400 to try to pull it off the melted hard drives. It has also provided fodder for a ton of lawsuits, of the patent/IP/he-said vs she-said varieties.
Given the latter use, and subsequent 'retro-takedowns' that have occurred on the archive, I wonder if there is a market for 'a copy of the archive right now' which would be hard to retro-actively modify? And I wonder what the legal theory would be around having a tape archive of something that was 'clear' at the time you took it, but then later 'redacted'. Could you use your copy of the unredacted information?
I wonder if archive.org actually deletes things when taken down, or just makes them inaccessible? Likewise for when archive.org takes content down due to a new robots.txt file that didn't exist on the original site, as often happens with domain squatters.
They only disable access, not delete. There was once a court case where the defendant got the court to compel the plaintiff to alter their robots.txt in order to allow the defendant to gather evidence from the archive. (Apparently, the Archive managed to convince the court that manually producing the info hidden in the archive would be too onerous.)
I think they make items "dark" for, say, 70 years, so it's inaccessible to the public. The thought of actual deletions is kind of alien to their goals.
I do wonder what the best form of compression would be and given there web pages if some form of customed compression that was optimised for HTML would be useful.
There again with that volume of data, what would CERN do for storage/access for a data pool that size and still be useable.
Reason being if you wanted to back that lot of up ship copies for research purposes then with todays technology the humble memory stick, even the biggest would fail to even handle the file index on this scale. Scary amount of data. But certainly a data set many would like to play with and try things out, being the geeks we are.
CERN has a variety of techniques to access, store, and backup their data. I know from experience some experiments, or at least some parts of experiments, use layers and layers of abstraction, like scalla/xrootd which operates on clusters of servers directly or with nfs.
In addition to this, there's levels of processed data. For example, raw data is usually level 0, and basic processed data is level 1 Processed and calibrated data is usually level 2, etc... but experiments often have different definitions for each level. Reprocessing of any later can happen at any time, although level 1 reprocessing is usually an extremely intensive operation because it operates with the largest amount of data.
Level 0 data is usually heavily compressed when it's left on disk, because it's typically the largest amount of data, but also least touched.
Most scientists will use level 2 or level 1 data. This data will be on low latency clusters.
So, while CERN has petabytes of data, typically a fraction of that will be accessible.
In the past, level 0 data was often left to tape. While the raw data is still backed up to tape (I know this is the case for ATLAS), for many experiments with large amounts of data they might leave it on lower cost HDD in simple RAID arrays for redundancy and not worry about performance so much. The BaBar experiment has done this for their long term data analysis.
In addition to all of this, it's still occasionally easier to transfer large amounts of data via tape instead of the internet.
Back when I was involved, the compression was gzip -- each archived page (HTTP headers + body) was separately gzipped and concatenated onto a ~100M file, though I believe they're now doing 1G files. (This cleverly allows retrieving the individual document by seeking to that offset and doing gzip decompression.)
They have significant amounts of inline images and other files in the web archive, which of course typically are already compressed.
google does something interesting when you use google.com to convert that number to say gigabytes by doing a search such as: 10,000,000,000,000,000 bytes to gigabytes
the result is 9.31323e6
Notice the e... because it's in the same font your eye might miss it like mine and then you'd say to yourself... 10 gigabytes is so small who cares... but if you do the same search again but this time to petabytes you'll realize it was an 'e' in the gigabyte number...
It's exactly 10 petabytes. Giga/peta are SI prefixes and hence are in base 10.
BTW, take google's conversions with a grain of salt. As I understand it, google's conversion was just a minor project and not something that is maintained at high quality. For these things Wolfam Alpha is almost always better because it's the kind of thing Alpha was designed to do.
That number has a lot of zeros at the end, but really, it's just barely getting started. This may be the greatest electronic civilization archival project in human history, but it is also the smallest and most impoverished. It has a lot of growing left to do.
> On Thursday, 25 October, hundreds of Internet Archive supporters, volunteers, and staff celebrated addition of the 10,000,000,000,000,000th byte to the Archive’s massive collections.
So, I bet everyone is dying to know, what was the 10,000,000,000,000,000th byte then?
Interesting note at the bottom for those who may have missed it (Donald Knuth on the organ is very distracting):
>The only thing missing was electricity; the building lost all power just as the presentation was to begin. Thanks to the creativity of the Archive’s engineers and a couple of ridiculously long extension cords that reached a nearby house, the show went on.
I think by now we have more or less standarized on 10^15 as a Petabyte, as this is consistent with the SI system.
For the powers-of-2 based approach, the -bi- prefixes are now well established. So:
> Also, you have a typo, 1 Petabyte = 10^15 bytes I believe.
Thanks!
The power-of-2 system was quite popular for a long time when applied to computers, but always clashed with the much older (and well thought out) SI system where everything is defined in powers of 10, and that is used everywhere else (think physics,...). I think harddrive companies where the first to switch their definitions to the decimal based system (for the obvious reason that their disks would then seem bigger to people accustomed to the common power of 2 system...). A decade or so ago the kibi, mebi, ... prefixes where introduced, and at some point people more or less switched to those. There seems to be an even more detailed article about this at https://en.wikipedia.org/wiki/Binary_prefix .
Given the latter use, and subsequent 'retro-takedowns' that have occurred on the archive, I wonder if there is a market for 'a copy of the archive right now' which would be hard to retro-actively modify? And I wonder what the legal theory would be around having a tape archive of something that was 'clear' at the time you took it, but then later 'redacted'. Could you use your copy of the unredacted information?