Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FYI - there’s an official standard (MHTML) for doing this that has existed for 20+ years and exists natively in browsers.

https://en.m.wikipedia.org/wiki/MHTML



> FYI

The alternative format (used by the Internet Archive and Wayback Machine) is WARC. It's also a single file, but it's preserving the HTTP headers as well; so its applications is specifically for archival purposes. [1] The "wget" tool which is co-maintained by the Web Archive people also has support for it via CLI flags.

Though when it comes to mobile browser support I'd recommend to use MHTML, because webkit and chromium both have support for it upstream.

[1] http://iipc.github.io/warc-specifications/

[2] https://www.gnu.org/software/wget/wget.html


WARC is also used by the Webrecorder project. They made an app called Wabac which does entirely client-side WARC or HAR replays using service workers and it seems to have pretty good browser support, but I haven't really dug into the specifics.

https://github.com/webrecorder/wabac.js-1.0


There is a project that uses a headless browser to implement HAR.

https://github.com/wabarc/screenshot


Is there any objection to adding WARC support to webkit/chromium? Seems like a not-so-complex project...


I know that WebKit relies on either libsoup [1] (on Linux/Unices) or curl [2] (legacy Windows and maybe WPE(?)) as a network adapter, so the header handling and parsing mechanisms would have to be implemented in there.

Though, on MacOS, WebKit tries to migrate most APIs to the Core Foundation Framework, which makes it kind of impossible to implement as a non-Apple-employee because it's basically a dump-it-and-never-care Open Source approach. [3]

Don't know about chromium (my knowledge is ~2012ish about their architecture, and pre-Blink).

[1] https://github.com/WebKit/WebKit/tree/main/Source/WebKit/Net...

[2] https://github.com/WebKit/WebKit/tree/main/Source/WebKit/Net...

[3] https://github.com/opensource-apple/CF


GTK/WPE use libsoup. Playstation/Windows uses curl. And yes Apples networking is proprietary.


I wasn't sure about WPE in regards to libsoup due to the glib dependencies and all the InjectedBundle hacks that I thought they wanted to avoid.

I mean, in principal curl would run on the other platforms, too...but as far as I can tell there's an initiative to move as much as possible to the CF framework (strings, memory allocation, https and tls, sockets etc) and away from the cross-platform implementations.


Over a decade ago I had a laptop but no internet at home. This was one of the ways I taught myself programming (and also downloading dozens of manga) by using internet explorer at a cafe which had an option to save to mhtml which was one file and had everything self contained. Legit owe a portion of my success to this. I still have some of these files, old crusty hello world c++ tutorials etc.


I have fantastic internet, and I still do something similar. Local docs just load so much faster, and if something happens (which it still does, even on Fiber in the US), I have docs and can program.

Lemme see if I can pull up the command I use to mirror doc sites.

    wget \
      --recursive \
      --level=5 \
      --convert-links \
      --page-requisites \
      --wait=1 \
      --random-wait \
      --timestamping \
      --no-parent \
      $1


For people who cannot afford internet access now, and for perhaps more in the future if times get more difficult, I believe this is a very important use-case.


The Chrome engineer who maintains the MTHML work wrote up a comprehensive doc on the modifications on the MHTML spec (RFC 2557) that are implemented: https://docs.google.com/document/d/1FvmYUC0S0BkdkR7wZsg0hLdK... Might be useful for you, gildas.


Thank you Paul! I had read this document some time ago, especially to see how the shadow DOM was serialized.


The browser compatibility section suggests MHTML is unsupported in current versions of Firefox and Safari.


I don't think it was ever native in Firefox, there is/was the excellent unMHT extension that was broken by Quantum/WebExtensions and The Great XUL Silliness. Shame.

I have Waterfox-Classic and unMHT (fished out of the Classic Addons Archive, just remember to turn off Waterfox's multiprocess feature) since I occasionally need to archive web pages - and more importantly, reopen them later.

mhtml is just MIME, literally every discrete URL as a MIME part with its origin in a Content-Location header, all wrapped in a multipart container. I don't understand why it's not a default format.


I can see WebExtensions breaking it (as it's a completely new set of APIs for extensions, and the losses do definitely still hurt)... but quantum/xul? How is that related, aside from "it happened around the same time"?


IANA firefox dev: XUL/XPCOM = old APIs, WebExtensions = new (multi-browser) API

Quantum was the the project name to re-engineer Firefox internals, with lots of design changes, not just plugins. XUL/XPCOM APIs were dropped, as an occasional programmer I understand why, "Quantum broke my plugins" is a reasonable first approximation for most users.


Safari supports webarchive, which does basically the same thing


The problem is that it is a proprietary format. The advantage of the format produced by SingleFile (HTML) is that as long as your browser is capable of interpreting HTML, you will be able to read your archives without worries.


Not so proprietary. It's really just a plist file, which the format is known and even open sourced by Apple[1]. Really it's only proprietary in that no other platforms have implemented it.

[1]: https://opensource.apple.com/source/CF/CF-550/CFBinaryPList....


For anyone else that didn't read the README, MHTML is mentioned in the comparison section https://github.com/gildas-lormeau/SingleFile#file-format-com...


Take the comparison with a grain of salt. Not including WARC is like excluding water from a comparison of beverages, it is the baseline standard.


> MHTML, (...) is a web page archive format used to combine, in a single computer file, the HTML code and its companion resources (such as images, Flash animations, Java applets, (...)

Well that goes to show its longevity I guess.


Does anyone else get two security warnings whenever you try to save an MHTML page using a Chrome extension? I have to click on one warning's button to confirm that I indeed want to save the "dangerous" file and another to confirm I'm really sure. It's gotten very annoying. I've looked all over for an option to disable this behavior but haven't been able to.


I’ve extensively looked into this as I can’t find a good light and easy backup options that isn’t extreme overkill.

I thought MHTML was NOT standardized which is why it wasn’t across all browsers yet. From what I remember, every company was doing their own implementation of it. Maybe it’s gotten more standardized the last few years though.


I've always thought the "M" stood for "Microsoft" -- wasn't even aware any browsers other than IE supported it.


There is also CHM which is actually a Microsoft only file format for "Compiled HTML Help" files.


I love this format. Very fast and compact. Entire Visual Studio help was in it once. Worked VERY well. And there's a KDE/Qt reader.


And it generally does not do a good job


What are the issues?


The big one in my experience is it doesn't play well at all with JavaScript. Single file to my knowledge (I experimented with it briefly) allows all js to load on page and can then embed loaded media as base64. I think it also has heuristics to embed relevant js as well. It still only gets you 90% of the way there, and I came to the conclusion that unless you are doing web archive type work or need audio / video a composite image works well


From my experience, wrong layout,missing pictures.


Unfortunately mhtml is not widely supported.


I remember saving webpages in MHTML when I was using dial-up so that I could read them offline later.

I would also download entire websites using a software which name I forgot, to read them offline. Back when websites held in a single floppy disk.

Good times!


I remember using HTTrack for this a while back. Still have a few of those sites lying around, I think.


I was gonna say Opera (the old, good one) had this. When saving a page there were some options and one was a single file IIRC.


I use this Chrome extension to save web pages as MHTML: https://chrome.google.com/webstore/detail/save-webpages-offl...


IIRC, back in the day mhtml won’t save java applets.


Are any sites still using applets these days?


80% of server IPMI Web control panels. But who whould want to save those anyway? :)


A lot of those are getting HTML5/Canvas based implementations and most of the old AST BMCs can get it through upgraded firmware.


None of my machines had any such upgrades and never will :(




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: