Hi! I'm one of the programmers at Gutenberg.
We've been improving the site a lot over the past few months (and more is coming!).
If you haven't visited the page recently, it's worth checking out again: https://www.gutenberg.org/
Have you considered having a detailed version history for each book (etext)? The process of submitting fixes to typos etc in books involves sending an email (https://www.gutenberg.org/help/errata.html) and although the last time I did this (2011) the fixes did get applied reasonably quickly (couple of days), it all felt a bit opaque. The version history could also include the project (usually PGDP correct?) the etext originated from; that way one would be able to compare against the actual page scans.
I have very mixed feelings about Standard Ebooks and would much prefer being able to use Project Gutenberg directly, but one good thing Standard Ebooks does is that every book has an associated git repository (on GitHub), so it's (in principle) possible to see a history of fixes to the text over time.
We're using git repos internally to keep history for each book. They existed on github for a while, but our implementation was awkward, and too big of project for the volunteer dev team. But it's likely that we'll evolve towards that.
I was hoping to reply to this in detail but as I never got around to it, I'll keep it short: mostly it's about the editorial changes they make to the text, modernizing spelling etc. Many of the changes are unjustified IMO, and often detract from the charm of the original, and I'm uncomfortable reading a text I know has been tampered with in this way. Of course it's their project and they can do whatever they want, and they clearly love books, so with strong opinions there will be some that I may disagree with. I'd much rather read books from Project Gutenberg or Wikisource, both of which don't even correct obvious typos without marking up in some way that they've done so.
I also have many positive things to say about Standard Ebooks, but I don't think you were asking about those. :)
----
Edit: Without going into what I think are the most egregious sort of changes they introduce (which I think will require a longer post) and limiting myself to ones easy to find immediately:
See the earlier discussion (linked in a sibling comment here) where the editor-in-chief says it's ok to change punctuation because "The sounds out of his mouth do not include an apostrophe whether it's there in the spelling or not." (a very American view IMO): https://news.ycombinator.com/item?id=16956931
And looking at a recent commit on one of their books, here's a recent (https://github.com/standardebooks/agatha-christie_the-secret...) revert of one of their aggressive "modernizations" from 2024 (https://github.com/standardebooks/agatha-christie_the-secret...), that had, in line with their usual practice, changed "every one" to "everyone" (in one place even when referring to "a good many risks"), and the same commit made other changes (including one still present) like "they ought to have it lithographed. It must be a frightful nuisance doing every one separately." having the last four words turned into "doing everyone separately."!
On the “every one” example, that’s a definite mistake that shouldn’t have made its way in to the book in the first place. The production process has a specific step for “every one” (https://standardebooks.org/contribute/producing-an-ebook-ste...) that guides producers through making the correct choices when modern usage has two different possible choices. It shouldn’t have happened, but it’s a mistake that was fixed at least.
Your comment makes it sound as though the mistake was introduced by an inexperienced contributor who did not read the guide, when in fact it was introduced by the founder/editor-in-chief of the project. :) And in case it wasn't clear, only one of the mistakes was reverted, and the other one I quoted is still present in the book even as of this moment.
More broadly, the position of Standard Ebooks is that a modern reader would be distracted by spellings like "some one" and "every thing", and by time written like "2.30" instead of "2:30", and that books in British quotation style must be converted to American quotation style. I think most readers can in fact tolerate such small differences, and this position is frankly insulting — the punctuation and spelling of works are part of their character, and if anything, I'm more distracted by such anachronisms in style introduced as part of the Standard Ebooks process.
And to be honest, that position is totally reasonable, and the good thing is that you have the option of Gutenberg, Faded Page, and a bunch of other archival sites, also for free, if you don’t want that.
But nearly all print publishers also do what SE does. Why do you think they do, when it costs additional money and time to do that? A reasonable answer is that some, or a majority of, people prefer it.
> But nearly all print publishers also do what SE does.
Do they? To check, I tried to find a recent publication of Agatha Christie, and found the collection “Country Christie: Twelve Devonshire Mysteries” which says “First published by HarperCollins Publishers Ltd 2025”. It still has British-style punctuation (throughout the book), and times like “1.30”, “9.30”, “11.30”, “7.30 a.m.”, “12.30 p.m.”, and “8.30”. I checked a couple of other recent publications and admittedly they do modernize (though not in phrases like “every one of you”), but again I found the collection “The Last Seance: Haunting Tales from the Queen of Mystery” (2019) which does not. So it seems mixed.
In any case, I think it's fine to do what Standard Ebooks does, and if it were instead called something like “Modernized Ebooks with American punctuation”—if readers would know before picking one up—it would be totally unobjectionable. The name “Standard” gives the wrong impression. It's a bit like colorizing old black-and-white movies (or dubbing foreign-language movies instead of subtitling them): yes possibly even a majority of people may prefer it, but IMO it would be good to be more explicit what has been done.
It splits the community and number of possible volunteer hours for one. It also splits the canon into different versions. More projects fight for the attention attention (and possibly donations) of the audience.
There are lots of reasons it could be preferable to centralize. OTOH their mission is limited and some competition is healthy, if only to explore alternative ways to do things.
PG focuses on an accurate digital translation of the source material, sometimes hosting multiple different versions of the same text, and doing things like putting work into recreating the adverts at the back of some novels.
SE focuses less of preservation and more on making readers’ versions of the texts, like other publishing imprints. So there’s typography standardisation, a light-touch moderinisation of hyphenation and soundalike spelling, and things like author-wide collections of short fiction and poetry even if it didn’t previously exist.
Both are valuable, but they serve different segments.
Not the GP, but I also have mixed feelings about Standard Ebooks. They modernise texts for American readers. This means changing the punctuation, merging some words, altering the syntax, etc.
When I read an old novel, written two centuries ago in England, the little differences to modern English are part of the charm, and I certainly don't want any Americanism mixed in. For one of my favorite novels, The Forsyte saga, the author deliberately used some rare forms of words, which SE replaced with the mainstream forms.
SE editor in chief here. What you describe is incorrect. The only thing we do is very light sound-alike spelling modernization, like "to-night" -> "tonight". We do not do things like change from en-GB to en-US, replace old words with different modern words, or change text for "American readers", whatever that means. I have no idea where you got that impression.
I personally worked on the Forsyte saga. If you think something was done in error, please let us know and we'll be happy to fix it.
You may already be aware, but SE marks all commits making those kinds of changes as '[Editorial]', so it is generally trivial to use their tooling to build your own high-quality ebook without any of the editorial changes.
When I tried this in the past, it was non-trivial because the editorial changes are mixed with the technical changes. Reverting the editorial changes broke the technical changes.
Not parent, but while I can appreciate your viewpoint, I would like to point out that many many many books have abridged, reworded, simplified, or disambiguated versions for different audiences.
The Bible is I daresay the most famous of these. Translations aside, even the English versions have had significant alterations done to wording, spelling, and meaning depending on the version.
There's also the Great Illustrated Classics imprint for certain classic novels like H.G. Wells's The Invisible Man. (I read that one like 10 times as a kid and it's what got me into sci-fi as a whole I'd argue. Haha.)
Whether these alternate versions are good or bad is obviously up for debate and depends on the person, but I'm just saying that what SE does is hardly new in the publishing world.
When I thought about Project Gutenberg I remembered that original brutalist non-design. The current site has been very tastefully updated but looks like it's still very accessible if you turn styles off. Great job!
I like the design but liked the previous design as well, it was unique and Craigslistish, you knew what website you were visiting just by looking at it.
>When I thought about Project Gutenberg I remembered that original brutalist non-design.
I suppose a printed book, black ink on paper, is "brutalist" and unpleasant to look at?
The text of a book shouldn't be encrusted with format, your reader or browser should contain the presentation that you want to see, find appealing, or need (accessibility).
Huh that's interesting: 4.5 seconds for the TCP handshake and an additional 9.2 seconds for the TLS handshake. Is this some kind of captcha, since most bots would disconnect before that, so if you complete it once then it knows you're good? (Until the bots catch on of course, but so long as it works it's relatively unintrusive and not discriminatory against uncommon client software (that is, non-Chrome/ium).) The rest of the requests were lightning fast
Edit: welcome to your first comment after 9 years on HN btw, nice to have you here!
I think their site is just slow, potentially because more people than they are used to are trying to view it.
I was unable to load it initially (got an error from firefox) and had to re-attempt. Still slow if one forces a reload (shift-r, etc, to not use local cache).
we are having occasional lows in page speed performance due to LARGE amounts of bot traffic. full disclosure - we've not really been able to resolve this fully/well. Let us know if you have a good idea for how to deal with it
How do you currently host everything? Your main web server should not be responsible for hosting content. All books should be hosted on mirrors, and clicking download should automatically select a mirror to download it from.
Furthermore:
* Make sure that all books are downloadable in bulk as torrents.
* Every day, generate a CSV file of all available books and their metadata. Distribute this so that bots and user clients can run queries locally, instead of using your search engine.
anubis only works against lazy scrapers, and at a cost to your users. I'd prefer people not use it.
Bot traffic comes from machines that usually have a lot of idle cpu (since they're largely blocked on network IO as they scrape a bunch of sites in parallel), so they can trivially solve the anubis "proof of work" challenge, save the cookie, and then not solve it again for that site.
The only reason scrapers don't solve it is if the developers were too lazy to implement it... and modern scrapers also do, codeberg stopped using anubis because modern scrapers were updated to solve it.
The "proof of work" has to be easy or else people on old cell phones couldn't access your site (since an old android phone would start to overheat and throttle trying to solve a challenge that would take a modern server even several seconds), and it also consumes your cell-phone user's batteries, which is a really precious resource for them compared to the idle cpu on a server.
Just to add to the two negative replies, I find Anubis to be the only system that doesn't ever get in the way. My browsers have Javascript enabled and, so far, it never took more than a fraction of a second to complete the checks
Every other system I've run into has constant false positives, e.g. Google captchas will sometimes say I've failed and make me do the hardest level (if it wasn't giving me that already), Cloudflare regularly thinks I'm a bot, Codeberg blocked me before, Github signup captchas used to take ~15 minutes to complete and then still said "well you failed, try again", Github's general rate limiting has false positives (some days I browse a lot, other days little, and on the little days it'll sometimes go "slow down" with no recourse whatsoever, you're just blocked for an indeterminate amount of time), OpenStreetMap blocks my browser at work because I'm using Firefox ESR instead of latest stable and it finds that user agent string to be implausible, whatever the german railway operator uses since a few days is triggering on me constantly, etc.,
etc.,
etc. Constant blocks everywhere.
With Anubis, my understanding is that you do the proof of work (with whatever implementation you like, it doesn't have to be the Javascript one that they provide) and you can move on without ever doing any task yourself. The power consumption is a shame, but so long as attackers aren't even doing this much, the couple Joules it takes doesn't seem to be an issue
Of course, the attackers will evolve, but for now...
Please no. I'm a non-bot who gets stopped and turned away all the time by that menace. Anubis doesn't work without JS.
One of the things I give duckduckgo a lot of credit for is that while they're quick to interrupt me for a bot check (sometimes multiple times in a span of minutes) they'll let me identify ducks even on the most locked down browsers I use.
I'm only a small-scale sysadmin but the way that I understand the internet is that you send abuse notifications to the IP address block owner and, if it doesn't get resolved, you block. The whois/rdap database reveals which IPs all belong to the same hosting provider or ISP, so you can summarize that all to one list of IP addrs + timestamps per some time period
The ISP actually knows which subscriber is on that line, can send them notices, block them, terminate them... loads of things that you simply cannot do because you have no relation to this person. And frankly I wouldn't want to need to have a personal relation with every website that I visit; my ISP can reach me if there is anything relevant to continued use of the internet. From personal experience, when I was a teenager, the ISP cutting our household off after an abuse report was an effective way of stopping what I was doing
It’s effective against teenagers maybe. Not so much against Amazon, Meta or wherever botnet/crawler is coming out of China these days from up-and-coming AI companies.
Then block all of Amazon, Meta, or wherever botnet/crawling traffic is coming from that doesn't honor robots.txt, sends DDoS reflection traffic, submits SMTP messages (in large volumes, not just probing) for domains they're not authorized for with SPF, or whatever else applies to the protocol you're using
If they can't keep their ranges clean to a reasonable degree, their customers will need to move if they want to access your part of the internet. New sign-ups will always be hard, so some amount of abuse is expected, but if it's the same abuse traffic for weeks after you've notified them, well, it stops being your problem at some point
This is the post I’m talking about. Make sure you understand how it would not be productive to go after each ISP individually when the traffic is from all of them.
wouldn't help, much of the traffic we've observed look closer to ddos patterns - IPs from all over the world, many different networks, each IP makes one request only, doesn't come back. highly distributed, no form of blocking would be effective except maybe captcha or proof of work.
The problem with this approach is that modern scrapers use hordes of residential proxies and quickly rotate through IP addresses which belong to ASes you get a lot of real traffic from. There's nothing you can do if the ISP won't take any action against the customer.
Worse than that - even if they would take action, you can't possibly orchestrate filing all of the complaints. It's a drown-in-quicksand problem, you can't fight quicksand one grain at a time.
> you can't possibly orchestrate filing all of the complaints
To the ISPs? Each IP range has an abuse email address registered and this is specifically exempt from rate limiting at RIPE's WHOIS server. Not sure how it is in other RIRs but I just happen to know of this policy
You can automate the whole thing, provided that you have a reliable way of identifying the undesired traffic which you need anyway for being able to block it by any means. The trouble is in user identification (they'll just use a new IP address from that ISP or hosting provider if you don't tell the provider about the problematic user)
See what I wrote above (and let me say I am talking about Project Gutenberg and Distributed Proofreaders here, I am one of the admins on both). A large amount of the hassle traffic we've seen is as I wrote above, the IPs come from everywhere and in many cases, each IP makes a single request and doesn't come back. They change user-agent dynamically, etc, to masquerade as regular traffic. They come from residential, cloud/hyperscale, corporate, educational, government, all the networks, on every continent. This is many thousands of "open a ticket with someone" events per hour territory. It's as difficult to fight as DDoS itself for the same reasons (presumably the harvesting parties know that and that's exactly why this approach is used).
Others online have been writing about their own experience with the same stuff; it's not unique to PG at all, it's everywhere. Talk to anyone that runs a web server and they'll have these stories...
I'm aware, I also host various websites that see an IP do a single request to the most unlikely of deep pages. Usually not hard to correlate with similar surprising requests from the same ISP, though, and that's exactly why it would be useful to talk to them: they know who used that IP address at the given timestamp. If they get a hundred complaints from different websites, the ISP is in the unique position to correlate that and find the subscriber(s) that are problematic
You also don't have to send out 1k support requests per hour. Could trial it with some hosting provider that you expect is responsive and see how it works out
edit: like, I just don't see another solution short of banning being anonymous online. Each site would have to know who you are. Someone has to be able to track it back to a person that is doing the abuse or there can't be any rules that we can apply. Imo it's better if that's the ISP (or VPN provider, say) who already has this information anyway
I know. All the more reason to do it, right? If an ISP can't keep its network clean, then allowing them to send traffic onto the web is just asking for the problem to continue
Show people a useful error, such as "You are using [ISP name] which sends large volumes of abusive traffic (think of spam and DDoS). They allow the attackers to hop around points across their entire network so we cannot block the abusers more selectively. Despite our attempts to contact them, the abuse continues in volumes which we do not see from other ISPs. To access our corner of the internet, use a different ISP. You could try mobile data instead of Wi-Fi or vice versa.", and they can make their own choices about staying with this ISP if more and more websites show this sort of error
If everyone tries to identify people piecemeal, we all need to implement ~200 different identification systems (assuming each country has a central system that everyone is signed up to in the first place), or rely on algorithms to tell who is a bot (I'm currently being misidentified on a daily basis and I'm, eh, not a bot. Trying to buy public transport tickets is currently difficult, for example, because the monopolist in my country blocks me after a few route queries when using a Google browser, and 0 queries from Firefox)
Occasionally, you misclassify a real user as a bot, and then your reputation is ruined forever.
The official Polish train schedules website did this recently, feeding incorrect departure and arrival times to IP addresses known for aggressive scraping, without taking CGNAT into account. People... have noticed[1].
traffic yesterday ~20% more than recent average.
4971601 sessions
177 robots
863462 robot files
3390115 user files
20.30% robot files
(robots id'd based on requests/ip address)
5 apache servers for static content, 1 CherryPy server for dynamic content
hosted at iBiblio.
The ebook editions are very good for this. Most of the e-reader software provides all the amenities (bookmarks, highlighting, notes, control of margins, etc).
A while back I attempted to extract the FF reader code to make it a front end to various non-web clients (email with pine key bindings etc)
I got it to a prototype level but then shelved it after having difficulty getting good results with various test datasets. Probably would make a fantastic ereader though
Hi for the past 20 years I have known about Project Gutenberg and I used to read a lot from it. One of the obstacle that I face is that there is no way to arrange the books in the order of their original publication.
Do you know of any such way.
Surely we can arrange the books by their release date on Gutenberg but it has long baffled me as it feels to me the most useless way of sorting the books.
Thank you for Project Gutenberg.
only 20% of our books have original publication data in the db. We have a project to add another 40% or so from another database, let us know if you want to help.
reply
As long as you're taking suggestions, since many of the books are quite old, adding a publication date or date range to the search functionality might be nice. I personally would find it very useful since I have a tendency to look for things that are older than year _x_ when researching various things.
only 20% of our books have original publication data in the db. We have a project to add another 40% or so from another database, let us know if you want to help.
I have the same problem on catholiclibrary.org, but insist on having something as the book date for every work. My solution is to temporarily default to the author dates until the book date can be refined. If there is no known author date I at least have a date range, hopefully to century or better.
Author dates are a much smaller data set, can be generally supplemented from public marc records (viaf, loc, etc - I don't do that, but it's an option) and at least provide basic filtering / sorting.
FWIW I absolutely love how 'no-frills' PG is compared to so much of the bloated, over-engineered, script-riddled web these days. Please don't ever change that!
Thanks for the free work! Project Gutenberg is nice to have :).
On the site I noticed the library boxes have roughly a single extra line causing a scrollbar to appear and the last line to be chopped off https://i.imgur.com/PQ8T0qc.png is there an issues/bug portal to properly submit these kinds of things?
OCR has improved a lot since then, but OCR is just step 1 of reading in text. They make a lot of errors (even now, especially on old worn out paper pages) and even if they didn't, one has to format the book, deal with footnotes, sidenotes, illustrations, etc. DP is very active, we will welcome you back with open arms :)
I uploaded a PDF to archive.org that auto-OCRs with plenty of mistakes. I have found no way of updating the entire stack of documents produced. I wonder if Project Gutenberg is similar
I don't know what the status of this is today, but a number of years ago my biggest complaint about Gutenberg is that a lot of books had images added back when low resolution images were the standard, so you have a ton of books with image resolutions from the year 2000.
Great project. Are many of the books in a format that can easily be converted into audio? Is there a way to search for them, and information on what software your readers find useful for this purpose?
(Note: A lot of print media these days has switched to far-to-small font-sizes. Less of a problem for (zoomable) digital media, but for many that's still a barrier.)
Gutenberg is nearly all books that have lapsed into the US public domain by dint of being published 95+ years in the past. Which broadly explains why you hit nothing for 3d printing.
As another commenter said PG is almost all books from 95+ years in the past due to copyright law in the US. We partner with a sister organization, the World Library Foundation, who have a self-publishing portal for modern works by authors who wish to put their own work in the public domain. You might want to look there for more modern material. https://self.gutenberg.org
Perhaps you can find the information you are looking for there.
However if you plan on scraping or otherwise hitting them with a ton of traffic, consider at least to donate a good amount for the traffic you cause them. It ain't free after all.
> All Project Gutenberg metadata are available digitally in the XML/RDF format. This is updated daily (other than the legacy format mentioned below). Please use one of these files as input to a database or other tools you may be developing, instead of crawling or roboting the website.