Hi! I'm one of the programmers at Gutenberg. We've been improving the site a lot...

svat · 2026-05-15T20:14:20 1778876060

Have you considered having a detailed version history for each book (etext)? The process of submitting fixes to typos etc in books involves sending an email (https://www.gutenberg.org/help/errata.html) and although the last time I did this (2011) the fixes did get applied reasonably quickly (couple of days), it all felt a bit opaque. The version history could also include the project (usually PGDP correct?) the etext originated from; that way one would be able to compare against the actual page scans.

I have very mixed feelings about Standard Ebooks and would much prefer being able to use Project Gutenberg directly, but one good thing Standard Ebooks does is that every book has an associated git repository (on GitHub), so it's (in principle) possible to see a history of fixes to the text over time.

gluejar · 2026-05-15T21:26:59 1778880419

We're using git repos internally to keep history for each book. They existed on github for a while, but our implementation was awkward, and too big of project for the volunteer dev team. But it's likely that we'll evolve towards that.

marcprux · 2026-05-16T00:16:18 1778890578

> I have very mixed feelings about Standard Ebooks[…]

Why?

svat · 2026-05-21T00:15:21 1779322521

I was hoping to reply to this in detail but as I never got around to it, I'll keep it short: mostly it's about the editorial changes they make to the text, modernizing spelling etc. Many of the changes are unjustified IMO, and often detract from the charm of the original, and I'm uncomfortable reading a text I know has been tampered with in this way. Of course it's their project and they can do whatever they want, and they clearly love books, so with strong opinions there will be some that I may disagree with. I'd much rather read books from Project Gutenberg or Wikisource, both of which don't even correct obvious typos without marking up in some way that they've done so.

I also have many positive things to say about Standard Ebooks, but I don't think you were asking about those. :)

----

Edit: Without going into what I think are the most egregious sort of changes they introduce (which I think will require a longer post) and limiting myself to ones easy to find immediately:

See the earlier discussion (linked in a sibling comment here) where the editor-in-chief says it's ok to change punctuation because "The sounds out of his mouth do not include an apostrophe whether it's there in the spelling or not." (a very American view IMO): https://news.ycombinator.com/item?id=16956931

And looking at a recent commit on one of their books, here's a recent (https://github.com/standardebooks/agatha-christie_the-secret...) revert of one of their aggressive "modernizations" from 2024 (https://github.com/standardebooks/agatha-christie_the-secret...), that had, in line with their usual practice, changed "every one" to "everyone" (in one place even when referring to "a good many risks"), and the same commit made other changes (including one still present) like "they ought to have it lithographed. It must be a frightful nuisance doing every one separately." having the last four words turned into "doing everyone separately."!

robin_reala · 2026-05-23T06:02:44 1779516164

On the “every one” example, that’s a definite mistake that shouldn’t have made its way in to the book in the first place. The production process has a specific step for “every one” (https://standardebooks.org/contribute/producing-an-ebook-ste...) that guides producers through making the correct choices when modern usage has two different possible choices. It shouldn’t have happened, but it’s a mistake that was fixed at least.

svat · 2026-05-23T13:45:51 1779543951

Your comment makes it sound as though the mistake was introduced by an inexperienced contributor who did not read the guide, when in fact it was introduced by the founder/editor-in-chief of the project. :) And in case it wasn't clear, only one of the mistakes was reverted, and the other one I quoted is still present in the book even as of this moment.

More broadly, the position of Standard Ebooks is that a modern reader would be distracted by spellings like "some one" and "every thing", and by time written like "2.30" instead of "2:30", and that books in British quotation style must be converted to American quotation style. I think most readers can in fact tolerate such small differences, and this position is frankly insulting — the punctuation and spelling of works are part of their character, and if anything, I'm more distracted by such anachronisms in style introduced as part of the Standard Ebooks process.

robin_reala · 2026-05-23T13:56:18 1779544578

And to be honest, that position is totally reasonable, and the good thing is that you have the option of Gutenberg, Faded Page, and a bunch of other archival sites, also for free, if you don’t want that.

But nearly all print publishers also do what SE does. Why do you think they do, when it costs additional money and time to do that? A reasonable answer is that some, or a majority of, people prefer it.

svat · 2026-05-23T16:42:06 1779554526

> But nearly all print publishers also do what SE does.

Do they? To check, I tried to find a recent publication of Agatha Christie, and found the collection “Country Christie: Twelve Devonshire Mysteries” which says “First published by HarperCollins Publishers Ltd 2025”. It still has British-style punctuation (throughout the book), and times like “1.30”, “9.30”, “11.30”, “7.30 a.m.”, “12.30 p.m.”, and “8.30”. I checked a couple of other recent publications and admittedly they do modernize (though not in phrases like “every one of you”), but again I found the collection “The Last Seance: Haunting Tales from the Queen of Mystery” (2019) which does not. So it seems mixed.

In any case, I think it's fine to do what Standard Ebooks does, and if it were instead called something like “Modernized Ebooks with American punctuation”—if readers would know before picking one up—it would be totally unobjectionable. The name “Standard” gives the wrong impression. It's a bit like colorizing old black-and-white movies (or dubbing foreign-language movies instead of subtitling them): yes possibly even a majority of people may prefer it, but IMO it would be good to be more explicit what has been done.

a2800276 · 2026-05-16T07:25:52 1778916352

It splits the community and number of possible volunteer hours for one. It also splits the canon into different versions. More projects fight for the attention attention (and possibly donations) of the audience.

There are lots of reasons it could be preferable to centralize. OTOH their mission is limited and some competition is healthy, if only to explore alternative ways to do things.

robin_reala · 2026-05-16T09:02:06 1778922126

It’s a different mission.

PG focuses on an accurate digital translation of the source material, sometimes hosting multiple different versions of the same text, and doing things like putting work into recreating the adverts at the back of some novels.

SE focuses less of preservation and more on making readers’ versions of the texts, like other publishing imprints. So there’s typography standardisation, a light-touch moderinisation of hyphenation and soundalike spelling, and things like author-wide collections of short fiction and poetry even if it didn’t previously exist.

Both are valuable, but they serve different segments.

idoubtit · 2026-05-16T09:10:24 1778922624

Not the GP, but I also have mixed feelings about Standard Ebooks. They modernise texts for American readers. This means changing the punctuation, merging some words, altering the syntax, etc.

When I read an old novel, written two centuries ago in England, the little differences to modern English are part of the charm, and I certainly don't want any Americanism mixed in. For one of my favorite novels, The Forsyte saga, the author deliberately used some rare forms of words, which SE replaced with the mainstream forms.

acabal · 2026-05-16T14:07:15 1778940435

SE editor in chief here. What you describe is incorrect. The only thing we do is very light sound-alike spelling modernization, like "to-night" -> "tonight". We do not do things like change from en-GB to en-US, replace old words with different modern words, or change text for "American readers", whatever that means. I have no idea where you got that impression.

I personally worked on the Forsyte saga. If you think something was done in error, please let us know and we'll be happy to fix it.

mrob · 2026-05-16T22:40:44 1778971244

I commented on this kind of editing several years ago:

https://news.ycombinator.com/item?id=16957359

The edit is still in place, and I still maintain that changing 'phone to phone in dialogue changes the meaning.

jeltz · 2026-05-19T10:34:15 1779186855

Yeah, that edit clearly changes the meaning of the text.

natex · 2026-05-16T15:52:49 1778946769

> The only thing we do is very light sound-alike spelling modernization, like "to-night" -> "tonight".

Curious. Why even bother?

bell-cot · 2026-05-16T21:39:45 1778967585

Guess: screen readers and such.

tangledhelix · 2026-05-16T19:41:55 1778960515

One could argue that this falls into the previous poster's thought about "the little differences to modern English are part of the charm" ...

jcurtis · 2026-05-16T10:37:43 1778927863

You may already be aware, but SE marks all commits making those kinds of changes as '[Editorial]', so it is generally trivial to use their tooling to build your own high-quality ebook without any of the editorial changes.

mrob · 2026-05-16T22:42:56 1778971376

When I tried this in the past, it was non-trivial because the editorial changes are mixed with the technical changes. Reverting the editorial changes broke the technical changes.

AdamN · 2026-05-16T09:15:49 1778922949

SE sounds truly, truly awful. Thanks for making me aware of its existence so I can avoid it.

phaedrix · 2026-05-16T15:25:47 1778945147

They're providing beautifully made ebooks for free...

The only thing they are is truly, truly wonderful.

AdamN · 2026-05-18T11:20:38 1779103238

But why not be true to the original author's text? What's the need to modify it?

encrypted_bird · 2026-05-22T21:50:41 1779486641

Not parent, but while I can appreciate your viewpoint, I would like to point out that many many many books have abridged, reworded, simplified, or disambiguated versions for different audiences.

The Bible is I daresay the most famous of these. Translations aside, even the English versions have had significant alterations done to wording, spelling, and meaning depending on the version.

There's also the Great Illustrated Classics imprint for certain classic novels like H.G. Wells's The Invisible Man. (I read that one like 10 times as a kid and it's what got me into sci-fi as a whole I'd argue. Haha.)

Whether these alternate versions are good or bad is obviously up for debate and depends on the person, but I'm just saying that what SE does is hardly new in the publishing world.

condwanaland · 2026-05-17T05:13:55 1778994835

SE is an amazing and wonderful resource

JSeiko · 2026-05-15T20:45:54 1778877954

I believe our new-ish CEO Eric Hellman actually did some work on something very similar

JSeiko · 2026-05-15T20:24:29 1778876669

That's an interesting idea. not a small feat to accomplish though ...

jefurii · 2026-05-15T17:59:28 1778867968

When I thought about Project Gutenberg I remembered that original brutalist non-design. The current site has been very tastefully updated but looks like it's still very accessible if you turn styles off. Great job!

JSeiko · 2026-05-15T18:08:09 1778868489

sadly HN doesn't have a "heart" emoji I could use :D

ricardonunez · 2026-05-15T23:41:32 1778888492

I like the design but liked the previous design as well, it was unique and Craigslistish, you knew what website you were visiting just by looking at it.

Wistar · 2026-05-15T18:16:12 1778868972

ok_dad · 2026-05-15T21:27:50 1778880470

<3

Less than three is a classic!

agys · 2026-05-16T08:26:29 1778919989

Ess two is less than less than three, but also a classic.

s2 < <3

fsckboy · 2026-05-16T20:24:32 1778963072

>When I thought about Project Gutenberg I remembered that original brutalist non-design.

I suppose a printed book, black ink on paper, is "brutalist" and unpleasant to look at?

The text of a book shouldn't be encrusted with format, your reader or browser should contain the presentation that you want to see, find appealing, or need (accessibility).

lucb1e · 2026-05-15T20:00:04 1778875204

Huh that's interesting: 4.5 seconds for the TCP handshake and an additional 9.2 seconds for the TLS handshake. Is this some kind of captcha, since most bots would disconnect before that, so if you complete it once then it knows you're good? (Until the bots catch on of course, but so long as it works it's relatively unintrusive and not discriminatory against uncommon client software (that is, non-Chrome/ium).) The rest of the requests were lightning fast

Edit: welcome to your first comment after 9 years on HN btw, nice to have you here!

codys · 2026-05-15T20:10:13 1778875813

I think their site is just slow, potentially because more people than they are used to are trying to view it.

I was unable to load it initially (got an error from firefox) and had to re-attempt. Still slow if one forces a reload (shift-r, etc, to not use local cache).

JSeiko · 2026-05-15T20:23:26 1778876606

we are having occasional lows in page speed performance due to LARGE amounts of bot traffic. full disclosure - we've not really been able to resolve this fully/well. Let us know if you have a good idea for how to deal with it

uyzstvqs · 2026-05-16T15:30:27 1778945427

How do you currently host everything? Your main web server should not be responsible for hosting content. All books should be hosted on mirrors, and clicking download should automatically select a mirror to download it from.

Furthermore:

* Make sure that all books are downloadable in bulk as torrents.

* Every day, generate a CSV file of all available books and their metadata. Distribute this so that bots and user clients can run queries locally, instead of using your search engine.

gropo · 2026-05-15T21:14:05 1778879645

Do you host a torrent?

I have about 50k of the books, I would have used a torrent of just the txt files if it was prominent.

gluejar · 2026-05-16T18:56:32 1778957792

we have a tarball of all text files - link posted somewhere here

dimava · 2026-05-16T00:20:10 1778890810

If it's purely bot traffic, then Anubis could help

You could have seen it on some websites already

https://anubis.techaro.lol/

TheDong · 2026-05-16T03:42:11 1778902931

anubis only works against lazy scrapers, and at a cost to your users. I'd prefer people not use it.

Bot traffic comes from machines that usually have a lot of idle cpu (since they're largely blocked on network IO as they scrape a bunch of sites in parallel), so they can trivially solve the anubis "proof of work" challenge, save the cookie, and then not solve it again for that site.

The only reason scrapers don't solve it is if the developers were too lazy to implement it... and modern scrapers also do, codeberg stopped using anubis because modern scrapers were updated to solve it.

The "proof of work" has to be easy or else people on old cell phones couldn't access your site (since an old android phone would start to overheat and throttle trying to solve a challenge that would take a modern server even several seconds), and it also consumes your cell-phone user's batteries, which is a really precious resource for them compared to the idle cpu on a server.

lucb1e · 2026-05-16T19:53:10 1778961190

Just to add to the two negative replies, I find Anubis to be the only system that doesn't ever get in the way. My browsers have Javascript enabled and, so far, it never took more than a fraction of a second to complete the checks

Every other system I've run into has constant false positives, e.g. Google captchas will sometimes say I've failed and make me do the hardest level (if it wasn't giving me that already), Cloudflare regularly thinks I'm a bot, Codeberg blocked me before, Github signup captchas used to take ~15 minutes to complete and then still said "well you failed, try again", Github's general rate limiting has false positives (some days I browse a lot, other days little, and on the little days it'll sometimes go "slow down" with no recourse whatsoever, you're just blocked for an indeterminate amount of time), OpenStreetMap blocks my browser at work because I'm using Firefox ESR instead of latest stable and it finds that user agent string to be implausible, whatever the german railway operator uses since a few days is triggering on me constantly, etc.,

etc.,

etc. Constant blocks everywhere.

With Anubis, my understanding is that you do the proof of work (with whatever implementation you like, it doesn't have to be the Javascript one that they provide) and you can move on without ever doing any task yourself. The power consumption is a shame, but so long as attackers aren't even doing this much, the couple Joules it takes doesn't seem to be an issue

Of course, the attackers will evolve, but for now...

autoexec · 2026-05-16T07:12:58 1778915578

Please no. I'm a non-bot who gets stopped and turned away all the time by that menace. Anubis doesn't work without JS.

One of the things I give duckduckgo a lot of credit for is that while they're quick to interrupt me for a bot check (sometimes multiple times in a span of minutes) they'll let me identify ducks even on the most locked down browsers I use.

lucb1e · 2026-05-15T20:40:05 1778877605

I'm only a small-scale sysadmin but the way that I understand the internet is that you send abuse notifications to the IP address block owner and, if it doesn't get resolved, you block. The whois/rdap database reveals which IPs all belong to the same hosting provider or ISP, so you can summarize that all to one list of IP addrs + timestamps per some time period

The ISP actually knows which subscriber is on that line, can send them notices, block them, terminate them... loads of things that you simply cannot do because you have no relation to this person. And frankly I wouldn't want to need to have a personal relation with every website that I visit; my ISP can reach me if there is anything relevant to continued use of the internet. From personal experience, when I was a teenager, the ISP cutting our household off after an abuse report was an effective way of stopping what I was doing

Jolter · 2026-05-15T22:22:56 1778883776

It’s effective against teenagers maybe. Not so much against Amazon, Meta or wherever botnet/crawler is coming out of China these days from up-and-coming AI companies.

lucb1e · 2026-05-16T19:39:50 1778960390

Then block all of Amazon, Meta, or wherever botnet/crawling traffic is coming from that doesn't honor robots.txt, sends DDoS reflection traffic, submits SMTP messages (in large volumes, not just probing) for domains they're not authorized for with SPF, or whatever else applies to the protocol you're using

If they can't keep their ranges clean to a reasonable degree, their customers will need to move if they want to access your part of the internet. New sign-ups will always be hard, so some amount of abuse is expected, but if it's the same abuse traffic for weeks after you've notified them, well, it stops being your problem at some point

Jolter · 2026-05-16T19:41:57 1778960517

See the other comments in this thread. The perpetrators are unknown and are jumping between residential IPs. Possibly botnets?

lucb1e · 2026-05-16T19:43:40 1778960620

Then see my other replies in the thread where I've specifically addressed residential IPs, e.g.: https://news.ycombinator.com/item?id=48163060

Jolter · 2026-05-17T10:31:01 1779013861

This is the post I’m talking about. Make sure you understand how it would not be productive to go after each ISP individually when the traffic is from all of them.

https://news.ycombinator.com/item?id=48155512

tonetegeatinst · 2026-05-15T23:38:59 1778888339

I mean you could block entire AS numbers that relate to amazon or big tech datacenters

tangledhelix · 2026-05-16T00:07:02 1778890022

wouldn't help, much of the traffic we've observed look closer to ddos patterns - IPs from all over the world, many different networks, each IP makes one request only, doesn't come back. highly distributed, no form of blocking would be effective except maybe captcha or proof of work.

miki123211 · 2026-05-16T12:00:54 1778932854

The problem with this approach is that modern scrapers use hordes of residential proxies and quickly rotate through IP addresses which belong to ASes you get a lot of real traffic from. There's nothing you can do if the ISP won't take any action against the customer.

tangledhelix · 2026-05-16T15:37:08 1778945828

Worse than that - even if they would take action, you can't possibly orchestrate filing all of the complaints. It's a drown-in-quicksand problem, you can't fight quicksand one grain at a time.

lucb1e · 2026-05-16T19:35:08 1778960108

> you can't possibly orchestrate filing all of the complaints

To the ISPs? Each IP range has an abuse email address registered and this is specifically exempt from rate limiting at RIPE's WHOIS server. Not sure how it is in other RIRs but I just happen to know of this policy

You can automate the whole thing, provided that you have a reliable way of identifying the undesired traffic which you need anyway for being able to block it by any means. The trouble is in user identification (they'll just use a new IP address from that ISP or hosting provider if you don't tell the provider about the problematic user)

tangledhelix · 2026-05-16T19:50:14 1778961014

See what I wrote above (and let me say I am talking about Project Gutenberg and Distributed Proofreaders here, I am one of the admins on both). A large amount of the hassle traffic we've seen is as I wrote above, the IPs come from everywhere and in many cases, each IP makes a single request and doesn't come back. They change user-agent dynamically, etc, to masquerade as regular traffic. They come from residential, cloud/hyperscale, corporate, educational, government, all the networks, on every continent. This is many thousands of "open a ticket with someone" events per hour territory. It's as difficult to fight as DDoS itself for the same reasons (presumably the harvesting parties know that and that's exactly why this approach is used).

Others online have been writing about their own experience with the same stuff; it's not unique to PG at all, it's everywhere. Talk to anyone that runs a web server and they'll have these stories...

lucb1e · 2026-05-16T20:06:39 1778961999

I'm aware, I also host various websites that see an IP do a single request to the most unlikely of deep pages. Usually not hard to correlate with similar surprising requests from the same ISP, though, and that's exactly why it would be useful to talk to them: they know who used that IP address at the given timestamp. If they get a hundred complaints from different websites, the ISP is in the unique position to correlate that and find the subscriber(s) that are problematic

You also don't have to send out 1k support requests per hour. Could trial it with some hosting provider that you expect is responsive and see how it works out

edit: like, I just don't see another solution short of banning being anonymous online. Each site would have to know who you are. Someone has to be able to track it back to a person that is doing the abuse or there can't be any rules that we can apply. Imo it's better if that's the ISP (or VPN provider, say) who already has this information anyway

lucb1e · 2026-05-16T19:34:17 1778960057

I know. All the more reason to do it, right? If an ISP can't keep its network clean, then allowing them to send traffic onto the web is just asking for the problem to continue

Show people a useful error, such as "You are using [ISP name] which sends large volumes of abusive traffic (think of spam and DDoS). They allow the attackers to hop around points across their entire network so we cannot block the abusers more selectively. Despite our attempts to contact them, the abuse continues in volumes which we do not see from other ISPs. To access our corner of the internet, use a different ISP. You could try mobile data instead of Wi-Fi or vice versa.", and they can make their own choices about staying with this ISP if more and more websites show this sort of error

If everyone tries to identify people piecemeal, we all need to implement ~200 different identification systems (assuming each country has a central system that everyone is signed up to in the first place), or rely on algorithms to tell who is a bot (I'm currently being misidentified on a daily basis and I'm, eh, not a bot. Trying to buy public transport tickets is currently difficult, for example, because the monopolist in my country blocks me after a few route queries when using a Google browser, and 0 queries from Firefox)

TurdF3rguson · 2026-05-15T20:32:27 1778877147

CF cache?

jimnotgym · 2026-05-16T10:05:11 1778925911

I would love it if you could detect AI scraper bots, and feed them AI generated bs instead of the real books...

tangledhelix · 2026-05-16T19:51:13 1778961073

Cloudflare sells that as a product, they call it Labyrinth IIRC.

miki123211 · 2026-05-16T11:59:10 1778932750

This is very, very, very dangerous.

Occasionally, you misclassify a real user as a bot, and then your reputation is ruined forever.

The official Polish train schedules website did this recently, feeding incorrect departure and arrival times to IP addresses known for aggressive scraping, without taking CGNAT into account. People... have noticed[1].

[1] (Polish) https://zaufanatrzeciastrona.pl/post/kto-i-dlaczego-losuje-w...

gluejar · 2026-05-16T18:54:12 1778957652

traffic yesterday ~20% more than recent average. 4971601 sessions 177 robots 863462 robot files 3390115 user files 20.30% robot files (robots id'd based on requests/ip address) 5 apache servers for static content, 1 CherryPy server for dynamic content hosted at iBiblio.

eulerpoolapi · 2026-05-16T13:37:20 1778938640

The biggest lever: make the reading experience great. https://www.gutenberg.org/cache/epub/245/pg245-images.html is still hard to read: lines are tooo long (macbook), no great way for pagination/remembering where I was, notes

tangledhelix · 2026-05-16T15:31:20 1778945480

The ebook editions are very good for this. Most of the e-reader software provides all the amenities (bookmarks, highlighting, notes, control of margins, etc).

SwampertX · 2026-05-16T14:02:16 1778940136

Firefox's reader mode works amazingly for these situations.

drzaiusx11 · 2026-05-16T15:42:36 1778946156

A while back I attempted to extract the FF reader code to make it a front end to various non-web clients (email with pine key bindings etc)

I got it to a prototype level but then shelved it after having difficulty getting good results with various test datasets. Probably would make a fantastic ereader though

elch · 2026-05-16T16:48:19 1778950099

Lines aren't too long. They look great on all my devices.

Use ⌘ + + until you get the line length you like.

Guestmodinfo · 2026-05-16T01:36:11 1778895371

Hi for the past 20 years I have known about Project Gutenberg and I used to read a lot from it. One of the obstacle that I face is that there is no way to arrange the books in the order of their original publication. Do you know of any such way. Surely we can arrange the books by their release date on Gutenberg but it has long baffled me as it feels to me the most useless way of sorting the books. Thank you for Project Gutenberg.

gluejar · 2026-05-16T19:00:30 1778958030

only 20% of our books have original publication data in the db. We have a project to add another 40% or so from another database, let us know if you want to help. reply

Guestmodinfo · 2026-05-16T19:33:42 1778960022

Yes I am willing to help. Plz include me in your efforts. Thank you for this

0x0203 · 2026-05-15T22:15:30 1778883330

As long as you're taking suggestions, since many of the books are quite old, adding a publication date or date range to the search functionality might be nice. I personally would find it very useful since I have a tendency to look for things that are older than year _x_ when researching various things.

Thanks for all the effort put into the site!

gluejar · 2026-05-16T18:59:30 1778957970

only 20% of our books have original publication data in the db. We have a project to add another 40% or so from another database, let us know if you want to help.

sgc · 2026-05-17T13:58:47 1779026327

I have the same problem on catholiclibrary.org, but insist on having something as the book date for every work. My solution is to temporarily default to the author dates until the book date can be refined. If there is no known author date I at least have a date range, hopefully to century or better.

Author dates are a much smaller data set, can be generally supplemented from public marc records (viaf, loc, etc - I don't do that, but it's an option) and at least provide basic filtering / sorting.

Falimonda · 2026-05-15T16:27:11 1778862431

The book list elements on front page render as both horizontally and vertically scrollable divs on mobile - seems like an opportunity for improvement.

Keep up the good work!

JSeiko · 2026-05-15T16:33:04 1778862784

good feedback thanks! Doing an iteration on the homepage design is actually pretty high on the priority list. will keep your feedback in mind!

Falimonda · 2026-05-15T22:05:53 1778882753

Any interest in offering PG as a multi-lingual web e-reader in any language?

I've since discontinued hosting it, but happy to add you all and merge into an official PG offering: https://www.reddit.com/r/SideProject/s/VtYKxjrMme

Falimonda · 2026-05-15T22:07:10 1778882830

More content visible on various videos I took and posted to X

https://x.com/abal_ai

xrd · 2026-05-15T16:49:17 1778863757

Thank you for your work. This site is an international treasure.

excitednumber · 2026-05-15T16:59:16 1778864356

Thank you for being one of the best places on the internet

windowliker · 2026-05-16T14:54:59 1778943299

FWIW I absolutely love how 'no-frills' PG is compared to so much of the bloated, over-engineered, script-riddled web these days. Please don't ever change that!

zamadatix · 2026-05-15T19:30:54 1778873454

Thanks for the free work! Project Gutenberg is nice to have :).

On the site I noticed the library boxes have roughly a single extra line causing a scrollbar to appear and the last line to be chopped off https://i.imgur.com/PQ8T0qc.png is there an issues/bug portal to properly submit these kinds of things?

JSeiko · 2026-05-15T20:27:33 1778876853

you can open an Issue at https://github.com/gutenbergtools/gutenbergsite

smallnix · 2026-05-15T17:04:32 1778864672

There's a minor bug with chrome in android where the menu will not close when you tap outside the menu or on the menu link/button

JSeiko · 2026-05-15T17:29:57 1778866197

I've messaged the guy who's best suited to fixing this. He'll be on it this weekend

smallnix · 2026-05-16T13:52:28 1778939548

Oh no. I did not want to cause someone to work on the weekend. I hope it's his hobby!

JSeiko · 2026-05-15T17:11:39 1778865099

will open an "Issue" for it

ExtremisAndy · 2026-05-15T16:56:32 1778864192

Oh, my! This does look nice. Thank you for your hard work!

JSeiko · 2026-05-15T17:00:27 1778864427

Thanks! We're currently working on a design update of the page of any specific book. Should be online soon (next 1-2 weeks or so)

freedomben · 2026-05-15T18:44:24 1778870664

I can't say for project Gutenberg specifically, but in general a huge issue I see is OCR errors. What do you all do to address OCR?

gluejar · 2026-05-15T19:11:20 1778872280

Check out Distributed Proofreaders: https://pgdp.net

jfengel · 2026-05-15T21:27:50 1778880470

I didn't realized DP was still around. I used to do it quite a bit, 15 years ago, but OCR has improved considerably since then.

tangledhelix · 2026-05-16T15:45:00 1778946300

OCR has improved a lot since then, but OCR is just step 1 of reading in text. They make a lot of errors (even now, especially on old worn out paper pages) and even if they didn't, one has to format the book, deal with footnotes, sidenotes, illustrations, etc. DP is very active, we will welcome you back with open arms :)

lapetitejort · 2026-05-15T18:47:41 1778870861

I uploaded a PDF to archive.org that auto-OCRs with plenty of mistakes. I have found no way of updating the entire stack of documents produced. I wonder if Project Gutenberg is similar

shuvrojit · 2026-05-15T17:03:03 1778864583

Great Work. Thank you. I'm also a programmer. If you are ever short on help, let me know. I would love to contribute.

JSeiko · 2026-05-15T17:39:24 1778866764

https://github.com/gutenbergtools

autocat3 and gutenbergsite are repos responsible for generating gutenberg.org

Jiro · 2026-05-16T22:48:56 1778971736

I don't know what the status of this is today, but a number of years ago my biggest complaint about Gutenberg is that a lot of books had images added back when low resolution images were the standard, so you have a ton of books with image resolutions from the year 2000.

TimorousBestie · 2026-05-15T17:55:01 1778867701

Wanna let you know you’re doing great work and you have my dream job, thanks to the team for everything!

JSeiko · 2026-05-15T18:09:14 1778868554

it's not my day job. PG is open-source. I'm "just" a contributor

TimorousBestie · 2026-05-15T18:16:01 1778868961

Oh, right. That makes sense.

8bitsrule · 2026-05-16T04:12:04 1778904724

Great project. Are many of the books in a format that can easily be converted into audio? Is there a way to search for them, and information on what software your readers find useful for this purpose?

(Note: A lot of print media these days has switched to far-to-small font-sizes. Less of a problem for (zoomable) digital media, but for many that's still a barrier.)

tangledhelix · 2026-05-16T15:43:28 1778946208

There are many books available as audio, some are human-read, some were automated. You can see lists here:

human-read: https://www.gutenberg.org/browse/categories/1

computer-generated: https://www.gutenberg.org/browse/categories/2

IIRC many of the human-generated ones come from LibriVox, many of the computer-generated ones came from a collaboration with Microsoft.

OfflineSergio · 2026-05-16T13:57:55 1778939875

For the Audio part, I suggest https://desktop.with.audio

8bitsrule · 2026-05-17T02:08:07 1778983687

IMO, most audio read by humans (esp. voice actors) are far preferable to machine readings. Also, I found no demos on that page.

BiraIgnacio · 2026-05-15T17:23:50 1778865830

Thanks so much for the work you and your team do!

samwho · 2026-05-16T09:39:57 1778924397

Looking really good! Great work.

shevy-java · 2026-05-16T11:34:10 1778931250

There should be more books at Gutenberg.

Also by the way I just searched for 3d printing and found nothing. Either there are no books, or the search query makes things too complicated, IMO.

robin_reala · 2026-05-16T18:34:36 1778956476

Gutenberg is nearly all books that have lapsed into the US public domain by dint of being published 95+ years in the past. Which broadly explains why you hit nothing for 3d printing.

tangledhelix · 2026-05-16T19:56:50 1778961410

As another commenter said PG is almost all books from 95+ years in the past due to copyright law in the US. We partner with a sister organization, the World Library Foundation, who have a self-publishing portal for modern works by authors who wish to put their own work in the public domain. You might want to look there for more modern material. https://self.gutenberg.org

samcollins · 2026-05-15T16:30:46 1778862646

Very cool! Do you have a recommended way for an agent to see an index of the books and epub links?

(I can’t quite tell if that’s an egregious abuse of the site or you’re perfectly fine to share without human eye balls hitting your www?)

jzs · 2026-05-15T16:40:01 1778863201

Now i'm not associated with gutenberg in any form, but they do have a page for offline consumption:

https://www.gutenberg.org/ebooks/offline_catalogs.html

Perhaps you can find the information you are looking for there.

However if you plan on scraping or otherwise hitting them with a ton of traffic, consider at least to donate a good amount for the traffic you cause them. It ain't free after all.

JSeiko · 2026-05-15T16:42:10 1778863330

Donations are always appreciated ;)

jimnotgym · 2026-05-16T10:11:05 1778926265

Presumably if you paid them enough money they would give you the books without you having to pay to scrape at all?

samcollins · 2026-05-15T17:10:09 1778865009

Thanks for the answers! Found it:

> All Project Gutenberg metadata are available digitally in the XML/RDF format. This is updated daily (other than the legacy format mentioned below). Please use one of these files as input to a database or other tools you may be developing, instead of crawling or roboting the website.

And strongly consider a donation! (My addition)

https://www.gutenberg.org/ebooks/offline_catalogs.html#the-p...

kay_o · 2026-05-15T16:34:46 1778862886

Check out https://www.gutenberg.org/ebooks/offline_catalogs.html

Don't hit the site with agent. The section furtherst bottom machine readable.

gluejar · 2026-05-15T17:57:53 1778867873

if what you want is all the text, please use the tarball or data files at https://www.gutenberg.org/cache/epub/feeds

JSeiko · 2026-05-15T16:35:11 1778862911

not yet, but that's not a bad idea imo. Dealing with Ai crawler traffic is definitely a challenge if that's what you were referring to.

dredmorbius · 2026-05-15T23:40:12 1778888412

Possibly ZIMs is of interest: <https://ebookfoundation.org/openzim.html> (via: <https://news.ycombinator.com/item?id=48152200>).

ancientcatz · 2026-05-15T16:39:04 1778863144

OPDS?

gluejar · 2026-05-15T17:33:32 1778866412

OPDS 2.0 coming RSN. email us if you want to test. OPDS 0.x is currently available (not recommended) by adding .opds to the end of a url