Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Hard Economics of Selling Web Data (databoutique.com)
89 points by gsky on Oct 11, 2023 | hide | past | favorite | 33 comments


Also, purchased data can be low quality, of unknown correctness and completeness, and not include everything you need in its schema.

For one startup, which addressed problems for brands like counterfeiting and gray market diversion, we needed to monitor online marketplaces over time.

Off-the-shelf scraping products we tried, which supposedly targeted one of the biggest marketplaces, were very poor, even when operated in their manually-invoked modes (nevermind scheduled).

I ended up building a scraper that was correct and reliable, and captured all the data that we needed. And cost around $5/month to run.

(This superior capability was not only useful in general for our services, but I understand it helped with cold enterprise sales approaches, from a tiny startup, e.g., maybe something like: "Here's an analysis of impact of X to your brand, based on data we monitor with some of our technology. Would love to discuss with you.")

Controlling and understanding the software also gave us the flexibility to add capability quickly, as soon we had a need. For example, since the problems we were addressing were global, as soon as we needed to be able to see what different geopolitical instances of that marketplace were showing to shoppers from those same geopolitical areas, I was able to implement scraping of that.

This was remarkably stable/resilient over time, as well. The biggest risk I perceived was something I warned the team about: "The first rule of Scrape Club is: Don't Talk About Scrape Club." I didn't want to get on the radar of the marketplace's lawyers, given that they were profiting from counterfeits and gray market, and possibly have public information turn into something that burns up the remainder of our meager Covid-time runway on lawyer billable hours. (Having data get cut off due to lawyers also seems a risk if using a vendor for the data.)

Of course, there are often good reasons to buy data dumps/feeds. Just giving an example of when it wasn't the best choice.


Quality issues are to be expected, given the most effective countermeasure when you detect scraping is not to block the bot or dispatch the lawyers, but to poison the well with subtly incorrect responses.

If a stream of data is central to your business, then you might catch on eventually, but if you're just selling it along, it will most likely take forever to detect this type of quality problem.


What's the point of poisoning the well, unless you plan on monitizing a clean flow?


If you block a scraper, they will find a way around it, and it's very obvious in the scraper logs that something is wrong. It's like barely even an inconvenience. They're back in 15 minutes with a new set of IPs and a new configuration for their browsers, like you can even go as far as to automate this stuff.

Poisoning the data is much more insidious, since it doesn't let the scrapers know that you know. It means you can work to undermine whatever purpose they have. If they try to sell your data, you'll ruin their reputation with their customers. If they use your data for decision making, you trick them into making bad calls.


I get the results of it, I just don't see the value in even poisoning the data, unless you plan on monitizing it in the near term. Maybe the scrapping is imposing real operational costs on you, but then poisoning the data doesn't help.


A fairly common use case for scraping is to keep track of your competition. This creates a fog of war if you will.


  The biggest risk I perceived was something I warned the team about: "The first rule of Scrape Club is: Don't Talk About Scrape Club."
It seems someone broke this rule: https://www.google.com/search?q=the+web+scraping+club


My experience is that many people want to buy scraped data that is against a large website's TOS such as Google, Amazon or Craigslist and the act of selling that data directly in a marketplace is likely to draw legal action from these companies as it's someone else's copyrighted data. Most companies that want you to scrape their data provide an api or would at least be willing to sell you the data or sell you access to an api.

That's why people simply hire people, rent out proxies and store the data in their own internal databases, because then the companies have to figure out who the source of the scraping is and be able to prove it, which isn't a lucrative target for the scraping companies.

That being said, there are clear examples of companies that sell processed data that's been analyzed by data analysts and isn't the raw scraped data and they have successful and lucrative business models.


> My experience is that many people want to buy scraped data that is against a large website's TOS such as Google, Amazon or Craigslist and the act of selling that data directly in a marketplace is likely to draw legal action from these companies as it's someone else's copyrighted data.

I came here to make substantively the same point. Scraping for your own personal use and re-use as a derived product is reasonably cool but can border into IPR concerns. Scraping to re-sell as a product, thats really stepping over a line.

This has been a problem since the days of the yellow pages phonebook. There are rules around when a catalog is in the public domain, page contents with data I don't think cross into the "its ok to just copy and resell" space.

I scrape several sources. World population and GDP stats are highly useful but finding good canonical sources was hard for a while there. I now use a Python API provided by the world bank, for most of them. Now I just have the problem that by economy, the inputs span 2019-2022 and I can't get a single authority for population estimates right "now" worldwide. But, scraping to resell that? I think it would be wrong.


Scraping internet data is perfectly fine and legal. It may be subject to copyright or ToS (which are not the same) but as long as you don't redistribute copyrighted content, or affirmatively accept any ToS, you're doing nothing wrong. So the potential risk lies entirely with the seller, not the buyer, and could potentially be reduced or circumvented in various ways. Just saying


Yes. but the article was almost entirely about being that seller. The premise was, "why don't more people scrape the web and sell it" -And if I was an IPR owner and detected a scraper, and then tracked back from their origin IP robot to their web service, I'd be sending my legals round.

Offering a brokering service to "share" access to the investment in scraping, has to be done carefully. If for example its brokering access to the page specific structure in some meta notation and you as a consumer scrape direct? thats actually quite cool. If its running a cache of the data re-presented in JSON form for you to consume, or monetising an Elastic Stack of the data and you read it from them? I think its getting iffy to be that seller.


> Scraping internet data is perfectly fine and legal.

I doubt that is true for all jurisdictions


It all comes down to what data: PII (personally Identifiable Info) or IP (Intellectual Property) are strongly protected. There are a lot of useful info not PII or IP that can be collected, used and sold, like prices (Ryanair case against Expedia being a crucial case)


I'm somewhat ignorant on this matter, but I was under the impression that in some jurisdictions (e.g. UK) scraping as a whole sale concept is pretty grey area at best not just on PII & IP


Regarding the legality of scraping factual data...

In the EU there's a 'database right' that covers specific collections of facts. The US has categorically refused to issue any kind of "IP"[0] on facts, and doesn't have 'database rights', but in practice you can sprinkle a few fictitious bits of information in your dataset, retain copyright on the lies, and sue people who don't independently verify the factuality of every bit of information in the dataset. So the US has database rights.

No clue if people are sticking fictional data on their websites to sue scrapers with, but I wouldn't be surprised if they were.

[0] Intellectual property is the right to dictate to competitors the conditions upon which they are allowed to compete, if any.


Doesn't section 230 say that most internet orgs are just aggregators and not publishers? The copyright in this case lies with the content creators and not the orgs? I can see individuals might have a car to sue for copyright, but do these internet orgs have any legal leg?


Maybe copyright is not the correct law to cite, but Craigslist sued companies that scraped it's data and provided it to a wider audience and won like hotpads, etc.

Not a copyright lawyer and I don't know what the actual law is against scraping, but just pointing out that previous companies offering raw scraped data from a large platform were sued out of business.


Not all content is covered byIp (intellectual property). Long textx (like an article), images, or videos are. But factual information (such as prices, a hot one in web scraping) are not.


> Long textx (like an article)

Short ones, too.


On the legal side, legislation in the US, UK and EU is clear on PII (personally identifiable Informtion) and IP (Intellectual Property). Web scraping can be perfectly legal when these elements are considered. Here my article https://blog.databoutique.com/p/web-scraping-legal-context


I've been working on a startup idea that involves a lot of scraping and I've thought about a general data market before but I came to the same conclusion that it is way too hard. I just use one of the many captcha solving services available and then build a convincing product around it that can be sold using "value based pricing".

All big players in the "information" economy found other ways to make money then sell data directly, for example Google's attention market for online ads or payment for order flow in trading.


The issue with general data markets is that they are too .. generic. They can't address the quality assurance problem, so as a buyer you'd end up doing you own due diligence for every merchant on the marketplace. As if you were to site-visit each Airbnb before you book a night there. A web-scraping dedicated marketplace might address this, surely not a generic one.


Data can be a tricky business. One of our customers was really upset at Bloomberg because they offer bundles (like cable TV) while you might be interested in something more specific. And the data endpoint does not offer some columns that are valuable to them.

On the other hand, extracting data from a couple of web pages was both more accurate for them, more reliable and less costly, assuming you can nail down the scraping/storage/filtering part. This latter part is a real challenge when people need data they actually rely on in a business process to make real world decisions, compared to "trends" or "insights" that are rather informative.


Similarweb do this really well. You should check them out.


So I have a niche market that has messy data around all around and I was thinking of cleaning and packaging that data from multiple sources into 1 and sell it to the sales managers or BI teams for the companies within the industry but it’s just been an idea in my head.

Can anyone provide advice on how I should package it and sell it or any advice towards adding extra value?


Is it proprietary data or web scraped? This might change the strategy. If proprietary, with enough history, and relevant for the market you cover (and this market has listed companies operating in it), you shall consider selling it to hedge funds. There is an entire ecosystem for the so called "alternative data" you should look at. But big money is for 1) proprietary data 2) long historical trends


I’m interested in hearing more about this. It might not pertain to the data I’m thinking specifically as it would be scraped but finding out where the holes are and creating proprietary data around that would be a consideration. However, I’m interested in diving in deeper on alternative data if you have some solid sources I can brush up on.


It all depends on where/who you are, but I have thought about trying to get things listed on the Snowflake Marketplace https://other-docs.snowflake.com/en/collaboration/provider-b...


Thank you for this!


Very hard to give any advice here without telling us what this data is. Is it customer lists of technologies or products? Is it contact data? (Before you say you’re afraid to give your secret sauce away, nobody cares or has the time to implement your idea)


It’s less about the data specifically and what are some common add ons or ways to add additional value, knowing my target audience would be sales managers or key decision makers on the sales side of the industry. Obviously fancy dashboards are nice, and historical data is a want, but is there any other generalistic add ons one could add?


I was expecting company data but e-commerce data also makes sense.

Do you have a sense for how people store, transform, organise, model the data once they have bought it? Is it typically data warehouse and dbt or straight into a data science pricing algorithm, or something else entirely?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: