Busting myths about BigQuery

tzury · on Dec 23, 2017

We at Reblaze [0] push to BQ nearly 3.5 Billion records every day.

We use it as our infrastructure for big-data analysis and we get the results at the speed of light.

Every 60 seconds, there are nearly 1K queries that analyze users and IPs behavior.

Our entire ML is based on it, and there are half a dozen of applications in our platform that make use of BQ - from detecting scrapers, analyzing user input, defeating DDoS and more.

The built-in (with legacy sql variant [1]) Math and Window functions are super handy, the easy DataLab [2] integration, and the last but not least, Google DataStudio[3] that let us generate interactive reports in literally minutes are all making our choice (3+ years a great one).

BQ replaced 2,160,000 monthly cores (250 instances * 12 cores each always on) and petabytes of storage and cost us several grands per month.

This perhaps one of the greatest hidden gems available today in the cloud sphere and I recommend everyone to give it a try.

A very good friend of mine replaced MixPanel with BigQuery and saved nearly a quarter of a million dollars a year since [4].

--

[0] https://blog.reblaze.com/how-to-stop-a-ddos-attack-in-under-...

[1] https://cloud.google.com/bigquery/docs/reference/legacy-sql

[2] https://cloud.google.com/datalab/

[3] https://datastudio.google.com

[4] https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-...

pentium10 · on Dec 23, 2017

We are using BQ for 4 years now. Since the beginning with dual-writing to diferent engines. We did an architecture that routes data to Redis, Riak, Elasticsearch, Mysql, BigQuery. We often have 500+ lines of queries in BigQuery writen in standard SQL that runs our reporting system. The average query time we have is 10 seconds usually scanning at least a terabyte. We truly love the ability to write in Javascript the UDFs, and our marketing team loves to combine their Google Sheets (mapping data) with a real BQ query. We heavily use the 'dryrun" feature of BQ to see without running the query how much resources and costs will generate. We even implemented a quota for users, based on this feature, wich works great, and only when they do a mistake in a query targetting alerts them at a certain treshold. Afer a couple of years we naturally started to consolidate all our data into BigQuery. Since it's Standard SQL all our team members can start using and reporting + tooling works quite well. DataStudio is another service which we started to use more and more as well.

indogooner · on Dec 23, 2017

How do you handle updates in BigQuery in case your queries require latest version of records? Do you use views? If yes does that affect the query latency too much?

pentium10 · on Dec 23, 2017

During the 4 years we at REEA.net tried a lot of approaches, in different projects. 1. We stream every upate as a new row directly to the table. We have for every table a view that has a naming convention "table_latest" and it targets latest version of the record. We didn't see a big latency increase. Deleted rows have a 'deleted' flag, we use EXCEPT in the query. 2. Since DML is available we started on newer projects dual aproach, having a dedicated streaming table, and rematerializing to a flat table. It adds developer complexity but it's a win, as the materialized table is easier to be used by our marketing team. We added to the reports "Based on data 10 minutes ago" label, and a CTA "re-run using live data", which in background calls the script to rematerialize the table and rerun the report. Our markting team loves this, and they don't abuse this functionality. On some projcts we have remateralization every 1h, on others daily. 3. regarding partitioned tables where you can stream only 7 days of window into their respective partitions we use also a complementary load job. We stream in the 7 day window. And we have a heavy "future date" use case, where we use "load jobs" to a temp table, then query and write to their respective partitions.

bzillins · on Dec 23, 2017

As long as you turn a PG table into a BQ dataset with enough partitions (BQ tables) such that queries do not touch a disproportionate amount of unnecessary data, you utilize date partitioning, and you can deal with a minimum 1.5 second query time the service is truly amazing. My only frustrations are that the majority of streaming service degradations are never noted on the Status dashboard and that you can get really odd error states like headers stating that you should be getting json while the body contains a html error message.

scrollaway · on Dec 23, 2017

Tangential: Anyone here used both BigQuery and Redshift?

Here's the primary reason that is currently keeping me on Redshift:

Redshift is basically Postgres at the query layer which is insanely cool: All the Postgres tooling works with it. All the Postgres expertise I have carries over. It feels like a lot less of a "lock in".

Anyone care to chime in?

nothrabannosir · on Dec 23, 2017

I have used both. My biggest aha moment with BQ was realising it isn’t “sql to query a database”, but “map reduce as a sql”. Redshift really feels like a DB: I’m always worrying about the number of the runtime complexity of my sql, the number of nodes in my fleet, how my joins work, it just feels like a big big DB. BQ is less like that. I’m thinking in terms of memory complexity of my operation : N, log N, etc. Always having as many nodes available as it can be parallelised to. You’re writing a map reduce job on your data, and it’s... it feels like the more appropriate boundary for abstraction.

This is a bit hand wavy, but there is really no better way to get a feel for it than to try it. And think of it less as a db, and more as a map reduce cluster :) if that helps.

barrkel · on Dec 23, 2017

Your comment is a little bit perplexing to me because I don't really recognize much difference between composing relational operations and map / reduce operations. SQL `select` is map, `group by` is reduce, `where` is a flatMap to zero or one items, and `join` is a kind of specialized flatMap to zero or more items. You have to consider IO quantity and throughput (both disk and memory) to get decent performance either way.

For example, if you're writing straight-up SQL, the fewer columns you project out before you start sorting the better off you are, it's better to join stuff after you've done your sort than the reverse because it means physically less data shuffling.

nothrabannosir · on Dec 23, 2017

In redshift, I control the size of nodes. That means I can brute force my way through an expensive operation by increasing the memory capacity of the nodes. In BigQuery, you play by the rules or you’re out. It forces you to think in map reduce, because there really is nothing else under the hood.

Also, you get immediate parallelisation across O(n) nodes. Again; not in redshift.

These are crucially different to regular DBs. They’re both semantically sql, nobody denies that, but they describe different underlying models.

E.g.: in BigQuery, you can’t sort your entire column, even if you do other stuff afterwards. That makes sense in a map reduce system, but not in a “normal” DB.

barrkel · on Dec 23, 2017

Thanks for the clarification.

vgt · on Dec 23, 2017

Here's a good presentation by Jordan, the BigQuery Eng lead that describes what you're talking about:

https://www.youtube.com/watch?v=UueWySREWvk

jacksmith21006 · on Dec 24, 2017

Thanks for sharing a link to the presentation. Is is very good.

RobGoretsky · on Dec 23, 2017

I'd argue that AWS's more directly comparable service to BQ is Athena.. Athena is AWS's managed Presto service, but it's positioned as 'serverless' as you don't need to plan capacity in advance, and very 'coincidentally' priced at the exact same $5/TB that BQ is priced at. Has anyone compared Athena vs BQ performance? I work at an AWS shop so we've been kicking the tires with Athena and thus far are pretty impressed, but I do wonder if the exact same workload would perform better on BQ, which has been around much longer and is truly built from scratch for running as a 'serverless' scalable query engine.

sturgill · on Dec 23, 2017

We dual write event data to BigQuery and Redshift. BigQuery is incredibly cheap at the storage level, but can be pricey to run queries against large datasets. Personally I am not a fan of paying per-query.

Historically BigQuery data was immutable. You could delete a table, but you couldn't modify or delete a single row. That support was added recently, but there are daily limits imposed on those actions.

I'm a big fan of immutable data, up until you find that someone made a mistake somewhere. Naturally you can make modifications in ETL, but if you do that long enough your ETL job is a mess of time-based if statements.

We originally moved to also support Redshift because of BigQuery not supporting row level updates at that time. Someone on the content site would forget to put an & between URL params and our source data would be busted. Like I said, immutable data is great as long as you never make mistakes...

I'm also not a huge fan of nested records. If you use BigQuery, do yourself a favor and make sure you have a uuid field (it's a good practice any way). When you get data back from the client it can be a pain to piece records back together (nested records come back as single rows for the most part). This makes certain queries really fast, but it can be a pain to work with in aggregate.

My preferred setup these days is to dual write to Redshift and BigQuery. We run almost all analysis off of Redshift, but have a natural backup of data stored elsewhere. And like I said, BigQuery storage rates are dirt cheap. And if our Redshift cluster goes down for any reason, we can hop over to BigQuery to find what we need (and backfill any missing data points).

scrollaway · on Dec 23, 2017

Wait, what's the advantage of dual-writing? My understanding is that there's a lot of cases where BigQuery is significantly cheaper, but if you're paying for Redshift, there's really nothing Redshift can't do that BigQuery can.

Lutin · on Dec 23, 2017

We dual-write mostly to be able to offload certain query patterns, like joining large tables on non distkeys. Redshift does much better than BQ joins as long as they're on the same DISTKEY or if you can afford to make the table DISTSTYLE ALL, but if you can't, having to shuffle really impacts not only the runtime of that specific query, but also any other running queries. We have a job that took 1 hour on a 28 node dc2.8xlarge cluster, but only 4 minutes on BigQuery with basically no rewriting of the query, and saw a reduction in runtimes of concurrent jobs as well.

sturgill · on Dec 23, 2017

Redundancy across providers. BigQuery storage costs very little. It was implemented first and we introduced Redshift later to deal with data issues we didn't want to handle long-term in ETL. It didn't make sense to kill off the bigquery integration and I've decided for $20/mo or so that the backup storage on a separate provider is worth it.

Redshift and BigQuery are both solid services. And BigQuery has gotten better with time. But I still have some reservations with it as a pure Redshift replacement.

scrollaway · on Dec 23, 2017

Thanks for the details! Could I ask you about your query patterns and the kind of data loads you're pushing into those?

sturgill · on Dec 24, 2017

Most of our use case is storing raw event data: user clicked on X; interacted with feature Y; was bucketed into test group Z; audit logs for subscription data points (joined, upgraded, downgraded, paid, payment denied, etc). Basically everything we care about, but that doesn't matter as a core concern of providing the service.

The main app needs to know if a user has access to entitlement A, but the history of that process is outside the concern of the app itself.

We throw everything into BQ/Redshift. Then when we find a need for it, we bring it into a separate reporting dataset in Redshift via ETL.

The nice part about this is you can join your sanitized ETL data against raw event data if you need to run a once off query like "for everyone put into variant A of experiment B, how many converted from a free trial to a paid subscription?" I haven't found a great way to programmatically recreate ETL data into BQ without recreating a new dataset every day (because of the history of immutable data). It's not that it's impossible, it's just not as great (imho) as an upsert type workflow.

Occasionally we'll have a once off query we access by hand, but that's relatively rare at this point. Jobs run around 1am and everything is ready to go when the team gets in in the morning.

You have the odd job where you have to manage distkeys, and that's rarely fun, but it's a trade off we're comfortable making.

And, while I'm pimping other people's services, Redshift + Looker is incredible.

manigandham · on Dec 23, 2017

It’s still SQL at the end of the day so not a big deal, BigQuery has a newer Standard SQL dialect now too.

It is nice to be able to reuse existing tooling for redshift but we’re willing to make the trade for bigquery since the no-ops model is so much nicer. It’s basically magical how well it works.

elvinyung · on Dec 23, 2017

I don't know why more things aren't PostgreSQL protocol compatible. It seems like a relatively easy way to make a system accessible to a huge ecosystem.

CockroachDB is the only other well-known project I know of that does this (but of course, it's not really an analytic database right now).

zerop · on Dec 23, 2017

Not sure if this relevant to what you want:

Analyzing AWS Detailed Billing Reports Using BigQuery and ReDash

https://blog.powerupcloud.com/analyzing-aws-detailed-billing...

nerdponx · on Dec 23, 2017

Is Azure SQL Data Warehouse comparable to either of these? It's distributed on the backend, and more or less standard T-SQL on the front end.

IU84 · on Dec 24, 2017

I guess - if you consider 30 year old legacy technologies insanely cool... Redshift is a shared-nothing architecture, so you'll always face scalability and concurrency issues. Snowflake is a brand new MPP shared-data architecture, and it can scale compute and storage separately. Blows Redshift away.

djhworld · on Dec 23, 2017

If you rely on JDBC support, BQ has a JDBC driver but it's written by a third party

lima · on Dec 23, 2017

I recently discovered ClickHouse[1], Yandex's recently open sourced BigQuery equivalent (someone on HN pointed me to it!).

If you're looking for a OLAP database running on your own infrastructure, make sure to give it a try.

[1]: https://clickhouse.yandex/

nemothekid · on Dec 23, 2017

Anything you run on your own infrastrucutre I wouldn't really call a BigQuery equivalent. As far as performance goes, if you attacked the problem with the same amount of CPU cores, I think you could get similar performance with Presto/Dremel on top of any fast data store.

What I think is different about BigQuery is just the on-demand parallelism. This means you pay per query instead of per instance which is confusing at first, but then a lot of the operational knobs simply go away (for better or for worse).

tzury · on Dec 23, 2017

Thanks for the link. Seems interesting. BQ is a serverless though.

lima · on Dec 24, 2017

Serverless is just another word for "third party service".

vgt · on Dec 24, 2017

I respectfully disagree. While what "serverless" truly means varies depending on who you ask, in this context I think it's:

- Multi-tenant architecture.

- Pay-per-job/query.

- Complete abstraction away of underlying resources.

- High availability out of the box.

- Automatic seamless scaling on-demand.

So services like BigQuery qualify, and Redshift don't.

(work at G)

hbarka · on Dec 24, 2017

Inside Google, isn’t F1 and Plx preferred over BQ?

ddorian43 · on Dec 24, 2017

See "oltp vs olap"

elvinyung · on Dec 26, 2017

F1 isn't strictly OLTP though. Even the paper says it:

> F1 also provides a full-fledged SQL interface, which is used for low-latency OLTP queries, large OLAP queries, and everything in between.

ddorian43 · on Dec 26, 2017

ok see "row store vs columnar store"

elvinyung · on Dec 26, 2017

F1 does not itself have a storage engine, and can use a variety of them. Spanner is actually now columnar [1] in a way that's similar to Parquet:

> Ressi stores a database as an LSM tree, whose layers are periodically compacted. Within each layer, Ressi organizes data into blocks in row-major order, but lays out the data within a block in column-major order (essentially, the PAX layout).

[1] https://static.googleusercontent.com/media/research.google.c...

jnordwick · on Dec 23, 2017

So Google ads are now HN material?

This is literally just Google putting out a press release touting one of their products. It isn't unbiased or doesn't attempt to be complete, but people here seem to think Google's opinion on their own technology is beyond reproach.

Think of something like this being on Microsoft or Oracle website. No way it would be voted up.

bloudermilk · on Dec 23, 2017

It’s a shame you’re getting downvoted, since you’re right about the content. The list of “truths” isn’t anything more than a feature list. I got to #4 before coming to the comments, where I found the top three comments were all well-written positive reviews of the product. The whole submission screams of a well-executed marketing play.

fhoffa · on Dec 24, 2017

Oh hi! I work for Google. I browse Hacker News in my free time - like now, from a plane.

Conspiracy fodder: how smart (or not) would a team need to be to plan a "well-executed marketing play" for an enterprise database warehouse -- in the morning of freakin Saturday December 23rd?

http://images5.fanpop.com/image/photos/31100000/Classic-Patr...

bloudermilk · on Dec 29, 2017

Respectfully, I’m not seeing how your employment at Google or your being on a plane is relevant to my comment. If you had the experience I do working at marketing agencies for brands such as Google, I suspect you wouldn’t find my theory near conspiracy territory. I concede that such a play likely wouldn’t be considered well-executed given the timing, but it’s entirely within the realm of possibility.

NicoJuicy · on Dec 24, 2017

Screaming that it's marketing and adding nothing ontopic is a good way to get downvoted.

bdcravens · on Dec 24, 2017

How is that any different than a blog post from Jeff Barr?

jwswj · on Dec 24, 2017

Totally ageee.