It turns out that with proper indexing and partitioning it can! even if the index is so big that it cannot fit in the memory (RAM). To test, I started with a table with 2B rows, 172 cols, 26 partitions and 5 TB logical size in BigQuery. This table was copied to Postgres using Dataflow. The copy job itself took 19+ hrs. Here are the numbers showing Postgres query performance and comparing to BigQuery. I used Postgres 14 running as Cloud SQL in GCP (8 vCPU, 32 GB RAM):
| BigQuery | Postgres (full table w/ all 172 cols) | Postgres slimmed table w/ 7 cols | |
| Partitioned but not clustered or indexed | 850 ms | 94,141 ms (1min 34s) | 13,950 ms |
| Partitioned + clustered (BQ) or indexed (Pg) | 560 ms | 98 ms | 101 ms |
Take a moment to appreciate these results and let the numbers sink in. With partitioning, we are able to improve performance by approx. 26x (equal to the # of partitions; the query only has to look for data inside the relevant partition now). With proper indexing on top of partitioning, we have been able to speed up the query (middle column) by 1000x – 3 orders of magnitude. The total improvement is 26,000x.
Of course, all this was possible only because we were able to create appropriate index that can handle the query being performed here. I don’t show the query as that is not relevant. Only thing that is relevant is that the query could be served from the index in above. Here it is anyway (it is basically a lookup query):
SELECT
SUM(sales) TOTAL_sales
FROM sales_table
WHERE dealer = "michael's toyota of seattle"
AND product = "toyota corolla"
AND quarter = "Q3FY2019";
the sales_table is partitioned on quarter and indexed on (quarter, dealer, product).
In some ways, by creating an index, we precompute the answer to a query (the values of query parameters can change from time to time like the arguments to a function but the query format is fixed; the more accurate statement is that we create a data structure for fast lookups). But this is possible only if you know what query you will be hit with. If you don’t know a-priori the query that’s going to come your way (I take this as the definition of an ad-hoc query), then you will be out of luck on Postgres (since you don’t know what index to create) and that’s why we have databases like BigQuery.
There are no indexes in BigQuery for good reason: it is designed for ad-hoc queries. There is only one index – the order in which data is stored in the database – a.k.a clustering. BigQuery achieves its performance via massive parallelism. Although a difference of 850 ms vs. 560 ms in above might make it appear that BigQuery does not benefit much from clustering, that conclusion would be wrong. Behind the scenes BigQuery parallelized the job across 1,000 workers when data was not clustered vs. 387 when data was clustered. And it had to process 4,805,618,480 bytes of data for the unclustered case vs. 14,828,424 when the table was clustered.
Before you get excited about indexing in Postgres, note that indexes don’t come for free. They are additional data structures (B-tree in this case) that have to be constantly updated when new records are inserted or old ones are updated or deleted. So be very careful not to go overboard with index creation. What is a good rule of thumb on max. # of indexes you should have? I don’t know. But I feel it should definitely be in single digits.
What about the effect of # of cols on Postgres query performance? For a column store like BigQuery, it doesn’t matter how many columns are there in the table. What about Postgres? The results are summarized below:
| Full table with 172 columns | Slimmed table with 7 columns | |
| Time to load from BigQuery to Postgres | 19 hr 42 min | not available |
| Table size w/o index | 1462 GB | 152 GB |
| Query time w/o index | 94,141 ms | 13,950 ms |
| Time to build index | 03:01:37 (3hr) | 02:01:22 (2hr) |
| Index Size | 93 GB | 93 GB |
| Query time w/ index | 98 ms | 101 ms |
These results were a bit unexpected for me as I was hoping to see a big difference in the last row as well. What is reassuring is that even when the index size (93 GB) was clearly out of bounds of the machine’s RAM (32 GB), indexing still helped. It is a B-tree on the disk.
To wrap up, when Postgres is given a query, the execution planner checks if the query is something that can be served from the index (B-tree is the most common type of index). If yes, it uses the index which gives O(log n) performance (n is # of rows in the table). Think of using the index as doing a binary search that you learned in school. You can do it only if the data is sorted according to the key you are searching for – if data is sorted by date of birth but you are querying by name it does not help. If no index can be found, the database has to do a full table scan and this gives O(n) performance.
What about MySQL? Try it out and let me know.