Can Postgres scale to billions of rows and TB of data?

It turns out that with proper indexing and partitioning it can! even if the index is so big that it cannot fit in the memory (RAM). To test, I started with a table with 2B rows, 172 cols, 26 partitions and 5 TB logical size in BigQuery. This table was copied to Postgres using Dataflow. The copy job itself took 19+ hrs. Here are the numbers showing Postgres query performance and comparing to BigQuery. I used Postgres 14 running as Cloud SQL in GCP (8 vCPU, 32 GB RAM):

BigQueryPostgres (full table w/ all 172 cols)Postgres slimmed table w/ 7 cols
Partitioned but not clustered or indexed850 ms94,141 ms (1min 34s)13,950 ms
Partitioned + clustered (BQ) or indexed (Pg)560 ms98 ms101 ms

Take a moment to appreciate these results and let the numbers sink in. With partitioning, we are able to improve performance by approx. 26x (equal to the # of partitions; the query only has to look for data inside the relevant partition now). With proper indexing on top of partitioning, we have been able to speed up the query (middle column) by 1000x – 3 orders of magnitude. The total improvement is 26,000x.

Of course, all this was possible only because we were able to create appropriate index that can handle the query being performed here. I don’t show the query as that is not relevant. Only thing that is relevant is that the query could be served from the index in above. Here it is anyway (it is basically a lookup query):

SELECT
SUM(sales) TOTAL_sales
FROM sales_table
WHERE dealer = "michael's toyota of seattle"
AND product = "toyota corolla"
AND quarter = "Q3FY2019";

the sales_table is partitioned on quarter and indexed on (quarter, dealer, product).

In some ways, by creating an index, we precompute the answer to a query (the values of query parameters can change from time to time like the arguments to a function but the query format is fixed; the more accurate statement is that we create a data structure for fast lookups). But this is possible only if you know what query you will be hit with. If you don’t know a-priori the query that’s going to come your way (I take this as the definition of an ad-hoc query), then you will be out of luck on Postgres (since you don’t know what index to create) and that’s why we have databases like BigQuery.

There are no indexes in BigQuery for good reason: it is designed for ad-hoc queries. There is only one index – the order in which data is stored in the database – a.k.a clustering. BigQuery achieves its performance via massive parallelism. Although a difference of 850 ms vs. 560 ms in above might make it appear that BigQuery does not benefit much from clustering, that conclusion would be wrong. Behind the scenes BigQuery parallelized the job across 1,000 workers when data was not clustered vs. 387 when data was clustered. And it had to process 4,805,618,480 bytes of data for the unclustered case vs. 14,828,424 when the table was clustered.

Before you get excited about indexing in Postgres, note that indexes don’t come for free. They are additional data structures (B-tree in this case) that have to be constantly updated when new records are inserted or old ones are updated or deleted. So be very careful not to go overboard with index creation. What is a good rule of thumb on max. # of indexes you should have? I don’t know. But I feel it should definitely be in single digits.

What about the effect of # of cols on Postgres query performance? For a column store like BigQuery, it doesn’t matter how many columns are there in the table. What about Postgres? The results are summarized below:

Full table with 172 columnsSlimmed table with 7 columns
Time to load from BigQuery to Postgres19 hr 42 minnot available
Table size w/o index1462 GB152 GB
Query time w/o index94,141 ms13,950 ms
Time to build index03:01:37 (3hr)02:01:22 (2hr)
Index Size93 GB93 GB
Query time w/ index98 ms101 ms

These results were a bit unexpected for me as I was hoping to see a big difference in the last row as well. What is reassuring is that even when the index size (93 GB) was clearly out of bounds of the machine’s RAM (32 GB), indexing still helped. It is a B-tree on the disk.

To wrap up, when Postgres is given a query, the execution planner checks if the query is something that can be served from the index (B-tree is the most common type of index). If yes, it uses the index which gives O(log n) performance (n is # of rows in the table). Think of using the index as doing a binary search that you learned in school. You can do it only if the data is sorted according to the key you are searching for – if data is sorted by date of birth but you are querying by name it does not help. If no index can be found, the database has to do a full table scan and this gives O(n) performance.

What about MySQL? Try it out and let me know.

This entry was posted in Computers, programming, Software and tagged . Bookmark the permalink.

Leave a comment