Can Postgres scale to billions of rows and TB of data?

It turns out that with proper indexing and partitioning it can! even if the index is so big that it cannot fit in the memory (RAM). To test, I started with a table with 2B rows, 172 cols, 26 partitions and 5 TB logical size in BigQuery. This table was copied to Postgres using Dataflow. The copy job itself took 19+ hrs. Here are the numbers showing Postgres query performance and comparing to BigQuery. I used Postgres 14 running as Cloud SQL in GCP (8 vCPU, 32 GB RAM):

	BigQuery	Postgres (full table w/ all 172 cols)	Postgres slimmed table w/ 7 cols
Partitioned but not clustered or indexed	850 ms	94,141 ms (1min 34s)	13,950 ms
Partitioned + clustered (BQ) or indexed (Pg)	560 ms	98 ms	101 ms

Take a moment to appreciate these results and let the numbers sink in. With partitioning, we are able to improve performance by approx. 26x (equal to the # of partitions; the query only has to look for data inside the relevant partition now). With proper indexing on top of partitioning, we have been able to speed up the query (middle column) by 1000x – 3 orders of magnitude. The total improvement is 26,000x.

Of course, all this was possible only because we were able to create appropriate index that can handle the query being performed here. I don’t show the query as that is not relevant. Only thing that is relevant is that the query could be served from the index in above. Here it is anyway (it is basically a lookup query):

SELECT
SUM(sales) TOTAL_sales
FROM sales_table
WHERE dealer = "michael's toyota of seattle"
AND product = "toyota corolla"
AND quarter = "Q3FY2019";

the sales_table is partitioned on quarter and indexed on (quarter, dealer, product).

In some ways, by creating an index, we precompute the answer to a query (the values of query parameters can change from time to time like the arguments to a function but the query format is fixed; the more accurate statement is that we create a data structure for fast lookups). But this is possible only if you know what query you will be hit with. If you don’t know a-priori the query that’s going to come your way (I take this as the definition of an ad-hoc query), then you will be out of luck on Postgres (since you don’t know what index to create) and that’s why we have databases like BigQuery.

There are no indexes in BigQuery for good reason: it is designed for ad-hoc queries. There is only one index – the order in which data is stored in the database – a.k.a clustering. BigQuery achieves its performance via massive parallelism. Although a difference of 850 ms vs. 560 ms in above might make it appear that BigQuery does not benefit much from clustering, that conclusion would be wrong. Behind the scenes BigQuery parallelized the job across 1,000 workers when data was not clustered vs. 387 when data was clustered. And it had to process 4,805,618,480 bytes of data for the unclustered case vs. 14,828,424 when the table was clustered.

Before you get excited about indexing in Postgres, note that indexes don’t come for free. They are additional data structures (B-tree in this case) that have to be constantly updated when new records are inserted or old ones are updated or deleted. So be very careful not to go overboard with index creation. What is a good rule of thumb on max. # of indexes you should have? I don’t know. But I feel it should definitely be in single digits.

What about the effect of # of cols on Postgres query performance? For a column store like BigQuery, it doesn’t matter how many columns are there in the table. What about Postgres? The results are summarized below:

	Full table with 172 columns	Slimmed table with 7 columns
Time to load from BigQuery to Postgres	19 hr 42 min	not available
Table size w/o index	1462 GB	152 GB
Query time w/o index	94,141 ms	13,950 ms
Time to build index	03:01:37 (3hr)	02:01:22 (2hr)
Index Size	93 GB	93 GB
Query time w/ index	98 ms	101 ms

These results were a bit unexpected for me as I was hoping to see a big difference in the last row as well. What is reassuring is that even when the index size (93 GB) was clearly out of bounds of the machine’s RAM (32 GB), indexing still helped. It is a B-tree on the disk.

To wrap up, when Postgres is given a query, the execution planner checks if the query is something that can be served from the index (B-tree is the most common type of index). If yes, it uses the index which gives O(log n) performance (n is # of rows in the table). Think of using the index as doing a binary search that you learned in school. You can do it only if the data is sorted according to the key you are searching for – if data is sorted by date of birth but you are querying by name it does not help. If no index can be found, the database has to do a full table scan and this gives O(n) performance.

What about MySQL? Try it out and let me know.

Can Postgres scale to billions of rows and TB of data?

Leave a comment Cancel reply

Recent Posts

Categories

Archives

Can Postgres scale to billions of rows and TB of data?

Share this:

Related

Leave a comment Cancel reply

Recent Posts

Categories

Archives