My fun journey of managing a large table of PostgreSQL (2024)

Recently, I have had a chance to work on a very large data table. By very large, I mean, 1.5 billion rows up to this point, and it growing about by about 300,000 rows every single minute.

The table takes in the constant stream of data from IoT Edge devices, and I periodically run the data consolidation script(let’s call it consolidator) to aggregate the data stream into data points by each minute.

The consolidator, at first, runs reasonably fast for its size. It takes about a minute to perform an SQL SELECT/GROUP BY/ORDER BY function. One might be surprised that one minute is relatively fast compared to the number of rows in the table. The reason for that is, I do, use an index to keep track of data. I will take you with me through my journey to the data wonderland.

For a reader who just wants TLDR

  1. Try indexing the columns that are used in conditional clauses.
  2. Think about the nature of your data. If it has intrinsic order, using BRIN might help.
  3. If you are using GROUP BY DATE_TRUNC or some similar kind of aggregate function, you might as well index the aggregated results.
  4. Every time the rows are inserted, the indices are also updated.
  5. SQL is declarative, so the end result of GROUP BY is order-independent. But indexing is not, and you might need to start the aggregate with whatever indexed column.

At the very first step, for one who works with a big data table, indexing is the first go-to solution. Indexing will basically rearrange your sequential data into a better kind of data structure that is searching-friendly in terms of time complexity. Postgres automatically creates indices for unique fields including the primary key fields. The default data structure of B-tree. This is a balanced tree of size 8KB, which will make the depth of just 3 even though the number of records is billions!

I found this a great resource explaining indexing if you have some spare time.

Since there is no free lunch, we sacrifice our space for the time complexity trade-off. The space used for indexing is just about 11GB in size per index!

My fun journey of managing a large table of PostgreSQL (2)

Luckily, the data is time-series. This enables one option for indexing as a Block-range index(aka. BRIN). It requires minimal space of just over 4kB.

My fun journey of managing a large table of PostgreSQL (3)

This indexing is based on an assumption that columns are well-correlated with the physical order of the table rows. In my specific case, the order of data is well-correlated(timestamp of each measurement value). And what kind of order is NOT well-correlated? You might ask. The data that order is artificially created. For example, user id 0001, 0002, 0003, … does not tell anything besides its order. it doesn’t naturally correlate with the columns’ values.

So, All I need is just to go ahead and implement the BRIN index and go enjoy life in the lavender field? No, not so fast…

My SQL query looks like this

SELECT data1, DATE_TRUNC('minute', data_timestamp), data2 AS minute_ts
FROM data_table
WHERE data_timestamp
BETWEEN DATE_TRUNC('minute', '2022-09-17 07:15+7'::timestamptz)
AND DATE_TRUNC('minute', '2022-09-17 07:30+7'::timestamptz)
GROUP BY data2, DATE_TRUNC('minute', data_timestamp);

And this query took about 14 minutes. Not ideal, but if you compare it to this.

SELECT data1, DATE_TRUNC('minute', data_timestamp), data2 AS minute_ts
FROM data_table
WHERE data_timestamp
BETWEEN DATE_TRUNC('minute', '2022-09-17 07:15+7'::timestamptz)
AND DATE_TRUNC('minute', '2022-09-17 07:30+7'::timestamptz)
GROUP BY DATE_TRUNC('minute', data_timestamp), data2;

Which took me more than 12 hours just to realize that, perhaps, I rather give up waiting.

So, what’s the difference here? What makes it takes so much time to retrieve the same query(from a declarative perspective) with a different order of group by column?

  • Aggregate is expensive
  • You might need to be careful about index when you want to optimize aggregation query
  • Insert Query Plan Image of wrong indexing
  • Insert Query Plan Image of indexing with DATE_TRUNC function

Remember that the table I am trying to optimize is taking a constant of data from IoT Devices? It implies that DB-Engine will, most of the time, perform inserts operation.

This is taken from the official document of the PostgreSQL website.

Once an index is created, no further intervention is required: the system will update the index when the table is modified, and it will use the index in queries when it thinks doing so would be more efficient than a sequential table scan.

The keyword here is “update”! Meaning, the index needs to be re-arranged, every time new data is inserted. This doesn’t sound too bad for the BRIN-index type. However, in order to keep the query planner using either the system runs ANALYZE command or you have to run it periodically.

But you might have to run the ANALYZE command regularly to update statistics to allow the query planner to make educated decisions.

In conclusion, if you are working on big data with PostgreSQL. You might very well need to consider the nature of your application seriously. Knowing the right indexing might help on speed things up. BRIN index might help to keep index size in check. However, you should double check with Query planner that it does what you expect.

Next article is how I manage to break down large index (and table) into partition using pg_partman.

My fun journey of managing a large table of PostgreSQL (2024)
Top Articles
Hong Kong Travel Cost - Average Price of a Vacation to Hong Kong: Food & Meal Budget, Daily & Weekly Expenses | Budget Your Trip
Real Estate Crowdfunding Returns
Sam's Club Gas Price Johnson City Tn
Solarmovies.te
Bank of America Routing Numbers and Wire Transfer Instructions
Mid America Irish Dance Voy
Current Inmate Database & Daily Arrests
Dale Jefferson Lise Barber Adoption Minnesota
Pocket Edition Minecraft Pocket Edition Manual Pdf
Yumiiangell
Metro 72 Hour Extension 2022
Lyneer Staffing Solutions Lyndhurst Nj
Akron Novelty Wholesale
Papa Johns Pizza Hours
XFN: Introduction and Examples
Specialkfree Cooler.com
Homepoint Financial Wholesale Login
Village Cleaners West Hollywood
Katherine Grant Wilkes County Ga
Wild West 2013-12 - PDF Free Download
Fickle Finger of Fate
Soap2Day That 70S Show
Julie Ibrance Commercial Actress
25Cc To Tbsp
Express Employment Sign In
Free Inter Tv Live
Lms Ensign Login
Felands Gunsmithing
Misou Nail Spa
Angels Pet World Hudson Wi
12 Best Discord Themes for BetterDiscord You Can Try
057 Cdc Quakertown Cpu
Craigslist Pets Sac
Driving Directions To Target Near Me
Mytowerlearninghub
Osrs Elf Slayer Task
Newadvent Org
Spider-Man Into The Spider-Verse 123 Movies
Lake George Ny Craigslist
Application guide for first-year students
Violent Night Showtimes Near Century 14 Vallejo
Bison Epay
Elemental Showtimes Near Sedaliamovies
Cdoc Visitme
Learn4Good Job Posting
Craigslist Alma Michigan
R/Mp5
Opscans 1073
Odfl4Us Driver Login
Seeking R4R Meaning
New details of Trump family crypto project released, including who can buy in
Latest Posts
Article information

Author: Horacio Brakus JD

Last Updated:

Views: 5542

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Horacio Brakus JD

Birthday: 1999-08-21

Address: Apt. 524 43384 Minnie Prairie, South Edda, MA 62804

Phone: +5931039998219

Job: Sales Strategist

Hobby: Sculling, Kitesurfing, Orienteering, Painting, Computer programming, Creative writing, Scuba diving

Introduction: My name is Horacio Brakus JD, I am a lively, splendid, jolly, vivacious, vast, cheerful, agreeable person who loves writing and wants to share my knowledge and understanding with you.