Preston St. Pierre - Medium Data Discoveries on PostgreSQL

Here at 27 Volt Labs we recently saw our database grow by several orders of magnitude after launching our new service, carb.io. While it may have been considered “Big Data” ten years ago, today it falls into the lesser known category of “Medium Data”.

What exactly is Medium Data, anyway?

Opinions vary, as they normally do. My favorite is this dynamic definition [Link Updated] which states, in general terms, that:

Small data fits in the memory of a single machine.
Medium data fits on the disk or disks of a single machine.
Big data requires a distributed storage/access system.

Obviously this isn’t an exact science. There is no one point where your data crosses the line; hardware capabilities are always changing. Still, we can say in general terms that our own database has stopped fitting in memory on most average machines today. This is a marked difference from the <256MB of data we had before launching carb.io.

Enough with the backstory. What did you learn?

If at all possible, use a subset of the database for testing. Sometimes we have to test with the entire database, but usually either fake data or a subset of the real database will do the trick. It’s much faster.

Test your queries on the entire database. Yes, I know what I just said. Most of the time you can test on a subset. But you absolutely must test your queries on realistic amounts of data to see how they perform. In the small data days a slow-running query would take a few seconds. Now the same problem query takes three hours.

Solid state drives are your friend. Did you know that doing a simple alter table add column in PostgreSQL can take hours on a 7200RPM hard disk? It can, if your table is big enough. Restoring a copy of the production database is even worse. I’ve never found my big, slow hard disks lacking before. Now I run the dev database on a Samsung 840 Pro Series SSD.

Check your indexes. This should be obvious. Hey, I’ve always indexed my tables. Yes, by every column I want to query against on a regular basis. How could I forget any of them? The truth is, I did forget some, and we never noticed back when the entire database fit into memory.

Your backup strategy might have to change. We used to dump our database and pull offsite backups hourly (in addition to the slave server running a continuous sync). Now it takes the better part of an hour just to dump the database and is a burden on the server to create. We’ve started dumping from the slave server instead, and only every six hours.

Increase the Write-Ahead Log buffer size and/or number of files. Large numbers of queries in a short period of time can push your slave server out of sync. Then it just sits there like a crying child, logging its woes pitifully until you rsync the database manually.

Tune your database/OS. There is more to this than I have time for here, but suffice to say that the defaults are not your friend. We ended up increasing the Linux shmmax/shmall and the PostgreSQL effective_cache_size and shared_buffers settings. For help tuning your own OS install for PostgreSQL look to the manual.

There you have it folks. You get the lessons learned from our recent increase in database size and I make my promised blog post for the month. I hope you found it useful!