PURCHASE

The Art of PostgreSQL: The Transcript, part I

This article is a transcript of the conference I gave at Postgres Open 2019, titled the same as the book: The Art of PostgreSQL. It’s availble as a video online at Youtube if you want to watch the slides and listen to it, and it even has a subtext!

Some people still prefer to read the text, so here it is.


Hi everobody. So we’re going to talk about The Art of PostgreSQL. The idea of this presentation is that it’s mostly oriented to Application Developers.

I’ve been a contributor for Postgres for a long time, started last century. I work at Citus Data and we’ve been acquired by Microsoft. So nowadays it is Azure Database for PostgreSQL HyperScale (Citus), or something like that.

One of the projects that I’m working on currently is named pg_auto_failover. The idea is just exactly as the name says. In PostgreSQL we tend to like boring names, so that you read the name and you know what it is about, usually.

So it’s business continuity, it’s automating the failover. That’s all about it. It’s on github, it’s completely Open Source, you can open issues, you can send bug fixes if you want to, even new features if you fancy that. So go have a look, it’s a project we did to simplify setting up HA for PostgreSQL.

Another project I’ve been working on a lot in the past years is a migration project for going from something else to PostgreSQL. The idea is that you don’t have an excuse anymore to be using for example MySQL. Just use PostgreSQL instead, it’s much better. But it’s not always easy to implement that. So with pgloader in a single command line you give it a source connection string, and a target connection string, the target should be PostgreSQL. And then it’s going to read all the catalogs from the source database, decide what are the tables, the columns the attributes, the types, do the type mapping for you and load the data and then create the indexes in parallel and etc etc. So it’s one command line and then your database is running on PostgreSQL now. So no excuse, just do it. It support MySQL, SQLite, SQL Server and some other input kinds.

Another project I’ve been working on is this book, The Art of PostgreSQL. We have some copies left, maybe the last one or something like that, at the booth. So if you’re interested show up there later. And the slides that we’re going to go through are mostly extracted from the book. So it’s kind of the same content.

So let’s get started now.

The first thing that for me is important as an application developer is why are you using PostgreSQL? Often when you ask that question — and I used to be a consultant before — and when you get around this question, most of them developers they don’t really know why, you know, it’s in the stack, it’s been deployed already, they have joined the project and they have to use it.

Some of them they’re like “Oh, I know why, it’s because it’s solving this problem that is quite hard to solve for the application and we are using PostgreSQL to do that.“ But often enough I heard that PostgreSQL is used to solve storage. Which is suprising as an answer because it’s so wrong. It’s not true. Storage in the 60s it was easy because at the time, with the compute we had, if you would unplug it from the power socket, then anything that was in memory would stay exactly the way it was. And you could re-plug like a couple weeks later and it would be as it was two weeks before. And in the 70s we switched to other technology where it was not true anymore, but being able to serialize something that you had in memory to disk has never been such a problem in computing science. It’s easy to do, everybody knows how to do it, you don’t need PostgreSQL to do that.

If you are a Java shop, you can serialize your objects in XML and read them back and that’s it. So if storage is the problem, you go use, for example in the cloud, blob storage from Azure or maybe S3 from AWS or something else. So that’s storage. PostgreSQL is not about storage.

PostgreSQL is about concurrency and isolation. The idea is that what happens when you have more than one person trying to do the same thing, like two updates concurrently? And the image is obviously the difference between theory and practice: in theory it’s the same, but not in practice.

The main thing around concurrent and isolation within the context of the database — Relational Database Management System, RDMBS — is that we provide ACID guarantees. I guess everybody knows what ACID is?

Atomic basically means that if you have many things to do in the same transaction and something goes wrong you can rollback. If you did two inserts and one update and then you rollback then everything is cancelled. You don’t have the situation where one of the inserts went through but not the other one and now your database doesn’t make sense anymore. So that’s pretty cool.

Usually we don’t type in rollback. Sometimes we do when testing interactively, but in the application, have you ever implemented a transaction that would do a rollback in your transaction? Maybe not.

What happens is… file system is full. I know it’s 2019 but we still have that problem in production sometimes. So file system is full, what’s next? Well with an atomic system, the transaction is rolled back and never happened. That’s it. So you’re safe.

Well PostgreSQL does something that almost no other system is able to do: it supports transactions for DDL. So if you have an application script to migrate from one version to the next version of the schema, you had a new column, a new table, maybe a new index, something like that, when then what happens if file is full in the middle of the script?

If you’re not using PostgreSQL, and you had version 1 in production, the script was to go from version 1 to version 2 and it failed in the middle. So now you have a version that nobody ever saw nowhere. No developer ever saw that version which is now in production… if you don’t have transactions for DDLs.

With PostgreSQL “file system is full” implies a rollback, you still have version 1, don’t deploy the app yet, that’s it. Simple, done. So that’s one of the reasons why you use PostgreSQL of course.

The C of ACID, it means consistency. Consistent means there are some business rules that you know about and that you can share with PostgreSQL, can explain to PostgreSQL, here’s what’s important for me to keep in mind for the whole data set that we are going to manage; and you can have PostgreSQL implement those guarantees for you.

So the first step for the consistency is the schema, and the data types. Here we have a very simple table with two columns. Anything that goes into those columns — here ID is an integer. If you have MySQL and you have an integer column and you insert into it the string “banana”, then it will happily take it and if you SELECT from it than it’s going to say zero. But no errors whatsoever. It’s happy to work with that.

With PostgreSQL we don’t do that. So if you try to insert a “banana” into an integer column, PostgreSQL will tell you “hey I don’t know what that is, but it does not fit your model, please be careful”. And then we have constraints like CHECK, NOT NULL, FOREIGN KEY, PRIMARY KEY… relations.

We’ll get back to that, relation are the central concept of SQL basically. And some people think it’s because we have foreign keys but it’s not true. A relation is just a mathematical concept where you have a set of elements that all share the same properties. It’s called attribute domains in the relational jargon and it means that it all looks the same. Here it’s an integer and a text columns, and anything that is in this table foo is going to have an integer and a text, that’s it. All of them are the same. That’s what is a relation.

So consistency is pretty important.

Then the I for ACID is isolation. It’s the other side of atomicity. It’s a little bit more complex to understand sometimes. Isolation means that while you are doing your queries, are you allowed to see what is going on concurrently in the rest of the system?

So if you want to take a consistent backup for example, you need to make it so that even if pg_dump is going to run for several hours because you have terabytes of data, it needs to be a consistent snapshot of the production. If during the backup someone else is doing inserts and updates and something else, you don’t want those to be in the backup, because you want something consistent. You want a snapshot that doesn’t move. You don’t want to see everything that’s new. So pg_dump will typically use an isolation mode where you don’t see the changes from the other transactions.

You can also do that in your application, and maybe it could be the default: REPEATABLE READ. Or even SERIALIZABLE, but that one is different. REPEATABLE READ might be what you expect from the database but it’s not the default. The default is READ COMMITTED. So maybe you want to look into that.

Anyway, every transaction in Postgres can have a different isolation level. pg_dump will be SERIALIZABLE while the rest of the system is REPEATABLE READ or READ COMMITTED, depending. So that’s isolation. So you see that’s very important, and that’s very hard to implement at the application level and so maybe that’s why you’re using PostgreSQL actually.

And then of course it’s durable.

Do you know the little test to do with the power socket plug? Basically you write a little client application that will only do INSERTs for example. And you count how many times you got the COMMIT message back from Postgres.

Remember that when you say COMMIT, maybe the answer is going to be ROLLBACK. Because there was a proble, Postgres was not in a position where it could actually implement the COMMIT. “File system is full” is the easiest example to have in mind. So you say COMMIT, maybe it’s ROLLBACK. So you count how many times when you said COMMIT it was committed actually.

And then while the test is running you unplug the power socket from the server. In the middle of the test. Then you plug again and you count what you have on the server and what you have on the client. If it’s not the same, there’s a bug somewhere. It’s not durable.

Durability means that anything that has been known to be committed by the client should still be there when you do that. If it’s not, maybe the hardware is faulty, maybe the BIOS configuration or many the kernel, OS configuration is wrong. Maybe you did fsync = off in PostgreSQL, or maybe you’re not using PostgreSQL. And then… yeah, don’t do that.

So that’s the basics around why would you use PostgreSQL. So to recap because you have transactions. And transaction is a short way to say you are compliant with ACID. But be careful, because some systems are naming themselves databases nowadays, and the NoSQL systems in particular, where as a developer if you think about them as a database, you might be in trouble because they are not ACID-compliant.

All of the NoSQL systems that you will find are going to implement some trade-offs. The only that is obvious is that they are not implementing SQL, it’s No SQL. Okay. But also they don’t implement ACID usually. Take MongoDB for example. It’s schemaless, that’s a feature. It means that you don’t have consistency, so you lose the C of ACID. It doesn’t have transactions, so you don’t have the A nor the I of ACID. No atomicity, no isolation. Remains the D of ACID, durability. It used to not implemnent that. Apparently they’ve fixed it nowadays, but for a long time you wouldn’t have the D of ACID.

So maybe it’s fine to use it anyway in your application because it fits your use-case. But as a developer if you think of a database as something that is not ACID-compliant, because that’s how we are taught about databases usually, and the system you use is actually not ACID-compliant, it means that all those guarantees that you don’t have, either you don’t need them, that’s cool, or if you need them, then you need to implement them yourself.

So that’s the main kicker of using PostgreSQL, is that you get everything for free and it just works and it’s available and you can just, you know, just care about the application.

And other good reasons to use it are written there and we’re going to see about them. We’re going to see about why I say it’s object-oriented. We have extensions in PostgreSQL, we’re going to see a couple examples. Rich datatypes. You can do actually data processing in SQL and we’re going to see what I mean with that. Etc etc.


That’s it for the first part of this presentation. We covered about 15 mins of the 50 mins of this talk. I will publish the transcript for part II and part III later next week, so stay tuned to this blog if you like this content!

Also, as the content comes from my book anyway, you could also subscribe below to get the free sample, or just go buy the book at the main home page of this website: The Art of PostgreSQL.

Subscribe to receive a FREE chapter of the second edition of my book, “The Art of PostgreSQL” including the full Table of Contents!