Dave Sawyer – SQLite: Gotchas and Gimmes – PyCon 2016

(moderator) Let’s get started with the first talk of this session Let’s give a round of applause to Dave Sawyer [applause] (Dave Sawyer) Thank you, David We have, like, the Davids club going here Thanks everybody for coming out here and venturing out to the hinterlands of the conference rooms Hopefully I’ve got some good stuff for you today about Python and SQLite First off, let’s talk about, what is SQLite? As a database engine, it’s an in-process SQLite database engine It has no configuration necessary, which if you’ve used Postgres and other things, can be tough to set up It’s free for any use, it’s compact so you can use it on mobile and a lot of other small-space areas, and it’s fast, it’s very fast Think about, like, 10,000 statements per second And that’s kind of pessimistic I tried it out; I was getting 100,000 reads per second, which was fast enough for my use It’s very reliable; the tests are outstanding It has 100% branch coverage I know, like, if you’re writing your code, you’re saying, “My god, it took me longer to write the test code “than it took me to fix the bug.” In test code, there’s 787 times more lines of test code than there is production code for SQLite, it’s just an amazing amount of test code So, why use SQLite? I mean, we already have something built into Python for serializing our data We’ve got pickle, and isn’t pickle good enough? Well, one of the things we could use SQLite for is handling large data sets I mean, once your program starts to become real, you’re going to want to be able to have bigger amounts than you can store with pickle We can search and sort efficiently with SQLite, we have the ability to do transactions with rollbacks so we can our keep data copacetic It’s a safe format Something like pickle says specifically you cannot trust it if it comes from outside With SQLite, the database, you can email that around, you can read it in, you’re not going to get hacked by that And it’s 32- and 64-bit agnostic; it’s big/little endian agnostic It’s great for just using it across different applications and different platforms And of course, we get concurrency We can have a lot of readers mixed in with writers, so this can work even on your websites and stuff like that Okay, so, this is basically copied directly from the Python documentation on SQLite, so this is how we get started with SQLite Import it, then we can connect to a database I called it ‘example.db’, you can call it anything you want, that .db is just my convention, but you could make your own file format, zap or whatever We make a cursor, which allows us to read and write to this thing, and on that cursor we’re going to execute a CREATE TABLE, make a table with the stock, the text, the symbol, quantity, and price And we’ll insert a row, which takes from the same example, and then we’re going to commit that change and then we can close the connection, which we don’t have to, but if we’re done we can close the connection It’s pretty simple, right? Or is it? There’s actually a lot of little gotchas that are right here in the very first sample of how to use SQLite The last one — let’s start from the bottom where it says, “Don’t forget to commit.” This is something that catches a lot of people If I don’t commit, then that’s going to sit there, and when my program exits, the changes won’t happen People will say, “What happened to my changes in my database — “nothing went in?” You have to remember to commit But also think about this: If this were to raise and you continue on in your rest of your program, that change is still sitting there, and then when you make some other change and commit it, both of those are going to go through That’s probably not what you expected So what we really want to do is say something like, “This should all be on a try, finally, with a commit.” And if it raise an exception, what do we want to do? We want to roll back everything So we want to say, “If there was any exception, “let’s roll it back “If there’s no exception, let’s commit it.” And we want to always do that at the end The other thing that this sort of lets out is, we’re doing two executes in a row here And the default mode in SQLite is to autocommit, so when we start the second write, it’s going to automatically commit the first one, which is also probably not what you expected You know, if you forgot this thing, you’d find out that one of your commits happened and the other one didn’t So we might want to change our transaction model so we can control when we do our commits and when we don’t So the first thing we want to do for this to control our things is, there’s a context manager that SQLite provides for us And this is great; this should be, like, headlines: “Contains context manager.” So the context manager you’ve seen in other talks, probably, you write a bit of code and you say, “At the end, I want to always do this, and if there’s an exception, “I want to do something else.” So the context manager — the connection that you get from connect can serve as a context manager, and so the code that we write inside with connection, if there’s no exceptions, it will commit it, and if there is any exception, it will roll it back So in just one line we’ve made it cleaner and nicer and it’s doing a lot more stuff for us automatically And then the other thing that we have is,

we created a cursor and then we can close it when we’re done with it and execute stuff on it If you say connection.execute, it will manufacture a temporary cursor for you, then execute that stuff on you Now, what it doesn’t tell you in the documentation is it returns you this cursor so that you could go ahead and use it So this cursor convenience, I call it more of a convenience in quotes, because you really want to close that cursor as soon as you’re done with it When you have open cursors, if they’re cursors in writing, nobody else can read until that cursor is closed Or if it has been reading, the writers all have to wait ’til the readers are done, so you really want to close it off OK, so, that sample code we want to sort of get real, and that means I want to make, like, a real class that does updates, lookups, and stuff that we can use, not just some code that we see like in our doc screens So, the first one I did here was just a sql_value, because you notice when we wrote here the values ‘RHAT’,100, we can print those things out but we have to quote the strings That’s the only difference on that first part And on the stock I have the symbol, quantity, and price, and when we get a value back from SQLite, it will come back as a tuple, so I made a little from row thing to create a stock, but this is just a little utility thing And now what we can do is we can make a stock class and we can hide that connection inside it We’ll just connect to the thing and make that our first thing to create a database Our CREATE TABLE, now what we can use is closing, because remember when we get this cursor, we want to close it as soon as we’re done From context lib you can use closing, and we’ll call inside that with, it’ll call the code and then no matter what happens, we’ll close it afterwards It’s just like the way you’d use files now, nobody calls file.close at the end inside of a finally, you say “with file” and then you read it, and then it always closes at the end That’s closing for you OK, and then the insert, we’re basically creating something that was similar to the inserts that we saw before, but because dictionaries — I’m iterating out the values of the dictionary — the dictionary can come out in any order; it’s not guaranteed And so what I did is I put the keys and then the values, which is a legal SQL statement And then this looks like the sample code that we got, but I put “Unsafe!” here Does anybody have an idea why I put “Unsafe!”? I mean, I’m just thinking back to the example code that we saw which said, “Insert in a stock’s values “and here are the values.” So what I’m doing is I am taking user data and I’m going to execute that Does that sound bad? That’s very bad Because if you imagine things like, you know, we did a lookup, and we’re looking for something named Robert, and we find out something like our little Bobby tables thing So it closes that off This is pretty clever, they have a dash dash at the end of his name so as to make the rest of that statement a comment so it doesn’t cause an error So we don’t want to do what we saw in the code And to be legit to the documentation, it says, “You know that code “that we just showed you? “Don’t ever do that.” Instead, what we want to do is we want to use placeholders, so these three question marks, and then we can pass the data in a tuple afterwards So here’s my take three on that example code We’ll just say “INSERT INTO stocks” now that we’re executing the part that’s in the quotes, insert, and there’s no user data in that thing that we’re executing, and the values that we’re passing in on the right are just — or can be user data And then the other cool thing is we don’t have to change it to unicode or do any byting coding of the strings, we can pass our raw types through to SQLite OK, so, here’s the fixed insert call, and now I have a place where I’ve joined a bunch of question marks together for however long the keys are in our dictionary, and I insert them in It’s a little bit easier to see this on the lookup, where you can say “WHERE symbol= ?”, and then I have to have — it’s kind of odd in Python to get a one tuple, or I don’t know what you call it, a muple, you just have to have the parens and then a comma and then a closed paren And then that’s lookup And one other thing I’d like to expose is the transaction, because it would be nice for users of this to say, “with transaction”, and then do a bunch of stuff So I’m making a connection sort of public here, and then update is not so interesting so I’ll just kind of skip down past that OK, so there’s our class Now, we said we don’t want to get it to do the autocommit thing, we want to come up with something where we control our own transactions, so what have we got if we don’t use that? We have three modes called: deferred, immediate, and exclusive And when I saw those, I’m like, “I have no idea what that means.” They’re not too obvious

They’re named that for a technical term, which once you totally understand it, then you know it But if you don’t understand it, the easiest ways to think about it is they’re really for — deferred is for having multiple readers and multiple writers Immediate is if you want multiple readers, but only one writer And exclusive means you just have one guy, he can write, he can read, and nobody else has any access to the database So where we’re going on this talk is just talk about deferred, because if we can implement deferred nicely then the rest are very easy And we can get decent performance out of deferred, we don’t really need the other ones So, how do we control these guys? In SQLite, this is all controlled by locks So when there is a reader, a reader will get a shared lock, and we can have multiple readers all reading out of the same database at the same time And when somebody wants to come along for a write, they’ll get a reserved lock And there’s only one of those, and during that time they’re going to start setting up the journal, the rollback journal So during this time it’s actually still okay for people who are reading, they haven’t started writing yet, so readers are still allowed in, in a hurry here to get their reading done, but then the writer will take the pending lock, and when the pending lock happens, the readers — we’re basically waiting for the readers to get out of the pool “All you readers, finish up what you’ve done, “because I’m going to write.” So no more read locks are given out Anybody who wants to start reading, they’re blocked And we are just waiting out for the reads This is where sometimes people find they get “the database is locked” errors, because there’s some reader that’s poorly written and it’s not finishing; it still has a cursor open It’s important to, like, close your cursors as fast as you can so you can let the writers in And then once you’ve got the pending lock, then you’ll get the exclusive lock, and then during that time, no other locks are given out When a writer is finished, it’ll release those things Then readers and writers can get back in and use the database So the readers are never reading, like, a database that’s being written to And our database is always just one file This is really actually convenient You can just take a file and if you need to back up your database, just back up the file — that’s so easy And you can, like, email files around This is a great way to just — it’s really handy to administer a database when it’s just one file OK, so now if we want to do reading and writing, we’re going to have to do some changes to our initial connection So, I added here — instead of just connecting to the example.db, we’re going to do two other things We’re going to say check_same_thread=False, and I’m going to change my isolation mode, isolation level to deferred from the autocommit So check_same_thread, if you look it up in the documentation, it’s strangely missing I put in a patch for this, so hopefully it will be changed soon So, what check_same_thread is, you need to have SQL and SQLite 331, which is over ten years old; they allowed you to use same connection on multiple threads, so before that you had to always create a new connection on that every single thread, which is a pain It’s actually nice to just open one connection and then share it amongst all of your threads, and so then that makes your code a little bit simpler So we’ll do that Or the other way you can do it is — well, I’ll show you the other way, too OK, and the other thing I did here is I used WAL mode I love WAL mode, WAL mode is awesome Why didn’t anybody tell me about WAL mode when I was younger? I don’t know So WAL mode, you need SQLite 3.7 That’s probably why, because it was only in 2010 But WAL mode is awesome So what is WAL mode? WAL mode is write-ahead logging Write-ahead logging means that we’re going to be able to read and write at the same time Which sounds impossible, but we can do it So the readers are going to read that green block and a writer will append its changes to the end, as, like, kind of a delta So while the writer is writing, the readers aren’t bothered They’re just still reading that same file that they had been reading before And when the writer is finished, the next writer can start, and then the readers will now read that same green block, and then after they’ve got their results, they will do this little delta, which is very fast, and the writers are not bothered So now, the readers can read and the writers can write at the same time And eventually you can see all these little deltas will pile up And so there will eventually be a checkpoint where we’ll stop all reading and writing, and then it will be combined together and then we can start again with one new delta added to the end So, how do I get these changes? So, in the — to get something like WAL mode is only available in 3.7 In Windows you’ll have — SQLite is not part of Windows, so Python being batteries-included will install SQLite for you as part of Python And then when you install your application, you’re probably already copying over python.dll and you’ll also have to copy over the SQLite.dll So, how do we upgrade? I was thinking this is going to be a major pain, I’m going to recompile Python or something like that — nope All you need to do is grab the latest dll and drop it in, and it works totally compatible forward and backwards

with all the database files, so it’s super easy to do that And on Mac you already get Python — SQLite for free, it’s already built in there, so it depends on which version you’re at, but if you want, you can also upgrade that guy So I put in here — oh wait, let’s — so you can check your version, and then there’s sqlite3.version, which you might think would tell you the version of SQLite And right in the box it says, “This is not the version “of the SQLite library.” It’s the module of the version that’s — of the module that’s running SQLite So, instead what we want to do is we want to check SQLite._version, and that will print out that version number on the right-hand side And on Linux, if anybody knows, please tell me; I’m not sure how to check the versions or what versions are shipping with which versions of Linux I guess there’s too many flavors there OK, and then to upgrade, just drop in a new version of SQLite All right, and finally our last step is going to be to do the writing We want to have — in SQLite we can only have one set of changes per connection So the traditional way is to say, “Well, for every thread, “let’s make a new connection.” And this is the first way that I show here So we’d say something like — On a property of a connection, we want to say, we’ll get the thread identifier, and then we’ll just basically put those, all the thread connections in a dictionary So whenever you just get the connection, you’ll look it up in the dictionary from your thread ID, and if there’s nothing there, then we’ll make a new connection from that thread and then return that and add it to our dictionary Now, the problem with this is that when your thread goes away you want to clean it up from the dictionary I mean, you don’t have to because if you — but it will eventually clean up, at the end because it’s not like a cursor, but it’s better to do the cleanup, so instead you’ll also want to subclass or monkey-patch the thread, so the run routine so you can close it off I found this was useful to do especially for things like unittest where you’re creating and destroying connections a lot You don’t want to have a whole bunch piled up or get blocked on the database The other way you can do it is you can use a lock while writing So this is when we have one connection and we’re gonna share it amongst all the threads And we’re sort of doing — this seems a little bit dirty because we’re doing the same sort of work that SQLite is already doing with its locks, but it’s the same efficiency because really only one connection can write at a time to the database So, adding another lock on top of that doesn’t cause any performance problems, and then we can just share this thing together And we don’t want to say, we don’t want to have one thread writing “a”, and another thread writing “b”, and then the first thread come and say commit If we say with transaction here, the first thread will get the lock and then the second thread will wait for our Python lock to be released before it can write And then this is basically the same code that you would see in SQLite to do the context manager for the connection We’ve just unwound it here, added in our lock, and so now we have the — if it succeeded we commit, and if it raises then we’ll roll back and then re-raise that exception so you can catch it outside OK, I finished a little bit early, which I guess is good because we were running a little late to start off with, but I can take some questions from the audience if we got some Yeah Yeah, please go to the microphones (moderator) Anybody who has questions, come here to the bmic (Dave Sawyer) Thank you, Alex (audience member 1) Hi, on the very last slide, when you’re talking of making a connection per thread, why not use thread-local storage? (Dave Sawyer) Why not use? — sorry (audience member 1) I’m very scared about different threads adding to the same non-TLS dictionary because that will eventually crash your program unless you put a lock around it, and why bother when you get thread-local storage with that? (Dave Sawyer) So the question is, why not use thread-local storage for the dictionary for each thread? I found it easier with one dictionary And our dictionary operations are already guaranteed thread-safe There’s a number of operations that are atomic in Python until the next — (audience member 1) Two dictionaries adding different keys in Python 2.7 is not thread-safe (Dave Sawyer) If you add to a dictionary, that — adding to one dictionary is an atomic operation, so I can do that one safely I guess I could put another lock in there to be thorough and say, like, perhaps in the future if this is JITted it won’t be atomic, but a lot of things like list.add are already atomic in Python, the GIL will prevent two threads from doing it at the same time

(audience member 2) So, unless you need DB-API2.0 compatibility because you’re planning on switching out the back end for Postgres or something, in which case be careful, why wouldn’t you use APSW instead of SQLite3? (Dave Sawyer) APSW, another Python — (audience member 2) Roger Binns’s lower-level wrapper (Dave Sawyer) Yeah, what I wanted to do in this talk is show that we can still use the built-in Python libraries without adding in a third-party library That library’s got the philosophy, like, “I’ve already seen “what’s in the SQLite library and I’m going to throw it out and do it again.” And I think that the nice thing about that library, though, is that they try to keep up with everything that’s going on in SQLite, so things like WAL-mode are built-in, as opposed to just causing — calling the pragmas that are in the lang — that are already here But in this kind of code, you don’t need to pip install or have any additional packages to do it And you can still see the same model that SQLite was originally written for in Python 2.7 is still working in 3.5 with the existing library (audience member 2) So package management, effectively? (Dave Sawyer) Yeah, you can use — are you saying you could use a package manager if you ever wanted a different one? (audience member 2) No, no, I mean the reason is not introducing an external dependency (Dave Sawyer) You’re right, not adding another external dependency Now, another question that sometimes people ask me about, why wouldn’t I use an ORM to keep everything from me so I don’t need to know all the details about how — when I’m making a query or when I’m writing, and what I found with the ORMs is that generally you need to know enough that you don’t need the ORM to really benefit from it If you go in kind of ignorant and write something, you’ll end up with something that — I looked at one project and they’re saying, “It’s so slow, we’re not sure why.” It’s because, did you realize you’re generating thousands of SQLite calls? Because you’re doing something that looks like a.b=b.c in a while loop or something like that, and it looks like harmless code, but it unwinds into a bunch of SQL select statements So I’d kind of rather say, just as — for most database use, it’s fairly simple, we can just write the code directly and then we’ll know exactly how many calls we’re making and how many reads and writes in the database we’re making (audience member 3) With WAL-mode, as you had showed us, what is the advantage of not having the autocommit? (Dave Sawyer) So the main advantage with not having the autocommit is a lot of times, you want to do something where you say, “I want to decrement this table and increment this table, “and I want it all in one transaction, I want it all or nothing.” So you don’t want to have one write statement end the previous write statement You want to say, “I want to write, write, write, and then now that — “if I got all of that, then I want to commit it as a whole.” And so we don’t want — you can get a bug where you might have half of a transaction went through if you were surprised by an autocommit (audience member 3) Hmm, and that’s applicable to everything, say PHP or Java in Android? (Dave Sawyer) The autocommit mode is the default in SQLite So, yeah, usually I immediately switch it over to deferred And some people recommend immediate, but I think you can still use deferred because immediate will only let you be the writer and nobody else can be the writer (audience member 3) Thank you (audience member 4) I once used a SQLite in a NFS system, and it became extremely slow I look up online and it said something about it’s probably because of locking system So do you have any comments about that? (Dave Sawyer) I’m not sure what made yours run really slow due to the locking, but it can be slow if something is reading and it’s not letting go of a cursor Typically it might be something like maybe you’re reading through a long table and for every row, you’re doing something that takes time, or it’s like a synchronous call to the operating system, and that would make — slow you down because while you’re reading, nobody else can write to it They can still read, but they won’t be able to write So if you’re trying to write in there then you might be blocked (audience member 4) I’m just trying to create a new table, just inserting a bunch of things (Dave Sawyer) Creating a new table should be super fast Yeah, another thing you might try, there’s — SQLite supports an in-memory model So if you open the database and it’s named :memory:, then it’s in memory And actually, this is a pretty good use, say in like a sorted dictionary or other kind of sorted collection because you can then use your SQL statements on it and it never actually ends up writing anything to disk (moderator) That’s all the Q and A we have The next talk starts in this room at 2:35

Let’s give Dave another round of applause (Dave Sawyer) Thank you all [applause]