PostgreSQL Service At Scale – Sean Chittenden

so postgres at scale and I’m not quite talking about it in the exact same context as let’s say Amazon RDS who has a different type of scale we also have a type of scale that we need to deal with internally which comes in the form of lots of teams so a quick background on groupon and what this means when I say large we process a large number of transactions you can see some of the highlights there in terms of the size and scope and for what it is that we do we process a fair amount of revenue that all hits databases and one of the things that we’ve run into because of this large proliferation of services and things that we provide is that internally there’s a lot of teams that have moved to so abased architectures we’ve acquired teams there’s massive fragmentation of application stacks where we’re just like everybody else you need just recently Pete teams and companies have gone from monolith type infrastructures and spread out horizontally to service oriented type infrastructures but this means as a result that lots of different applications are talking to different types of databases and with different queries and workloads so we have large numbers of teams and large numbers of databases and instances and applications and services and in order to provide a foundation or a service that can supply and satisfy the demand of applications and back these these services that we provide we need to go in and figure out how to build database as a service right and we’ve done that internally so one of what I’m here to talk about is how that you do that what are some of the considerations that you need to in order to have some of this happen feel free to ask questions if any of you want to raise your hand along the way some of this is light on detail some of it is I have covered in other publications specifically on the OS side of things and what we do regarding storage but i’ll come back to some of those things so so databases databases for better or worse are not just a single piece of software that you plop off in a data center in a cloud or wherever it is a database is a collection of things right and just like ogres there are layers and postgres has many many layers in order to make a successful database function inside of an organization and it’s not just about the postgres process right database is a constellation or collection of things that need to be layered together in order to provide database services so as and they don’t operate in a vacuum right like again postgres there is no such thing and I challenge all of you to think about it in your environment when you say you have a database and you’re providing database services there is more than just postgres the process running in order to support that right you have backups you have all kinds of things that go into supporting that infrastructure so backing up right because database has all of these these things internal but like it sits in a really interesting place in organizations infrastructure right if you’ve got your typical software stack which goes from browser all the way down to a database there’s nothing else after the database right like like that’s it you have sweet of different services and they all pass things through right there doing some kind of crossing at proxying they’re doing some kind of processing or forwarding or some other reasonably low latency thing but then they get to the database right and that’s why we’re all here so and we really wish and I’m sure all of us have thought this that there was something else right beneath the database right because then that would be the last thing in the chain right but it’s not that way so the macro level of what a database is is is you know things that do the Davis needs to do is it’s got a query of lanar see like there’s a whole host of things that the database has that is this union of all of these other macro components and if this is unique in the software stack because most everybody like application servers they have cpu and maybe some network and that’s about it right like but the database has it has all of these macro components right and if you think through this stuff right some of these are database some of them are OS specific right so CPU is kind of a hardware thing disk i/o is is largely a hardware thing capacity is the amount of available resources but then at the same time you have to deal with set theory and understanding how databases work inside of the postgres process itself right and that’s fantastic if you like a challenge right when somebody comes to you and

says something slow the worst that which is like the most interesting thing somebody can say because it also means that you have no idea what rathole you’re going to go you know chasing down today right somebody’s going to say it’s you know the CPU and you have the entire stack to go look at right internally we use freebsd to host our databases so you know we’ve a brendan greg has done a fantastic and it’s unfortunate it’s cut off the URL at the bottom anyway these will be on this on the slides when I post these there’s a URL for where you can go pick this up off of Brendan Brendan Greg’s blog but decomposing the actual box itself in the OS and all of the macro subsystems you have to figure out how and where to go do this right and it’s partially OS right a lot of times like the reason that the database was slow and some application engineer was complaining is because you ran into some system limitation right cpu wasn’t fast enough which typically means the query is talking to too many rows not that the CPU was too fat or it was was too slow right or you know large bulk insert came through and you ran into some like you know regression and things backed up or you know you spent a lot of time doing disk i/o these are all things that you have to go chase down it really didn’t have anything to do with postgres right one of the big things that we’ve noticed is by and large postgres just works right like and this is was really interesting and walking around the conference because you don’t hear about people like talking about hey how do we like you know move postgres forward in the industry or like do all these things like everybody kind understands that now it just works right and it’s not about postgres entirely because like postgres is just this one part of the entire thing it’s you know it’s all of the other things that you have to layer on top of that right so in the case of the OS side of things you have to you know understand all this so if you begin to decompose what a database is these are all the things that you have responsibility for that are in your domain when as you are offering database as a service to your organization right you have the query engine you have like which is acting and then behind that you Han I the serialization layer like and that’s a really important value prop and and service that post grass or databases provide right it does caching for you you have to deal with that caching right you’re dealing with it both insides post-crescent you also have to deal with it inside of the OS and the filesystem layer right you’re also in the business of storage right like postgres just added for instance page Chuck sums right so there’s no way that we any of us can deny that that as it dated like somebody that provides databases you’re not in the storage business you have to be aware of this right there’s page tearing and all kinds of awful things that can happen that will you know result in a really miserable day for all of us and then you know one of the things that’s interesting also especially with postgres which is a really you know if any of you I if you would have asked me five years ago or ten years ago if postgres was going to be a proxy I would have said no dinner GTFO there’s no way oops so oops my slides read order so I’ll come back to this slide here you there let’s get past what I think the fun part here so these are all the things that you have to have some understanding of right you have to understand SQL you have to understand how mvcc works you have to understand how storageworks with pages and bit rot and check sums and things of that nature shared buffers the OS versus postgres and and foreign data wrappers which now is this really interesting you know proxy either proxy to a local process or proxy to a remote process right in the case of see store or let’s say porous gross fdw a mysql FTW so that’s your domain that’s what you need to go and satisfy and then somewhere along line you have to go and address all of the risk management parts that are attached to this right and by that you know you’ve got regular carrying them a whole host of other things but this is this this is what it means to be database and it’s the last thing in the light in the China chain of responsibility for being able to service something right sweet so since it is a friday and i want to have you a little bit of a discussion risk right we’re in the business of managing risk effectively right we’re the keepers of the data and that is one of the largest pieces of responsibility and that responsibility you have to treat with some degree of care right so is this risky right and there it is yes and no right I’m trying to provoke a conversation there we go it is risky right because if something happens right you’re going to lose data between you know when that commit goes off and the data shows up on disk right and this is particularly dangerous on block oriented file systems right you really do want st. commit off if you trust your if you need to guarantee your data right so you know how about this is this risky right there’s an extra little bit here and let

me see if I can all right so this little bit here if you don’t know what ZFS is this is really interesting and you know this answer is going to be different for different organizations in our case by default oh we don’t do this or we do this we do this by default an organization right and I like I really want to provoke it’s a file system but it’s it’s a log structured file system if you’ve heard of butter FS or btrfs smooth sync disabled means that when the I’m sorry thank you when st. disable means that when an application f sinks to the file system the file system effectively ignores it right but here’s why this is potentially not that risky on ZFS right we and to be completely clear there’s that mandatory disclosure which is kind of cut off we don’t do this on all of our databases right we do do it by default on our non-core non-critical databases and there’s room we do that because it does good things for performance right but what it means in the case of ZFS because all these rights are coming in they’re written down in five they’re accepted in order and then they’re written down to disk and a transit is EFS transaction group every five seconds right so assuming the synchronous commit that information gets flushed to the to the OS in the form of a right then every five seconds that information is going to end up on disk in a file system consists way so if we lost power on any of our boxes we have five seconds worth of exposure right plus or minus a second or two right that’s it there’s no file system check in ZFS by the way so like if the Box bounces it comes back it goes back to last successful transaction group you’re off to the races right we’ve hunted something terabyte boxes and we bounce them it takes like you know 30 seconds for it to get through the file system part right so is this you know risky well I’m not risking anything from a file system perspective right because I’m sitting on this this file system which is fantastic and from a database perspective yeah I’m risking five seconds worth of data okay but we’re all in the business of risk management is as database administrators and people that provide database services so we’re making this justification or cost justification because we’re all you know we need to trade off like how much outage have people suffered as a result of a performance problem where the database became unbearably slow and then think that provided back pressure into the application back things up and you had a cascading failure right the flip side of that is is you always alleviate that pressure by letting things roll through and you accept this window of loss right and if you go and look at and do that calculus for like let’s say mark some marketing service that’s churning away and beating on the database really aggressively but the data is just not important right it’s absolutely worth it right because f ck it is an amazingly expensive call it’s a really important call it says that when an F sync happens the data goes gets flushed through the OS down to disk and it’s everything’s in a durable state right but if you’ve got 20 or 30 disks in a system that f sync means that every single one of those disks has to be in a situation where it’s an icky known consistent state and that’s really hard to do in a performant way right so yeah I I would challenge I argue that this is not that risky when you just when you cost justify it and you way off we’re way this and it’s interesting because as database administrators myself included you’re you everybody has this loss aversion right you want to make sure that to the last transaction the database will suffer and survive a power pole test without losing any data right but that means that that last trans worth of data those last like five seconds transactions worth of information are really really expensive right from a hardware perspective from a horse scaling perspective really really difficult to go and scale horizontally because you’re now paying this this this performance penalty that this is really large right ZFS also lets you do random right and it on disk converts it to sequential I oh and this is a really also awesome thing so that’s something I want to challenge you guys on feel free to ping me offline I guess if you guys want but think about it right like from a loss aversion perspective to your organization is it better to have a back pressure and outage because the database got slow during hot periods of high right activity or is it better to have that to serve to to make sure that there’s no data loss during that five second window of time good yes and this is actually just this week we actually had some really interesting benchmarks I can provide you some of that information even with sync and this was something that surprised everybody by and large is is if you have sync disabled on ZFS and

you do an F sync it actually does force information into the ZFS bought into the ZFS you know database is what I’m going to call it like it’s actually a file system database and it forces some degree of consistency in a way that that information gets serialized into the transaction buffer so f sink is not entirely free with sync disabled it just means that does a whole lot less it doesn’t like rebuild this this uber block tree and there’s a lot of really good talks on them but yeah it is more expensive so I’m sorry there is an expense a core associated with F syncing against ZFS with sync disabled ok so let me skip back this ok so when you’re building a database right and you’re creating this as a service you start out with a single instance right and ever like okay great I got a database and this is what I’m saying like nobody runs just this right you typically will have a slave so if you got a slave you get some you know replication master slave whatever and but then you need to have some kind of failover mechanism so you’re now you know adding some form of a load balancer right either doing IP sharing or using a load balancer doing TCP proxying or like DF space you know VIP management because somehow you need to get to the master because you know by and large most all of us are probably relying on math on binary streaming replication but then you can’t do that because a slave is not a backup right you need to have some kind of point in time recovery so you now have to go and build this out that’s not good enough because you also need connection management on top of that so now you have to handle you know the connection aspect of things but even still like you don’t have just one data center or like two servers in a single rack or adjacent racks or like you know to data to databases on the same floor you have to go and send this stuff off you also have backups right because you also want to like off-site some of this stuff and then you know you maybe you get like a little bit more sophisticated and you go in and you set up load balancing to your PG balance rinse and this is so that you can toggle back and forth in a slightly more hitless way in the event that you want to do maintenance right so these are now all the complex some of the components that you are responsible for when you’re building database or some kind of a database as a service right we have we have long since exited the single postgres process world right like or you know collection of processes but these are all like you know a part of this this family of things that you have to build and take into account right but when you go to the provisioning side of things you have you know no fewer than five discrete components here right so times however many combinations and this this becomes an interesting challenge for management in terms of the how do you go in and systematically programmatically go and flush this out right like we’ve got configuration management stuff that do this and so like I said I’m going to be giving a talk about like I’ve been pointing out there let me give you even talk about that in a couple weeks more on the provisioning aspect of things but the combination the permutations in the way that you need to take these building blocks and then deploy them in the environment kind of explodes and when you’re talking about hundreds or thousands of you know time lines is what I’m going to call them that you’re managing this can become kind of ugly and and is really an interesting like just I don’t call it mess but like there’s a lot going on there right so how do you manage that complexity so you have to go and build checklists right because even though you’re doing these four different workloads different teams you have these standardized things that you have to do it because that’s a part of your service definitions right so if you’re building any kind of oz’ service postgres even mine scale we do that this is a postgres conference but we do offer this exact basically platform internally for MySQL users right and this checklist is by and large the same right it these are just general certain things that you have to do as a service provider but they’re now your specialty because you own the service which is all wrapped around postgres but like you know we’re here like you know ten steps right and that’s great but then you know you go look at the combinations here and this becomes a really big number right you’ve got the number of VIPs you know different back-end targets it kind of starts to explode on you right there’s a large number of combinations and then you get into this and you go well hold on a second i’m not going to go in like you know I can’t rubber stamp this config out because my config is going to be different for a a team that is ninety percent right volume right and it’s just going to be fire hosing data in here and they have like you know PG part man or something like that and they’re doing partition management versus like some other application where they’re doing very very high very high rate of queries very small transactions and everything is reasonably well tuned but like totally different workloads right so you have these different database / parameters that you now have to take account for times you know n number of instances so okay Ron um regulatory requirements are a real interesting one we’re in a bunch of different countries and have a different everybody like the Venn diagram for like what you have to

do in order to provide services for different teams that span different country regions is really interesting I think that’s the word i’m going to use so so tips for / on the provisioning side I think find something automated you have to you have to build this coherent model of what it is that you’re doing on a per team basis and each per team is kind of like you have to pull down each individual component and then figure out how to plug it into this framework that you’ve designed internally so you know we use ansible that’s worked out very well for us because we can provide a coherent framework for an entire team right so or an entire deployment these other ones are kind of interesting things that have bit us these are these are lessons learned that we can communicate out like um the item potent execution right if you have let’s say shared buffers and you have that in a config file and you think you want to like you know and change that value and then you want to do a restart on postgres automatically because you change that parameter that sounds fine and dandy until someday that happens in the middle of the day right you really don’t want to have that happen so if you go and deploy a postgres instance and you’re using some kind of config management push it out there and make the parameters that require a restart right on right once you can’t ever go back and revisit them if you want to go in and change the light set of shared buffers on a postgres instance then you should go back spin up a new slave someplace else and then promote it right see because that’s now a controlled crash right and it’s not like you no favors aren’t what it’s controlled in the sense that you know that off hours like you’ve spun this up it’s replicated up you can go and do this and you can like you can make that and make that kind of a change in an automated way but if you go in set like you know give every team this foot gun that says hey you won’t go total to snob but I’m going to go like you know let you restart right you’re going to be suffering the consequences of downtime constantly because teams are going to be doing this in a reasonably uncontrolled way so providing this this this anti foot shooting mechanism whereby policy you just say you know I’m going we’re going to deal with this this potential for downtime by saying we’re not going to let you restart postgres will hop it for you right and if you really do like there are teams and databases where it’s it’s completely unreasonable to you know enforce this constraint now you open up a ticket and have a DBA actually do it by hand right but from an automation perspective absolutely not right that’s really dangerous so questions so far on the provisioning side of things okay fine you really sleepy right now so see if I can live the stuff again efficacy versus efficiency one of the things that that’s it’s interesting to watch in this DevOps culture is that you know automate everything great however the efficacy of automation where people and teams and companies I’ve watched this externally bloat tons of time and many many many man-hours automating something that happens once every 12 months and it takes like 5 10 minutes to have happen who cares right like the ROI means that you’d have to have this happen on like you know after you complete this automation thing every like 30 minutes for the next two years before you’re going to get some kind of a break even and that break even for ROI doesn’t really happen right so from a QA from a repeatability per se yeah documented go train people on it like but don’t get lost in the weeds on automation this there is a trade-off here automate all the things we definitely want to get two hundred percent automation but you know pay attention to the ROI so connection management so you’ve got this large fleet of databases teams really like to disconnect and reconnect and people just assume stuff’s going to work and they don’t understand the full stack and you don’t you can’t fault them for that you’d like it’s outside their specialty but like you know teams and people this is a an faq especially like you know it’s just an FAQ at this point I’m like I have a connection string and I’ve set it to like 2000 and I expect it to be faster because I have two thousand connections and that’s absolutely positively wrong doesn’t work that way so connection management’s a big deal we rely on PG bouncer very heavily for this to the point that actually taking a page from Ford you can have any color car you want as long as it’s black we say that on port five four three two you’re using PG bouncer and transaction mode right just done if your Java you know we’ll deal with that in a second but this has saved us an enormous amount of grief because we have throttled the throughput and the amount of abuse that the postgres process in the back end will actually suffer right like for us this has mitigated most of the situations where you know we’re going to have a load average of two thousand because we two thousand connections they’re all trying to do something equally you know content like some form of contention it’s just gone it means that our load average will get busy we’ll have like you know 44 box will have a load average of 40 or 60 on busy boxes and then it’ll come back down to something that’s reasonably managed but like it does it

always means that we’re no longer a situation where everything slows and comes to a grinding halt because we’ve got this governor in place and that’s really what PG bouncer is is it’s a governor to make sure that that that the postgres process does not back up right it acts as this short circuit in order to prevent excessive load from basically slaughtering us like postgres as a scheduler will happily accept all work and PG bouncer because it has this connection limit allows us to mitigate that so when you’re dealing with large numbers of teams you don’t know what who’s going to do what we’re at what time of the day this is a really effective really really effective control and so building these kind of controls in the environment where again you can have any color car as long as you any color car you want as long as it’s black this works really well we do offer session mode we put that on a different port we make it the Nandi standard port right we that way it’s like an extra thing in order to go and do something that’s potentially going to be abusive right but the default mode of operation is we’re not going to let you be abusive of the system that is being provided so rule of thumb is we have settled on this kind of formula which is you allow M connections to the back end and that’s a you know derived from the number of cores on the box times how like some K value and that’s you know the amount of in-flight transactions that are off CPU and scheduled for some external resource that almost always means disk and we find that it’s you know typically two to one somewhere in that neighborhood one and a half to one depending on the speed of the disks it actually may be interesting to go back and look and see if that correlates with random page costs but so JDBC just I don’t know what to say I it is the bane of my existence and it’s nothing against the language it is entirely this little piece of software and specification and driver for the way that applications always talk to the database and any one particular so I just learned actually at this conference that apparently in 16 there’s PG bouncer will now open up a transaction never a preparer sent so transaction mode just kind of works transparently for PG bouncer somebody okay any 16 I haven’t dug into it I just learned that so you know the use of being in a conference room so starting 16 this may be significantly less painful painful because transparently will just work without having to tell your jdbc connections that they need to have this prepare so at threshold 0 oddly enough though that does result in a performance boost because it reduces the amount of round trips between the application and the client and that’s that’s been amusing for teams to discover and watch their reaction because they’re like well it should be faster and you yeah but you’ve reduced the number of trips and like yeah but you know so there you have covered that so backups right you’ve got this big system slaves are not backups right this is probably the most common thing that I see have seen over many years not just a groupon but like many many years people think of a slave as a backup and it’s just not right like basically moving on I it’s would Penniston potentially debatable is replication plus snapshots right is that a backup right if the retention period is long enough you’ve off cited it to a remote location and you can go and query it at any time can you reuse an off-site snapshot as a you know CFS doesn’t have bugs that’s that exactly anyway so yeah you need to have some separation of systems there but it is like you know depending on what it is that you’re actually trying to protect against it actually does act as a reasonable but it’s not it’s not bulletproof I’m not trying to say you can get away with that so it’s debatable but anyway we see this ah constantly large number of teams lots of teams doing things in the middle of the night just deadlines whatever else schemas automated schema processes are really famous for doing this I don’t understand necessarily like but that this becomes back to the loss of versioned aspect of things like you you measure twice cut once that doesn’t happen through automation going back to that item potent comment earlier regarding provisioning when you go and you look in a directory and something’s gone that you very much expect it to be there because you’re creating file systems and mounting file systems and like the data under that file system just kind of disappeared for some reason automation is is good and bad but and and most of all the outages that we’ve had not outages most of the incidents where we had you know we need to go back to some kind of point in time recovery have been as a result of automation right there’s definitely cost to some of this stuff so

how do you mitigate this right how do you prevent the foot shooting right you have to design this big system or like series of systems that’s going to accommodate like ninety-five percent of everybody’s needs out of the box right that so how do you do that you go and create give everybody a give everybody their own database but you don’t give it to them so that they’re the owner of it because now you’re just inviting trouble now these things will actually succeed right the drop database one in particular drop table is like yeah that’s a partial outage or partial failure but a drop database is a little bit more of you know higher magnitude maybe the application will continue to go so we use the PGH be a very successfully to limit the database connections so that at only one time you can have two connections connecting to your database instance right and this means that your scheme upgrades are limited from a couple of hosts very specific hosts which are hopefully different than your application hosts alright and then on the firewall side of things so going through what we what we have found to work reasonably well and this works well from an operations perspective or I don’t want to say open revolt but it took awhile for everybody to kind of move into this mindset where you can’t from any arbitrary host going and perform some kind of ddl upgrade right hat like the enforcing some release engineering standards and process at the database level was an interesting series of discussions organizationally that change was not free right but it’s resulted in a large amount of value for the company and for people that are using the service so what we do is is we actually offer the DBA account we actually have postgres only listening on port 1 5 4 3 2 right not the defaults of 5 4 3 2 not 64 32 we’re both of those reserved for PG bouncer 1 5 4 3 2 the reason we do that is so that we can actually have these connection limits mean something if the only way that people are able to access the database is through PG bouncer lpg bouncer acts to this like you know m2n you know proxies so you have no idea how many actual DB connections or DBA account connections are actually present on the system so if you limit this stuff fantastic right so now you’ve got to at any one time a max of two concurrent connections to go that can go and make ddl changes and for application teams and this force is rigor on their part everybody by and large benefits fair amount of grumbling until that kind of gets automated on their end for better or worse but it the amount of incidence where we have seen drop table or drop database or you know application host a talking to the prod database or application hosts in staging talking to like the wrong those have gone like largely gone 20 now so fantastic like we’ve addressed this classic problem and then similarly the connection limits for the application you know that we pass out two teams we set the connection limit reasonably low to begin with we keep the max connection inside a postgres reasonably high so that any one point in time we can go in and you know bump this connection it on the fly without having to restart a postgres because again if you do that you should go in provision a new instance and then fail over to it and you know move on with life but this work has worked out pretty well for us so questions so far so incident response right you’ve got all these systems in place and you have teams that are going off and you know really aggressively pounding on the system what happens when somebody says something slow something’s broken something’s down right there’s all kinds of different scenarios that you need to go and chase and you know in the operations lifecycle there you’ve got different things well as the database as a service provider internally there’s lots of things that you can do to show up in these response elements where where you’re able to you know more quickly remediate the problem like ninety-nine percent of time it’s not post gross but it is some interaction between the application the environment that results in some form of degradation on Tuesday of this week I would spent the better part of a day hunting down a TP 99 you know query response problem that had something to do with an f sync on a master and the end the query response problem was happening on the slave and like you know wasn’t necessarily the database Davis wasn’t broken but like this interaction of things and events and circumstances resulted in in this you know problem where for 10 seconds right in the middle of the night o dark 30 we saw query response times from a particular process go from 50 milliseconds to one and a half seconds right and only lasted for 10 seconds but like you know this kept kind of happening it wasn’t you know a batch job or anything like that it kind of was moving around as time but like you have to show up for these kind of response events and have something in your playbook and like come prepared and this preparedness aspect of things largely as a result of you doing work in advance understanding the the probable

forms of outage or incidents that you’re likely to go and deal with right the most common one probably is locking right having scripts like this so that when somebody calls and says hey something’s on fire whatever I need some help right I’m trying to run a truncate and it’s not working and there’s a vacuum running the user had no idea right so having these types of canned queries that you can just pull out plop on the machine done right being able to answer the question which process is blocking what process and why is really useful right because like now you’ve got this this thing that you can go and execute on and you can automate this to go and figure out like hey what’s the health of this database oh it’s too small Josh wrote this very fantastic i keep cribbing it using it over and over and over index bloat some of these queries and the reason I have this in here is not because I want you to like you know check your I quality but like this is a CTE and you know it’s a very large query a lot of these queries that we have developed our large and expensive and insanely useful because as a an administrative tool this is something that’s pretty unique for postgres as an administrative thing where you can query the system catalog in very intelligent ways right doing a select start a PG stat activities great but it’s pretty high level right but if you want to go in like you know if you have some theory about what’s going on inside of the database you can go and develop these really sophisticated queries to go on ask these really sophisticated questions of the database and go get that information back pretty quickly right it gets a little ugly like or long right but it’s very useful very very useful right the the battery of these things that that you can should should is probably the wrong word but like it’s advantageous to develop in advance is legion like you just it’s the gift that keeps on giving because you can now pass this on right you’ve got this big distributed environment providing services to large numbers of team and now you have to have you have this training element right because at some point you know everybody’s going to come and go and there’s there’s turnover and whatever else and you need be able to train the next guy and so now you’ve got like some prescriptive way to go and do things so I’ll come back to that point in a minute duplicate index is a really common one things are slow why well there was a release okay what happened well turns out that you not only have this really expensive index with really terrible cardinality you have two or three or four of them right and for whatever reason they’re just renamed in like different tea names added you know conflicting indexes to the same set a database in like or it’s just it’s been there for two years and it finally becomes a problem because a query plan change it right the stats are such that it actually decided to use one of these or something so some of the top used queries that we have internally are these guys here so I actually moved through that a lot faster than expected I cut out of handful of things one i guess so i’m going to go back a little here are there questions for stuff okay yeah yeah these are they are they’re kind of traded on the black market is like currency if I had I if I had to describe like so why aren’t these like described as a package I would say it’s because these types of queries are largely traded on the black market and you accumulate them and you’re like oh that’s really and then once you kind of get into the mindset of developing these techniques you know it’s really useful like the system administration aspect of this would be the equivalent of like you know knowing how to put together a shell pipeline like you know grab this command pipe awk this go do something and you get some useful that does some analysis for you and tells you something this is effectively what it is but it’s done on SQL and click on the system administration side for instance you know there’s not a whole lot that will tell you that right like it would be it would be really interesting if this this kind of thing existed but for the database using queries right um it doesn’t go ahead nasty patch to make it a view that exists that would be a great suggestion yeah yeah no I have not I take a lot of these kind of scripts and I have them so that I just run them in a fire hose like it I’d like a on the networking side of things like at the equivalent of like a show

tech-support and they just gather all this information then I go and look at it that’s effectively what I do just I can chuck under gosh yeah cool yeah some of the some of these queries I’ve had for like 10 years and like this like call name change one day and you know from we released the release need anybody for me yeah so yeah one of the most common problems that we have going back to the the way that we provision accounts it is not free to be able to provide this kind of constrained environment where we I’ll keep looking it’s one not free to be able to go and do this kind of you know restrictive thing where you provide a non super user account two teams so creating extensions oh my goodness we get that constantly so you know there’s that’s one thing the other one that’s really interesting is you can’t select that a PG locks if you’re a customer in the database they’re perfectly talented engineers that could go and query the locks table but it’s not available right to a DBA you can’t delegate that to somebody you can create a stored procedure that will go and do that potentially but you have to go do work to do that it’s not something that’s out of the box good yeah that would I mean that would be really fantastic right like if I’m and there’s no inheritance on users right like what we really want is we want like you know an instance owner kind of you can’t drop your own instance like you can’t remove the current directory that you’re in on using you know filesystem metaphor right you want a DBA account so that they have basically control of their own little container right and then you want them to be able to see everything inside of that container right that just doesn’t exist right now so yeah doing it as a view through it like you know some function that gives elevated privileges so security definer if it’s definitely yeah that’s exactly that’s exactly right so now you have like you have this company specific ISM that now needs to be explained right so right now like this is just one of the things that we have to deal with and it’s a source of friction right not that it’s invalid or doesn’t work but it’s just it’s friction right so the other one is standardizing on little things like this where you say explicitly prod DB a right there’s no mistaking this I thought that was a staging user I thought that was the you a to use or the prod you like Debbie no no it’s in the username it’s in the hostname right you put these warning flags up there so that people will like you know have this last-minute reminder and that’s actually caused you know that little ounce of prevention where you get that in front of people’s faces it’s not the same username across you know the prod staging and in dev databases you make them deliberately unique it either prevents it at login or it provide gives people a visual cue that says oh I copy and pasted the prod thing and it it people can be tired they’re not that tired all right mistakes happen but like that one is an easy one that that ounce of prevention has actually gone a long way for us good yes yes our issue is is we are the reason we don’t do haven’t done something like that is because we don’t allow people to have local access to the boxes right we give them remote access and that’s it right but now the dot psql shells all psql are sees those our client side right and so I have no idea what host they’re going to come from right like I would have to go in and generate one of those for each host that’s connecting through here that’s a good idea I’m gonna I’m gonna borrow that i will post that and get that out that’s a good one I like that a lot that’s actually really good one yeah yeah there you go that’s that would be

really useful so other questions like i said i did go through some of this really quickly and actually paired back some of my slides because i thought i was going to be over but others good Josh we’re using ansibles that biggie yeah right now we do that just because it’s multi host it executes sin cereal it’s easy to debug we compile out plain text file artifacts because then we can inspect it you know one of the things that’s really important like I think I mean I hinted it this a second ago was I don’t care how slow something runs yeah like unlike let’s say dynamic container environments where you potentially want to spin them up and spin them down and like you want to have them be this kind of commodity right databases are kind of special in this world because there they have mass they have important information and it’s stuck on this machine right or serious if machines and you don’t want to go and blow it away you don’t want to recreate it you want to give people the opportunity to like you know foot gun you know that caused some kind of a problem so you know you don’t provision that often we can run through this stuff in a reasonable amount of time right it’s not days it’s not like milliseconds but like it’s okay right it happens in you know a handful of minutes and that’s acceptable so ansibles as a biggie we looked at all their alternatives but that’s what we ended up settling on there the talk that I’m going to be giving soon it has something to do with console for for clustering in order to orchestrate so yeah other tools questions no it’s friday i will give you guys back three minutes of time then