Scaling Puppet Usage to a Global Organization

ok let’s go then hi my name is talky Frandsen and i will present to an example of how puppet is used in a very large and distributed organization it’s a case study that a current customer has allowed me to present here provided that I don’t mention their name unfortunately so you will I will leave that for you to guess which one it is and but first who am i and where what’s my background I’ve been working with linux and open source software for about 13 years building systems and since i started working for redbridge a few years ago building systems that manages systems and train people how to use them redbridge is a company that was started with the idea to build stuff with the open source and buy with stuff i mean all kinds of the boring administrative the business systems or even more unsexy the infrastructure for boring administrative systems maybe that’s why this i feel that you maybe we’re on the same page that’s maybe why there’s no women here at all this is sorry that what that’s why there’s only one here and and she’s paid to be here yeah well even even though the things that we work with are pretty boring at least i think that the technology that we work with is a little bit or it is sexy it is we work with the cloud stuff with the stacks that the complex system to make things like Amazon and and and those kinds of of the really big systems take over and one of those systems is puppet so that’s that’s about me then or rather more about red which will keep it short and I’ve divided this presentation is it’s going to be PowerPoint poisoning but I’ve it’s this some kind of structure here we have a it’s three parts in the first part I talked about the customer who is not to be named and i will also kind of i will say what the actual what about the problem isn’t trying to to get that down importer to import to i will focus a little bit more on the soft issues the way working which has been the greatest challenge in this and to just to keep you to the end i’m going to talk a little bit about the technical platform in part three so that’s it the case as I’m said I’m not allowed to say which company I’m talking about is if not that it’s anything secret or anything this is a it’s a big telecommunications company Swedish wasting sister the thing is that we don’t we don’t want to bother upper management with this and the press officer and so on just to get this presentation through because then I would have stopped like six months ago just to get these thirty slides knowledge well they’re interesting in that they are this customer has around 10 main sites around the world I checked for the time zones and they’re spreading in 16 / 16 different time zones so they they talk about a someone else at this company has looked at this and SL we want to use follow the Sun at least for

16 hours and then 48 hours we sleep we have 10 sites around the world on those sites with there are thousands upon thousands of users and there are thousands at least of systems there is a something that struck me while I was listening to the other guys presentation was that they said that okay with we are a debian shocked that there are a lot to gain from being able to run just one or two operating systems but this company they can’t do that they have to use all of these different operating systems and they have to have servers running all of these operating systems there are virtual and both virtual and metal servers virtual servers from different using different hypervisors and on those service they run Susur at least three different versions of Susur everyday also 32 and 64-bit so that makes more combinations they use red hat and centos about five and six also 32 and 64-bit so that’s little bit more combinations and they also use you want oh of course and solaris 89 and 11 so that poses a challenge which I will maybe go into a little bit more later on these servers there’s nothing or at least the things that that this project is about there’s no not really special applications and they don’t run very much in-house applications mostly they they have stack that is based on on open source and and closed source third-party software and the users are in different divisions and they choose from from predefined system pipes they say okay we want to use this this kind of server or these kinds of servers on these operating systems but they choose from from more or less from a from a collect of different things for the systems there are some unique applications that they say okay but in addition to this we also want to run this application from this vendor on him and for supporting all of this this kind of service apart from obviously puppet there is some supporting infrastructure as well of course since you have red hat they have a few satellite installations and same thing for solaris I think it’s called ops center of center of course and the same for the other operating systems and they also use to not have to push every application onto every platform they also use a big network file system called a so now you probably know which customer I am talking about so the challenge was a pretty big one about a decade ago I don’t know exactly when they they outsource everything they said this big mess is too big and too messy let’s give it to someone else and let it be their headache but now the pendulum has swung back then they they say okay we can we can probably do this a little bit better and cheaper because I mean it shouldn’t take 18 weeks to get the virtual server with some software on it well that that sets the bar pretty low then that’s not what we’re aiming for they also wanted to increase a cost efficiency because they knew what it was like to be before they outsource everything they knew how

much if they just would go back to that and scalar organization up it would be very expensive for them so what we need to do is our goal was that we should the administrators should only solve each problem once globally on all sides and then share this solution and well that’s about it about I want to say about that there are also some additional requirements and that is that there is a little bit more like a people thing we don’t want to make all of these sites they have their own system administrators which are experts at different systems and we don’t want to from from together all the system administrators from all the time zones and have them work under like a central organization because that was would just be inefficient and well then it would take 18 weeks to get out of virtual server probably so we should leverage the existing expertise if there’s an expert somewhere we’re going to keep in there and we’re going to use him we also know that the users are close to their site so each site has their own users so it’s better that you have administrators working with the system close to their users be close to your market so so we should let them solve their own problems as they because they are the best at solving it but we should try to get this solution back and share it also the this the deciding when you decide on on a change in this company they it works they you see I feel so with an addition that you can ask the customer who uses the server and say please may I reboot the server and they can click no and then you can ask them again in a week American at least reboot this server so it’s a very important that it’s difficult enough at it is as it is to get changes done so we want to keep that freedom as well so where does pop would come into all this well it’s a tool for system administrators in this case and it’s only part of the solution that they are bringing out worldwide but it’s a very important focal point as all of the services that they use will need to be Papa tized somehow if if an application or service isn’t properly architected or documented then we’ll will notice it if not sooner then we’ll notice it when we have to write the puppet modular pepper manifest to roll that’s a result so thats makes it like a cutting edge that leads me on to part two then so with given that problem how what is the solution while the solution is partly technical but mostly finding a way of working and in this case when I say code I say I mean pocket code why and and when mr. co-developed by whom and how do we share it that’s our way of working and this the facilitator here is a global team which is supposed to coordinate this development I don’t really know this I was forced to put this image here because you should have fun images well anyway this global team is a little bit like this guy we have to know all of the services and we have to know everybody

everywhere to be able to help them to make puppet code but we are not the ones who develop the code and because we remember the slide where that we should leverage existing expertise so if we have an expert somewhere else each your develop the code so what we should do as as a as a global team is really find out who knows what about what and make sure that this knowledge is shared there’s says we need to share the puppet code we have to have some kind of code standard the global team will be the keeper of this code standard and we will also develop and support the puppet architecture so this is an example of how code developmental puppet code development happens site users on one of the ten sites they need a service configured they say we want hosted Jenkins and we need you administrators to set it up so this decide team in despair calls out to to the global team for puppet code and say we need puppet code for this thing we don’t really know what it is do you guys know anyone who knows this so the global team then asked around a little bit and finds some kind of a Jenkins expert it may not be the best at writing puppet code but in is very good at Jenkins so we from the global team we helped him write a module that will work hopefully if for the ones who ask for it and also globally and then we deliver it to the site theme and users and everybody’s happy here’s another example site needs a new site users again need some kind of service but this time they have an expert locally on the side so we’re going to leverage him he leverages himself he just writes a puppet module and says by the way guys I wrote the puppet module for this if if you need it the areas so then the global team takes this knowledge and keeps it and next time someone asks for a hosted tomcat up at module then we take up this code make sure it is still okay and then give it to them how many of you are familiar with using flow shorts flowcharts works best for programmers and programmers who write like the old-fashioned kind of programs I found that none of the the management of this project really understands flowchart to me it’s really really simple we have some kind of someone needs code do we have look global code yes we do the first the team one is it well get the code is it okay yes use it done and then we have we can go in different directions here but need code no there is there is no code please so then the global team as i showed in the previous examples they ask around order do anyone have any code for this and the answer is no there isn’t then we need to find this expert and develop this codon and test it and so on and but finally we will get back down here to the deploy where we want to be so how do we do this code sharing technically well there is a global git repositories a lot of them they have like they have been Gary the installation which everyone in the company can use and what you do is we have this global repository which everyone knows where it is but when you’re before you deploy code you first take it from there pull it down locally and deploy it from your local copy because remember we don’t want to be

centralized things this is a this really is about trust to trust someone who you don’t know who works on an office in a different time zone somewhere but that’s really seen this worried they say but what if someone commits some code that doesn’t work and it said okay that’s okay because he will commit to his local repository you won’t even see the code and once you have gotten the code to your local site then it’s locally tested before you deploy it before they deploy it they are still responsible we aren’t responsible for any of the code it may set your server on fire or not and then they do an ITIL change and and just deploy the code and when they do any changes to the code they are supposed to notify the global didn’t say by the way let that code int add support for solaris 8 32 bit and we added that and the global team says oh great this is something that would really need so they just pull that code from the repository back up to the global repository so the code standard well the there is I think the previous presenters have already spoken about is how you structure the corner so on so it’s not so interesting to repeat one thing though is a unit test that we have added to the code standard that I will go into that a little bit later why we have that this is a repeat of mostly craigs with the roles and stop we do pretty much the same and read me one thing to add about to read me there used to be well in the readme well when you document all the parameters to a mode to a para mattress well you can do it in the readme but that didn’t get updated so we stopped doing that so now we just do it in in it TP where they have to be so you have to write it in there and if you change the name of a variable you have to change it in there so just put the documentation there as well and another thing about the const code standard that is and that we need to worry about this that we have to have a nice results if since we have to support 10 different sites with a number of environments each and there are so many different operating systems if the code chance at your server on fire if we’re unlucky so if we’re don’t just assume something if you’re if there is a choice somewhere which you which you don’t know about then fail and notify the administrator say sorry we haven’t tested this operating system yet for the code review this is again to have some kind of the way of tackle that fear of what if someone commits something that breaks things well we put the code review in place and that is that the global team both accepts codes and sends code for review we can send it out to the sides and say that because we know everyone on all the sides we can send it out to guy in China and say please can you look at this code test it on your site we need an extra set of eyes this code and he puts it on his Testament and tested and what what we really want to be sure that it it does is that it doesn’t break anything horribly and that’s our that’s our

success criteria so if you can’t read the code then that’s a fail and it’s if it breaks other modules then that’s also a fail and how do we know that it has broken other modules well because there is a small unit test for each module that says okay this this module did its job it worked and having a standard is fine but there is also the small matter of adhering to this standard and what what we have here is we’re very far from being DevOps I don’t think we have a single developer in this project they’re all system administrators system administrators aren’t familiar with peer review they don’t know scrum they don’t know extreme programming they may have heard about unit tests they rarely use well they probably use gift maybe so this is also something to take into consideration we can’t really have build fancy structures that are difficult to work with it has to be pretty basic it shouldn’t to get going you shouldn’t be assumed to have any other tools but VI and then I don’t mean them I mean VI and all the other gaps the global team has to take care of and Phil if there’s if there’s a guy he’s very very good at Jenkins but it doesn’t know anything about get like that’s ever going to happen then the global team has to go in there and help him and teach him this is the way you use it and we know nothing about Jenkins so we’ll we’ll have to learn that as well in the process and to further lower the bar 4 4 code contribution we’ve written a boilerplate module it’s a it’s a model that contains everything that you need so but the structure and and the readme and everything and it’s commented here you put your parameters you document your parameters by writing so and so so what they have to do is just copy this pull this module rename it add some their code to it and then they can push it back I I’m not sure I want to go here how this is a very complicated thing which we just started discussing a few days ago and that is how are we going to do I mean touching the top with manifest is almost like touching anything else what you have to choose whether you want to stay with the main line or if you want to keep with what you have that is stable and and the ones you need a certain feature then you take the extra extra work of the testing it because you have to test it with there is there is no way of well if you want to everything is possible but I mean we have ten sites we have seven different OSS major all your OSS of which there are two or three versions I mean it creates so many combination so it’s you will have to have an automated system to test it and you will have to have an automated system in each of these ten sites so but maybe we’ll build it we don’t know you’d have something like Jenkins starting up a lot of the virtual machines and then doing operate on them my main point is that any regression testing must also be discreet distributed because we can’t do it centrally we don’t have access to all of the all the sites and environments let’s brings me on to the third part which may or may not be the most interesting for you depending on who you are as I said we have to serve thousands of clients and and since we also are responsible

for helping the deployment of new puppet installations we also said that well we’re not going to do the installation by hand we’re going to have puppet manifests for deploying puppet and of course the platform has to support the way of working and that’s the same slide again so too how do we deploy puppet well first of all there so there’s a global network file system which is it’s based on NFS and our sink keeping things moving around there on there we have the copy of the these git repositories and also all of the packages that are needed to this because as I said although they have satellite they have may have several different satellites so you still have to load your satellite if you have a lot of red hat packages you have to load your satellite with packages from somewhere so they are in this global investors them and so what you do is a you put your packages into satellite you clone the the code from the git repo to your local and you test it then you change parameters and then you bootstrap your first puppet master with puppet apply and from there on a they use load balancing by the way I’m not sure if this is needed the Spotify guy will see you had a lot of the nodes per server here we have we have a web beefed up really we can have we start out by saying okay we’ll start out by having three and load balancing and will will the certificate in the ca will allow for up to ten puppet masters which we can load balance between we may have overdone it I don’t know but Ted another monster is really easy if if we should need to you add the round-robin record to the DNS you add the server obviously you mount this the shared storage where all the puppet manifests are and then you bootstrap this server from an existing server by using puppet agent- server minus CI server so it’s good to have it puppet being copied deployable so as I said tree we start out with three masters go up to 10 1 CI server it’s only doing that we have to keep this up to 11 servers to have their have them share manifests we use an NFS file system there obviously we were using passenger you too is there another one in widespread use I don’t know unicorn and SSI robin DNS records in this platform we also have Foreman it’s it’s really a it’s great to have when you have a lot of nodes because there can be a server like forgotten somewhere we’ve all heard the stories of a server forgotten somewhere behind the drywall of the because they rebuilt the data center so when you have a lot of servers it’s it’s good to be able to see if any one of them have a problem and we do this by well the the agents send back the report to the puppet master which stores them on the same network file system where the manifest or which is shared by all of them including the form and server so it’s pulled every five minutes or something and the good thing is that we since we there are other ways of doing it but the good thing here is that the form and server since we have only one it can be offline for a week or something and we will still not lose the reports once we started up the cron job will spoon it into its mysql database on and we can see it there the agents well obviously we have several different ways of deploying operating systems not we’re not using razor and that we I mean the

main ones are a kickstart a jump start on the agents we we have standardized on version we say we’re using 2.7 point 14 and we’ll compile it for you since there are there are some who use actually use puppet for other things I won’t go into that but they can use the deposit that came with operating system will install our puppet I in a directory under opt and it will be there even though you uninstall everything and break the server horribly our installation on the OP will be there to be in effect to life a run it by Chrome here’s a controversy there this is really cold standard issue and but this is where the the puppet files many light the idea that we will have a well now great we have a way of distributing files universally with the puppet masters but then I said no you can’t sorry you can’t use it why well I think the is I know it’s a little bit late for showing code but I mean this is the really is the reason why we don’t well the problem is solaris can’t use anything but the file system to install packages from until version 10 where you can mean you can pull the package from HTTP but the admin file a response file you can’t so you have to have things map into file system to be able to install packages so that breaks everything about using something other than NFS really and the reason for that is that if you want Papa to transfer you well you could transfer it locally tone to the local file system and then have a PKG add pull it from there but then you will have to write some code for transferring the file so this is what you get if you use this is all the code for installing a package that is stored on an FSS volume and this would be the code if you also had to transfer it and this is not counting the the custom ruby the custom fact code that you would have to have to determine if you were if should transfer this file or not so I’m we’re going with this it’s easier okay tration we don’t have it I think most the biggest reason for this is that we have so big change windows we can say okay sometime during the next weekend there will be a change so why would you need to specify exactly what second we do the change also there is a lot of in this organization even though we stay at the same sites the complicated network firewalls security guys a lot of red tape to open port somewhere which kind of makes even having puppet run it’s a little bit difficult imagine that you can reach all your all servers on the porch somewhere they have to open like 105 volts okay almost right on time what did we learn what have you learned from this well the first thing is something that we didn’t learn here actually I it was learned elsewhere but I’m really glad that we brought this with us because modules we try to avoid having modules depend on modules because as I said we have each module in a different git repo and all of the sites can pick and mix their modules as they see fit so they okay we there’s a new feature in this module okay I’ll take this module but only this module I won’t upgrade all my other modules so to avoid a lot of the regression testing a lot of regression testing then we have modules that mode you can take just one module

and use it and it will be fine as long as that module works another thing that is we this is something that is can be difficult it’s I used to say it’s never too late to give up when you have invested a lot of time writing a very complicated puppet modular a big chain of XX to install something with a something nasty with a like a java installer which really is a GUI something like that give up popular is not for everything what we can do is have a guy who is good at this thing right the shell script which you can shoot out as a template and let the shell script is all it because she’ll scripts are much much better at doing sequential determined executions also well the packages have they have understood this I mean that’s why I have a pre and post and uninstallation scripts in them or PM has debian has PKG PKG has so if it’s too complicated if you have had like five different people looked at the same module trying to rewrite it and it still doesn’t work give up do it with the shell script instead but one thing we have really learned is that the biggest issues are with people when we when you come to a big organization like this what they have a lot of is people and all these people will have different expectations of what we’re trying to do I will explain to them what what properties and is it oh great and then they think that once you have done cradle puppet model for something that is a turnkey solution which they can just give to anyone maybe not even a sysadmin may be a guy from the a the cleaning department in the puppet model to him and he’ll deploy the system and that’s not it that’s not what we built what we’ve done is a poverty is a tool for system administrators you can’t really replace them you can but only under very controlled circumstances and with a lot of testing and there’s also even within the system administration community as that this organization there’s endless discussions of everything from way of working to what what should variables be named should we use uppercase lowercase on the score things like that be prepared for this if you ever try to do something like this I think that’s it questions were closed as well