Datadog – Using Metrics to Measure and Understand Your AWS Environment Performance

all right so we’re at a session called using metrics to measure and understand your AWS environment performance really long title basically the gist of this is I want to make sure you’ve got all the information you need to be able to start working with metrics and understand monitoring especially in an AWS environment my name is Matt Williams I’m the evangelist at data dawg now just kind of curious a show of hands how many of you actually have heard of data dog were awesome awesome how many are you of using data dog ok that’s ok you will be soon so if you want to reach me I’m Matt W at data dog i think matt wi data dog comm also works but and yeah we’re at data dog HQ comm or data dog calm I’m not sure why if we threw that HQ on there and on Twitter I’m a techno evangelist which is pretty cool to have for a evangelist out of the people here ok so you told me that you’ve heard of data dog how many of you are let’s see how many instances do most of you have let’s say do you have more than let’s go with 50 50 instances on it great cool ok how about more than 100 ok more than a thousand cool cool all right kind of nice note in this session we’re gonna talk about six main things I got it a little bit of an intro kind of why are we dealing with this a little bit of who I talked to I talked to a lot of customers and so this presentation comes from information I collected from a lot of those customers about what they’re doing how they’re monitoring their AWS environment we’ll talk a little bit about why and and how you’re gonna deal with the issues that you need to deal with and and then then finally you know what what kind of stuff do we need to monitor and and then finally details the actual details of you know which metrics do we recommend you look at at a very minimum now whenever we’re talk about you know specific metrics it’s hard to recommend okay you need to look at these three things and nothing else because there’s far more than three things that you need to look at are five things or fifty things it depends on your environment and so one thing that I’m going to be mentioning a few times throughout this session is that there’s no way to come up with a specific set of recommendations because it depends on your business your environment your goals and so you can’t just rely on what you see in a presentation or a book or a blog post or a website you need to put a little bit more thought into what’s required so let’s talk a little bit intro now before I get started before I get too far into this I want to let you know that getting started with monitoring on AWS or whatever the platform happens to be it’s pretty easy we at data dog we make it really easy to do you know data dog is a SAS based monitoring platform and we’ve got these agents that you’re gonna be running on each of your hosts and we try to make it super super easy to get those hosts up and running to get those agents up and running so getting started is easy getting good that’s not so easy it takes time it takes time to understand what your environment looks like what is normal you know a normal for me and for my for the environment that runs data dog is gonna be very different than what normal looks like for you and it’s not exactly quick you know that well we’re gonna display the the metrics really quickly but for you to understand what is good and what’s bad for you to understand what’s normal it takes a long time it takes you know whatever the cycles are in your business you know if you’re you’re you kind of cycle through big bursts at the end of the month it’s probably gonna take a a couple of months to figure out what is normal if we’re looking at more like everyday then gonna take at least a few days to figure it out but there’s always these cycles that it’s gonna take a long time to come up with a normal and it’s okay to come up with a normal today and then come up with another normal tomorrow and you know as you progress get better and better get closer and closer to what normal should be and the reason why I keep talking about what is normal is that we want to eventually come up with

some alerts that help you get alerted to when there is a problem in order to be alerted when there’s a problem you need to know what normal is and so we’re monitoring company and one of our customers who has actually mentioned I think they were mentioned this morning company called AdRoll a guy who works there at Bryan Trautwein he talked about in a lot of his presentation he talks about monitoring has three parts and I thought it was really great a great way of explaining it the three parts of visualization alerting and analysis and basically each one of those things break down so a visualization tells you how things look but not why if you look at enough of the visualizations maybe you can figure out the why but on its on the the on its own a visualization is not really going to tell you why it’s gonna show you stuff make it look pretty we try to make it look pretty and try to give you a lot of information in a condensed area so you can see yeah see how things look but again not necessarily why alerting tells you is that something one particular thing happened it went over a certain threshold it went below a certain threshold it there was a sudden change but it’s not gonna tell you why it’s gonna tell you something happen but not why it’s the analysis that tells you why but only if you know how to ask for what so those three things kind of come combined to to make up what we think as monitoring now notice I didn’t talk about actually doing anything to fix the problem it’s a little bit out of the scope of monitoring where a monitoring company ends we are you know we’ve got 100-plus integrations to help make it really easy to monitor whatever your environment looks like whether it’s a AWS and on top of AWS you’ve got post grass or Redis or cassandra or nginx apache whatever that happens to be we’re gonna try to make it as easy as possible to do this that the monitoring all the three steps of monitoring and so who did I collect some of this information from for for this specific presentation I looked at three main customers and of course us I mean we’re a were user of data dog and so I talked to the guys Admiral and Admirals uh I guess I heard that they were mentioned this morning as a case study real-time ad bidding they do two million transactions per second two million bids incredible amount of stuff that they’re having a process through their systems another one that I don’t know if you maybe haven’t heard of is called team internet and they’ve also got a part of their business called DN TX and they’re in the business of domain parking and when I first heard that well that doesn’t sound like ethical stuff know what what you know you’re buying a bunch of domains that’s sitting on it for a long time and but what was really interesting was this park domain traffic exchange I’d never heard of such a thing and the idea there is that you buy some domain my favorite monitoring tools calm and I don’t have something to put on that side I haven’t invested the time to build out the site I can sell the traffic of people who visit that site and sell it on this exchange and then other people who want to get traffic who are people who happen to be choosing that domain you know my favorite monitoring I already forgot my no domain name they can buy that traffic and so the person visits the site or visits that domain and within some hundredths of a second they get redirected to your to your website because you’ve purchased that traffic on dnt X and they deal with you know for the park domains or domain parking sites in the world they handle about ninety-five percent of the traffic goes through them pretty cool stuff and another customer I talked to was a simple reach simple reach is all about content measurement and so a company like a big publisher they produce all sorts of content every day and they want to try to understand what are the best articles what’s the best article that they need to really push up to the top it’s not just about page views because there’s also social media involved in which are the things that are being shared the most and they’re trying to make about seven billion measurements per day and to make sure that you know the best content rise rises up to the top so pretty neat stuff that they’re doing as well and each one of them have different requirements in AWS and different requirements of a monitoring

solution and finally I work for data dog so I was able to talk to some of the people that manage data dog internally and what you know what are the most important metrics that people should deal with ok so that’s a little bit about who who helped me build out this session so let’s talk about why so why do you want to measure why is it important to measure why is it why are you here why is why why is it stuff important in order to improve ya gotta have know where you came from so you need to understand what was performance like yesterday or last week or last month you need to know this stuff so that you can go forward and make sure you’re always better or at least the same so see where you came from see we’re how you’re doing right now especially compared to where you were yesterday where you were last month you want to understand that difference and then see how you can improve so I mean just seeing that you’re doing better than you were before is nice but you want to see how you can improve on what you were doing so you want to have be able to come up with experiments that test and come up with new ways of doing things faster doing things better and by having a bunch of measurements a bunch of metrics that have been recording every every day every well we recorded every second and we keep that data for a year so by doing that and trying out a few experiments you’re gonna be able to see whether things have improved and you’re gonna be able to see whether things don’t improve this was the one of the main things that the guys that well both Admiral and yeah aggro which we’re talking about they will roll out you know ten percent of you know if they come up with an improvement to some portion of their business they’ll roll out maybe ten percent of the servers like what Netflix will do and a few other companies and the roll it out and then they’ve got a dashboard on data dog that looks at the existing machines you know the ninety percent of machines that are already out there and then the ten percent of machines that have been updated and look at the exact same metrics has performance improved or is it’s a little bit slower when you work with data dog the dashboards you can also say you know maybe I have one line that represents some sort of performance metric for all my old machines create another line on that same chart that just looks at a different tag and that tag could be new version or new boxes or something like that and so the reason you measure is to be able to see where where you came from what are you doing now and how can you improve so okay that’s easy anybody could do that but what makes this complicated is the fact that we’ve got this idea of elasticity elasticity is what makes us really really complicated and that’s become the new normal you know if I ten years ago fifteen years ago typically you had you five boxes ten boxes a hundred boxes what do they have when they happen to be in your server room and the ones that you had in your server room one day happen to be exactly the same number of machines you had the next day and the next day and the next day and the next day but today you know so Minh more recently the number of instances is is fluctuating a little bit more so instances could have a lifetime of maybe days or weeks or months but now we’re seeing more things like docker and other container pieces and these containers have lifetimes much shorter one of the other companies that we work with it does kind of amateur sports scores and you know this is recording kind of sports statistics for your kids baseball game and including play-by-play what’s going on and you know there’s a scorekeeper and they can say okay little boy Jimmy went to first base click and all this play-by-play data is stored up on on their server well on AWS but the number of games that happen during the week it’s pretty minimal they need just a few instances going on but on the weekend the number of games that they’re tracking is something around fifty to sixty thousand games that they’re tracking and of course with each game there are X number of players on the on the to each team and all these different plays that happen and there’s this enormous amount of data and so to go from zero to thirty thousand games per day they need to scale up incredibly and so this elasticity is what kind of makes this a lot more complicated because they can’t

rely on just a few machine names in your monitor and you need to come up with a different way to help monitor and manage large groups of machines and these groups of machines are changing all the time so just some scenarios that require elasticity amateur sports that was a pretty interesting story when I talked to those guys shopping events concerts marketing campaigns it’s definitely something that ad roll deals with and few others okay so it’s a little bit about why so how one of the ways that we try to manage that complexity is through the idea of tags and so I got a few screenshots here of tags that you’re probably familiar with so the one on the top left is my gmail mailbox and Gmail uses tags they call it labels but for each message I can assign it several labels in this case I’ve assigned it of a label of support and a label of dev and so if I click on the dev what looks like a folder I’ll see you all my dev related messages and if I click on support they’ll see all my support related messages except they’re really just tags they’re not really folders they’re tags over on the right side you got your ec2 console or AWS console and here I’ve got a list of machines I’m only showing the top one but each of those machines I’ve assigned a tag and some of the tags that are available named as a tag and then I’ve created another one called demo tag and I’ve called it given as a value of mat demo the idea there being I would create a bunch of demo instances and assign different values to that tag mat demo Chadd demo different demos depending on who’s running that particular demo and down on the bottom I’ve got my Mac OSX finder and Fred you know and finder and Mac operating system as well as Windows has supported the idea of tagging on files for a long time okay Mac now so long but windows for a really long time tagging per file so tags makes this easier and we found this in data dog as well and so this happens to be our host map and in the host map we can see our infrastructure we’re looking at the part of data dog that’s on us East and I’m looking at us East 1a 1b and one and I can start breaking this down finding different parts that interest me based on tags and so I’ll go ahead and do that right now let’s see I want to look by image and so now I’ve broken this down to us east 1a 1b and one a and inside that what images what source am i filed but a.m. eyes are being used for each one of these hosts each one of these hexagons is a host so I can zoom on on one of these availability zones and at this point hmm okay so it looks interesting and using a tool like this I can see maybe I’ve got two different versions of the same image somebody has come up with a the next version that upgrades a few pieces of software inside it and I conceived side by side different versions of that same thing and see oh well looks like the newer version all my instances are slower for some reason but the only difference seems to be that am i and so maybe that’s an ami we should probably avoid I could also group this by anything else that has a tag so a role we have roles associated with all of our instances we have the kernel that’s associated with that instance or location or other all sorts of other things let’s go with role and I’ll zoom in on one of these sections and you know maybe even look that instance type so now looking at down to the you know it’s a c3 to X large are 3x large and keep zooming in to find you know just the groups of machines and so I didn’t have to find you know I’m looking for this particular instance name of course nobody remembers what these instance names are so you need to find some other way of finding which machines you care about right now you want to learn about all the are 3x larges that happen to be in using that particular role and using that particular image here’s a great way of finding that data very and that’s what we’re using this host map for okay so that’s pretty cool tags also allow for kind of an ad-hoc

aggregation ad hoc query in this case I got an example of something you might look for when having to manage in this case docker containers so you might have a query that says something like you want to monitor all the docker containers that are happen to be running the image web in region uswest to across all the availability zones and make sure that residents set size is less than a gigabyte on instance types that rc3 excel all right you large and so if i apply the different would probably be tags in this I can do that with a bolding and so I’ve got image web web is one of my tags I would sign an image tag of web a region tag of us west to an availability zone tag in this case I’m looking at all of them and an instance type tag of c3 excel okay so that that makes this query a little bit easier to deal with I don’t have to do a query that was looking for specific machines I don’t have to do you know complicated list and group management I just have these tags that I can look at and then I can just change the the key query in the middle which is that resident set size is less than a gig so I could easily set change it to a resident set size is greater than a one and a half times the average of all the resident set size across all the web images on c3 Excel and uswest to and across all availability zones so tags makes this really really cool really easy to deal with really easy to manage what could be I mean this could be all docker containers and we’re seeing customers that have tens of thousands of docker containers across hundreds of instances on on AWS dealing with tens of thousands of names of containers is really difficult dealing with a few tags that’s a lot easier so what about context here I’m looking at this happens to be just one of the dashboards that we provide looking at Postgres and this is looking at some two day period of time back in March and one thing we try to do is provide context you you want to just see a metric on its own doesn’t really tell you a whole lot about what’s going on you need to get context to be able to understand the full story and so one of the ways we provide context is by having a lot of dashboards on the same page now there’s these are probably they can be a little bit smaller than you might actually use just to fit them all on this screen at 1280 by 720 but you get the idea and if you have a much bigger monitor you’re gonna have much bigger graphs but at least I can get a little bit more context about what’s going on right here and Oh before you think wait none of the labels make any sense at the bottom that’s intentional this is actual you know real data from us and so I’ve got a little tampermonkey script that just runs and says change all the fonts to something that’s not readable by any of you in the audience so it doesn’t normally look like that but as I drag my mouse along I’ve seen this vertical bar that shows me what’s going on across all the other graphs that I’m looking at and that provides a little bit more context about what’s going on in this environment and if I want to add a little bit more context I didn’t say well maybe sources I want to pull stuff from github and so now I’m looking at issues and pull requests and other types of information that come from github overlaid on top of all of my dashboard graphs and I don’t know if you can see this but there’s this little vertical line kind of moving along as I drag this over I see the particular issue in this case it’s a pull request so it gives me a little bit more context about what’s going on and so I can eat now if because I’m using tags because I’m using other sources of context I’m able to get a much better idea of what’s what actually calls the problem what actually you know if there’s a huge spike in something what caused it being able to find different ways of looking around your environment to understand that context this becomes really important and tags are something that allow us to do this so in this case the tag is sources and the value is github okay so what so what metrics do you need to look at so this is the hard part you know understanding why you need to do that that’s easy what metrics do you need to look at that’s a lot harder because there’s not really one source of expertise we’re trying to become you know one of those sources of expertise we’re starting to build out more and

more blogs about what kind of metrics you need to monitor but there’s not one place you can go to that is just the right information for you and this is again the hard part but there is a lot of guidance a lot of books you can grab we happen to like really like the system performance book from Brendan Gregg not so much because it recommends specific metrics to look at but it recommends a way of looking at metrics a way of looking at metrics in general its way of classifying metrics I really like this effective monitoring and alerting and again not because it’s recommending specific metrics to look at but because it’s looking at patterns and how to analyze the the dashboards that you’re looking at how to read and how to identify anomalies and what are anomalies and then I added some random system sysadmin book I’m not saying I recommend this one it just happens to be you know choose your choose your flavor dot system administration book and so there’s a lot of this book you know books online or books available or blog posts or websites that talk about oh here are the five metrics you visit the data dog calm or dated al gage Cuba calm website and go to our blog we’ve got a several posts that say here are the five metrics to look at there five metrics that are important but they’re not necessarily the five metrics that you need to look at because you business is different I wish there was an easy way but usually the best way is to find hire or train some sort of expert in your environment somebody who understands you or what your needs are what your goals are what you are what is good for you you know is it that customers have a you know don’t experience any delay and adding something to a shopping cart or is it that they don’t we don’t ever lose a transaction or you know what’s the most what’s the key bit of information what’s the key metric that’s super important and how do you translate that back to maybe AWS ec2 or to Postgres or to Redis or to whatever the applications we’re dealing with that are so this is yeah the hard part trying to figure out the context because your context changes the meaning of the metrics that you’re recording and so the Brendan Gregg book talks about three categories of metrics there’s utilization utilization is percent over time so CPU utilization or disk utilization or whatever other kind of resources utilized utilization there’s saturation saturation is looking at weight queue length so the how long is a queue before it goes to whatever is utilizing whatever is is processing stuff so that queue length is talking about saturation because if that queue starts growing obviously your resources is overly saturated and then there’s errors just an error count how many things how many bad things happened in X period of time and so ok that didn’t show up too well but so to help understand what is good and what is bad and what are the metrics going to look like it kind of helps to look at some patterns what are some patterns that you’re typically going to see within your dashboards and so I’ve got a bunch of these patterns that I grabbed mostly from data dog dashboards and I’ve kind of removed the scales although the scales are super important because I think with pretty much any one of these patterns you can see the same pattern in the same metric just if you zoom in or zoom out enough but try to think about this at the whatever the normal level is and so we might have the spiky pattern which is you know you’ve got a lot of bursts up in in value and this might be CPU utilization on something that’s on a machine that’s really busy all the time it’s constantly spiking up and then stopping for maybe a second and then spiking up again there’s kind of this more steady view where things tend to stay about the same maybe it’s because the machine’s not properly utilized it might be a locally low steady value or it might be a really high steady value in which case probably utilize pretty well there’s a counter the idea of a counter is constantly rising you know it could be errors number of number of times somebody viewed the page in a day and that costly goes up a number of errors that you experience for some plug-in on your system it constantly goes up until you reset that value there’s kind of the bursty pattern where things tend to stay around a low value or maybe a really high value and and all of a sudden spiked up or spiked down

it’s kind of this bursty pattern where things generally are brown the round you know a normal level and then spike up spike up every now and then there’s a binary idea so this is this is not exactly a binary view but the idea being kind of an on-off maybe it’s a container that comes on and then it dies and then it comes on for 30 seconds or 10 minutes and then it shuts down and so that’s the idea of a kind of a binary view there’s also the classic salt tooth we see this across all sorts of dashboards all the time the idea being maybe it’s a the cache is being loaded up and or a queue is being loaded up and it’s constantly got more and more items on in the queue and all of a sudden whatever is processing it just has this burst of activity and the queue drops or the cache is flushed or you know something is that value drops down to nothing so we see salt tooth all the time and in a lotta graphs and then there’s the idea of cyclic graphs and here where we’ve got some sort of pattern that keeps happening over and over and over again and this could be every day maybe this is looking at two days and the trough is in the morning before anybody wakes up and the peak is maybe at lunchtime and the trough again at lunch in the night time you know it could be that it could be some other interval of course I didn’t show you the scale so it’s kind of hard to know what exactly that is it was intentional and then there’s a kind of a Sterry Sterry view that’s not a word but you get the idea maybe if this were going up so kind of a stairs going up maybe it’s number of connections to a particular database and you just spike up to a little some value stay at that value for a while and then all of a sudden need a bunch more connections and you open up a bunch of more connections and then you kind of kind of go along chug along at that level and then grab another bunch of connections in this case it’s kind of going down then steady-state say in the town we see that kind of pattern not quite as much again that’s all – you see all the time and once you’ve looked at some of these patterns then you start looking around for anomalies and so this happens to be this looks like it’s average of CPU stolen over instant size over a long long period of time and if this were a really short period of time maybe those anomalies would be really important to know to see oh oh there’s this anomaly right here is that bad is that good or is it just not important this happens to be one spike of not really that much over the course of 12 months probably not all that important not important for me to focus on unless of course it’s December 30th and it just happened just now in which case I probably do want to know about that I probably want to try to deal with that and try to see if this is a problem across machines and of course because I’ve got these tags assigned to all my different metrics and hosts and instances I can use this to understand where what else is being affected by this particular spike so are the anomalies the focus or should they be actually ignored and it depends on your environment depends on what you’re looking at depends on so many things so you want to figure out your cycles here I’m looking at one of our integrations is with desc we used desk comm for tech support and here we can see the kind of the fluctuations through the day and this is looking at four days of activity and so in the morning things are pretty quiet customers aren’t bringing me in any tickets and then somewhere you know maybe it’s 8 o’clock maybe it’s 10 o’clock ok we don’t wake up don’t know where it’s around 10 o’clock things spike up and then around lunchtime everybody is fine they want to eat lunch and then it spikes up again in or out in the afternoon and then it dies off towards the evening and fewer people are issuing any tickets in the evenings and then it you know kind of does that same pattern again and then people go and home for the for the night I don’t know what happened on this fourth day obviously is something not so good but because we had a lot of tickets then but you know that so maybe we see the pattern that happens we see that maybe the mornings tend to be pretty low and the eat afternoons tend to be pretty high but that last day obviously something big happened so it’s probably be a really good idea to have an alert based on a value closer to here but you wouldn’t know that unless you had figured out your cyclist you net wouldn’t know what normal is you might

think normal is if you start looking you know somewhere around here you might think normal is right here it’s not until you understand those patterns of what happens each day that you know ok that’s actually not so bad it’s this stuff that’s bad here I’m looking at a month of cloud trail cloud trail related metrics and I happen to see that this is about month I can see there’s one week to week three four okay five no two halves so four weeks of cloud trail data and you’ve got like leads five fingers of spikes and each of those fingers represents a different day of the week and not as much cloud trail related activity happening on the weekends for this particular host and so understanding what a pattern you know understanding the pattern understanding the cycles helps me know where to set the alert thresholds I throw this one in kind of fits just getting the scales right sometimes it’s easy to get put two metrics on the same dashboard that don’t really relate to each other don’t really fit on the same scale and so you might have you know some value that’s stays around zero and then some other value that stays around 80 or 800 or eight million but it’s because the scales just aren’t right you want to get the scales right or just so that things you can compare things more accurately combining sketch time scales in this case I’m looking at docker as comes from one of the customers looking at docker for some period of time and being able to look at docker performance per hour per day per month in this case they weren’t using it for that long in the month but understanding how these different scales fit together help you to identify the different patterns that you might see here’s another example of these these patterns that you see in days so this is ZFS logical reads not sure what software this is coming from but I’m looking at it kind of spikes up around 8 o’clock in the morning a little trough around lunch and then dies off towards the end of the day looking at a week of data we see that kind of spikes every day and then looking at a month of data we see you know there’s five fingers of of spike eNOS and then two days kind of in the trough on the weekends so it helps to kind of combine different scales to combine different values to help finding the patterns in this case I’m looking at one particular metric and I’ve done a one of functions in data dog is the idea of time shift and so you can time shift things down back a day or back a month and that way I can look at one chart that shows me information from multiple days multiple weeks whatever I want to do using that magical time shift function okay so details there’s enough of the kind of generalities let’s talk about some actual details actual metrics that you know these customers as well as data dog has found to be really important and again I’m gonna say it’s impossible to come up with what are the best metrics to you because your environment is different your what’s important to data dog what’s important to add role what’s important to simple reach is different than what’s important to you and your environment so all depends on your workload on Amazon AWS ec2 we’re we often think it’s a great idea especially uh this was who was I think was the guy’s a droll actually no is the guy says team internet often they’ll have instances that they bring up and they tend to be you know most of the is that they bring up are fine but every now and then there’s one or two that show up or that they bring up and they see CPU stolen goes kind of wacky and that this this any VM that look gets loaded up on one particular host happens to have a really high CPU stolen right and when you look at that combined with CPU idle you can see there’s this correlation between the two and and what they do is once they identify that in data dog once they see that that is happening they quickly just kill that instance and bring it up again in chances are they’re on a different box and everything’s fine again and so it’s just one of those things that they look for every time they bring up one machine or a set of instances is look at that CPU stolen and especially alert on it and if they see that alert come up then they know okay let’s let’s shut it down and bring it up again and we should be good system load norm norm 5 the idea is system load normalized over 5 minutes is a great way of understanding that plus the memory percent usable and disk

percent usable is a great way of understanding is the instance type that you’re using big enough for too big and these types of measurements help you see well ok this one obviously I need to upgrade to the next size next instance type size another one is AWS ec2 and network in-and-out and disk read and write ops so one thing that’s different between these metrics is that the system CPU or all the system related metrics are things that are running in the agent the agent is polling the actual machine or the host the the the operating system for these values but the AWS ec2 metrics are coming from cloud watch cloud watch is being updated depends on how much you’re paying but being updated one minute every minute or every five minutes whereas the metrics that we’re looking at in the agent are being updated every second so that doesn’t mean you should rule out the ec2 metrics but just use them in context of all the other things that you’re collecting to be able to get a full picture of what’s going on because there are some things that you know the the cloud watch metrics are Tutt are looking at that we can’t measure any other way so you definitely need to look at those things but combine them with the metrics that you’re collecting using our agent or whatever tools you’re using so in this case I’m looking at CPU stolen and I can scroll down and see that same CPU idle nothing really special here it’s not until I go back some time I’ll select a range and go back a bunch of months and then also update the instance size m3 medium had the issue and should update and so that’s where I saw that spike in some time around towards the end of December there’s also one around March and actually in this case in both of these cases there happened to be a message in the AWS logs saying a there was this issue we experienced it’s all fixed now don’t worry about it but we happen to notice the problem just before we got that log message thing else on there nope here I’m looking at some of those system load norm normalized over five minutes over all my machines one thing I can do here is I’m going to change this up a bit and take a look at grouping this by host I think or maybe by image in this way I’m going to see multiple lines and display them as lines and not areas so I can see them better and save this out oh I can’t get down to the OK button my screen is too small ok well the idea here is that on now I’m grouping them by image and I see one line for every image type that I’m using inside of on AWS and that way I can see very quickly for this particular metric this guy image am ICC 5 2 to 9 a 4 and 4 for e blue is a little bit different is performing not quite as well of course the scale is 1 2 out of I think a hundred so maybe it’s not that big a deal but if the scales were a little bit different I’d be able to see very easily ok this particular image hmm maybe I need to investigate what’s going on with that image what’s what’s the role on the image are there is there a role that’s all across multiple images in which case maybe this is the wrong image to use in this particular role I’ll close out of that I move on when you’re dealing with provisional helps EBS one of the things to look at is NWS BBS volume Q length and it should never be over one you know if one is where it should be of course we’re dealing with times that are you know updated every minute or every five minutes or so and so across the across a minute it should never be more than one and in this case we have a bunch of spikes that where it spikes up to five ten five or ten and every time it does that we just happen to have this correlated CPU i/o eight that happens right around the same time and just by looking at the EBS volume queue length

we can see this kind of pattern and if we were just looking at CPU io8 maybe we wouldn’t identify that as a problem so EBS volume queue length is one of those really useful ones to look at to get a better idea of what’s going on elastic load balancer some of the metrics that we think are really important healthy house healthy host count and latency so you what the idea is you want to keep enough hosts it’s you try to keep that latency low HTTP code ELB 500 and with enough host this is gonna sprit a around zero so that’s good surge queue length so as the number of inbound requests becomes too much we’re gonna see this surge queue length go over a certain value go increase in size and this usually happens as you don’t have enough healthy hosts to be able to deal with that and then there’s the back end connection errors there’s a the back end connection errors for each of the kind of classes of HTTP errors 500 400 300 so actually I have an example of that I think so here I’m looking at host healthy host healthy host count and ewl be latency and so ELB latency is right pegged at zero healthy house have a hard time saying that every single time healthy host count is it’s good I mean we’ve got enough host to deal with this load and the HP Co 500 over across all my ELB hosts is you know this round point zero one it’s such a low number that hardly worth noticing and then here I’ve got the backend 300 400 and 500 so was it client error server error and redirects with 500 being read down here so 500 are pretty low so things are tend to be pretty good here and so those were kind of the some of the key metrics that we’ve seen and that some of our customers have seen things that you are really good to keep in mind so how do you get started with this whole process how do you understand how do you play around with this stuff well one you know we we recommend dated okay that’s what we do and easiest way to get started with this stuff is you know go to visit the website dated I’ll get Q comm there’s a video two minute video shows you everything that we do well two minutes worth of what we do and and then sign up for a free trial pretty easy don’t need a credit card it’s 14 days if you don’t like it enough you can continue using it for a five host or less and we never charge you a thing for that but if you let really like it which we hope you do it’s about 15 bucks per host per month and that brings me to the end so at this point kind of want to open up to questions and in case you’re interested you can find this deck up on slash data dog slash evangelism presentations if you are not familiar with go present the format might look a little bit weird but other than that any questions based on what you’ve heard and there are mics so they want to you to use the mic so you talked about everything related to the AWS performance matrix yeah what about the application performance matrix like you know if you want to capture the data of our application specific behavior then what is the interfacing that we have for day at all and what like what what can we use so that data can be pushed the application data can be pushed out to data dog and we can see that in dashboard and next so is this a application that you’re writing yourself or is this one of them okay it’s our custom application like you know so we’ve got a restful api and a bunch of libraries in all sorts of languages whether you’re using PHP go Ruby elixir even all sorts of amazing things and so use one of those libraries to instrument your application at the key point so you probably don’t want to record everything because just gets to be too much but at least the really key moments in the application lifecycle when what a whatever the most important parts are have those recorded and those are sent up to data doggin and look like all the other metrics that we’re collecting and we can create a dashboard that looks at maybe a table us metrics and your metrics all in the same view so you can understand that correlation do you have any suggestions like or you have support for all the basically we are a Java shop so we have you know java application yeah so let’s see in where’s this gonna

be this is gonna be under the documentation and so in the documentation there’s a section called okay superfast internet there’s a section for libraries and each one of those libraries also each one of these libraries points to usually usually the github location where the source is and most of those have really good documentation some of them not so good but most of them are really awesome and for many of them they’ll have also created a blog post on our site or on their site talking about how to use it is there one recommendation for everything really but yeah there’s not really a a single answer for everything yeah okay thank you yeah any other thoughts questions again use the mic cool so it’s five Wow 5:22 I did pretty good on time and if there’s no other questions well thank you so much for coming again my name’s Matt Williams I can be reached on in super fine print Matt w at day to dog HQ comm or on Twitter at techno Vangelis and yeah thanks so much for