Introducing CallTracing(tm), based on RabbitMQ, Spring and Zipkin

well I’d probably the lucky spot here because it’s just up a lunch huge attendees for understanding call tracing so thank you all for coming to spring 12 GX a big shout out to the pivotal team as well as spring one folks especially Peter Humphrey who for coordinating this event so a little bit about myself I’m was working in e trade financial for 16 years as a programmer and an architect so this is my third presentation this is my first solo presentation on in spring one in 2013 I wrote a framework which exposes jax-ws services behind amqp using spring integration and in 2014 we presented jax-rs services API development behind him to be using spring integration ok since then I’m unaffiliated to any institution so this is the time when you can ask me all the questions and I can answer them let’s get that out of the way today we’re going to talk about distributive tracing so the agenda items for me are the background of what it means where it is important where you would use this to be retracing I’m going to go through a use case a mock use case hypothetical one and it’s going to deviate a bit from what you see in the title of the talk essentially because I cannot talk about dead set you at rabbitmq so I will actually focus more on the topic that’s my favorite which is basically the core concepts which are involved in distributed tracer and most importantly with Zipkin there’s going to be a demo because by then you would have digested all your food and become sleepy so i’ll wake you up with a demo and then we’ll go through a scenario where let’s say how minimal tracing you can do and get away with it and then we’ll go into the more sophisticated model of Zipkin tracing I’ll do a demo of that as well after which we’ll discuss production deployment what it takes i can give you suggestions to implement this in your beautiful enterprise in a fantastic workplace that you’re working in and then exceptions to this beautiful sophisticated model as well and some of the lessons that I learned which which did not come without a price so I hope you all gain from it so before I start I’d like a show of hands how many of you work with site reliability engineering I sorry okay no one all right how many of you work how many of you are architects all right how many PR programmers okay maybe an equal set so I was about to say those are the guys would really do the work okay so so this is basically distributed for almost all parts of the organization both architects system reliability engineers programmers because once you have the system you can glean a lot about the system behavior as well as debugging okay also a show of hands before I go ranting about Zipkin how many of you are familiar with Zipkin how many of you have used it in your organization okay so one or two people are familiar with Zipkin okay that’s that’s good to know so just so we understand each other let’s start with the background so in the beginning we had clearly specified contracts and rules there was a well-known topology and flow web servers went to application servers application servers called services services databases there were pre-scheduled down times for the most

part stack traces were good enough for the most part and what happened is everybody had to move to the newest fad the newest language right and you want to do development really quickly you we all know that the real work horses are sometimes even our main frames but there are not enough people to support that so organization quickly start hiring Java programmers thank God for all of us so and but the workhorses sometimes still remain see there are bad jobs that are still running on legacy systems and you really want to move away from monoliths to microservices it comes with a cost so in the micro services world what really happens is is a collection of software modules they are developed by different teams sometimes having completely different programming languages they can span across many machines they are distributed and they are in completely different physical servers depending on the industry that you’re working in you would be working with the East Coast West Coast Japan and wherever your data servers are if you were AWS and he Roku you already know that this is true so what happens in such a world developers have checked in their code 6 p.m. the evening lines of businesses are isolated they have tested all that code servers are ready to go now site reliability engineers are monitoring the cues and the services databases they look at kew surges they have database starts tracing this disk corruption that’s happening they need to rebalance the load so they need to shut down and redeploy these services and that happens to all of us we tested our code everything works fine but site reliability engineering it’s their fault of course it’s their fault then the maps they want a map of all the services we tell them hey I’ve written my jax-rs service jax-rs service don’t have a whitelist no produce a whitelist for these services or if it is jax-ws produce some sort of a registry but these are see services I don’t know how to produce the registry for it they want it because those services are calling databases database connections are usually our bottleneck and so however much you may horizontally scale vertically scale you also have a scale of databases so these are the problems of microservices right you need to figure out the weak links in a system you also need to figure out the critical parts so how do you debug the system this is not your normal system where you have stack traces in fact I don’t even know I can go and look at the stack trace once I know where to look at because it’s going over load balancers right and you go from one node to the other node you go to the third node you don’t know where it is landing there are four of them that are copies of each other that’s sitting behind a load balancer and this is production and you need to do it yesterday it’s burning so how do you determine request parts also market opens I don’t know how many of you actually do trading but in a trading system for example when market opens there are requests parts that are more optimized and there are requests parts that are different when the market closes similarly for cash transfers if it is cash transfer it would be a different part and non scheduler will be different there are out page and timing issues that are related to it and also I I know that and however much testing that you have done and sit and you act and even PE LT testing once that system goes to production it’s a completely different beast it breeds a completely different way production systems have a way to fall apart just because of the various ways in which systems integrate with each other how they talk to each other okay so fortunately for all of us in april 2010 google published a research paper called dapper Dapper is I think Dutch for brave with there are any Dutch people they can confirm and they they

made a very pretty statement as is their want modern Internet services are often implemented as complex large-scale distributed systems these applications are constructed from collection of software modules that may be developed by different teams perhaps in different programming languages could span thousands of machine across multiple physical facilities this is their opening statement and this is way back in 2010 so they were as usual ahead of the curve they just published this one paper very teasingly and left the rest to us to figure it out ok so a lot of people we said okay what are we going to do about this paper we have to understand it we had read it and finally in 2015 februari Netflix actually said oh we have an in-house depa like framework where we are able to layer the request demand see our infrastructure and have a simple visualization for it this is februari 2015 ok they call it internally by some other name but slalom ness what they posted in their blog Twitter gave a small tweet so many character in 2012 but fortunately for all of us they also open sourced their framework called as Zipkin just mostly scala based okay and LinkedIn came out with their system it’s called incapacity that also has this graph of indispensable insight to their service performance and it actually adds to their technology stack and they can actually reassociate it with costs so today we have call tracing talk so before we set ourselves on this journey a few caveat emptor basically buyer beware this is a spring one talk so i will get into the details okay so forget the slides i’m not a slides man in fact this is the first time i’m talking two power points usually i would be come on line guy that’s what i did in the last two but i have to kind of set the stage for this so let’s say we are looking at a gmail app which is probably the best representation for all of us to talk about a common language so this is a single page app right let’s try to guess hypothetically again what are the services that are involved here so hypothetically let’s say there is a contact service some presents circles audio links video links labels and tags it’s takes some of them okay so in order to broadcast your presence to multiple devices it has to keep some sort of a session and the application has to probably call many services in order to figure out who’s account it is who are your friends which devices here using how much time and what is the order in which these requests should be sent over okay and this is actually an example where sometimes single requests can actually go across hundreds of machines and in 1000 services various different subsystems there’s also another example that comes to mind essentially if you have a portfolio with positions so if you have portfolio with a large number of positions you may have to do a lot of calls to the backend sometimes spanning over 6,000 services most of them are even nested I service calls itself or service upstream service downstream services and you make some modifications they basically have no idea about their dependencies or who’s dependent on them so you don’t know what’s in greasing and what’s expressing so let’s try to let’s try to lay a foundation to this okay so let’s try to understand what is what’s the minimum that we can do for call tracing okay so we have the system let’s say this is the flow there is a node on which a request comes in there’s a note on which the request goes to and there are many other nodes that are so these are the red arrows are basically network hops and let’s see whether we can get away with very simple system like Hiroko dashboard already gives you this let’s try to generate a trace ID and let’s call it a globally unique identifier

right so let’s let’s try to get away with minimum most that we can do what this is going to do if we just go to log it and then finally use a log aggregator somewhere and basically know where our request fill however we want to be a little more sophisticated so what we want to do is we want to create this thing called a span ID okay a span ID for us not a single request is going to be unique but a span ID is going to be generated for any network hop so any RPC call that you’re doing you’re going to generate an IV and you’re going to log that what you can do with this is that you can set up a time stamp on the node on which the ID was generated and on the node in which the ID was received also you can put other attributes there for example what’s the host on which this was generated what’s the host on which you got this ID and obviously you are to transmit it so just to review what we have learned it’s going to be a globally unique immutable ID correlates to a request a span ID is a unique ID for an RPC so far so good it seems like very simple and trivial ok so this is the time when I tell you to wake up because I’ve been hand waving a lot so I’ll go to the demo at the peril of my terminal let’s hope I can get there and as is the want ok so what I’m doing here and I apologize for all the terminals that I’m using what I’m most happiest in this environment what i’m doing here is i’m going to start a server and because because it is easier to show without the distraction of java and I’m apologize to all the the Java gurus here I’ve been working in Java to for a long time but without the distraction of Java it’s the spring guys have done a great job I’m going to just start a server just so that in node ok so this is server a and then this is sev Abby let’s say we have started and i’m just going to call a client so what’s happening here is the client is going to call server a right so i don’t know maybe we should just go back here so the client is calling server a and then server a call server be and then we’re going to see a user experience with what races can do for us okay that’s too early and let’s make it smaller okay so something something happened don’t really care right now this is too sophisticated for us let’s just go here okay I need to make it smaller well just so I don’t cheat I’m going to take the trace ID from here which we just generated and see whether we produce anything oh there it is okay so as you see right out of the box you have a span which basically shows you how long the request took and another span another request RPC ID that shows you like how long the service worked in server be this is really a toy example but this is your zip can you I but in order to get here we have to do some work and there there is no two ways around it the instrumentation models may be different but we have to do some work so back to our slides for for the moment okay so model number one very popular model which is basically what we just went through generated trace ID login node 1 and our goal is to stitch the call tree from the traced logs okay that’s our goal and see whether we can make it so we generate a trace ID here generate a span ID unload one log it there transmit the trace ID and the span

ID to node 2 whenever a transmission takes place I log it again on node 2 then I generate another span ID because it is calling another service after some time after it’s done its business logic and then finally I’m transmitting that as well and so I’m done well let’s look at what we have in our logs node one has trace ID span ID 1 no 2 has trace ID span ID 1 and span ID to note 3 has grace ID and span ID to ok it seems like with a simple you can actually stitch this up because we said we are going to have a span ID related to a time stamp so we can actually order span ID 1 and span ID to something happens before something right this works out very well for us well we are happy and comfortable until this happens to us so what happens is after the server receives a request and this is the most common scenario you have a bunch of services to call and all of them from the same node and you generate it span ID 1 and span ID three on this on the node 1 and you transmitted all the stuff you also have time stamps maybe up to the nanosecond maybe up to the nano nano second microseconds it still requires a little bit of work for us to stitch the request and to understand the call tree so we haven’t reached the place where we can really just grab it or use some sort of an algorithm to figure it out without doing some sort of reasoning about it okay so this is known as black box instrumentation why is it black box the pros are that there is application level transparency every time a socket call happens an RPC call happens across the network I can intercept it I can put a trace ID and I can distinguish between all the sockets that are being used by that system ok where is it’s great for people who are boxing this for customers and say I’ve given you all the stuff for you to do your call tracing it requires a little work statistical reasoning and in fact in the 11th usenix symposium OSD I 2014 Facebook came out and said mystery machine well it’s a mystery machine for them but the reason is they have to do all this happens before and it happens after all the semantics that your jvm takes care for you okay they had to derive the causal relationship what happens before what happens after can we do better than that maybe with a little bit of work maybe we can do better than this right so okay let’s try to follow the leaders of this dapper group right so let’s try to get a uniform logging structure here’s where the deep dive begins so people who are really interested in instrumenting the system they will have to wake up here so it’s bit tedious but it has its rewards okay so we already know our friends trace ID and span ID okay span ID actually puts me off a little bit but think of it like an RPC ID that’s all because that’s the name that is used throughout in Zipkin paper as well as in dapper wherever you really will c span ID and some of these slides although they may look tedious right now but go back to them I’m pretty sure that most of you are going to look up dapper after the talk or right now and it’ll help you when you are trying to reason about your system okay so let’s let’s try to get a uniform logging structure say there is a trace ID 30 generate and then you have the span ID and trace ID we have already got familiar with it they are not changing but we are going to introduce parent ID okay and it’ll become clear what it is I’ve got nice diagrams for that too and for events where did this number four come out of well we’ll soon see whether it is good enough for us or not so the four events are going to be server received service and seems simple enough when a server receives a request and when a server sends the response back client sent and climb to see so

when a client sends a request and then the client receives the problem is servers and clients are not that easily they’d get blurred the lines are very blurred right a server when it receives a request and it makes a call out to another service after doing some job after a while then it’s a client sent so just keep that in mind because there was a lot of confusion in the beginning with a team that i worked in so and then the parent ID is something that is an optional route spam so so the the route span has no parent ID and if there is transmission of a request then that span ID of the transmitted request becomes the parent ID for the new request I’m not going to say anything more let’s go to the diagrams so so let’s see how the structure works so i got a request I timestamp event as server received I generate a trace ID see I’m going to hand wave here how the trace ID is generated all of that stuff we’re going to get in so I generated trace ID I generate a span ID okay that can’t be that is completely distinguished from the trace ID doesn’t have to be so for the first request and then since this is a request that let’s say let’s suppose did not have any headers so the parent ID is not available okay this is the first time someone is seeing a diagram like this so this is the place that you saw it first okay even like you go to the Twitter talk in for Zipkin SF scala they don’t make it clear what happens after some time is the contact service needs the contact server needs to call another service and that needs to go over and our pc okay so it’s the server that is working to flow it’s doing some business logic and now it needs something to fulfill the request so it goes over our cnet work to try to fulfill the request at which point we step in again with intercepting the call generate an event called a client and use the same trace ID that we received how do we say this is the same request in Java world usually if it is the same thread unless you are in spring integration where then you have to think about thread boundaries and will come to those things okay so let’s say this is the same thread to keep our map simple so there’s the same thread and the thread needs to make another request notify wait so quine send i generate a span ID that is different from the span ID that i generated earlier and see what happened there was a span ID 1a here and i got a parent ID 1a so this note that the parent to this call is that span ID 1a and you you’ll see why this is useful so moving on we transmit this information across the network and we’ll see what information is transmitted mostly it’s very few bites trace ID the span ID and the parent ID and on the other side we again annotate an event we got a message server received trace ID span ID is the same parent ID is the same and you’ll see why this makes sense so after the server on the other side got these messages it did its business logic right and it’s sending the response back so at this point we know the context of the server so the server sends the request back same trace ID same span ID in the same parent ID and then this server that was working and poor guy was waiting for the call to come back you got a client received message because part of the same thread context it says okay I’ve noted this time see how cleverly I made everything blue blue you okay so the point is this is a very important concept this is called a spam okay so we got a service and I mean just to come for the completion we have to send the response back to the to the client so we did a server sent okay but this is the most important concept so for a span it is client sent server receive service and wine received and

everything else remains the same you might think this is very simple but just sink into this for a moment or not so simple because we’re going to go really fast now okay this is where it all stops so we have reviewed our view it’s basically we started with the trace ID in a span ID but span ID now can identify our PC attributes has request and response and we define not just a span ID but a span okay so if if you think about a trace ID it is the root span and a span can have multiple spans ok the root span consists of many spans and that basically correlates one single request ok is going to happen very fast last time to ask any questions about spam or trace IDs any questions yes yeah multiple requests sending in one shot sit okay so i will thank you so the question is will this model work for multiple requests where you are sending a burst mode batch mode of requests so it depends the answer obviously is always it depends but i’ll tell you i think where you are getting too because i’ve seen this in the forums essentially if you want that to be one single span that’s not going to happen you have to do some more work for it this is that what your intention is separately that’s not a problem that shouldn’t be a problem a use case for that particular thing would be for example let’s say multiple pages right so you have forget about the batch puss just have multiple pages for example in quick transfer or maybe opening an account so you have to go through multiple pages or even a single page app like you have to in order to render a particular element you may have to call many services you actually need it to be grouped so that you can search by it later that requires a little more work because each service request is going to be separated out which means each one of them is going to generate their own pace I t’s right so that’s called like grouping I want more intelligence and we’ll get into some of those use cases where the model may not best so any other questions now’s the time you’re going really fast up to this okay all right you asked for it so boom boom boom boom boom boom boom they’re done okay so so the point is okay I’ll go a little slow but it’s really another use case there there there was a request that came in and now we are all masters at this right so this is a requested game and got a server receive generated this on node1 fine sand so we received I guess I should ask you to tell me what’s going to be the next one because that’s the only way to learn this stuff and does this look right okay how many people think this is right none of you think this is right okay the two people three people okay yeah so the confusions always like why what is one C being generated there and what is this one be and that’s essentially the parent ID of the previous server received okay and does this look right this from node to two no three yeah it’s very simple once you get this diagram commit to memory and you’ll never have problems with Zipkin remember zip ian is going to be introduced in cloud foundry so in probably in 2016 it’s going to be a first class citizen of spring so you would have to you’d have to really work

the reason why I need to understand this model not just because you are instrumenting the systems but also if you want to extract data out of the system and you’ll see that there are rewards to understand this model okay and then we’re done with this what happened here is it’s a nested request earlier one that we saw was one request going to a server and then the server just responding back this one is more like a still request okay so now what can we do can we do this so what happened here is again this is the use case that we got stuck with happens before happens after all the semantics that the JVM actually takes care of us when we write synchronized or when we write volatile but this is on a distributed system right so you got like something that happened at the same time well what about the red lines do they look right who says they look right okay one be brave come on it’s okay it’s dedapper all right many of you think it is right who says it is wrong okay one two three brave brave robot okay so i’m sorry i was just setting you up okay this this this is right and the reason why this is hard to hard to say it is right wrong is when you are not introduced to the system this is the only way that we distinguish ourselves from the previous model okay this is the idea of generating parent ID so if you did not have a parent ID then how do you correlate this to the first request that came in the first service request that came in you’d have to think about something that correlates to it to a trace ID and also that correlates to a timestamp but this allows us to cleanly distinguish because what what we did is we generated a new span ID the CR CS 1 1 C 1 C is a new span ID okay give you a new span ID for making a new request and there is no confusion I mean it’s it both of these requests that is CS 11 B 1 a and C s11 c1a they both belong to the same parent request so any clients that are going to make like you can you can add like hundreds of these nested calls here each one of them is going to generate their own specific span IDs right but they’re going to keep the parent the same and on the on the other hand on the other side on the server side they still know where these requests are coming from and you can stitch them up later for any questions here is this clip yes okay I’ll repeat the question would you not would you not like to generate unique IDs for span IDs on node 1 and node 2 let’s go back to my slide I told you there’s going to be zero to hundred really quick okay what do you see here a span ID has the same trace ID same span ID and the same parent ID for events so the events that occurred is client sent server received so send find received so this is a spam there’s a request that goes from from the server or the client calls a service and a response this a-hole RPC so for that entire RPC it’s a bridge for that entire bridge it’s the request our pcs IDs are going to remain the same so the distinction is made from the client side always right if there is a new request coming from the client side then that will have a new span ID span ID is for spam trace IDs for the entire request if I have multiples if I multiple clients being called I need to generate new span IDs for them but they have to remain the same on both sides okay any other questions it might get

really hairy because but there is ok you can go back to the slides it will always see so annotation annotation based in instrumentation that’s what we just did so this is basically gives you an to end per request visibility and it also can show you what the services that you are deploying both upstream and downstream and you can also calculate critical paths you can also calculate long tail Layton sees but you may not be able to reason about why that long tail latency happened you’ll be able to see the long tail latency but you will not be able to reason about it and the corners are obvious it requires instrumentation of your code of your program and ask close to your framework or the libraries that you are using right as close as you know it the better your instrumentation will be ok so we’re going to move on from this to trace collectors okay so one of the goals of of any instrumentation or any any call tracing has to be that it has to have close to negligible impact there is going to be some impact well as they have close to negligible impact on the system that you’re tracing Heisenberg once again observer observed but it’s more it’s worse than that if you are doing in Bank racing so there’s there’s a lot of use Nick symposiums which basically had systems like magpie pinpoint which our cross trace that’s another one which try to do the same thing but they had to pass a lot of data over the network right and it in certain cases it’s called the iceberg query I make a request select star from that’s a very simple request right but I can get a huge payload back or I can make a request for a select star from and then I have a lot of where conditions so that gives me a like a little response back my tracing payload itself is bigger than the response that is coming back that’s horrible right so this is to mitigate all that you want to do all the collection out of band maybe that’s obvious to many people and and so you this the concept here is that some systems even continue managing data after the data is returned yeah and then some of the collectors that Zipkin already provides out of the box is actually cool is Kafka flume and of course Scribus what they use internally so syslog-ng requires customization requires a connector to do collectors so is everyone with me as what i just said here what is a collector it’s basically any logging system that you’re going to use on the nodes if you’re going to use a logging system you need to have something that near real time we shouldn’t be online not be much worth when you are doing production support near real time you have to set up this infrastructure to in your in your in your system so so what’s the overhead this you need to you need to actually tune your system you need to collect the empirical data for your system what what’s funny is maybe I’ll put Jeff on the block but he is no longer working for Twitter when I was working on the system I asked him about he was a person who was taking care of Zipkin Jeff’s make and when I asked him about who’s what’s the infrastructure requirement how much should I you know have this capacity what’s the log rotation that you should have what’s the network socket which should i use one gig 10 gig n IC cards he just wrote an answer to me I don’t do it so so in my case I had to so if you’re in that spot you may you may want to think about all that and the most important thing is the ability to turn off switch off creasing all together okay so a production system will look something like this in in most use case

you can look at the zip in page it has more or less the same kind of diagram and you will have certain either syslog-ng offenders luma Pender’s whatever transport you use you had to have sinks and there’s a lot going on here ok cassandra is just it’s just a choice Zipkin comes out of the box with HBase Redis integration and yeah I think Cassandra and sequel light they call it an arm DBS that’s the from the Play Framework that’s what is the interface to Seco light ok so don’t get scared of Zipkin it is written all in Scala it is Harry because it’s part of the finagle RPC system that’s used within Twitter and in fact the idea is that if you are you going to use vinegar then you get distributed tracing out of the box if you’re going to instrument it then you need to learn how Finnegan is instrumenting it but it’s not too bad because there are systems out there for example if it is written in Scala I think Christoph adryana sins if I get his name right has made a fantastic effort he calls it brave because dapper is basically brave in English oh that’s what he tells me okay so okay maybe now a lot of people again are sleeping so it’s time for a demo what I’m going to do is this is the following I’m going to generate traces fixed races I cannot generate real traces set you at not allowed for me then I’m going to cover a little bit of what the aspects of the zip in UX what what it gives you once you’ve done all this work I’m not sure whether i want to show you aggregate of the services mapped but ok we’ll see how it goes maybe after a little while i can show you what is the reward for doing all this collection and yeah in fact you can run all of these in your mac so let’s try this is scattered of this okay so you see those three error error error and how it looks yeah so those are three systems that are running so slip can runs off of three systems there is a collector there is a query interface and there is a web UI interface so when we started this system I started each of them individually and then what happened when I did the demo was there’s a scribe collector that talked to the collector on the Zipkin system here that receives a message and then there is a query system that I queried with that’s the trace you ID that you saw the trace interface that’s the web api and that basically queries the Cassandra database so this Cassandra running here the background funny thing Cassandra but we’ll come to it no good stuff so okay so what did I say I’m going to do is to generate traces so there’s a very simple API for this where you can generate all the traces by a simple example let’s take okay why forget so this is what I’m going to do okay too many things but essentially there’s only sorry you should have cleaned it up but that’s that’s the line that’s going to run essentially I’ve already generated the traces if I wanted to generate more I can run this line for example which says just generate sample traces and what it shows is the Cassandra destination it’s also called Cassie with into it within Zipkin and within Twitter so that’s the destination on which this is going to be generated now before we do all this unfortunately I stopped everything but let me just okay just start this because this is going to be extremely overwhelming even for me right now we just start once more and I’ll tell you

why okay so here I just hand waved here and I want that running because that’s this is real-time queries that are happening in the system okay so when I click on this and can everybody see in the back as well okay so you see these annotations that we got familiar with clients and right and then server receive service and pine scent client received and there’s some request headers since it is JavaScript and I didn’t make a two-string it wouldn’t show what the request headers were so as it is you can see it’s a different port that it is going to 9098 those are some of the key value pairs that you get out of the box with Zipkin okay and then on the second one you see 9098 and 90 99 and the uri you can put a lot more annotations here if you modify the service okay but this is our toy example it also tells us how how much x server be took letters 13.4 nine whatever for eight milliseconds and then how much times over a took 30 milliseconds it’s a very slow service okay so let’s go back to this going to shut this down this time and this is where the real nested calls are going to be simulated so let’s fly with this I’m praying to all the gods hopefully this works all right those arrows don’t mean anything okay it’s generated the traces I’m going to give you a live commentary so not sure whether it is done okay let’s let’s let’s take a look I have a backup to this if this doesn’t work but ok so there’s some fixya services you see which were produced here so we lose our server a and so Abby but there are some fixtures services that were produced here on 9 16 2015 I’m still on California time so it shows 1040 but ok let’s just go see what it produced and 86 pans were produced by this service and what you get is a view of the spans and you expand all this and goes on and on and on we kind of fulfilled our dream or nightmare as you would on the left hand side of this UI you see the services that I that are you see the route service it basically is the route span ok you don’t see trace IDs here except that right on top that’s the trace ID that’s that’s the API that you call everything with that’s your trace ID and there is a story to say about it but I’ll get ahead of myself I talk about it right now and you see some white spots in between those are the annotations that a programmer might or a developer might think is useful for showing insides so let’s look at this for example ok so you have some annotations and value you have some if some yeah these are all made up and then the horse on which the server receive and service centers done this is the first route span where the request is coming okay and you have all of these babies after that okay what was the next one in the agenda I forget hmm we did aspects of Zipkin you I ok aggregate of services well let’s hold off for a moment maybe maybe we’re not ready for it yet so let’s go back to to this I’m

on the wrong slide this okay so you saw if you did make some effort you get some level of transparency into your system how are you doing with time okay thing we have lots so we just go through the anatomy of a trace and this may be useful for some people but I’ll go through this really quickly so a trace is basically a root span that relates many spans together expand as a set up annotations annotation records time of the event and in custom and there is and that is custom annotations you can also give all kinds of useful values that is important for for your system and then there is sampling which is very enigmatic because it just says boolean here okay so sampling is used to control the overhead so usually it will be managed by an adaptive sampling kind of system so that it sees the overall so if the zookeeper system it can say if there are is this a huge volume of traffic in your in your service in your service node then what you have to do you can sample much less because the call patterns that arise will be the same more or less over a period of time so you can get away with sampling one call in thousands but but let’s say it was just like ten calls that happen in a day then you can sample each one of these requests there is no overhead for you so there can be like fixed sampling that can be adaptive sampling the and you can actually run the whole system with debug which means that besides sampling you can also get insights into the Zipkin system the annotations we are already familiar with this we’re actually masters of it or should be we have server received server send these are the time stamp annotations but other annotations are up to the developer and the program this is what you would see if you open a thrift file thrift is basically the IDL that is used within Zipkin as well as in twitter okay so you would see a time stamp which is a 64-bit integer and you know you will see a value what happened at the time stamp we know the events that we need to record they are mandatory server receives over same clients and kind you see in points the endpoint itself is a structure that may or may not fit your organization is like port and value and there’s some aromatic service name again that model may have to change when you use the system a span is a trace ID that is associated with it name and this is the structure of the span I’ll just go over it quickly you can read all of all about it when you go to the second side well most importantly what you communicate over the network are these protocol headers so what is b3 this actually stands for big blue bird and that’s that’s basically something that they use internally in Twitter but they are kind of random and good enough for IDs that may not have any overlaps in your in your organization and that’s important so that you don’t overlap with any of the headers that you are transmitting which actually means something for you so the instrumentation points maybe this is what we were after when we came to the stock so the instrumentation point each framework is different unfortunately I have to state this and in spring there are events and proxies and interceptors so for example let’s say you were writing a simple spring MVC framework so you have you have spring dispatch a servlet and you could write interceptors for it if you were writing rest easy there’s already some hang out there which is developed by Christophe okay so you can use that at least as a starting point yeah neti neti is the hairy part but you can still you can still use the all the good stuff that you learned in enterprise integration patterns to overcome Nettie

by sending your RPC headers across in the message headers and maybe using over even bus how many people have used google even bus okay great so that’s a publish-subscribe framework that allows you to write an annotation and get messages back so maybe that might work okay so once some of the loggers that you want to use is log back and log4j I can get into the nitty-gritty there there’s really some simple stuff that you do with with log back which I think you can only do with log4j to one of the thing is that it has to be a synchronous any sort of i/o operation that you do try to make it as a synchronous as possible log back has this eath else conditions that you can write within your log messages now those are really useful there’s something known as a sift appender there so the sift appender what you can modify it a little bit tweaked it a little bit to make it work like this let’s say you were releasing a new feature or you releasing a new customer friends and family stuff you want that particular customer or that new feature to be traced okay you can write actually some sort of a business logic within this not just by writing a new appender but you can write some business logic with just using log back also i think la project two went words disruptor but it also disrupts your api is as far as i know you need to change your AP ice okay too much information there couldn’t help it spring integration i would love to show you all the wiring here but everybody how many people use spring integration I’ll just go over the slide okay great so everybody knows about abstract reply producing the seat handler yes okay so that that’s basically your service activator okay and it in which you have a callback called handle request message you can write an interceptor there then give let it proceed to the actual message that message handler that needs to handle the message and then you can write an out interceptor for doing what so implementing a library so for server receive and server sent you want to actually provide some sort of an API and I suggest I mean this is again depending on how close you are with the library I suggest that you have an API somewhat like this sorry to you Scala here but it fit well on the slide that’s the only that’s the only reason I’m using this as an API but essentially it is yeah public void the preprocess string service map of headers and a map of annotations and for post process you don’t need anything as long as you know where your thread context is the same way for clients an inclined to see you’ll have something called a pre client execution server receive and servicing has to be done with probably a filter or dispatch a servlet kind of entry point if you are in the servlet model or if you were in the spring integration model we already looked at maybe a sort of a modified service activator that you can write to handle your messages crime scene and client receive is done when you are actually making an RPC call out so if you have the luxury of making your own RPC libraries then this is your instrumentation point remember all of call tracing you shouldn’t it should be totally transferred do the do the business logic developer as far as possible except the 2 rakaat places so for recording traces you want an API that as simple as just define Tracy there is no span ID there’s no all that stuff it’s removed it’s it’s cleaned up and that’s managed by your system when you initialize your system you want initialize your system this is what goes into your static parameters you want to initialize your system static block sorry you want to initialize your system with all the collectors that you possibly can have and then configure them maybe from outside essentially maybe one of the collectors a lot for J you won’t just put it in the static block this is going

to be it’s going to be used throughout you can use sprays filters different types of stress filters for example in in the node example that I was running I used to trace filters one that outputs to the console which I call the debug.trace tracer and the other one that goes to the zipkin back end and then yeah you have to initialize 22 tracers as well for symmetry server tracer and client racer the reason is that your clients maybe call less often your server may be called more often you want to have a way to control each one of them separately the pre process API is really simple if you’re part of a trace set the state from the current recetas generate a new span if you’re not part of a trace then you generate a unique trace ID set the span ID as the trace ID and then record a server received annotation post-process API you have to add annotations which you think are useful for example hosts session ID JVM ID threadid and then you annotate the server sent and those operations are symmetrical in pre client execution and post client execution ok need a breeder ok so the trace ID dinner this is taken right out of the Quran zipkin ok the salt is used to prevent race ID collisions between machines by giving each system a random salts less likely to process will sample the same subset of trace IDs fantastic what they use is new random dot next long ok so the problem was among 15 million traces right we found collisions in the place that I was previously employed ok so 15 million traces had collisions that means random next long was not working for us ok so then as usual as every one of us does you go to stack overflow we go to forums and we look for an implementation of this library some poor soul must have done a UI degeneration of the trace IDs and when I look there was a lot of discussion in the forum but nobody had done it so it fell to me to do it and there is a pull request for it to Zipkin and this is the this is the URL so it was important to generate uid based traces and that’s what you saw in the in the demo as well so let’s go back to the demo here this time it’s all small very nice so if we go back to the demo this trace ID that you see here that’s a UID okay both and then let’s let’s see how yeah that’s the one so both in the in the node.js example as well as in the scala example the trace IDs that are generated are all youyou IDs okay and I think now it is possibly time because you have suffered enough to show you the aggregations and this is what you get if you map your services which is nothing right you can’t see anything so this is your universe and let’s let’s try to make it really really like how big is it okay that’s how big the universes but so what I mean what is this what are you telling me so the aggregations that you get is basically showing you the services that are getting called and mind you these were real services this is using dagger ad 3gs and and right now i’m using scalding for generating these traces this these aggregations making it into spark so that i can do real-time aggregations also the rendering of this UI is done on the on the site so many of this many of these need to go in the back end pointers point is nothing okay here so you have all the services that are used for this particular point like ultrices i don’t even know how to spell it i won’t pronounce it used by all these other services on the left hand

side and their users they use all these other service on the right hand side there is a dependency upstream and downstream dependencies that you can see clicking on any one of these which i will venture to do you should be able to see the number of calls that are made the mean the strip ution variance standard deviation skewness kurtosis all the nice things that statistics gives you and this uses a library called algae bud ok so the mana and mana odds and monoids all those things come into play here but you don’t have to know it the api’s are very simple you get this you get to do all this stuff and keep yourself entertained about this huh so yeah so that’s that’s about the the aggregations that you can you can get this is the service to service mapping the relationships of services I apologize that I’m not going to show you some real mapping but it’s it’s pictures enough here so in order to do this obviously your instrument like all of zip conds framework it was very embedded util vinegar ostrich server and then Hugh ID support for now sequel light as well as Cassandra some useful zipkin api’s that ok out of the box here you can see as you see this traces but let me make this API this I have to remove this plural okay so that’s that’s the Jason so essentially what you see there in the in the UI is is basically this modified and the different views that you see is also this modified you can actually see the network calls that are made before this is rendered a lot of that stuff needs to be done in the back end anybody who’s interested in collaborating I’ll be happy to work with with her or him so but this is basically what the place IDs and the span IDs look like okay yeah and then there are a bunch of other other things that he can do if you are shy like me of understanding Cassandra because that does take a long time to understand maybe you want to collect these traces into something more like elasticsearch also just an FYI Twitter moved away from Cassandra they have their own database called Manhattan so the internal they made their own database okay so you can do a lot of production coverage as long as you have those RPC libraries within you within your control so common yeah this is I think we’ve already covered all this maybe non-standard use cases so how many of you have like routers and forward is in your organization like there is there is a there’s a call that comes in its an EIP pattern as well header out a payload router yeah and is this distributed across networks yeah so it’s a very common pattern guess what this beautiful diagram breaks it doesn’t it it cannot take care of this the problem is that at a router you get a server received and then the router just forwards the message but the reply is sent back from the ultimate service back to the client right so where’s my spam I don’t have a span there is one be 1a and 1b 1a on the unknown one and on node 3 we have 1 C 1 B but there is nothing that is corresponding to a CSS our SSC are and what’s missing is the poor guys on the red so yeah this is the nitty-gritty of this one way to work on this problem at

least that’s one way that you can do is that you can just annotate this request as an entry point if you have to deal with the zipkin core libraries it is tedious it’s extremely tedious the have to be a hundred other models that need to be regenerated but it’s not impossible also you have to come up with a good model that supports this okay but you can simply annotate it and then write a UI some kind of a key value that says okay show me what are the entry exit points and maybe write something like a forward but if you just use out of the box you will have the spin where you won’t know you won’t know what how much time was spent in the router however one thing to notice you will always know the duration that the client made the colon because if you see CS and C are they have a time stamp that is associated these are time stabbed notations so you know how long this called on node1 right and then how long the call took on node 3 all that is missing here is a non-participating node in between so none of the the timing if you if you were doing timing yourself that not be missing the other I don’t know whether I covered it in the first thing yeah ok so in non-standard use cases when you use frameworks like reactive Nettie or a synchronous callbacks so request is sent on a different thread response is received on a different thread so you have to use messaging pattern and if anybody is interested in I can show you the code how to do this later after the talk or I am going to just commit it to github I actually just started this morning which essentially allows you to span to have like a neti messaging tracing system to quickly to lessons learnt but the last one which was covered in the in the previous slide is what happens if i have a tracing system and then there are non participatory notes in between let’s say you have memcache tea or you have some Oracle database and you’re not actually doing any tracing in that so there is there is this thing that is doing all this beautiful server receives over sin blah blah blah and then the rhetoric and then there is memcache do there is some caching layer in between and you are not able to trace our instrument them what happens does the system break it does not it does not as long as as long as you’re okay with the fact that i will not know what happened in the node where the memcache call was made or the oracle call was made but i’ll know how how long i took to call that service so the duration is going to be well known on the nodes from which you’re having non participate recalls also do trace or not a trace that is the question right so the problem is it was very simplistic when I kind of showed you well I can just move over to this diagram and sure you like okay the server received a request I generate it race ID but how do I know whether to generate a trace ID okay the only markers for me is if the headers contain traces right so if you were let’s say in the HTTP transport and you were doing HTTP headers you got certain headers and maybe because of some some reason the headers are missing or you went through a non-participating node and there was a database trigger of some sort that basically called another service so in this at this at this point the whole idea of a trace is lost you have a trace ID that was generated and I don’t have a request view because there was a non-participating node in between right at resided got generated here no trace ID here finally I again generate a new trace ID lost so you have to take this into consideration yeah these are some of the lessons that I learned essentially whenever whenever someone looks at this tracing system they are

always surprised all the development teams are always surprised that I cannot use it debug this request that happened that’s because not all requests are sampled so the requirement grew very quickly to generate trace IDs all these generator a series okay now we come to your multi-page group tracing so there wasn’t requirement so that there is one place ID that is a group of all other trace IDs in scholar as an uber trace and then don’t try to invent collectors use collectors that you already have so if you having something like flume known we do it at least the costs may be prohibitive maybe if you are having syslog-ng just useless log in G I always have to monitor this space and you have to think about buffering okay there is dependency hell you have to control the lib drift versions and that is you know you know in a place where you have like Java libraries see libraries / libraries there is one thing that basically joins all of them together like the lip tripped version yeah misunderstand Cassandra got to go to sleep with her in order to understand this is not trivial okay there was there’s always a need for different consumers for aggregations you probably want to do something with scalding or algae bird and you spark it’s it’s a steep learning curve other things that I learnt is you have to be ready to interact with your infra if you’ve been ignoring them and not being nice to them and not saying good morning and anything then you should start doing that because you would have to talk to them about SSDs versus traditional hard disk port chassis I never knew there was a chassis besides what is there in a cart or is like network failure points yeah you also have to be ready to talk about planning for capacity and if you do not know the answers that’s okay but you still have to talk to the right people understand storage systems a bit more what are the conditions and constraints under which they work okay and essentially this is basically what i called it call tracing it’s not it’s my name for it Zipkin is the name that that Twitter uses for I think the same thing it’s called distributed Tracy and the demos that if saw today was thanks to people in in in Twitter and no trifle from rackspace and I I’d like to acknowledge like the spring 12 GX team and thank you all for for being here so at this point that’s all I got actually waiting for questions yes actually you can start looking at it right now it’s on github that’s the good part it’s bring one cloud or spring cloud i’ll give you a link to it if i can find it what it is based from my understanding i think they just started like five days ago so what they’ve started doing is events that are coming into spring integration right and even in the spring even in the spring context so spring cortex is basically produce a consumer and there are events that are generated and what they’re doing is they are able to intercept points where they can do server received service and etcetera which is basically the same they’re using Zipkin api’s in libraries there so that’s why i said they you have to like at the end to blame have to get familiar with this stuff yes I could say more but I could kill you no I’m just kidding it’s so uber trace is is not yet implemented this stuff to be done it’s a very it’s it’s basically tied to the business so if you had a

business likes that’s the way it was implemented where I was previously employed and not giving any secrets away but essentially it’ll be if you had a page like account opening right and you can tie up with a business logic you can tie up saying for this account opening since I have multiple pages and I want them to be grouped you tie up with the developers and say okay you want these trace IDs to be generated right but have a hash for it such that it ties itself to this business logic later on I can query the trace IDs are going to be generated no matter what that’s how zukin system works but later on I can query with this unique name that you have given for the business let me explain a little bit more but yes okay I’ll just repeat the question the server receives service and are there any api is yes that’s where if you should look at brave Christoph Audrina sins so if you look at brave he is provided nice api’s for for starting off your project you would have to know about your your context so basically it uses a thread local context okay so threadlocal works in in many setups if it is a simple application but you’d have to use it very carefully so he provides AP ice for for this in the Java world yes right okay so thankfully to answer your question I just talked to gary russell he gave away a small secret he said basically they are using spring security for it as a threat context that’s the same thing that they’re using for maintaining thread context in spring security we’re going to use the same to in order to handle this they are actually already doing this tracing they’ve already started committing code code for it yep yes this sorry didn’t get it yes aggregation yes I did a lots of them yes yeah the the other ideas okay so anomalies for example you have a you have pot right you have various parts your request parts are gathered that’s what you get out of soup can so you go to like Part A Part B Part C and let’s say you were created over a period of time I’d say like three months five months a year whatever makes sense for you and you want to look at path anomalies so at rays generated like in real time today if some problem happened on the server today you want to see why there was it was different from what happened like let’s say over a period of time you can generate like these part anomalies and maybe use ml libraries to figure out you know what’s the pattern predictive analysis you can also think about annealing your system like Auto healing your system so this is just the starting point you basically now you what we talked about is just the instrumentation point the work starts after you instrument your libraries that’s when you can actually go to the Big Data world so to speak is that clear or okay any other questions all right then thank you so much you