BKK16-401: Enhancing Application Performance with ODP

ok this is session BK 16 401 enhancing OD application performance with odep once again for those who are expecting the Molly presentation that was moved to there was a scheduling conflict and that the Molly presentation is being held in m23 dash 74 so if you’re looking for that you’re not in the right room if you are looking for this then you are in the right room so what we what we’re doing today is as you know we’ve got a series of demonstrations that are planned for this afternoon in the LNG hacking room which is lotus five so we hope you’ll will be able to see folks there this afternoon but what we wanted to do today is share experiences over the last several months using ODP both in terms of application form porting and performance tuning and offer design advice to application writers in the form of do’s and don’ts and pitfalls to watch more we’d really like this to be sort of an interactive QA type roundtable so we’d encourage people to sit a little closer to the mics if possible so that we could pick it up this room’s a little large we’ve got a couple of folks up here I think there’s some other folks who are still expecting I’m billfish offer and Christian is he in the room yet okay and we’ve got we’ve got Krishna here we’ve got Barry here to paying and wishin are you did you did you want to come up here for for talking or least one of you so that we can have I think and so what we really wanted to do just today was really cover a couple of topics of interest based on the experiences we’ve had with with porting and working with a number of applications non-trivial applications running on ODP and so without further ado I think we’d like to start with talking a little bit about engine X yeah so hidden would it be we tried to hope ports on web server application to to write in what EP context so that we took the nginx web server for that for the purpose and I think many of you know this epsilon have mended lightweight web server and he said it’s all size of open source and it has a concept of using worker processes to handle multiple connections in parallel so unusually the number of processes you usually be this number of course on the hard way and it’s also supports me it handles the connections using this multiplexing new connections on solves the content in parallel and it also supports a very rich set of features where many features and also performance tuning options you can tune the performance the updating updating the nginx a config file so you mean if you run it with the default config you might see the basic performance so but you if you update the nginx feature with a the configuration then you will see different performance numbers so that’s why I just mentioned that can you go to the next slide so what we did in wood EP 2 port a an j NX so now we use the we ported the nginx to ODP context by using the ofp a posit which is a user space a tcp/ip stack and we in the nginx we added a module called nginx ofp module which you which gets initialized the initial stage and we all we replaced all the system calls I mean in the linux system calls with the way FB system

calls so we have and then we have like a two options support one is using the why of p event mechanism another one is a normal selector mechanism so we have like multiple options so we try it both and mainly we use the standard select mechanism on the event mechanism we even notification mechanism we use it as an option and we plan to work more on this also for the future more future and we have this ODP schedule inside the nginx Corps worker process so ye each worker process is running go DP schedule for the run to completion of tasks and yeah the it’s about the this one is about the challenges we face during the poor porting of this web server into a DP and difficult to understand yeah this one here of course the nginx is a really calm units be Guard tech chure and it was initially it was it took some time to understand and try to put it into the word EP so that it runs in what he be context so they don’t get into the up into the same so some conflict of running these tasks in parallel so that that was one of the challenges we had and and we had a major mentioned before we had a multiple solutions one is 1a the event mechanism from yfp and also the Select mechanism and so we have to run my several scenarios and understand the performance of each option and I think we we need to work more in this one I think we believe we have like a we can tune the performance of Engineers with different mechanisms provided from ODP and yfp and then we gone yeah this multiple team that is like we v from wood EP and working together with the ofp a project so we have like multiple teams and pair people working on this project so a quick question then can you characterize the amount of change that was required I mean obviously one of the challenges was just understanding a large existing code base but in terms of the amount you know the degree of change that was required in order to adapt to this can you comment on on that in terms of the size of the changes or the scope of them yeah actually we it’s not that big effect in the end I mean we we have to initialize some as I mentioned some module engine ex-wife be module two so we have to make Anna replace all the system calls with ofp calls ofp system calls for that we needed some integration work to do but the nginx has some feature that we can add some module plug into it and so that it gets replaced by the water feature you want to add you can enable that feature instead of the others so that we we did that one that took a little while to figure out how to run this boy woody pecan / DP context that took calc Leroy and and also the next thing is is only the basic part we had in the beginning and then we figured out that it’s not actually completely running in the ODP context because it’s we couldn’t able to run multiple processes in parallel we could one in the beginning we could we can only run only one process in the vertical contacts and if we try to spawn more than we got into some trouble so then we figured out that okay then we had to go into this nginx core module and we we try to modify some changes there so that the all the processes that get spawn is from the yfp contact / DP context so so and then we think you mean it can be modified more more and more and optimizing more and more on we can make it more better so it’s simple to understand of course we have the I can show you the way the code exists so you can go through it oh yeah we use the

Linux generic the test from here from the earth yes base so we have plans to hire the slide the one of the points if we plan to do it in word dpdt context also for a PDP DK also ensure off probably for the next connect so you talk about is a basically running one worker thread you know from Laurie people everyone ODP worker thread / linux process yes but not now we made it working for a multiple process can run on the each core 441 process I look like you do a filthy so multiple processes aren’t running on the same core but there’s only one thread per process no actually we have if we have the affinity and then they all the worker processes can run on the multiple cpu cores great were there any questions from the audience on this hike is no one question so basically engine X interacts only with ofp or how much interaction engine X has with ODP so if I understand Oh 20 FPS built on top of OD piso engine X is built on top of AFP so how much of the interaction engine X does directly with ODB we it’s not much actually it’s mainly through ofp only but in the initial a but the initial initiation part we have to initialize some initialize this processes from the ODP that’s the only part we mainly did the rest is through AFP only thanks so you just mentioned that you’re going to try to run it over or the power or the PDP DK that’s in our demo one RV eight we run it in a vm engineering so fpo DPT BDK in the vm works I mean we were showing them yeah we could I mean we’ve mainly focused on getting this getting to the east stable stage so this is what we focus mainly okay this is about this performance numbers I’m times trying to pour this one so this you can you can see NATO versus our ported ng + X numbers on different work purpose and they are over there with with multiple worker process but this is this is running on Linux generic yes that’s correct all our own linux channeling it isn’t there is nobody here dpd k it what a pity brigade the the I think the standard packet area yeah I mean we haven’t I mean we have to see we have to go through it we we together with overfitting there are some updates we we got it from yfp so then we say we see much more performance increase or increase and then we believe we also need something from word EP also like the next step would be like I try to do with the multi multi q support that we believe it might show better performance now we are using this ODP schedule plano DP schedule so it might be from both sides I mean it is not only from yfp are not only for more EP so we we are working on that one yeah what is the schedule yes that okay then yeah this is the future work we have to I’m talking about this so so the next step would be try to run it on whodi PDP DK with a multi q support and the we want to do some optimization and profiling to see where the the performance is heating and then we I mean we also try to polish them in our patches we have some patches then we have to do some more work on these patches and try to make it more better look nice on we try to upstream this

work to the NGI next project that’s also there are one of the tasks we have and then yeah these so they get report this is my private git repo this is where we tried I’m trying to test the ofp and j NX in at lng so if someone wants to test it so they can take it from here so great I can I have a question just regarding the up streaming do we have any feedback from the guys upstream do they have interested so I I think we haven’t as far as I know we haven’t contacted them the these we just started this like a 24 months ago so we haven’t I think we will soon maybe well I think one of the one of the factors also was as a result this work was also helping to to know FP correct yes that’s that’s absolutely so this was really not not just a exercise in hoarding a non-trivial application to ODP but also the one of the first large-scale exercises using open fast path yeah that’s kind of yes so there also is a question of tuning open fast path as well on this so work in progress but i think the promising start yes the SB let’s cut because we’ve been finding i mean some more performance tuning from yfp also so they are delivering constantly the patches based on this nginx working actually so we fire when we find some issues then we speak to FB and then they they see if it is something from where p then they’re delivering all these changes to the wave pay directly so it’s our FB is always getting upstream i mean all the changes getting up streamed in yfp so the match is here it’s only on the ng an ex parte you only you will only see the nginx patches here all the other patches that gets upstream to the yfp great so the paying is also then going to talk about how experience was working with the t-rex traffic generator thank you guys first of all when she arrived from cisco and i’d like to take this opportunity to say thank you to the parent our parent company on the orange it’ll give us the chance to working on the interesting project so let me give you a short brief introduction to the tracks tracks is a open source traffic generator created by several cisco engineers under my college akali win share as well for the initiator of this project so first of all random array the tracks is based on the TPT k it can generate us a level for our 427 stateful traffic by replaying the pcap file in a smart way a bundled with the tracks you can find mine a pcap files which creates capturing the realistic production network so we call it a realistic traffic generator from the superformance of the original tracks is impressive it can scale to 200 kick back the pprs we were working out running on ours is called server Lucien’s in cisco we use this traffic generator to test our faithful features like her DPI and performance routing so so listened my work is part tracks over OTP i think the benefit of protein tracks ODP include to come up we cannot original tracks neither the typical compatible hardware while after we parted drugs 20 DB we can use the tracks with some interface which is not compact edible of whiskey BDK for example in our demo in this afternoon we connect the tracks and the hour cisco software router csr 1000 way the interface of the

CSR is actually the tab in the face the winner interface the original tracks cannot work with such kind of interface which is not compatible with did it occur after departing we give Adobe trucks can work with such kind of interface and moon well if linked with ODP dbt clay I saw the tracks ODP can still benefit from the GDK performance post during the protein we actually we summarize some tricks about the protein first of all in Enzo original tracks it use adium buffers a representation of the package but in order Peter Parker is represented by the ODB package though actually this to object is not as different in the semantics so we cannot directly replace and bath to the would be back later it could easily create some difficulties during the podium so we adopt our way the mansion in the option to actually we implement our own and bath and related measures based on the ODP buffer and ODB packet and we converted and bath to the OTB package at the last moment which is before the only package I Oh son in this way our polling work become much more easier because we don’t need to find many places in the tracks code manipulations on both the main part of the transcoder caramel and changing in this way and yes exaggerate the province a memory cocky introduce a noticeable performance job we will come to this topic in the later sighs okay in the render node in the medium DP DP it kinda the hardware of loader convoluted to the application by the way we don’t stand on a similar function in the ODP API so we have to disable this kind of this part of functionality for example in the original tracks you can send us a flag and with an ID in the above so once an input and Nick standards a packet it will have the application to tag the villa ID so we have to beat Scotland his part of functionality for Stan beam and oh there are some feedbacks from us to the old PAP our first of all we’d like to see a new API to combine multiple ODB packet to form a single packet so I want to use the example to explain why this is useful first so you can see that packet a is made of either a small header and fallen by followed by a large part and the packet be is made of another small header a given a small header and followed by the same followed by Oct same large part so actually in in the trucks when we construct the packet with we just need to change some fields in the header we don’t need to change anything in the a pilote that’s why so in the original tracks we can just create allocated different hider but the other packets shared a common pilon it it makes packing the duplicator much faster yes another exam some other example that may benefit from capability include the fragment and reassembling and the multicast replication the second feedback from us is we like to see it an ooma support because in the gt3 if you why you create a memory pool you can pass a socket ID or the argument but in the oatmeal because we don’t have similar support I think bill has hold

some discussion session about this kind of yeah i mean this this was one of the we’ll be talking about this in fact in the next session where we recap all the deep dive discussions that we held this week and certainly both of these areas are things that we are addressing in ODP monarch and more fully in ODP tiger moth later this year so but this this this was exactly the sort of test bed we needed to prove out the requirements for these as well as to enable us to test the the patches that are currently being circulated and will be circulating as we add these additional features so that’s really shows the benefit of these type of ports in order to be able to help guide the tuning of the underlying system so a quick question I understand totally why the first one makes a big performance difference being able to have a template in common part that’s huge for traffic generators it’s not so clear to me I don’t quest and why the memory pool DBK feature in a x86 environment it’s going to give you that much performance benefit it’s not that gets a little confusing where the performance benefit come from from that means that you must apologize yeah okay yes actually the performance differences here to be evaluated but some reading tests to the original tracks shows that if we will conduct to give lanique two different socking under you created them packet buffer on the corresponding my brain known that will increase the performance drastically so this is basically not a multi core intel there’s a multi socket yes intel that is your that’s why okay I understand so these aren’t single single socket and else yes you can certainly okay there are a few other feedbacks let me go through it quickly first of all we’d like to we’d like to see your API that we can program the Nick whistle at least to filter some to filter based on the mac address because we don’t want to see any packet which we are not interested in and also walk around in the current at ODP t-rex we just sat at the Nichols so promiscuous mode and we’d like to see I’ve had to make the ODP monoprice the unique and other support is the packet we want whose we want to manipulate as a nick in in a diverse way not only the uptown Staters and another wise injured in the original tracks Enzo in the DPD clay it can give you a structure obstructed obstructing the to hardware and you can find some register addresses in the structure than the application and attractively interactive with Nick like it reads the register for some statistics and so on but we don’t have the similar stuff in the ODP APR and the last wise we’d like to read the free buffer number a number of buffer in the memory pool is used is useful for troubleshooting and the next next few slides talk about the performance evaluation and the prevailing we identify some hot pot hot spots in the interpretation so please like me a lot our island my colleague lin chen to introduce this park thank you okay you and after popular she likes to OTP with the Sun quick test of performance here’s our Sun test test consideration firstly with this Ebola acts traffic relay with this table acts traffic’s that the India native t-rex it beverage some hardware hardware functionality to accelerate some some kind of classification and which is not supported in 00 DVD BDK and as we disable reacts traffic so we some funky functionality of periods might be disabled to the examples NetSupport calculate reality some kind of this feature but but I think we can also we

can steal evaluates and performance even we disable X traffic annales another considerations that we used only one workers right to generate the traffic the reason we used only once read is that original kill regs leverage the new mass support and we don’t have led and another thing is that original eurax used multiple T excuse and when we add reverb reporting we used ODP 1.6 which has no multiple excuse support and Laura fake profile used is the sfr profile which is a realistic profile in central provided by some tele telecom company from france and in this traffic profile many application are combined in the traffic and we show a traffic report in cisco routers which is connect to the helix back-to-back reason we used resolution counters that we found the ODB Peggy iOS test returned since the very early 10th by the year or a peak hours test is not expected we are during the investigation but wherever there’s some issue in the ODP or in our code so here’s the some quick test results loafers the column is the native tbk performance you can see that we use when we use only 11 straight to generate the traffic you can achieve more about about five of five point seven GPS throughput and when we use the Linux generic is testing one gbps and then we turn to the Linux TBT k it’s about 2.2 which is about sixty percent performance shop and after we after some performance analysis we identified one of the pops pops is the about the punk rock API which is the system call in Linux generic and has not been optimized in the ebdb DK and after we up after optimize the term clock API with the DVD kpi when the performance is improved back about more than ten percent so here here is some analysis of use puff toward who analysis the hot spot of the of the after our protein and basically we found layer 22 hot spot which is which app has impressed the performance are robbed when the first the first one is the memory copy and no I think the most of the memory copy is reduced because we introduced a new action layer of the ODB ODB packet and as estado has mentioned let we need some new API to support to eliminate the it and the analogous poem is out it’s about it at the time clock so I think if we can we can optimize these two paths the performance of the pdx / 0 DB DB DK the performance would be improved much better so a couple questions here obviously one of the sources of overhead here is the fact that you’re you’ve got an application which is natively using the emboss structure within DVD k and trying to separate that from direct manipulation of the of the M buffs trucks so that instant introducing introducing a lot of overhead were you able to do any sort of preliminary analysis of what would it take to change the code to not be directly manipulating and buffs but instead use you use the ODP packet structures natively and then take advantage of the fact that in ODP DVD k for instance it really is just n buffs under the covers it’s just that it’s not explicitly exposed to the application as such yeah actually we are trying in our next time you to try to elevate the some additional medical peak

but the first thing we met is that as we mentioned okay api we we suggest to at least we need to add some api to to conquer to make the packet representation as two pathways the different header in the common pad right right yeah API I think we could evaluate to some cold in here if I do I that depends right that I mean that is one of the things that we we are adding you know as a result of this work as well as others that’s so that’s good so we’ll hopefully we’ll get a much better view of that well yeah we get to the next connect the other thing that stands out on this particular chart is the fact that you it looks like you had one column the last one which in which there’s actually quite a bit of a head drop and if you could comment on that x GV x trust me packet the color you have a sin city actually we have no with we so far have no detailed analysis of this column I think it’s interesting in the native DVD kale use the X minute pakist API total trust me the packet and this API occupies almost twenty percent of the CPU cycles but in our TOS the ODB DVD k we use another DVD KPI transmitter packet and use only I think one possibility might be the difference of the TBT conversion in original helix USD btk 1.7 1.8 and in awe DVD DVD Caillou’s or dvd depth to point to note would be one possibility well I mean or did this this particular call presumably is being done by ODP DVD k rather than by the application oh sorry so so the question is who is this is this is an arat these these calls in in the case of the the latter one we’re running on ODP / Tilly that’s being done by ODP DVD k naught by T Rex itself yeah at least I was meeting so it’s the way that the packet output was implemented in od peedee peedee k which would had possibly some changes to the underlying DVD k level yeah so okay so the other to prevent the the memory copies let’s say you stick with the in buff transit to the very end of the corrective OTB packets theoretically you should we’ll do that with just retaking the OTP the inbox metadata the structure around it and copying that into the equivalent of these structures and leaving the buffers alone you shouldn’t need to do a mem copy of the actual data buffers I don’t think maybe that’s quite a small change the next year but I something like that yeah so after we have let API then we can do this change I think it’s not big you can skip this test okay so some our future work I think the first visual walk might be that we need to add some support after we introduce the new API say like the motive multiple TX Q support and the rack we can let we can eliminate the additional medical field and we can make some optimization of the time clock and and then we would like to evaluate some performance difference between the OTP player mode and the ODB schedule mode and currently we use the packet mode okay so thank you very much I have question i’m so so this is freely available this t-rex code now people like we know that we could i could get it and try our using my code my systems so the first question I wanna know is um can I built you know these traffic unit before one of the questions for tcp traffic generator these are you actually implementing stateful sequence neural processing and retransmission of your P caps in if you see some tax toca back or something txt for his fears about and in existing t-rex we do not maintain the force tech for protocol step which has some kind of smut replay we we generate when one flaw and as I off report for this for all and we simply pray the packet from the pcap file so so you know no replay but an entry for right now it sounds like you always testing on Judy

peer-based know the peptide package on all right I hear it’s not for forty does not support for ttc bistec what we do is that we captured a pick a file in the middle that we use some some realistic to affect traffic and simulate some delay in the middle so we capture a packet in the middle as the picker file and we replayed a pick a file and assumed the delay we simulate it’s enough for a packet to register the other side but let’s say if some packet job here in me some packet drop happens we don’t know have the retransmitted here let’s time trade-off between performance and the functionality and as the user of the traffic generator in most we only need satchel Pope satchio functionality presumably when you do your pcap replays you can change the tcp sequence numbers and port numbers and stuff like that so you can replay in my multiple sessions and stuff yeah yes okay thank you and well both both engine X as well as t-rex will be demonstrated this afternoon in lotus 5 so any additional questions and seeing things and talk about futures we can cover that in this afternoon not really okay Barry did you’ve also had some experience with with working with with ODP applications did you have any any comments or observations on you know the pros and cons and what are the challenges and you know tips for for doing that oh yeah sure a few I some of the stuff is actually goes back to our September then connect the network connect days where we’d actually I’d n some work on porting it to the tile GX platform OTP and we were able to see very very you know thin overhead very pretty much though the performance that they don’t like hard work could give us and of course I tried a couple of different apps the the easiest one is an L 240 example and which actually was interesting because it actually had both schedule mode and the more direct modes and the performance was indifferent there as I thought that was probably one surprise but then it wasn’t very complicated out the other app that we experimented with was ovs and there it’s such a more complicated application that was much harder for us to get a sense of yeah we get working but you know we couldn’t you just took to long for us to really compare apples to apples and different scenarios though what we did discover was that the there was a lot overhead haven’t get to what they call this EMC cash there and it wasn’t that particularly well done so uh no BS code that we had and I you know I wanted to but I never did it go and fix that but otherwise from Nora TP point that wasn’t the bottleneck that we were seeing right so you you you and you are also working on a platform that had quite a bit of hardware acceleration capabilities built in yes so I guess the question is then to fold from an application standpoint how what how easy was it to for the applications to take advantage of those accelerators without having to do special coating and then from an implementer standpoint how easy or challenging to map the ODP api’s to take advantage of those accelerators under the covers right so that’s the column exit that you’re talking about which one hundred armed course so that’s still in process it’s been unfortunately slow down a little bit because they my new company has other things they want us to focus on first but for what I saw to be honest I really don’t have any good feedback on the app part of it more on the implementation part of it because you know that task him first you can’t really do that and the implication part of it we we didn’t get all the accelerators in there before we had our ordered something else but what the ones we saw it it looked fairly straightforward but we didn’t get to a point where we actually say oh it’s in it’s now production on any test the second step which is how easy for the apps to use it we don’t I don’t have any feedback we might be that obvious or depends a lot on on the assumptions the app has to start with any any general questions for for anyone who have been talking here all speak at once

alright well thank you very much for for attending here we have Oh I want to know when we do the accelerate what whether we can consider the power consumption about the wp no I’m sorry I didn’t quite capture the that question there was it so as a question on acceleration from one way while we accelerate DDK okay why do we consider power consumption from another respect to consider the key BDK yeah so how would we accelerate DP DK is that the question Oh power management Yeah right again the the that that’s really a function of the of the underlying implementation so for existence if you have an application that is doing a and using the event model for ODP and use just calling ODB schedule well if there’s no work to be done then it’s the implementations option to enter a low-power mode until work is available to run the data movement from the piece of pie through the PCIe and that thesis of cheap access so they must have consumed lots of power right right yeah data movement dominators cheetah power not only computing was very small well country the times about yeah okay well thank you yeah I mean I that that is one of the issues that applications have in general which is the more they are tuned to a specific platform either by design or happenstance then there’s a question of sort of D tuning them from a specific implementation and using more generic structures which are going to then allow them to adapt to other other environments without effort so in many ways it’s it’s it’s common for you know some applications to be highly tuned to a specific environment and then on tuning them form that so that they can actually run better on different platforms can be a bit of a challenge and yeah power features are one in one of those areas where that tends to be very platform specific and so it’s one of the reasons why an ODP we leave that up to the implementations rather than the applications to be power-aware because that ties them to a very specific platform so maybe over a may be very some specific case bill where some application should say for example I want to go to power save mode now or top speed now and if in favor say requirements from applications with self this requirement to my name list and the use case and probly we will discuss the article understand you we find a power-efficient days and more important than the sauropod yeah so big yeah because of power power dominate datacenter movement data movement that dominate the power yeah I also remind remind me of something but one tip that I would recommend for both implementers and users that we’ve always found to be very powerful which is use of huge pages which you know really does make can make a difference in almost all the platforms of course the one frustrating part is of course arm and Intel and tell Jax all have different huge page sizes but still you know that’s that’s one thing I strongly recommend considering all right well that takes us to the end of the hour the next there’s a break right now

and at eleven-fifteen will be continuing with the next session which is the lng of futures consideration which is a summary of all of the discussions that we’ve had this week so for those will they see some of you back here for that until then enjoy your break you