Erlang Factory 2014 — Memory Management: Battle Stories

thank you so when I wrote this talk I thought that I was going to give a bunch of examples about how you analyze memory and see what the different allocators and things how they work in ALM virtual machine and spend a lot of focus on doing actual examples it’s on doubt that the theory of this stuff is quite complicated so it’s more of a theory talk and then having some stories at the end rather than being stories focused so I’m going to talk about the problem and why there are memory alligators in the a language machine what they’re used for so and the different handling going to go all through all of the different concepts with the memory allocator so that you get the terminology right so that you know the different options you want to set and so on and so forth and get to know what the different things you’re talking about when you’re talking about locks when you’re talking about carriers and so on I’m going to have a look at the statistics that you can gather there’s a lot of statistics and the main reason for doing the reason why we have the alligators implemented a way are is because we can gather the statistics about your system and try to figure out okay it’s just an optimal way of you’re using your memory we’re going to have a look at two different cases where we’ve been looking at the statistics that we can get from the memory allocators making modifications and hopefully making the system work better and at the end we’ll have a look at some of the new features because there’s a lot of developments happening with the alligators we spend a lot of time optimizing these things and trying to shrink and get the fragmentation to become smaller and smaller so the reason why we have memory allocated in the a language machine so the memory allocated there is because the normal malloc and so on are relatively small relatively slow for very small allocations so when you’re allocating some small X objects somewhere it’s quite a heavy operation to have to do a sis call and go all the way down so we kind of cache memory up in the purge machine also in order to battle fragmentation we use different strategies for how to allocate data locating something like a module is something that you were you okay so you want to spend a lot of time finding the correct place to allocate that memory but allocating something like a message being sent another process you don’t suspend as many CPU cycles trying to find the perfect place in memory because it will be de-allocated when you receive the message so we have different strategies depending on what type of memory that were allocating also there’s no way to get the statistics in a cross-platform manner that we want to have out of the allocators if you want to just use malloc or something like that if you want to try on your system and run it without the Erlang allocators and just use malloc from the beginning you pause that flag to the Alembic machine and it will disable everything and then you just run a normal malloc sometimes this is faster I was looking at a benchmark and that’s called Easton and was I’m going to see okay so how much faster is all these alligators and I switched them off and the benchmark went faster than using alligators so it’s not always but that’s what you get for synthetic benchmarks in real production environments they help a lot there’s another benchmark that I’ve been running for to do for G telephone benchmark thing and they’re there it’s about 50% speed up to using these locators rather than using malloc and so on so in real world applications it’s usually better for small benchmarks they override sometimes bits over the allocators also help a lot when you’re doing Numa so lots of cores and lots of things so going on talking a bit about the concept so we have these are the four main things I’m going to talk about so carriers and blocks single versus multiple carriers how the multi block allocators work and what’s the third specific allocators so talking about a block so this is terminology so that we can talk about these things a block is one continuous piece of memory that the LM virtual machine needs this is something like for instance when you do a that’s insert that piece of data becomes a block a heap and a stack of a long process is a block a message being sent from one process to another is copied inside a block so it’s a continuous piece of C memory we have carriers carriers is something a container that contains one or more blocks so we have something looking like that so we have a header saying okay so this is a carrier of this type with these settings and so on and we put the block into it and as we continue with more operations on the air

ETS table we put more and more data hopefully in a good place we enter the block and then we delete something and we get fragmentation and we might end up with something looking like this now these blocks when you have like this they are normally we unlined them in a normal emulator they are lined at an 18-bit limit so they’re about 256 kilobytes minimum so they’re lined at that limit and that allows us to do lots of optimizations in order to do that the size of how much memory can be allocated is controlled with a lot of different settings going to see some of the settings they say but before I do that we talked about single and multi block carriers so when looking at the statistics and the allocators we have two different concepts there so we have a single block area which is it’s something that contains one blocks it’s a carrier that contains one block of memory now normally you want to put large for some definition of large blocks into one carrier a single carrier and if you have small blocks you want to put them in the same carrier this is because the operating system is quite good at handling large continuous chunks of memory and Singapore carriers are usually used just a locked by a malloc normally large depends on your context sometimes large can be just being something like 256 bytes sometimes it can be five megabytes so it depends a bit about your context the default is half a megabyte which seems to work for most systems so if you have a block of some sort that’s smaller than two 50 512 kilobytes then it’s placed in a multiple carrier if this bigger it’s case in a Singapore carry and you control that but saying single block carrier threshold so it’s that time there normally you want to have most of your data in a multiple carrier so in a normal system that you you’re running you won’t have mostly I don’t know 80 85 90 percent of your data to be part of a multi-block carrier so if you’re realizing in the system that you’re running that you have a lot of singer blockers and not a lot of multiple carriers you might want to raise the limit to see okay so my date I know from my code that I’m reading one megabyte messages from a TCP socket all the time this becomes one megabyte binaries and that’s bigger than 512 those always get put into Singapore carries so I might want to just adjust the threshold to say okay so it’s gonna be one point five megabytes and then they put all of those carrier blocks into multiple carriers and hopefully your system will be a lot faster when manipulating this threshold you probably also want to change the size of the carriers that are being allocated so if you’re increasing the size the size of the individual blocks you want to increase the actual carriers that are being created so if you increase the size to one the threshold to one megabyte the I believe the default size of a multi-block carrier is two megabytes the smallest one and then it goes up into eight megabytes depending on how many carriers you have so you might want to do since you doubled the initial amount you wanna might want to double the next one as well so that you start at 40 megabytes and then grow up to 16 megabytes or something to that effect there’s quite a few different types of allocators so these are different areas that you can configure individually how they work so we have the heap allocator the binary alligator the driver alligator and the ETF’s alligator so these are things that you can reason about as an airline programmer these allocations and no these are the ones you tune I’ve not seen a problem with any of the other alligators yet normally somebody’s allocating binaries that are growing out of hand or there have many many small ETS stuff that they’re putting into tables and they get fragmentation issues because of that that’s the two normal ones that are your problems don’t wait there’s also temporary short-lived standard late long-lived and fixed size allocations and these are emulator internal things see yeah I think yeah so the different the difference between these emulator internal things is temporary is a very short time so something that lives just in a c function just to make swampers for what it can be their standard is Lynx monitors these kinds of c structs short is something that lives across a I don’t know schedule of annoying functions lives for maybe 4 or 5 milliseconds something like that it’s long is things like code atoms things that we know that possibly could live forever and now we

have fixed a lock which is things we know are of a certain size and we can make optimizations because of this so process control blocks port control blocks and lots of other things if you want to find the exact things that are there’s a file called a lock types in the Alamo tepee repository under PRTs all of the mappings are set there and the mappings are different if you’re running in a half word emulator if you’re running a 64-bit 32-bit if you’re running Windows UNIX or some other things so depending on what operating system that you built the a language virtual machine for you will have different settings on these things yeah so talking a bit about so single block carriers are quite easy you don’t have to manage them in now you have one block in one carrier and that’s it and one carrier is one chunk of memory requested from the operating system so they’re very easy to handle multiple locks on and are quite difficult and you have lots of small areas because you need to keep track where can I put this piece of memory I need to play something that’s 8 bytes big I need to look for the best place to put it and spend a good amount of time for it so we have a lot we have a few how many are seven different strategies that you can use in order to allocate where to put the blocks inside a multi-block carrier so we have the block oriented ones that are these these are strategies that span carriers so if you have two carriers and you you will then build a tree with all the free blocks of all the carriers that are in that allocator at the point and we have best fit is just it’s the best fit it looks for the one with the released waste so if you have a block that’s eight bytes and you find a slot that’s nine and the next smallest lot is seven then you put it in a nine slot and so on address or the best fit works in the same way only that if there’s a tie it uses the one with the lowest address rather than just taking the one that was put in the queue lost address or the first width tries to take the one with the lowest address that you can fit them good fit takes the best or a good fit to do it you can tune with settings what you define as a good fit so if you say that okay I’m it’s okay if it wastes 10% of the block so if I have something that’s ten bytes big then a good fit is something that doesn’t waste more than 15 bytes so something like that so you can tune what you want to have a good fit to be and a fit is mostly used for these temporary allocations that are really fast so just finding something we have the carrier oriented once these are broken down so that you have first you have a strategy to say how the carriers are organized and then you have a strategy saying how the blocks within the carrier is organized I’ll get into I think I’ll get into more details but exactly how that works before I have a picture saying a small example about how best fit works so we have two carriers the shaded areas or memory areas that are taken and the blue and red ones are free slots so free memory we build a tree looking something like this of the different carriers saying okay so we do a binary search tree so this is I think it’s a red-black balanced binary search tree that you look for the blocks in they end up looking something like this so you have the smallest red one all the way to the right and then you have build up this tree of them and the neat thing is that this requires of course no memory to build this tree because we save the pointers of this tree in the free slots so it doesn’t actually take any memory to build this tree and then we just search this tree to see okay which one can we find and we take a slot and then you take it out of the tree rebalance and then hopefully you found it good slot to be in so I’ve been talking a lot about the blocks in there so you have different looks but the carriers where did they come from so we have two different ways of allocating carriers in the runtime system we have something called we called the MSA Gallic alligator and the systolic so the m/sec alligator basically tries to shut up it tries to shut out the operating system as much as possible and just allocates big segments and this is normally used for the multi block carrier allocation on linux is used as dev zero open status of file and then those m map with the different arguments in order to get large chunks of memory one of the key things about the m cyclic so it has a cache of a certain size of carriers that was just returned to it I

think the default is 10 something like that and on a normal system and most of ICU we have a cache rate hit rate of about 80 85 percent so normally when you were reusing a lot of carriers in a system syslog is just a call to a malloc free or POSIX memo line depending on your operating system it allocates pieces of memory in multiples of a variable you can there that’s called a mu something-something so it’s I think the default is two megabytes so it’s allocates two megabytes no matter if you want to have something that’s let’s say one megabyte and it over allocates two megabytes in order to make it easier for the operating system to get less fragmentation of it things we’ve had a lot of problems with projects that have had fragmentation of the virtual memory space so this is why we help Madlock in order to have these nice chunks that it can manage and get it perfect systolic is also used to allocate what we call the main carrier the main carrier is the initial carrier so the first one that’s always there doing startup the idea with the main care is that you have a sufficient memory in order to do a normal boot up of a neuron emulator without doing any extra allocations in order to speed up starting something okay so another picture and now we’re getting into different schedulers so how do we have a picture of yep we have shortly dialogue a heap binary and all of the different things they request carriers from MC gallic all of these we have one chain of these per scheduler in the system so all of these are local so you have one ETSI lock for scheduler one one ETS are lock for scalar 2 etc etcetera etcetera so they’re all local and they can take a take advantage of the Numa architecture of your system so an allocation is always close to you it’s at run and these small mailboxes there say that if of course we have to pass memory around between schedulers so if we want to do a free of something that we passed over we have to send a message to that scheduler in order to do the free because we don’t have any looks at all on the actual allocators because we want to do these to be la creme so we’re doing them by a message passing between the schedulers to the allocations in was it or 16b or two we added a carrier pool that’s shared between the different schedulers you get access to this if you use the carrier oriented algorithms that I was talking about before and so if you use something like the address order best fit carrier first fit block or something like that you get access to this pool this pool gets populated of carriers that are below a certain percentage of utilization so if you have a carrier on schedule n that has a utilization of say twenty percent so it has quite a high fragmentation then that scheduler can give that piece of carrier that carrier to the pool and somebody that needs a carrier so if scheduler one needs a new carrier to allocate data in it can take that from the pool rather than having that from somewhere else you lose some of the numa things of course but you get better memory utilization which is nice for all of the asynch threads and all of the driver threads and everything there is a global alligator that has a big lock in front of it and all of the data gets put into there so if for instance you’re reading a file with binaries those binaries gets put into this global alligator it’s seen yeah and all of these are MSA they weren’t as if there weren’t enough layers here we haven’t something called Earth’s em up at the bottom as well that was added in our 16 bo3 I believe that does something fancy as well if you want to know the details just ask me afterwards so that’s a general theoretical background for these things so getting statistics and trying to figure out what’s wrong so let’s dig down into some of these things so we are the different types of alligators and they’re called things like this when we’re talking about the statistics part of them so if you do a length System Info alligator you get information about which alligators are active in your system it’s not always the same and it’s definitely not always the same over releases but doing this you can get any information about what features are enabled in your system you get all of the different settings that you’ve set in our system so you can figure out what your threshold is or what’s your multiple carry allocation strategies or

things like that and this features like if you have an if you have a POSIX memo line in your system if you can look physical memory in place rather than trying to get the Linux way of doing memory management in there and lots of other things now you can dig into more detail even more so if you say a long System Info allocator and then put one of the types I have the end of C previous lights so if I look at salak something like that you get a list of tuples with lots and lots of data you can see that we have first is structured in instances so instance 0 is the global allocator that’s used for a sync threads instance one is for scheduler one instance two it’s a scheduler – it’s a red setter etcetera for each instance we have the version of the allocator use the option set statistics about the multiple carriers statistics about the Singapore carriers and statistics about the calls made by the carrier I’m going to break this down for you and scene so this is what you get out of the data I put it in a table for you so you don’t have to read the data stuff so what we get is that we get the current value we get the maximum value of that part’s sense the last call to get the statistics and we get the maximum value of the system lifetime and we can see the number of blocks there so we have said 1 million 1 million and 1.8 million blocks in there and we get the total size of all of the blocks in there I think I actually have a yeah see there so we have eight hundred twenty megabytes there 18 or 20 there and we had a maximum when we had a peak of 3.3 gigabytes of memory we can see the carriers we fit these 1.6 1 million blocks into 455 carriers we can see the carrier slices down there as well a single block carrier would of course have the same amount of blocks as carriers and we can see that the block size is 620 but since we’re over allocating a bit the carrier size is 7 point 5 and 25 so quite simple the number of calls that you can see we can see that we’ve done quite a few calls let’s see yeah so quite many so it’s 28,000 mega calls has been done to the binary a lakh so quite a few binders have been allocated here we’ve freed roughly the same amount which is a good thing otherwise without a memory leak and we’ve been able to reallocate some of them as one and we can see that MC Galica done 24,000 of the allocations while sister look has done zero so this is a good thing there so that’s for the single book and multi book things so you can also get statistics about the carrier allocators we can get the number of segments cache information seeing when we get an Outlook di look and destroy so here you can see the cache in working so you have M segala calls 464 over there while actually we’ve only created 40 carriers so we have a quite a high cache hit rate there saying that we’ve done 460 calls and only 40 of them ended up in an always called okay let’s see yeah I’m just thinking how much time I have which seems to be plenty that’s good so some case studies to see how you analyze these things and get this data once you understand what they mean so I’ve run across two cases I’ll run across many cases but these ones are quite useful for the open source community because you guys seem to run into them look more than our commercial customers so the first one we’re going to talk about this one called large binaries so there was this person that realized that hey I’m running s trace and I know that these alligators are supposed to be really efficient and they should be using M maps to allocate all my stuff but for some reason it’s not using em up it’s using malloc instead just using a lot more malloc so if when I started to so I requested the statistics for the binary allocators in there and you could see that there was about yeah 300 mega calls being done to binary outlook and we have about 0.4 mega calls being done to em Sega look so yeah we have a discrepancy because we want to have we want to have most of our stuff into the MSA Galica tour not the system allocator because MSE gallica der means that we’re using multi block

carriers while the system allocated means that we’re using Singapore carriers most it and we couldn’t see here that we have a systolic all of 1.4 mega calls so quite a discrepancy there so this is when bells started whistling for me saying something is wrong with these settings so we at this point you can kind of Reason out that okay so something is wrong with seeing with how the threshold is set for which things are being put a I know that this is binary unlocks the song so this person is allocating larger binaries than the actual single broker is threshold for most of the binaries and therefore they’re pushed into Singapore carriers which is not as efficient going back we can see that the multi block carrier size is about 2.4 gigabytes and the Singapore carry size is 11 gigabytes of memory so it’s a big discrepancy normally you wanted to have the reverse if not even more in favor of multiple carriers in there so what can we do about this so we want to adjust the single book carrier threshold somehow but how should we adjust it so looking at the statistics you can also get the average block size that’s in the Singapore carriers in there so you take the number of blocks and divide it by the total size of the carriers and we can see that the block size is about one point sixty eight megabytes that is in this case it was reading from a TCP socket and so now we have a block size that we know okay so this is the average block size so we probably want a bit go a bit over that size when we’re tuning the alligators and in the end we ended up setting it to 2 megabytes which is a reasonable limit above there and it’s a nice clean number so that put binaries that are greater than 2 megabytes into Singapore carriers while it puts them that are smaller than 2 megabytes into the multiple carriers and thus since multiple carriers are being done by em up and you get the caching of the MCG and all that the performance increase was quite good I believe and also of course increased the largest multi dog carrier slice in the smallest multiple carrier size that’s what these two lost one say for only the binary alligator and this is where the strength of the long run times alligators come in because you can say that I only wasn’t these settings for the binary alligators because if you set these settings for all of the alligators then you will have a problem most probably with the ETS or something like that worried over allocates more memory than you really want in your system so you can specify saying I have a problem with only binary allocations so therefore I put only banner of binary tunings into my into the emulator okay so second case that I was running into is was actually helping Fred her dealt with the problem at Heroku and which he wrote a big blog post about so this is that if you’ve read that blog post the symptoms that I was having was that our line memory total was showing about seven gigabytes of fused memory but he topped in his operating system was used it was showing that it was a 15 gigabytes of memory was being used by the beam and then all of the sudden he got he got some kind of small traffic spike and you got a crash Tom ETS a log failed to allocate and he was confused because hey I have it says I have seven gigabytes only and then it was 15 gigabytes and I don’t have memory to allocate these things now the thing to know about Alan memory total is that is the total area of the blocks in your system it’s not the total area of the carriers in your system so there’s a discrepancy there where it’s only the used areas Allen memory total it’s not the actual allocated from the operating system so if you see a discrepancy between what a long memory gives and what the top in your operating systems gives most probably you have some kind of fragmentation problem with your alligators in there so again looking at the statistics in there we can see that he had a lot of blocks and he had a block size that was starting up it was at the moment was 1.6 gigabytes for these and I believe this is just one instance so it’s only one scheduler but we can see and the other max carriers max block size of 7.8 6.8 sorry but we can see at the carrier size was largely the same for the current and a maximum so this is what the problem is you have a fragmentation in the allocations of these binaries so for some reason the alligators were allocating binaries in lots of the different carriers but when they were releasing them it was not releasing all of the binaries within a carrier so there was one binary there

one binary there and it was impossible to release these carriers and then he got a spike in ETS inserts and you cannot put ETS data into a carrier inputs for binary allocation and therefore we tried to create ETS carriers and you run out of memory so again we had to know what the average block sizes so we can set C that are the blocks were about 800 bytes per block when we were at current they had a peak of 1.7 megabytes per block and the carriers were about seven point eight seven point seven megabytes per carry this is something you would expect but if you take the block size and divide it by the carrier sighs we had a utilization of about 22 percent when running at the current which is not good at all but when we’re at max we are at 93 percent which is quite okay so what we want to do here is to find a better strategy for allocating these so that we don’t get in a scenario when we’re freeing binaries that the the carriers cannot be returned to the operating system so what we ended up changing there was to use in the north the default allocated for binary is a normal best fit alligator by using the address order best fit we’re trying to squeeze allocations further down into the memory space and therefore compacting them spending a couple of CPU cycles doing this but in the big scheme of things didn’t notice any difference we also shrunk the size of the largest multiple carriers so that by doing this we were hoping that the statistics would be in our favor so that we could free more of the carriers and that turned out to be the case as well I think he’s I don’t know exactly what is dip dips are nowadays but it’s not at 20 percent it’s more like 50 percent to 40 percent which saves the system in a lot of cases so it doesn’t have to deal with these crashes anymore yeah and some of the new features have been coming up lately so I was talking about the pool that was coming in our 16 bo1 so this can only be migrated from things win with the same type so it’s a migration in between schedulers but it’s still binary to binary migration so you cannot migrate something from a binary allocation to a ETS allocation yet this is something that we will probably get to work on eventually I hope so this will not save the fragmentation problem having this pool but it helps when you’re having a scenario where you’re allocating a lot on average schedulers in high load and then your loads shrinks for once and then it helps a lot and you get a lot of deallocations the reason for this introduction is that we had a customer that was seeing a memory growth as the load of the system decreased which is not really what you want to see because he had a peak and then we allocated lots of memory loss of memory and then the load shrunk and he was just allocating it memory in one scheduler but he was not able to use the carriers in the other schedulers and therefore the memory that scaler increased and it wasn’t able to deallocate and the other schedulers so by migrating these carriers into scaler 1 we solve that problem in our 16 this by the way is being turned on by default in 17 so check your system and see if you want to have it on or not we have only seen in positive effects of this but it might be possible that there are some downsides for it in our 16 bo3 we added something called a super carrier which is where the earths em up and functionality was added so this is a way for us to pre allocate a huge chunk of memory for the yelling virtual machine at startup so that we don’t have to request memory for the operating system the reason why we input this in is because we have a virtualized environment where one of our customers is running that way it took about 20 seconds to do a malloc so it was taking a long time and their tests were timing out for some reason so we put this in place so that they could pre allocates a [ __ ] bites of memory and don’t have to request it from the operating system at so small intervals it can also be used in order to limit the amount of memory that the a language machine uses but you limit is a better tool for that than using this you get the same result and I think the operating system in general is better at managing these three things because you kind of the reason why we use do single block carries as operating system allocations is because it can cheat by

taking memory from caches and so on and so on by if we’re allocating things with with doing em up we don’t get the caches that mod lock uses and so on so we don’t get access to all of the memory that’s in the system so but if you use you limit you don’t get the downside but you get the same benefits yeah so I’m done but answer questions yep yeah so the question is there’s one super carrier for each for the entire system and how does that work with the new market texture I don’t think we’ve done any benchmarks on that actually to see how that works the implementation about how it takes carriers and so on I’m a bit fuzzy about so I’m not 100% sure but it’s some kind of taking on the same sides and allocating things don’t know unfortunately answer No yeah so the question is how does the migration of carriers work within in a Numa system so yes you can have problems with that definitely you will have remote memory accesses where you might not want them to be but if you do not want to have this behavior it’s very easy to shut off and then you don’t have to deal with that in systems that do not have a lot of new my directions with different things it’s just beneficial to do it normally you don’t want other scenario anyway where memory is being allocated and not released from these other parts in there so it’s yeah yeah well as long as you do when you do message passing or interacting in your environment you kind of the we don’t group things in Numa nodes for processors in order to for them to not have the overhead and and if you send a message from one process to something in another Numa know the memory of that message will remain in the original new manage so you have a lot of these things anyway but tries to help out with the fragmentation point points of it yep well the sort of question is what’s the s plus one so let’s see if I go back to the pictures and see if I can explain that better so so I guess the s plus one that was talking about is just one plus that’s the two one so the third is 0 the the thread pool is zero and then you have one miss the first scalar two is the second scalar and 3 is the third scalar and so on and so on I might have made a mistake in the slide so if it’s no there’s nothing at the end over there so it just ends at the loss scheduler yeah so the question is if the NIF’s use the driver like I just know they used to heap allocators you’re allocating directly on the heap for things when you’re building stuff there of course if you make a binary that’s bigger than 64 bytes it will end up in the binary alligator but niffs have the advantage of building stuff directly on the heap which is why they’re a bit faster when calling small stuff then drivers yeah unless of course you go over the heap size then it will create the heap what we call a heap fragment which is still part of the heap allocators but it’s a different memory area yeah all of the fragments always get collapsed when you do a garbage collection so the question is where can I find information about this stuff without reading the source code yes you can whether it in my mind yes but there are

two other minds that you can read as well but yes in part of doing the work too with Fred to get the fragmentation problem I was working together with him to forest tool that’s called recon and there’s a module in there called recon a lock that does analyze it does some basic analysis for you and tries to figure out what’s in the system and yes maybe 7 or 8 functions there that analyzes different aspects of things problems that we’ve encountered we’re trying to add more things to it but what I find is that most of the problems that you encounter are unique so it’s like yeah this is a it’s not like our this is a general problem for everybody because then it would be a general setting but if it’s a it’s almost always a specific problem that somebody is having because they’ve written their code in a certain way and therefore this happens but for this fragmentation parts and finding out if there’s many Singapore carriers rather than multiple carriers in these things there are functions in this library I can do that with a lot of documentation explaining all the other stuff the documentation is almost better than the tool I would say so just reading documentation explains most of the things that I did in my slides in you welcome any more questions all right thank you for time you