Seattle Conference on Scalability

MALE SPEAKER: Our first speaker of the day is Jeff Dean, who’s our keynote speaker Jeff has been with Google for the last few years And he’s been an architect of pretty much every single system that we have at Google– in search, in crawling, in indexing, and advertising systems. And he’s a Google Fellow He’s also from the Seattle area He did his Ph.D. with Craig Chambers, and did his Ph.D. on compiler optimizations for large-scale for object oriented programming languages I want to invite Jeff JEFF DEAN: Thanks So welcome everyone So the plan is a lot of what you see and use on a daily basis from Google is our products, which are nice things But what I’m going to talk about today is basically the underlying systems infrastructure, and also a little bit about the underlying computing platforms so you can understand what sort of computing systems we’re building on top of, just to give you a flavor of what’s sits beneath all these different products I’m not really one for mission statements But I actually like this one Because it’s pretty broad It means we’ll never run out of things to do So how can that be bad? So the question is if you’re trying to take all the world’s information and organize it, what does that mean? So if you think about just the web, there are today tens of billions, possibly hundreds of billions, of web pages in the world Each one is about 10 kilobytes of data And that gives you on the order of hundreds of terabytes of data that you need to be able to organize and search over quickly, things like that And then there’s all the other kinds of data that is both private data for individual users, things like email, things like broadcast media, now all my videos, online pictures So all those kinds of things add up to an awful lot of data in the world And it’s growing substantially and most of it’s in digital form these days, but some of it’s not We like to take stuff that’s not in digital form and make it accessible in digital form And the web search, although that’s our starting point as the company, is just a tiny fraction of what we’re trying to do We take this pretty seriously We started with web search, and then we gradually added various other kinds of things Now we do a lot of things with geographic data– that’s kind of an interesting area these days, with satellite imagery, and various kinds of pictures of cities and so on We do a lot of community oriented applications now, where we allow people to talk to other people, organize their own email, search their email, and so on So there’s lots of things there And they all place different kinds of demands on the system’s infrastructure They all have slightly different requirement, and so on But the goal is basically to build systems infrastructure that allows you to quickly and rapidly, with small teams, build some of these interesting products So one thing we’ve certainly seen over the years is that we seem to always need more computers than we have. And there’s several reasons for that One is, as traffic grows over time, even if you don’t do anything else, you need more computational power just to handle more and more requests And the second thing is, as you increase the size of your index or you index more kinds of documents or things like that, you need more computers to deal with the increasing scope of data for the same number of requests And finally as you try to improve the quality of your ranking algorithms, you try to apply more expensive ranking algorithms on every query You need more computational power just to perform searches over that same amount of data We have better algorithms. And so the product of those things means that you essentially need lots of computers, and you always seem to need more of them then you have So the goal in terms of our systems infrastructure, if we want to create a set of tools and systems that allow people to make it easy to build products

And we’re heavily focused on price performance We don’t care about the ultimate performance in a single machine, as I’ll talk about a minute We want to make it easy to use lots and lots of machines for people That will enable us to build better products We can have larger indices We can update them more often We can have more responsive queries We can have faster development cycle and so on OK So let me talk a little bit about the hardware that underlies our computing systems these days The basic philosophy we have is that we have chosen to build on very low cost commodity PCs That’s where you get the volumes of purchasing in the marketplace that drive costs down, so that you can buy insane amounts of disk storage for really cheap prices these days You can get really fast computers that sit under your desktop for not much money And it’s at that price point that you get really good performance per dollar And we just build lots of them So in part, that’s because we don’t really care about how fast an individual machine is If you can buy the fastest machine today, it’s going to cost you a significant premium to buying the one that is just a little bit slower And we’d rather buy the one that’s just a little bit slower, because most of our problems don’t fit on a single machine anyway So you already have to figure out how to partition things across multiple machines And we have lots of inherent parallelism in most of our occasions So, for example, there’s both across request parallelism, so you’re handling many requests per second from different users Those are fairly easy to paralyze across different machines, and you also have within request parallelism, where you can take a large index of billions of documents and partition in into many pieces of millions of documents each Each machine can deal with a smaller index of on the order of a million documents And so there’s parallelism both across different requests and within the same request. So it’s pretty easy to figure out how to, in our case, parallelize computations across different machines, which means you just want lots of performance per dollar The other thing is you could spend more on more reliable machines Higher end servers tend to have features like rated disks, redundant power supplies, things like that, which are very nice features to have. But you pay a lot of money for those features And at our scale, even machines that are ultra reliable that have these features are going to fail anyway So you already have to deal with this in software in some form It’s just a matter of how often do you have to deal with it And if I can buy twice as many machines versus buying half the number of reliable machines, I’d much rather have twice as many machines Because they’re not half the reliability OK Some gratuitous pictures This is– what Larry and Sergei started the company– started the project, it was a research project at Stanford And their advisers apparently wouldn’t buy them any computers What they would do is they’d go down to the loading dock at Stanford and volunteer to receive shipments that other research groups had ordered– of machines that other research groups had ordered, and volunteered to set them up and then they would live on the float They’d hold onto it for a little longer and use it A drawback with that is I think they had 10 computers and nine different processor types and operating systems in here, which is a little more heterogeneous than you might like But clearly, their lessons have played out well in designing the first versions of our machines So when we were first starting out, we needed machines quickly, and we decided that we would manufacture our own machines, because it was too expensive to buy other ones So we would buy motherboards and disk drives and assemble the parts Each one of these trays has four motherboards, so four machines basically, and eight disks Each machine had two disks There’s four in the front, kind of laid on top of the motherboard and some wires to kind of keep them off the motherboards, and then there’s a row of four that are more neatly organized in the back There’s reset switches in the front And they’re all sitting on this tray,and to protect the motherboard from the tray, there’s a thin layer of cork below the motherboard So these were affectionately known as cork boards They also had some cabling issues So our next design, we decided we would omit the cork, and we might be better served by putting all the connectors on the front of the machine, rather than snaked it back to the back So this was around 2000 Data centers were interesting in those days Lots of people would buy high end Sun machines and put them

in cages and data centers And data centers charged by the square foot, which was an interesting pricing model So our incentive, as far as we could tell from reading the contract, were to pack as many machines as we possibly could in those square feet And they didn’t actually charge you for power, which was kind of nice So we sometimes had to help them out a little bit on their cooling So we bought a little fan at Target Eventually, another skill we had to acquire was moving out of bankrupt data centers and into new ones Because the pricing model seems to have not worked for a lot of those data centers So you get pretty good at building pretty large scale clusters and deploying them rapidly Basically, you can pre-wire all these things on a rack level The racks are on wheels You just kind of wheel them in, and then you have to hook up the inner rack networking and away you go This is kind of our current generation It looks pretty similar It’s basically commodity PCs, usually with dual CPU chips, with now two cores per chip, so typically four processors per machine It’s got low end hard drives They run a version of Linux with a very tiny number of patches we found useful for our particular platform, and then a bunch of custom in-house software One thing to point out is typically our networks have– we have a cluster of perhaps thousands or tens of thousands of machines, connected together with a central switch And we have a bunch of machines in a rack that you share a switch for that rack And then that rack switch hooks up to a central switch for the cluster And there’s less bandwidth available than full by section bandwidth for all the machines So one bottleneck in our system is– talking outside of a rack is less efficient than talking inside a rack So we do a bunch of software things to kind of help mitigate that in some ways Just to give you a flavor of the kinds of things that can happen in a cluster, I asked one of our ops people to put together this list of what kinds of things actually happen This is in the first year of a new cluster After that, they get a little bit more reliable than this, but this just gives you a flavor of some bad things that can happen And some of these are– you’ll lose 40 machines for a few minutes Some of these are you might lose several thousand machines for a while You have lots of individual machine failures Many more are hard drive failures That’s the delightful platform we build on OK, so what I’m going to talk about mostly in this talk is three pieces of infrastructure we’ve built over the past few years to allow us to use that computing platform in a reliable and reasonably efficient way The first thing is, if you have a bunch of computers with disks, you’d probably like it to be able to store stuff on them And you’d like a distributed file system, so that you can have a centralized common namespace over all this data So one thing we decided early on was our file system requirements were a little bit different than typical file system objectives In particular, we want to have really large read and write bandwidth We want thousands of clients to be able to talk to thousands of machines in a file system and get really good IO bandwidth for reads and writes It needs to be reliable at sort of the cluster level of machines We’re mostly dealing with fairly large files Most of our systems take small things like web pages and put a bunch of them in a file, so we have you know files that are many gigabytes, typically, rather than lots and lots of tiny little files And we need sort of efficient distributed operation, which means we don’t want to have a central bottleneck in the file system, much as we can help it One thing that really helps us is we’re able to link in some clients-side code into our file system to put some of the logic of how to deal with this distributing nature of the file system into our client applications, rather than having to have to have it work at the lowest level or have everything proxied through a server that we control OK So the basic idea of GFS is we have a special machine called a master that deals with all the metadata– the file names and keeping track of a mapping from file

names to chunk location At the master level, we break files into 64 megabyte chunks, which is a pretty large block size by file system standards The actual chunks on disk use Linux files, so those are stored with 8K file system blocks or whatever it is But at the master level, it keeps track of things at the 64 megabyte level And then the actual data for files is stored spread across a bunch of what we call chunk server processes, stored on local disks, on the machines And every chunk is typically replicated three times, on three different machines And typically we try to spread out the chunks across different racks So if you lose a rack, you don’t lose all three copies of a particular chunk So the master manages metadata Clients talk to the master when they open a file The master says yes, the file exists, and here are the three replicas for the six different chunks in the file, gives you 18 machine locations, and away you go And then the clients talk directly to your chunk servers to read and write files And occasionally they’ll talk to a chunk server and the chunk server will say I don’t have that chunk anymore Go talk to the master again And the master is also responsible for noticing when a machine dies, and then rereplicating any chunks that that machine had, to make sure they’re bringing them back up to the full desired level of replication So that system’s been pretty stable for a while, and is sort of running on almost all our machines in various clusters We have probably several hundred GFS clusters Some of them have upwards of 5,000 machines You often get pretty large collections of clients talking to the file system, and you get pretty high bandwidth rates out of these file systems As you’re processing a large amount of data, you have maybe 10,000 clients talking to the chunk servers in that file system And it all sort of works in the presence of disks failing and machines going down, and racks going down, and so on So now that we’re able to store data, it’s often useful to be able to compute over it And in the early days at Google, we would basically have some large dataset, maybe a bunch of documents we crawled, and then we would need to write some phase that, say, counted how often every word occurs And so we would take the files, and we would write some code to partition the problem into a bunch of chunks of those files, and we would write some code to actually do the work of counting word frequencies, which is not that much code, and then we would have a bunch of code in there to deal checkpointing the state of this computation, and what happens when a machine fails? How do you recover? And so all the messy details of running computations on this slightly unreliable computing hardware, you sort of obscure the real competition we’re trying to do which are often fairly simple things like count how often words occur or cluster documents by content checksum or something like that So Matt produces a system we came up with after writing several of these phases that sort of abstracts away a lot of the messy details, allows you to express your computation in this particular programming style, and then the library can deal with what happens when machines fail, and so on OK I’ve probably said most of this So the basic idea is you’re going to have some input data, which is– you can think of as a set of key value pairs or input records of some form And then crawl pages example– the key is the URL of the page, and the value might be the contents of the page So the map phase, you’re going to process those input records and produce some sort of intermediate key value pairs that your computation is trying to extract from that input data or summarize from that input data And then there is a reduce phase where you can specify for the same intermediate key, how do you want to combine different values that may have occurred from different records, or multiple value from the same record How do you want to combine them into your final output data? So the user basically writes these two simple functions– map and reduce, and the underlying library then takes care of it And the user provides a little specification of what input data they’re supposed to be processing and so on So let’s look at an example Here the key and value are a URL and the contents, and then the map function, since we’re trying to count word frequencies, is just going to, for every word in that text, we’re going to split it at spaces and omit key value

pairs that are each word occurrence and 1 Simple enough The reduce function in this case will be very simple It’s basically going to get invoked for each unique word and for each unique intermediate key– word in this case And then it’s going to get the sequence of values that the map function generated for those things In this case, it’s just going to have the value 1 many times, one for each word occurrence And it’s going to add up all those things and emit a final count, a final table, final count for that word And then the library takes care of applying the reduced function to every unique intermediate key, and away you go Now this is a really simple example, but you can do much more interesting things with it, like produce inverted indices, you can do training for machine learning systems, various things It turns out that you can express a fairly wide variety of problems in this map and reduce style So some of the things the library does for you are makes your computation pretty fast One of the things it does is it tries to put computation for particular chunks of your data on to the machines that have that data, or push it close to those machines So remember I said we have limited sort of cross rack bandwidth? So it will actually push computation to close to where the data is, because the computation is typically the size of a binary, and the data is typically much, much larger And so you can often get thousands of machines reading their data conceptually off of the distributive file system, but really it’s just reading off the local chunk server in most cases You can spend a lot of effort on the sorting algorithms and the sorting library inside MapReduce, because you can tune that a lot and then everyone who uses this library benefits The system deals with machine failures– I’ll talk about that in a minute– can deal with bad records like if you are using some third party library that crashes on some random record in some deterministic way, you can effectively set it up to– after it’s tried that a few times, it can skip that record so your computation will actually complete, if that’s what you want Or you can say I don’t want to skip records You have your choice It’s pretty easy to use It deals with what happens when you want your computation to go faster You can just add more machines generally and go run things with wider degrees of parallelism And it’ll also give you some monitoring and sort of central status pages that you can look as your computation is progressing That’s kind of a standard for across all these different kinds of jobs we’re trying to write So it’s being used basically all over Google I’ve a graph later about how many different MapReduce programs there are But it’s basically a batch oriented computational model that’s proved pretty useful I’ll tell you, there’s some testament to this, if you look at the number of different MapReduce programs in Google source tree, over time, those early numbers are when we were basically trying to rewrite our production indexing system, which had a sequence of maybe eight or ten different phases to take raw documents on disk and ultimately end up with a final inverted index and other data structures for serving And so that was when we first started using MapReduce We didn’t think it would actually be useful for things other than our indexing system, but people found out it’s actually pretty easy to use and it would take care of a lot of problems they were finding in writing other kinds of computations If you take a derivative of that graph, this is the number of new MapReduce programs per month As some testament to it actually being easy to use, every summer we have a bunch of summer interns come in, and most of them don’t have any experience with writing distributive computations or parallel computations, but they seem to be able to check in map reductions into our source tree I don’t know if they work, but they do get checked in And they seem to run a lot of jobs So here’s just some stats about how much it’s being used these days The typical MapReduce finishes in 10 to 15 minutes, but it uses several hundred machines We run 3,000 of these computations per day We have a bunch of deaths per job, that’s fine Most of those are caused by some job that repeatedly crashes over and over, not so much that every job has five worker deaths Someone also pointed out that worker deaths seems bad when

Google is hiring so fast But I assure you these are all machines So a little bit of how the computation is actually staged Basically, the MapReduce program has two kinds of workers There is a master who is responsible for coordinating the activity, dealing with what happens when machines fail, knowing which other workers have done which pieces of work and which tasks So it basically knows the input data that’s provided by the user, and breaks it into a bunch of tasks Those tasks are then parceled out to free workers, so it says oh, I have a free worker I’ll tell it to do Map Task 12 Typically you want to have many more map tasks than you have worker machines, because that allows you two things– one is better load balancing If one map task turns out to be slow, then the other machines just do a little bit more work. and the slow guy churns away for a little bit longer on one of the map tasks It also makes recovery a lot faster So if you have each worker doing 100 tasks, and one worker dies, then 100 other machines can each pick up one piece of work and recover very quickly from that machine dying The other thing we allow the user to do is to partition the intermediate keys, typically with a very simple function to like fingerprint the keys and do the mod, so that you can apply the reduced function in parallel So you end up spreading the intermediate keys and computation of reduce over a bunch of machines And then you end up having to do the shuffle So the master keeps track of which reduce workers have to talk to which map workers to get intermediate data from those machines The intermediate data is written to local disk on the map workers and basically buffered there And then it’s transferred across the network just by doing RPCs And then we shuffle it– once we have all the data from all the map workers– we can’t apply the reduce function, until we have finished all the map tasks, because the guarantee we make for the client is that when you process an intermediate key You’re going to see all the values for that intermediate key And so that’s basically how it works, and once we have all the data and we’ve grouped together, which we do by sorting, then you apply the user’s Reduce function to each unique key, and away you go So the status pages I alluded to– beautiful pieces of red and green So the green indicates map tasks that are currently in progress This is sort of when a job is just starting up– you see in the upper left, we have 323 workers– no deaths so far, hoorah We’ve split things into 13,000-ish map tasks– those are called shards We’ve started working on 323 of them Our total input is a little bit shy of a terabyte We’ve done about a gigabyte so far And we’ve decided to partition the Reduce function across 500 different machines We have 500 reduced partitions, in this case This is actually some phase of our indexing pipeline So, as map tasks complete, we start shuffling the intermediate data from the completed map tasks, as those same workers are doing other map tasks that are in progress So you basically pipeline the shuffling of the data with the computation of the map tasks And now you see, we have more workers We have 1,707 workers– we’ve had a death– but doesn’t really matter– fine And eventually we get to 100% map task complete All the shuffling is now down The red is the shuffling And now we’re starting to apply the user’s reduce function, which is the blue, and you know that proceeds in parallel across the 500 machines we’ve asked for our reduce partitioning And we get close to the end And now we have all but two of them are done All but one of them is done And finally, they’re all done One thing you notice is that the stragglers are kind of a problem Sometimes you end up with a slow machine that isn’t due to data dependency in the actual work it’s doing It’s just slow for some other reason It could be that lots of other jobs are running on that machine, so you’re getting less of the CPU It could be maybe it has a bad local disk and so instead of reading at 20 megabits a second, it’s reading at one megabits a second, because the disk controller keeps retrying things There’s all kinds of reasons We’ve actually had all kinds of weird things In one of our first platforms that used hyperthreaded processors, the BIOS manufacturer hadn’t really thought about what would happen when two chips for reading and writing processor registers at the same time

And so they had a race condition, where it would read processor status register, doodle some bits, and then write it back in And you’d end up with– in 4% of the time, the machine rebooted, you’d end up with the processor caches disabled, because it would stomp the bit that says please enable to processor caches So that was kind of annoying, because you’d reboot 100 machines and four of them would come up slow You’d reboot them again, four other ones would come up slow A machine without caches is a working machine, but it’s 30 times as slow Very annoying AUDIENCE: Do you mix the jobs? Do you run one job to conclusion? Or do you overlap the jobs, so stragglers are filling in the front of the next one? JEFF DEAN: So if you have a dependent MapReduce, we typically would run that one to completion, and then run the next one But we are interleaving lots of independent computations, like on the same cluster So there might be you know hundreds of users running different MapReduces on the same cluster– random jobs, taking network bandwidth, taking CPU from you, so it’s pretty important to deal with stragglers So I mentioned some of these locality– the shuffle stage gets pipelined One thing we do to deal with stragglers is towards the end of the computation, we will start off multiple copies of the last few map tasks, or the last few reduce tasks And whichever one finishes first wins That actually brings in the job completion time tremendously, because you’ll typically get scheduled on a machine that’s not as loaded as the slow guy, and it will be able to complete things more quickly We also compress intermediate data, because our environment is more CPU-rich than network-rich, so it makes sense for us to do fairly lightweight compression on the intermediate data, just to avoid the inter-rack transfers It’s proven to be pretty useful We have a paper about it that has a lot more details about it, if you’re interested You just search for MapReduce You’ll be able to find it The third half of the talk Over time, we found that a lot of applications wanted a interface to storage that was a little bit higher level than just a raw file system They wanted to be able to process lots of different kinds of structured, semi-structured data or maybe different pieces of data would become available at different times, but they all were kind of related by some key In our calling system, for example, you have URLs as kind of a natural key to tie everything together And then you have various kinds of data, like you might have some small metadata saying when did I last crawl this URL You might have other things like the actual contents of the page, the last time you crawled it or the last few times you crawled it And then you have other things that are being run asynchronously, where you’re extracting links from these pages Maybe you’re running a page rank computation over the graph structure you’ve extracted from all these pages, and you want to update the page rank value for this page So these are all tied together with the URL And we have other systems that have natural keys for organizing data, where you have, for example, per user data, you have user preferences, you want to be able to keep track of recent queries done by this user so you can show them in their search history, that kind of thing And geographic data tends to have a natural organizational point, where you want to organize around a particular region of the earth’s surface, and you have satellite imagery have, you have vector map data, you have maybe user annotations that you’ve allowed users to make about different points of the earth, and so on So we really need something that looks kind of like a data base, where we needed to scale to really large amounts of data We have hundreds of terabytes raw web content We have lots of satellite imagery data, and so on So we want something that has the structure or semi-structured API, and it’s kind of like a data base So you could use a database like Oracle or something The problem is the scale’s really large Even if you could buy something, it would be really expensive, and it would solve the problem for that particular application And then we have lots of applications that are like this So the next time you wanted to solve– not for URLs but for satellite imagery, you’re going to have to go spend a lot more money Also, we can kind of integrate it with our file system, and have a little bit tighter integration of how the system deals with compressing data, how it stores data on disk, and get some nice advantages from that So we basically decided to build something like this ourselves We decided we didn’t really need full database functionality, so we don’t support joins, we don’t support full SQL queries or something We have a fairly simple API that allows you to get at data

with the following model It’s basically a multi-level map that I’ll describe in a minute It’s designed to be fault-tolerant and persistent, so once you’ve written data into the system, it basically is persistent It’s scalable, so we have systems with several thousands servers serving a set of tables in a particular– what we call– a BigTable cell A lot of those cells support pretty high volumes of reads and writes We initially did it more for batch-style things, like our crawling system But more recently it’s been used a lot in user facing applications, where latency is a much bigger concern, where you want this operation to finish in 10 milliseconds And you need that to happen It’s important to us that it be self-managing So you can– it deals with machine failures, of course, but also you want to be able to add another 500 machines to the cell, and then have it take advantage of the extra capacity that those machines should offer and load balance across the available machines that it does have The basic data model is that you have rows and columns Think of it as kind of a really big spreadsheet A lot of our applications actually wanted to be able to look at multiple values across time So in our crawling system, it’s useful to be able to keep several versions of called contents, so you can look at how much is this page changing from one day to the next Is it something that’s completely static? Is it changing only a little bit? It it changing a lot? So we actually allow a third dimension of time, where you can set up a particular column to keep all versions of data that you’ve written in there, one version– you can say I want to keep all the versions in the last two weeks, that kind of thing And it turns out, if you squint at this abstraction right, a lot of our applications can make use of this thing, because it’s a pretty generic abstraction So the way we actually take this and distribute it across lots of machines is– you can think of the table as a sorted sequence of rows We actually think it’s important that the users be able to get at sorted sequences of rows, rather than just a random ordering like inserted hash table or something So we break these tables into what we call tablets, which are just contiguous regions of rows that are roughly a few hundred megabytes in size And we have a serving machine that’s going to be responsible for on the order of 100 tablets For the same reasons that we want a map worker to deal with many different map tasks, it helps recovery So if one machine fails, you can quickly have each of 100 other machines pick up responsibility for one of those 100 tablets And you recover pretty quickly when a machine fails It also give you finer granularity load balancing, so if you notice a machine is overloaded, you can move one tablet at a time away from it until the load imbalance kind of goes away So initially we have tablets We have two tablets in this table at the moment Eventually that bottom tablet get enough data that we decide to split it And so we basically pick a row that’s roughly in the middle of the tablet It seems to split the amount of data roughly in half And then we partition that into two separate tablets that are now independent And we can move one of them away from this machine, put it on some other machine, and so on We actually also do merges, which are more complicated because you have to stage things a little bit In the case of a split, the thing you’re trying to split is one tablet and that’s on one machine In the case of a merge, those two things are independent entities and you have to sort of pre-stage things to get the tablet on the same machine before you can kind of glom them together OK The system structure we have– like a lot of our systems– we have a master that is responsible for basically metadata operations and load balancing We have tablets servers that serve data, and then beneath that, we build on top of a lot of the other infrastructure, some of which I’ve talked about in GFS We have a cluster scheduling system, where you can insert jobs into the cluster, and you say I want to run 100 tasks on different machines for my job And it will take care of allocating resources to you It takes care of handling failover When a machine fails, it will restart that task in a different machine GFS– we obviously store the underlying state for the table in GFS And we have a distributive box service– that’s highly available and reliable– so that, for

example, we want to have hot spares for the masters So if the master goes down, we bring another one up very quickly So we actually start two processes They each try to grab a distributive block in this lock service to say I’m the master One of them will get it, and then the other one will basically continuously try to acquire that lock So if the first master fails, then the other one sort of takes over fairly quickly as soon as they can grab the lock And then we have a client library Again, we can link that into applications And the client library, when it needs to open a table, the metadata for each table is stored in the lock service And then it talks directly to the tablet service to read and write data Occasionally, it docks in the master, but unlike GFS, most of the metadata operations– most of the location information to find out where a particular piece of data is actually handled by the tablet servers and spread out over all the tablet servers So you only need to talk the the master if yo want to create a new table or something like that, which is pretty rare So the state in our tablet is, basically, we have a mutation log of mutations that have been added to this tablet so when a write comes in, we append to that log Then we buffer that write in memory, so that we have efficient access to it When a read comes in, we look in our in-memory buffer, and we also maybe look in some compacted representations of the log that we’ve stored on disk, that are organized, sorted by key Sometimes you need to look in multiple of those You can also specify that some columns are mapped in memory, so you might have a bunch of files mapped in memory to represent things So as I said, a compaction happens when– there are two kinds of compactions One is you’ve buffered up a bunch of updates and memory, and now your memory is getting full So you’re going to take that state in memory for this tablet, write it out into on disk commutable form, and then you can flush your memory and serve the state off of those disk files, instead of off of memory The other thing is eventually you build up too many of these files, and you need to reduce the number of files And so you basically pick a bunch of them You merge them all together And then you produce one file for the tablet that represents the state up to a certain point in the log So you always have the state of the last major compaction, and then you have some number of other files that are minor compactions, and then you have the piece of the log that has not yet been compacted That’s kind of the state of the tablet We also allow the ability to segregate columns from other columns in the on disk representation under user control The one thing that is very useful is if you have some columns that are very small and you want to iterate over them independently of all the other ones, the contents here is quite large It’s basically all the contents of all the web pages we’ve crawled And sometimes we might want to iterate just over the page length and language values for the page, which are a few bytes each So we allow segregation of columns into what we call locality groups And you can say this one should go in this locality group This one should go in this locality group And then if you need to scan and you only scan the language and page length columns, your IO is proportional with the data in those columns rather than all the columns So one tricky thing in the system is how do you actually locate a particular piece of data? What we actually do is we store tables that are themselves BigTable tables, and those tables point to the tablets and the other tables in the other machines There’s a bootstrapping table that we store in our lock service which is a pointer to the META0 table That META0 table has one row for every tablet in the META1 table Every row in the META1 table points to the actual location of a tablet in a real user table So if I need to find a row in this user table on the right, I go to the META1 table and scan forward to find the right row for that tablet, find the entry in the META1 table which will point me at that thing If I don’t actually know where that tablet is, then I go back to the META0 table and find out where that META1 tablet is And if I don’t have the META0 location, then I read that out of the bootstrapping table It seems to work pretty well And if you apply a little bit of prefetching and caching, then you generally just need to go directly to the machine on the right, and sometimes need to do a look up in the META1 table Most of the other stuff is cached completely So we now have about 500 of these cells, where a cell is a

master and a bunch of tablets servers The biggest one is managing three petabytes of data, and its in active use for a lot of projects Most projects these days are building on top of something like a BigTable rather than building directly on raw JS files So, in terms of what I think are our challenges are, in terms of where we want to take our infrastructure these days, I think a lot of these tools work pretty well at the single cluster level We’re pretty happy with those Where we have issues is in terms of– we have lots and lots of clusters distributed around the world, and we don’t really have a single system that deals with data in all those clusters or computation in all those clusters So one thing that would be really nice is to have a single global namespace for all of our data Right now, we have different GFS cells, and those are separate namespaces So if you copy files from one GFS cell to another, there’s no automatic system that knows that connection of this data was originally copied from this data, and it’s another source if you need to read it So we really need some way of keeping track of that kind of thing, so that you can more automatically make more replicas of data across different clusters, for example And we’d like them more in a more automated fashion– migrate data and computation across clusters We sort of do that migration today within a cluster, but not very effectively across clusters Once you have that, you have lots of consistency issues that are mostly tied into wide-area replication and network partitions of this cluster is now offline for a few hours for maintenance or something Or it’s partitioned but I am still getting requests in both sides of the partition In a lot of cases, I need to be able to do something reasonable, continue operating in some limited mode, on both sides of that partition I can’t reasonably say at all these are requests that show up in the nonquorum side of the partition that well, sorry you can’t do anything It would be better to show, for example, you show users their email but maybe say some email messages may not have shown up or something, rather than completely not showing them anything So basically, looking at how can we build systems that are spanning multiple clusters and are much larger scale than what we’ve built today I will briefly talk about the kinds of– so a lot of what I’ve talked about is the stuff that sits underneath But the end goal is to build interesting products, interesting features of products So I’ll talk a little bit about the kinds of things you can do given this infrastructure I mean. it’s actually a pretty nice environment, because we have built this infrastructure, and it’s pretty easy to write a random MapReduce Interns have a lot of fun writing MapReduces and doing interesting computations And we have a lot of interesting data So one of the things I think we believe pretty strongly in is that the more data you have, the better you can make your systems. This is actually an example of all of the various– the top one is the correct spelling of Britney Spears All the other ones are misspellings of Britney Spears that were detected by our spelling correction system to be misspellings and corrected to the correct spelling And you see there’s a very long tail of potential misspellings And the more data you have– our spelling correction system is trained on a collection of documents and queries And the more data you have, the better the system is going to work Because you’re going to see more examples of misspellings and things that have been corrected here and so on Actually, there’s one here that is briny spears that may refer to pickles I’m not certain That may be a problem The other thing you can do is put together little demos of interesting things, so this is looking at query frequency It’s a little bit of old data taken from an old system So this is the beginning of 2002 to the middle of 2003, so it spans 18 months– the frequency of queries for a variety of cases of queries to Google.com So you see, for example, the eclipse case– there’s always some number of queries containing eclipse because Mitsubishi makes a car called the Eclipse But occasionally, you see a big blip in eclipse queries And actually if you look at the frequency of individual queries on those days, you can figure out if it’s a lunar or solar eclipse, which is kind of cool Every 28 days, regular as clockwork, we seem to get a big spike in queries for full moon, which is kind of cool You wouldn’t have expected that but there it is

You may not have expected watermelon to exhibit seasonal trends, but it does It’s kind of low in the winter, you know There’s a big spike in the middle of summer Any guesses? Fourth of July, right It’s a popular picnic food You’re trying to figure out what to do with all those watermelon you’ve gotten I don’t know it’s actually the 4th or the 5th of July Probably the 4th World Series– you expect a big spike in October for the baseball World Series But there’s actually two other world series that happen– the Little League World Series and the college world series This is kind of a toy, and it’s kind of fun to play with There’s actually a public version of this called trends.google.com, which you can do your own queries So we basically make this data available, as long as they’re sufficient unique users to not have any privacy issues The data on the upper right would be useful for perhaps as a signal to our ranking algorithms. If someone does a query in July for World Series, it might be more useful to show them college world series pages than major league baseball October World Series pages So even though this is kind of an interesting toy, there’s always lots of interesting data that you can imagine including in things like improving our search quality, improving ranking Summer Olympics is kind of interesting, because this was a Winter Olympics year There was no Summer Olympics that year This is just people pining for gymnastics instead of ice skating or something And you also see interesting things that happen when a new word enters the lexicon So Opteron is an AMD processor that was introduced a few years ago, there were no queries for it before, and then they announced that they were going to be building this processor, so people did a little few queries Nothing much happened until they actually released it, and then after they released it, it dropped back down to a much higher level than before And that’s potentially interesting things So I’m going to talk about one more kind of application that sits on top of this infrastructure, which is machine translation This is a translation of one human language to another– Arabic to English, English to Arabic, Chinese to English, and so on Most of the translation systems work as a set of rules that have been handcrafted by people over many, many years And in the early 90s, there was a group at IBM Research that looked at what would happen if you trained the system by training it on lots of data where you had translated versions of one sentence and the same sentence in the other language And he basically just looked at probability distributions of if the word hello occurs, I have a high probability of seeing Bonjour in French, for example And basically, if you have enough training data, you can basically build a probabilistic model of words and phrases that tell you how to translate text without actually having any handcrafted rules So the more data you have, the better this will work If you don’t have much data at all, it doesn’t work very well at all But the basic idea is to build a model that, given the source language sentence, tells you what is the probability of all possible target language sentences, and you pick the one with the highest probability Clearly, that’s a very large search space, so you do a lot of pruning along the way to help guide the search And so for training, you basically have some amount of what we call parallel-aligned corpora, where you can get, for example– one big source of it is United Nations documents which are translated into six languages, literally sentence by sentence It tends to give your translation system a little bit of a United Nations bureaucrat twinge to it So you’ll want to use that data and all the other data you can get, to kind of hope it washes out But the basic idea is you find these parallel corpora, and then you build these language models One thing that really helps is in addition to these aligned corpora, which you can’t generally get as much data of those as you would like, you can clean up a lot of the translations by having a very large language model of how often every five-word sequence in the target language occurs, for example So if you take all your English documents in the world– you’re trying to translate Chinese to English– you have some amount of Chinese and English parallel data that you’ve trained one of your translation models on But another thing that helps is to have an enormous language model of English word and phrase frequencies And if you take 10,000,000,000 English documents, you process them, and then you have two candidate translations for a sentence, and one of them has a five-word phrase that has occurred 43 times, and the other one has a phrase that has never occurred, you’re probably more likely to prefer

the one that’s occurred 43 times It kind of makes the translations flow a lot more naturally And then you actually just need to look up lots of probabilities I don’t know much about machine translation, but I know how to look up data in a large distributive system So I worked with our machine translation folks They need to do about a million look-ups per sentence in this state, and the state is hundreds of gigabytes So it’s kind of fun Good problem And you see that if you have more data, your translation quality is better This is a contest run by the National Institute of Standards It measures the percentage overlap for your translations with a set of four human translations of the same documents At the high levels here, it is actually reasonably readable You can definitely get the gist of exactly what the document’s about It’s not a perfect translation, but even if you took a human translator, another human translator and compared them, they would only get like 80% because human translations are not 100% overlap either The important thing to notice is the more data you have, the better your system works So you can train the system on less and unless data to see what the effect of having more data is And every doubling in the amount of training day you have basically gives you a little bit higher score And that trend seems to bear out as far as we can see So that’s kind of cool One of the things that I think having the right infrastructure helps with, is that if your infrastructure solves a lot of annoying problems that are commonly repeated by scene and need to be solved by lots of independent groups, that allows those groups to be more productive in building real products, which is what you really care about and improving existing products So we tend to have people work in pretty small teams. Some of the teams build interesting infrastructure Other ones build interesting product on top of that infrastructure And we let people kind of play around with different things like the thing that showed query frequencies over time is something someone played with, just as a side demo, but it’s actually useful data in some cases for search quality, all kinds of things One of the things that I think is interesting about Google is that we have a pretty broad range of problems. And the set of problems we work on spans very low level things like hardware and networking, how can we build large clusters of machines, how can we network them, connect them together with better bisection bandwidth, how can we design data centers to be more efficient for cooling, all the way up through how can you build interesting distributive systems on top of this slightly unreliable hardware How can you build good data structures and algorithms for a variety of problems Information retrieval, machine learning, and statistics, and so on, and then all the way up to user interfaces and product design All those kinds of things are integrated And a lot of the teams span lots of these different problems. It’s kind of fun So we’ve written a few papers about some of these systems that I’ll just points you at They go into a lot more detail about some of the inner workings And I’ll take questions Yeah? AUDIENCE: I was wondering [INAUDIBLE] as something that BigTable had guaranteed [INAUDIBLE]? JEFF DEAN: Yes, so one of the things we’ve moved to– AUDIENCE: Could you repeat the question? JEFF DEAN: So the question is does BigTable give you any SLAs or latency guarantees about different kinds of operations One of the things we’ve moved to is having groups of operations People manage collections of BigTable cells, so we have what we call service cells that, rather than having each individual group run their own BigTable cell, and as part of the service agreement, you can say I have this much data I need this kind of access to it It needs to be three nines, less than 10 milliseconds, and you can basically say that OK? OK I think we don’t have time for any more questions, because I’ve babbled on too long But I’ll be around at the breaks and stuff OK? Thank you