Spanner Internals Part 2: Global Meta-Data and Scalable Data Backend (Cloud Next '19)

[MUSIC PLAYING] DEEPTI SRIVASTAVA: Hello, and welcome to “Spanner Internals Part 2.” This is a second talk We had a talk yesterday called “Spanner Internals Part 1.” That was around how sort of Spanner internally works and how it’s evolved as a system over time inside Google, and then we externalize it into the Cloud So today, we’re going to focus on one of the critical systems built on top of Spanner internally, which is Google Cloud Storage But before that, let’s sort of start out with, what is Cloud Spanner? So my name is Deepti Srivastava I am the product manager for Cloud Spanner, which is our Google Cloud Platform managed database offering, which is offered on GCP And we talk about it as the only Enterprise-grade, globally distributed, and strongly consistent database service which is built specifically for the Cloud And what does that really mean? It means that this database service combines the benefit of relational semantics with non-relational traditionally NoSQL horizontal scale So we have schemas We have asset transactions We have SQL And we have the ease of scalability and high availability, such as four nines of availability for our regional configurations, five nines of availability for our multi-region configurations And all of this is packaged up in a fully managed database service that runs on Google’s Cloud infrastructure So as I said, the special thing about Spanner that no other database system in the world currently has is the ability to do asset transactions and have strong consistency even across WAN and across data centers and across regions No other database in the world today can do actual strong transaction– a strong consistency on transactions when you start to put the replicas in more than one data center that are actually separated from each other through a wider network fiber And that speciality allows both our customers who are using Cloud Spanner to build interesting applications on top of it that can be globally distributed while having read-after-write semantics It also allows our internal services to build some very exciting and differentiated services on top of Cloud– on top of the internal Spanner So we talk about internal Spanner and external Spanner, but what are the real differences here? So Spanner, internally or externally, is actually the same infrastructure It’s the same binary it’s running, whether it’s serving external stuff or internal stuff The only difference is the access paths So externally, we have– when you’re talking to Cloud Spanner, the access path is through the Google Cloud Platform APIs and through all the security, and all those kinds of things, whereas internally, we have our own internal access paths And we keep the data separated for obvious reasons We don’t want to mix external customer data with internal data So obviously, all of the security and privacy and compliance aspects apply But in terms of, are you giving me a slightly different version? We’re not We’re giving you the same Spanner that we have internally for our critical services The other differences are in terminology So my colleague will come up and talk about some of these But basically, externally, we have Cloud Spanner nodes, which we call Spanner Servers internally And externally, we have instance configurations, which are the topology for how a replication is done– basically a replication topology And this can be regional configurations or multi-region configurations For those who are familiar with Spanner, Spanner is inherently replicated So we have three replicas for regional and 5, 7, 9 replicas for multi-region, depending on which configuration we’ve chosen And these things are called sort of Spanner configurations internally So as I said, essentially, it’s the same differentiated technology that we use internally and have done for seven, eight, nine years, which is why we feel so confident about hosting our customers on it in the Cloud, because it’s battle-tested It’s hardened And most of Google’s critical services at this point run on Spanner One of those critical services is Google Cloud Storage, GCS,

which is a very popular service It’s built on top of Spanner And because of Spanner’s strong external consistency, as we call it, GCS is able to provide very differentiated set of capabilities to its customers So to tell you about that more, I’d love to invite Denis, who is our colleague here Denis, why don’t you come up? DENIS SERENYI: All right Thanks, Deepti So a little bit about myself I’m the tech lead for GCS’s metadata layer for Google Cloud Storage And hopefully, you are already somewhat familiar with Google Cloud Storage and/or maybe have even seen one of our other talks But I’m going to talk a little bit about– give you an overview of what GCS is all about, and then go into depth in how GCS takes advantage of Spanner So GCS is object storage made simple It’s a central repository for all your organization’s data, putting it under a single consistent unified API And so GCS is able to store a wide variety of data, whether it be content-serving data, whether it be analytics data that requires high bandwidth, or even warm/cold storage, or archival tiers It’s highly reliable GCS provides up to four nines of availability and multi-regional storage It’s extremely cost-effective GCS can store hot data, but it also stores cold data, offering a price per gigabyte as low as $0.04 a gigabyte Or sorry, $0.004 a gigabyte And we just introduced our archival storage tier, which is as low as $0.12 per gigabyte GCS is secure It’s a great way of keeping your data secure Its data is encrypted at rest and in transit when stored in GCS and were integrated with Cloud storage solutions like IAM and KMS GCS offers three different location types There is what’s called regional storage And that’s where your data is stored redundantly within one of our Cloud regions across availability zones within that region And that’s really good for analytics data This regional storage is a great way of ensuring that your data is co-located with your compute so you can get really high bandwidth access to your data within that region Our second location type is what’s called multi-regional storage And that’s where your data is spread redundantly across multiple regions within a multi-region– a multi-region being US, Asia, or EU And this is really great for content serving We have over 100 points of presence where this data can be stored, which means that if you’re doing content serving, oftentimes the data is located very close to where it’s being requested, and so we can minimize latency, minimize bandwidth by distributing it widely across a multi-region And then the third location type is dual-regional This is something that we’ve introduced recently that kind of combines the best of both worlds Dual-regional is a great way of ensuring that your data is located in two specific– in a specific pair of regions And this way, you can ensure that you have really high bandwidth access to that storage within those two regions Say, if you want to use that data for analytics, but you can get higher availability with dual-regional than you can get with regional So if you need to have an analytics application where you need to potentially failover between regions, dual-regional is a great way of providing that And Twitter is an example of a user that’s using our dual-regional storage right now for analytics Vimeo is an example of a user that’s using our multi-regional storage for content serving So GCS provides a lot of key features

This is an example of four key differentiating features of GCS And what’s important to keep in mind is that– as I’m going to go into in more detail– GCS uses Spanner heavily under the hood in its implementation, and all of these features are in some way enabled by the fact that GCS is using Spanner So let’s talk about the first one– strongly consistent listings What this means is that when you do an object listing in GCS, you’re guaranteed to see exactly what’s there, not some partial view And this makes your applications on top of storage much simpler because you can rely on your storage layer being consistent GCS offers geo-redundancy within the system So as opposed to systems where you have to sort of cobble together geo-redundancy at layers above the storage system by adding replication and cross-regional replication, GCS offers that as part– just in one package And we of course leverage the fact that Spanner’s metadata is geo-redundant in order to provide this GCS is highly scalable So much like Spanner is used internally very heavily by Google, GCS is also used really heavily by many systems across Google And so we are– we’ve had to grow to Google scale just to support Google’s internal applications And so we expose that to our Cloud users as well GCS is scalable to exabytes And the fact that we’re using Spanner as our metadata storage within GCS really facilitates this scalability So because we have a highly scalable metadata layer, it allows our storage system to scale to these levels And GCS provides a single unified experience for all the variety of data that you have So regardless of what storage classes you’re using or what location types, there is a single unified API And that is also facilitated by Spanner Because we’re using Spanner as our centralized metadata repository, it allows us to expose– for all of the data that’s in GCS, it allows us to expose the single unified metadata layer on top– API on top of that metadata So now let’s dive a little bit more deeply into how GCS uses Spanner under the hood So you’re probably familiar with the GCS core storage abstractions And these are buckets and objects Simply, an object is a container for a collection of data, like a file in a file system And objects have a particular name There’s also buckets, which are collections of objects And both of these things have associated metadata in GCS So there is bucket metadata and object metadata And that metadata is stored in databases in Spanner GCS is using, much like Deepti alluded to, GCS is using Spanner much like any other user is using Spanner It’s the same system And so we’re basically using Spanner just as an ordinary customer So I would say that Spanner provides four key differentiating features that are critical for GCS’s implementation and critical for GCS’s use case And the first one is strong consistency And so what that means for us in our implementation is that when we do mutations into our Spanner databases, those updates are immediately visible to all of GCS’s front end systems And that allows us to provide our own user visible consistency semantics much more simply It’s certainly possible for storage systems to offer the kind of strong consistency that we offer without a strongly consistent database, but that would add tons of complexity Storage systems can do that with coherent caching layers or fancy request routing, or having master election in the storage system But we kind of don’t have to do any of that We don’t have to deal with it We just put our metadata in Spanner

and serve that metadata out of Spanner, and it just kind of takes care of the consistency issues for us Spanner has a really simple programming model It supports single-row transactions, and we do mostly single-row transactions But we also use multi-row transactions, multi-table transactions, and even multi-database transactions And this allows our system to be much simpler in how it interacts with metadata GCS is made up of many independent components under the hood And a lot of those components are interacting actually directly with metadata And so the fact that all of these systems can interact transactionally in a variety of different ways with our metadata databases, it makes all of those systems pretty simple, and it gives us a lot of confidence that our metadata is correct and consistent The fact that Spanner supports really high availability is critical for our use case So we use geo-redundant Spanner deployments And the fact that you only need a quorum of zones available at any one time in order to be serving is something that we rely on GCS is also pretty big So the fact that Spanner has really excellent horizontal scaling is very important for our use case Simply put, when one of our locations– when the metadata in one of those locations grows to the point where we need more resources, we just kind of scale up our Spanner instances And that’s a pretty simple administrative action And Spanner scalability just kind of takes care of the fact that location needs to grow I’m sure that there are scalability limits I mean, every system does have them But we just haven’t hit those limits And GCS is already really big So just to give you an idea, some idea of what kind of scale we’re talking about when we’re talking about how big GCS is, this is an example of one of our multi-regions and some high-level data points for how big we’ve been able to scale our Spanner databases for this multi-region So we’re talking about trillions of database rows, millions of queries per second, millions of mutations per second, petabytes of logical data stored in these databases We’re using thousands of Spanner servers, even for this one multi-region And there is over 1,000 miles between Spanner zones So the fact that we have been able to scale up our databases and a Spanner instance to this scale, and we haven’t sacrificed our strong consistency or high availability is really pretty unique It’s pretty cool that we’ve been able to take advantage of this So let’s return to the earlier slide where we were talking about GCS’s location types and talk about exactly what that means in terms of our Spanner deployments For regional storage, we’re using a Spanner instance that is entirely within a region And so just to keep in mind, we use separate Spanner instances for all GCS locations And so for regional, we use a Spanner instance that’s within the region and zones that are within different availability zones of that region For multi-regional, we use a Spanner instance– I’m calling them Alex in the slide That’s what we call them internally It’s just a different name– with zones that are spread across the regions within the multi-region And for dual-regional, we use a Spanner instance that has zones within the two regions of the multi-region plus a third region So now let’s deep dive into really exactly how Spanner’s strong consistency is so important for GCS’s use case So I’ve already talked about this a little bit, but I want to go into depth into how important strong consistency is for GCS users Simply put, it’s much simpler to build applications on top of storage when the storage system is consistent, when you’re getting a complete view of what’s in your storage system at any one point in time So as a specific example, GCS offers read-after-write consistency, which means when you write an object into GCS

and then later try to read it back, even if it’s one of our multi-regions and we’re doing sort of cross-region redundancies with that data, we have– we provide a strong guarantee that you read exactly what you wrote earlier And it sounds kind of obvious, like I mean, I guess, really every storage system should work this way, but there’s a lot of storage systems out there that don’t actually provide this guarantee, that are only eventually consistent, especially when you’re talking about geo-redundancy And this makes applications a lot simpler If you don’t have this guarantee, you might have to introduce retry loops into your application I don’t know If the object isn’t there and you try and read it and it’s not, then you might have to just kind of like wait and retry later Another reason why strong consistency is so important is that GCS is being used as an organization’s what’s called a data lake, which means that there are many applications that are working together on top of storage that’s centrally located within GCS And if all of those applications don’t have a guarantee that they’re seeing exactly what’s in that shared storage system at a point in time, then it also adds more complexity to those applications They might have to deal with the fact that there is only a partial view that they’re seeing A specific example of this might be like an analytics pipeline, where there is one stage of a Hadoop pipeline that is producing a data set And that data set is later consumed by some other stage of the pipeline, which is doing additional transformations on that data If that later stage doesn’t have a strong guarantee that it sees everything that was produced by that first stage, it’s going to be some extra– you’re going to have to deal with that in your application somehow And a lot of times it’s just hard to deal with that So in addition to strong consistency provided by Spanner being really beneficial for our users, GCS also takes advantage of Spanner’s strong consistency within the implementation of GCS So how many people in the audience are familiar enough with GCS to know what a compose operation is? OK Not too many people So compose is an operation that’s provided by GCS And it’s a way to take several-source objects and stitch them together into a single-destination object This is typically used, say, for uploading large objects into the system You can upload the object in pieces, and then after the fact, stitch all the pieces together into a single object Now, compose clearly is going to involve some kind of a multi-row transaction here, because it’s going to be operating on multiple objects and doing that atomically But compose also offers what’s called preconditions, which means that for each of the source objects that you specify, you can specify that the operation should only execute if certain preconditions are met For instance, the objects have to match a particular generation in order for the operation to complete So clearly, here is a case where strong consistency is going to be really important We’re going to want to read the metadata for all of these objects out of Spanner and evaluate these preconditions, and then if the preconditions evaluate, commit a mutation If we don’t have strong consistency guarantees from our database, it’s kind of a lot harder to implement that operation So here’s another example of how we’re taking advantage of Spanner’s strong consistency So GCS has an FSCK, a Filesystem Consistency Checker This is something that most storage systems have It’s a process where in the background, you have something– the FSCK checker is something that’s running in the background and validating that all of the metadata in the storage system is correct, that it’s free of inconsistencies And a lot of times, FSCK checkers want to do what’s called cross-checking So validate that the contents of one database is consistent with the value in another database

And in order to make these kinds of checks simple within GCS, we take advantage of the fact that you can take a consistent point in time snapshot across databases And we have a strong guarantee from Spanner that everything that’s in that database, even– I’m sorry– in that snapshot, even if it’s multiple databases, is exactly what was in those databases at that point in time So this would make our FSCK cross-checking much simpler In this example– it’s kind of a simple example, but this is an actual check that we have in our FSCK, which is that if there is a table of all of your objects in a given bucket and then a separate table in a separate database that shows the state of the bucket, whether it’s deleted or not, pretty clearly, if there’s objects in the bucket, then the bucket shouldn’t be deleted So we want to validate that the contents of this one database as refers to this one bucket is consistent with the information about objects in the bucket So here’s– my third example is that we’re using secondary indexes within GCS And the way that we take advantage of secondary indexes is to optimize the performance of our background scanners So storage systems It’s fairly common for storage systems to have background scanners And what these things typically do is things like maintenance operations So you might have a scanner that’s running over all of your metadata periodically and doing maintenance on the data that’s represented by that metadata Another way that background scanners are used is deferred operations So if, say, you do a delete on some entity in your storage system, typically, that delete is [? act ?] very quickly And it kicks off some asynchronous work to sort of like actually clean up the storage that’s represented by that object that you’re deleting And so that’s sort of a deferred action that’s often driven by a background scanner So by using secondary indexes, and in particular sparse secondary indexes, we’re able to build background scanners that are much quicker at iterating over our metadata, because only the things that actually need attention from the background scanner are represented in the table that we’re scanning So in this example, we have an object table And there is a column in that object table that represents the time at which an update was made to each object And it’s only when these objects have been updated that the background scanner actually needs to pay attention to those objects And so we use that update time as the key in a secondary index And then so that secondary index is much smaller than the object table And so the background scanner is running over that secondary index And it’s much simpler, actually, to implement that background scanner because the secondary index is consistent with the object table itself So the last thing that I wanted to talk about is just some of the challenges that we’ve faced when building our system And one of the challenges that we do face is something called hotspotting So what is hotspotting? The thing is that every single system has scaling limits that apply to individual keys So hotspotting can occur if you have an uneven distribution of load within your system Say, for instance, you have a table of videos And one of those videos happens to be “Gangnam Style.” And that’s represented by one row in your table Clearly, that row is going to be really popular And you’re going to have to deal with some kind of a hotspotting problem on that row Now, systems can deal with this by caching whatever is in that metadata and distributing it more widely We choose to not sacrifice our strong consistency guarantees while attempting to deal with this problem But we have developed some techniques to deal with it that I’m going to go into So just for some background, hotspotting–

strong reads in Spanner require going to the leader to get the latest timestamp So that is going to hotspot particular servers And if you aren’t careful about this in your application design, it is something that can limit overall application performance So there are two techniques that we employ to minimize the issues that are created by hotspotting And the first one is through schema design So if the hotspotting problem that you’re facing is that there are small ranges of keys that can get much hotter than other ranges, say, for instance, in this example, a collection of objects that are all very similarly named So they share a given key prefix You can address any potential hotspotting that might occur on that range of your key space by just spreading it out more widely in your key space by applying what we call a sharding factor So you can basically compute a hash, maybe, over the contents of your primary key and prefix the actual primary key with that hash It’s going to have the effect of spreading out otherwise co-located keys much more widely And that spreading is going to cause these keys to tend to fall into different Spanner splits, which will improve your ability to scale But keep in mind that this kind of spreading can come with a cost, depending on your workload So if an important workload for your use case is sequential scanning over your primary key, then this kind of spreading can make that kind of sequential scanning a little bit more expensive, because instead of scanning just one small range of your key space, you now have to scan several ranges of your key space, depending on how you’ve broken up your primary key So the other way that we’ve mitigated the issues caused by hotspotting is by using bounded staleness reads Now, you might be a little bit sort of wondering, OK, I’ve been talking about strong consistency through this whole talk Now we’re talking about bounded staleness But you can actually– the nice thing about Spanner is that you can pick and choose when you can introduce bounded staleness into your application and where– and use it only where it won’t jeopardize the kinds of high level semantics that you want to provide in your application So we don’t use bounded staleness everywhere We use it where it won’t jeopardize the semantics that we want to provide to our user And keep in mind that bounded staleness reads are actually much cheaper than strong reads And that’s because, typically, bounded staleness reads don’t have to go to the leader Oftentimes, all of the replicas in your Spanner database are almost entirely up-to-date It’s only just that last small window of transactions that haven’t yet been propagated to all of the replicas And so even if you just have a small staleness bound in your query, it will almost always not have to go to the leader It can go to any replica And what’s nice about Spanner’s API, is that you can easily specify how much staleness you’re willing to tolerate And Spanner will provide a strong guarantee that that staleness bound that you provide is adhered to strictly So here’s an example where we’re using bounded staleness reads in GCS as a performance optimization And we’re not jeopardizing our strong consistency guarantees by using bounded staleness reads So what we’re doing is we’ve got some code in our system that is attempting to determine whether a bucket exists And so that’s– a bucket is represented by a row in our database So we first start by issuing a bounded staleness read for that bucket Now, almost all the time, that read is going to return success It’s going to say, OK, the bucket exists And at that point, you’ve just issued a bounded staleness read, and you’re done Now, in the unusual case, it’s going to say, that bucket isn’t there And in that case, we retry with a strong read to make sure to differentiate between the case where it was just a bounded staleness read that

caused that bucket to not be visible versus the bucket actually not being present So in an unusual case, we’ll retry with a strong read in order to maintain our strong consistency guarantees, but we still get a big performance optimization So to sum up, it’s been a lot simpler for Google Cloud Storage to build our industry-leading functionality, reliability, and scale because we’ve gone all in on building our system on top of Spanner And we take advantage of Spanner’s strong consistency, its simple programming model, its high availability, and its ability to scale horizontally [MUSIC PLAYING]