Google Cloud Next Amsterdam '17 : "Introduction to Google's Severless Analytics Platform"

TINO TERESHKO: My name is Tino Tereshko I come from Seattle, which some people have described in the past as the Amsterdam of the United States Kind of, I guess So I want to, first of all, thank everyone for being here, and investing time in coming here, and listening to us, and having conversations with us, because I heard earlier today that people in Amsterdam have no time So it’s especially important for us to recognize the investments that you guys have made to come here and spend time with us So real quick, before we get started, let’s take a selfie, especially the people here in the front On three, I want you to raise your hands, to smile, to clap, whatever you want Just make it interesting, OK? And I’ll put it on Twitter later So let’s do it 1, 2, 3! AUDIENCE: [SHOUTING] Hey! TINO TERESHKO: That’s good That’s good Thank you So I’m here to talk about analytics and the various spectrums of analytics that are possible today with the advancements in technology by Google and by others But let’s start with a quick little story You probably have seen these little cars driving around with a 360 camera on top of them that say Google Maps on top of them For the past 10 years or so, Google has been mapping the entire world, or as much as possible, and externalizing that information through Google Maps, through Google Earth You’re able to take a little yellow stick figure and put it anywhere in the world and walk through the neighborhoods I do that with neighborhoods that I grew up in that I haven’t been there in a while But the interesting thing is this is, of course, a massive data collection and optimization problem And it really speaks to how heavily invested in data Google is itself But the interesting thing here is, up until recently, up until the last four or five years, this was just imagery, right? You couldn’t– and then maybe coordinates and things like that You couldn’t really gain as much insight as possible from this But with advancements in AI and with advancements in commoditization of computing hardware and improvements in computing power, especially in AI machine learning, we’re able to derive further information from this imagery So, for example, we can understand that the business has turned over, right? There’s a new business in place there So we can go into Google Maps and update the place of business Or maybe this sign was wasn’t there before Maybe it was a two-way street, but now you can’t go this way So it’s very important for us to tell you that you can’t go down this road So all of this, of course, is done automatically through our AI learning capabilities And so this right here is probably the worst kept secret in Silicon Valley Google is very serious about analytics It says so in our model It says, Google’s mission is to– gosh– organize all the data that’s possible in the world and present it for bad or good, for everyone, right? And so Google has built a number of technologies that it uses internally to organize this information And the premise of Google Cloud Platform is to be able to externalize these technologies, and the lessons learned, and the power of compute for your benefits So you can run alongside Google, leveraging the same Google investments and technology and innovation for yourselves as well, which hopefully accelerates your pace of innovation, accelerates your iteration, allows you to deliver faster, better, more interesting features to a lot of your customers And the fact that Google has been a data-driven company for such a long time and has built these services that allow you to do some really powerful things with data analytics has made it fairly easy for us to become a machine-learning company as well This chart probably needs updating But essentially, for the past five or so years, virtually every single product and service that Google has has become, in some ways– and often, in many ways– AI-driven, right? So if you use Google Photos, for example you can very clearly understand the type of inference that occurs on your content automatically, right? It allows you to create stories based on just the imagery that’s happened and the ability to pull out metadata about that imagery But, of course, you can have too much of a good thing, right? Going back to that example where Google has generated a whole lot of photos of a whole lot of streets around the world, and if you need to go back in time and process all that information, that is a very serious computing power So you can have too much of a good thing, right? You can have too much of a requirement

for AI, especially for inference, which is asking the model what the model thinks is happening So in the last few years, Google has been faced with a problem, a serious problem, right? We have too much demand for inference And we we’re faced with having to expand into dozens and dozens of new data centers And if you’ve ever seen a picture of a data center that Google operates, these are huge things They’re bigger than this venue here So it’s a serious investment It takes time It’s complex So we implemented Tensor Processing Units– TPUs– and now, we’re on version 2.0 TPUs are specially optimized for matrix algebra that allows you to essentially infer results of an AI model Essentially, TPUs allow us to save a whole lot of energy And so these are live inside of Google for the past several years They’re intricate They’re delicate And they’re really important to our infrastructure So on to a little bit more practical terms When I talk to data scientists, AI experts, machine-learning experts, they tend to say, in day-to-day activities of what they do, some folks tell me, well, I have to operate an HBase cluster, because that’s where our data is kept Or I have to take data from Hadoop, run a Hive query, create a cohort, and feed that into a model that lives somewhere else Or I have to run a bunch of SQL queries before I can even feed anything into my model to randomize samples, and so on and so forth So ultimately, folks tell me that 90% of really true AI machine learning workloads involve pure data science or maybe even data engineering, right? So when you’re faced with complex problems like that, it’s really important to try to minimize complexity wherever it’s not necessary So this is really something interesting to keep in mind You can get lost in the noise very easily But then, according to Gartner, a whole lot of projects that try to create big data infrastructure, Data Lakes, end up failing I’m not sure how accurate that statement is, but it’s quite possibly true Right? So it’s very hard to do big data in 2017, right? Hadoop’s complex HDFS is complex You need to have a lot of expertise So how do you compute this? If you want to be an AI-driven organization, and [INAUDIBLE] data science, but 90% of data science projects fail, that’s kind of a daunting proposition, right? So I like to use this Jenga analogy You guys are probably familiar with Jenga, right? It’s basically a game where you pile on wooden bricks on top of each other And you pull out a brick from here, and you put it on top So you want to think about data infrastructure as entirely interdependent Bricks on top depend on bricks on the bottom, and you want to have a very clean Jenga tower so it doesn’t topple So at the bottom, you could think about this as baseline architecture What I mean by that is you want to have good networking You want to choose a good public cloud provider or two You might want to have a really good data center, right? You want to get that right And on top of that, you want to have really good compute, Virtual machines, servers, containers, and good storage object storage, for example And that aids you in developing your good applications and databases that power these applications So in order to have good applications, you probably need to have those two right Only then you can start thinking about a proper data warehouse You can certainly do a data warehouse without getting all that right But it really helps you to have the foundational blocks stacked up before data warehouse You’re probably going to say, Tino, data warehouse is kind of an old term We don’t use that anymore We have a data lake Sure But sometimes data lakes kind of get stale, and they turn into data swamps But you drain the swamp, and you have everything right And now your data lake is so big, you can call it a data ocean now Fantastic Well, let’s just– you know, let’s just summarize all that, and just use one umbrella, data, large body of water I coined that You can ask me to use it But that’s mine, data large body of water And once you have data large body of water set up, then you can start thinking about AI machine learning There are some exceptions you guys saw earlier today And I can talk about them some more But that’s really the lesson here And of course, it’s a Jenga tower analogy, so if you build a bad Jenga tower, it’s probably going to topple, hence the 90% of all data lake projects fail

That’s the lesson here OK, so now that I’ve scared you sufficiently, let’s talk about what a modern data analytics stack looks like There’s probably a whole lot of attributes that you can identify that you can say, this is what has to be in a modern data analytics stack So I have four that I want to talk to you about The first one is separation of storage and compute And that is a little bit of a loaded term So let me quantify exactly what’s happening here Separation of storage and compute is really a solved problem It buys you so many options when you’re designing your system If you don’t have this, and you are operating on a decently sized scale, you should look into this yesterday It’s just the pros and cons of having this attribute are just tremendous in your favor, right? So this is a solved problem Get it done Here’s what I mean by pure separation of storage and compute It’s exactly that, right? There’s a fine line between processing and compute storage And that fine line actually represents a very powerful networking, networking that can talk, that allows every node within the data center to talk to every other node within the same data center, at 10 gig, all at the same time, right? So Google’s data centers last reported have over a petabit of bisectional bandwidth per second per data center So that’s kind of a requirement You have to have good networking But ultimately, in practical terms, you can have your various storage layers, whether you have BigTable, which is our– Google’s NoSQL database You can have BigQuery storage, which is our data warehousing data lake storage Or you can have Object Store, where you keep your cat videos And you can have Google Cloud Dataflow, for example, which is our batch and stream processing unified programming model and engine, access all of those, regardless of where they live Or potentially, you can have BigQuery SQL, which is standard, run SQL queries on all of those– not cat videos, but just ignore that for a second And you can have your own Hadoop and Spark clusters run on top of that information as well So I’m not discounting the significance that Hadoop and Spark have in today’s environment The momentum behind these projects is phenomenal, right? So we have Google Cloud Dataproc, which is our fully managed Spark, Hadoop and Flink service The awesome thing about our service specifically is that you can get clusters very, very quickly– 90 seconds Hopefully you guys see that over there 90 seconds is 99th percentile from the moment you push the button to the moment your cluster is live, ready to use That’s very fast, especially within– you know, if you’re coming from your own on-premise private data center space, the typical numbers that I hear involve six to nine months, sometimes even longer So 90 seconds is just a phenomenal change, right? It enables so many options And, of course, we have pay per minute, right? So you can shut this thing down after 20 minutes if you’d like to You save the results into storage That’s where the state is And because we have fully [? femoral ?] preemptible VMs– which, kind of disappear every once in awhile in exchange for 80% off of list price So you can have very quick Hadoop clusters that you can shut down whenever you want, because you pay by the minute, and you actually don’t pay a whole lot for it And so the idea is that Hadoop doesn’t really keep a lot of state Unless it’s intermediate state, state lives in storage, right? Initial state and output lives in any of these storage pieces The other really big point here is that data silos are bad You’re playing the game of telephone with data when you’re moving data around needlessly, ultimately, you’ll end up with multiple copies of data And a funny thing happens When two data scientists try to get the same answers to the same problem using the same data, if that data lives in other places, they’ll come up with different results I’m still not sure how that happens But data silos are bad, amd separation of storage and compute allows you not to deal with that So here’s another different analogy that I can use Well, first, of all, I apologize, especially to the CXOs This is– you know, we don’t think of you that poorly from these stick figures But ultimately, this is what your potential organization could look like Even at a startup, you can have this type of delineation And you can have your various bits of data living inside of the query, all the way from very, very raw data that comes from application databases and services, to data that’s potentially sensitive that you don’t necessarily want to have floating around everywhere,

to real-time dashboards, things that tell you the state of your business, and all the way to data that you can potentially want to externalize from your infrastructure to your third parties, to clients, because they find use in that So they all have different roles here And the nice thing about our separation of storage and compute, just like with Google Docs, or Google Sheets, where you can, in [? Place ?] share information and share collateral with other folks within your organizations outside, you can do the same thing with our data lake– data body of water– services So IT and Ops could potentially have the overreaching role of this organization They want to own every single data set They want to administer that That’s fine And so in our Access Controls, they say, I’m the owner and everything, just like in Google Docs But the data engineers may only need to have your access to the really raw data sets, the really important data sets, because you don’t want to mess that up But then, they’re the ones that are creating dashboards They’re the ones that are creating these data sets And so they have editor access to that And, of course, the obvious follows Right? Your executives get a real-time dashboard in View mode, and you can even share in [? Place ?] without having an FTB server In [? Place, ?] you can share your data with third parties, so they can run SQL analytics in real time without having to do a whole lot of work on top of your data So it’s pretty awesome And I’ve showed this already Data silos are bad It’s a relic of old architectures– tried to avoid that ourselves So the second aspect of modern analytics I want to talk about is this concept of serverless And of course, you’ve heard of the various terms that– or the various contacts within which server is mentioned So the most confined definition of serverless is– it’s actually Cloud Functions, tiny little pieces of code that execute one at a time But the industry has been extending this buzzword further into identifying some things that have a very high level of abstraction, some things that are fully managed, that essentially allow you to use a [? femoral, ?] very, very scalable piece of compute, and only pay for the resources that you use So our serverless analytics kind of central piece here is BigQuery Actually, I was on the BigQuery team before recently, and it’s an amazing product If you haven’t used it, you should definitely look into it BigQuery is our fully managed analytics data warehouse It does have separation of storage and compute It has a very, very powerful storage engine that’s self-optimized, that self-heals It tries to understand how you’re using the storage and burns a whole lot of CPU and RAM to optimize the storage for your workload It does that on your behalf So it’s kind of opinionated in that regard And of course, that storage component allows you to share data within your organization or outside of the organization in very elegant ways without having to copy that data around On the other side, BigQuery has a compute engine It’s basically Dremel And Dremel– you can go read the paper that’s published by Google Dremel executes SQL queries in real time, very fast analytical workloads And BigQuery has an incredibly high level of abstraction You never have to touch hardware You never have to define storage or compute, or really, not a lot of configuration there And that’s on purpose, because we want you to focus on what’s important to you So let’s take a look at what BigQuery looks like, if you haven’t seen it So this is the BigQuery UI, which is one of the many ways you can interact with BigQuery And at first, it doesn’t really look all that exciting It kind of looks aged a little bit, kind of like Gmail from 2004 But just like Gmail from 2004, when you had what? A gigabyte inbox right away? When you were yelling at your grandma when she sent you a 30-megabyte video because your inbox would explode, and now you get a gigabyte That’s amazing And so just like Gmail in 2004 hides a whole lot of complexity and a whole lot of power underneath the hood and a simple UI, BigQuery does the exact same thing So let me show you So this is– right here, BigQuery petabyte is a data set, kind of like a database that BigQuery has And I’m going to click on this sales partition table here, which is a table that we made up It’s not real data We did it for the purposes of demonstrating the power of BigQuery So there’s nothing super special about it I’m going to look at the schema here of this table What you see here that’s really, really interesting– so it’s typical, order execution, when it happens, how much But here we have actually nested structures, like we have an array inside of the table It’s pretty awesome

So you can have an array of an array, JSON-like structures that are very, very complicated, or very simple But it kind of allows you to have interesting relationships with your data here So that’s a schema If I’m going to click on what the table looks like, you’ll see here that the table size is a little bit over a petabyte It’s a decently sized table And that looks like a trillion rows, a trillion rows in this table So the most sensible thing to do with this table is to try and query it live So I’m going to– I have this saved query here So I just push the big red button that said, Wrong Query Oh, no! I has the cache turned on So we’re going to not cheat We’re not going to have cache But if you look at this query, the only thing that I’m doing here to really cheat is that this petabyte table has three years worth of data We’re only doing– we’re only calling one month of partition But otherwise, it’s querying a whole lot of data And it’s got a join It’s got a big old regular expression It’s got a window function, everything you can possibly imagine in a typical query, right? Mixed bag– oh! Looks like we’ve finished this in, you know, 16 and 1/2 seconds So this table didn’t have anything special There’s no keys There’s no indexes We didn’t do any optimizations, really, outside of what the BigQuery service itself provides in terms of automation and optimization All we did was load that data into BigQuery So that’s the demo here So let’s talk about what exactly happened So I know– I’ve ran this query before, so I know how much resource it takes to execute this query in under 20 seconds So it’s about 3,000 cores So you could potentially take that data set, and if you have 3,000 cores lying around, you’d deploy that, assuming you have similar software to BigQuery, you can execute that You can run your own demo of that, if you’d like And so that’s really what we did right we rented 3,000 cores from Google for the length of the job So let’s think about this in virtual machine terms Virtual machines take minutes to start You get per hour billing, maybe per minute, maybe And virtual machines– it’s hard to get 3,000 cores very quickly Right? So what BigQuery allows you to do is to go from 0 to 3,000 for only 16 seconds– 20 seconds– and you get the equivalent of only paying for what you consume Now, the asterisk here is that BigQuery charges not per second It charges for the amount of data processed But that’s a great proxy to think about when you’re trying to relate this to virtual machines But that’s all fine and dandy The really key benefit of serverless is that you get a whole lot of power for very short periods of time And you don’t even care that that’s happening All you care about is you have data, you wrote a interesting SQL query, and you push the big red button, and you want to get the results That’s what you care about Like, sometimes you might want to geek out and think about that But ultimately, this is where you live And that’s the benefit of serverless, really So that’s a lot of power, very short bits of time And of course, if you want to extend that to TensorFlow and machine learning technologies, our fully managed cloud machine learning engine service allows you to do the exact same things on top of TPUs and GPUs So you can rent– ultimately, you can rent lots and lots of resources for short periods of time without having to worry to stand up those resources, manage those resources in order for you to train your TensorFlow models And once the model is complete, you can take that into your own environment You can take that into Google Cloud platform and deploy it, or even on your mobile devices, for example And you might have seen this already in other talks But the one exception to that whole Jenga analogy that I’ve used earlier is that there has been a whole lot of research done by Google and by other technologies to train models on large bits of data Google has a good amount of data, right? So we’ve taken our investments in technologies, and we’ve externalized them through simple APIs, rest APIs So any developer can use them, implement them inside of your application Now you have an AI-driven application And you can certainly get that next round of funding, if you say AI in your name So the third aspect I want to talk about is real time

Real time analytics is also a solved problem There are certainly use cases where you can have batched nightly analytics hourly But you have a very easy path to getting analytics on data as it occurs And this is a very typical path that our organizations take Now, Spotify has blogged about this quite extensively, and other companies, of course “New York Times” is an avid user of this right here It’s a very high levels of abstraction across three services that allow you to move data from the moment it occurs in your application, to the moment that you have precomputed analytics You have a dashboard you’re trying to develop here And to the moment that you want to have real-time access to SQL, to your analyst, your data scientist who has data– all of that is built. You just have to implement it And it’s a very popular way of doing things today The last thing I want to talk about is ETL I have to do ETL sometimes, and I don’t really find that necessarily too enjoyable It’s a means to an end There is no glory in it So ETL should be easy We should just get that in as quickly as possible, have it as clean as possible, and focus on value again So I’m going to give you a quick demo of what I’m talking about So I have this dataset here called [? Citibank ?] Data It’s just a bunch of rows I actually took BigQuery’s public data sets, and I exported them for this demo So just a bunch of information And I ran this– I’m going to run this BigQuery job It’s going to take this– oh, I need to update my CLI I need to– basically, I’m taking that data– and hopefully, the Internet holds up– and I’m going to load it into BigQuery for a quick analysis Wait a minute We’ve got an error Oh, no No, I’m kidding I made that– So it looks like the error says, at position 751, there is closed double quotes that have something inside of them All right, well, I’m going to go to– let’s go to that line This is the line that it’s talking about Hmm Oh, I see what it is So I have this quotation here, and inside of it is an escape character that BigQuery doesn’t like for some reason And of course, there is an RFC on it that says, for comma delimited values, you can only escape quotes with other quotes And BigQuery is very pedantic It likes RFCs So it rejects your data So what do we do? Well, we have this other tool called Dataprep that you can leverage today And I’m going to create a job inside of Dataflow that will hopefully recognize that information So I’m going to go ahead and go through the flow of importing that file inside of Dataprep There it is It’s going to load that file in, or at least some part of it, and then it’s going to run some statistical analysis to understand what exactly is going on It’s going to try to understand what the data looks like– did I do that? Yep It’s going to try to understand what the data looks like, if there are outliers, if there are errors, if it’s going to try to parse this information, and so on and so forth So it looks like it created this thing for me here And this is basically an ETL job, right? I have the file, and it’s going to create a job And this is the recipe that it’s going to execute, which is very simple You can certainly make that more complex So we’re going to look at the data and see exactly what happens here Now we wait But the idea here, again, is that the– wow, it’s taking really slow I did not have enough content prepared to fill this time But ultimately, what is going to happen is it’s going to take that bad row that I threw in there, and it’s going to recognize that row And it’s going to say, hey, that is actually an escape character It’s not conforming to an RFC But we definitely think it’s an escape character So– wow– live demos don’t always work, I guess So let’s forget about that You’re just going to have to take my word for it So with a tool like Dataprep, you have one click ETL You can take ugly data that’s, you know– if it works You can take ugly data that doesn’t conform to any standards, and you want to– you don’t want to write Python code or anything like that

You just want it in one place for you to analyze So Dataprep is a fantastic tool for that Under the hood, Dataprep actually creates Apache Beam pipelines It creates these recipes that turn to Apache Beam pipelines, and then executes them on top of Google Cloud Dataflow So you are familiar with that technology This is actually a pretty cool way of doing things So Dataprep is awesome So to very quickly summarize, the practice of data architecture or data infrastructure is moving very, very quickly But there are pillars that emerge that become very obvious wins very, very quickly And these pillars can simplify your life And this is kind of like, can I have things simpler, cheaper, and faster? Yes And this is because the architectures fundamentally change You can save money and get more That’s the lesson here So that’s really the four things that I’ve identified I’m sure you can find many, many more I’m going to close this on saying that, you know, the whole concept of serverless– the idea here, again, is that we want to provide the highest level of abstraction possible We want to automate all the noise, all the tinkering, all the things that are not really that important to you You don’t want to tinker Abstract that away, automate of that away, and give you a surface that allows you to create value So instead of doing very little value and a whole lot of tinkering, we want to flip that equation on its head So we want to give you the ability to create a lot of value in very little tinkering And that’s what serverless means So I thank you very much But before we finish, I’m going to have Geoffrey and Constantijn come on stage We’re going to have just a quick little fireside chat where we can take questions from each other and from the audience as well Geoffrey comes from RTL, and his organization implemented some interesting analytics workloads on top of Google And Constantijn is from a services partner that helped Geoffrey implement this Please [APPLAUSE] Thanks for coming, guys Geoffrey, can you tell me more about your organization? GEOFFREY VAN MEER: RTL, yeah RTL is not only a TV broadcast So it’s a multimedia company So we have also a lot of digital assets Nobody knows, but RTL Group is the largest content uploader on YouTube in Europe And we work a lot with the Google products, also in ESL’s domain, in [INAUDIBLE] domain, and also, in the big data domain TINO TERESHKO: So when it comes to big data specifically, it’s a big data session What have you guys done? What did you implement? GEOFFREY VAN MEER: We store the data in the Google storage Contstantijn helped us with building the– we have a Spark platform, so he built the platform based on Dataproc And we have BigQuery on top of that So it’s in our distribution of products that we have implemented And it helps us in all various ways It’s very transparent It’s easy to use the setup I can go to my business stakeholder and tell him exactly which query takes that much long It costs me so much euro If you want to have it faster, no problem I can put more machines on And you can have it faster That’s the decision, up to the business That makes it really transparent We’re all really, really powerful TINO TERESHKO: So it sounds like your business, with the help of Google Cloud Platform, was really enabled to perform better analytics and more data Can you talk about it more? GEOFFREY VAN MEER: Yeah What I like about it is that you bring the decision back to the business If they want to have it faster, you can have it So it’s of their decision If you have it within 15 minutes, I don’t mind We’ll make it short in 15 minutes It costs a little bit more, but it’s their decision It’s not an IT issue It’s a business decision That’s what I like TINO TERESHKO: That’s fantastic And so, I’m aware that you guys are leveraging Google BigQuery Can you talk more about this particular piece of our infrastructure? GEOFFREY VAN MEER: Yeah BigQuery helps us a lot in our grow intelligence maturity So, like every company, we have a set of reports that helps you by measuring how many customers you have, how many revenue you have, and blah, blah, blah But then it happens that customers go away, cancel their subscription, or the revenue drops down So reports tell you about what’s happening

And then you want answers on why people left, why cancel their subscription? Then you have to go into the data And it helps business people to query large data sets, finding answers on why people left What was the behavior before they left? So it empowers, again, business people, not a group of data scientists They can create They can do a lot Now, business people, you give access to large data sets and finding answers on the problems you see over a rise in your reports Very powerful TINO TERESHKO: And I wish– you know, I wish big data buzzword was actually called holistic data, because it allows you to store very raw information about what’s going on with your applications, how your users are interacting with your application So then further down the line, you can have your analysts understand specifically what the behaviors were and why they left your platform GEOFFREY VAN MEER: Yeah And when you have those answers, then you can ask data scientists to make predictive modeling But it’s the step in between that you want to have a broader audience in your company, people in the business– marketeers, salespeople== have also access to launch data sets And it’s really important to make that grow also to AI and machine learning TINO TERESHKO: Yeah Which leads me to my next question What’s your take on AI machine learning? What are your plans, and what are the opportunities within your organization? GEOFFREY VAN MEER: We are a total video company, so we have a lot of content, video content That’s what I like about RTL So we analyze the content, the video, because it tells me about the quality of the content So we have a lot of content that must help me, but what is the– I want to quantify the quality of the content So I use also the video APIs But we also install TensorFlow and Keras to have a better understanding of what’s happening on screen And that generates a lot of metadata, tags, labels that helps me also in my recommendation So I can prove that a little more I need to have a better understanding of the quality of content TINO TERESHKO: We haven’t forgotten about you, Constantijn So Constantijn was one of the job men that did a whole lot of the heavy lifting on this project Can you tell us more about what you’ve done? CONSTANTIJN VISINESCU: Yep When I started with the project, they already were using Hadoop and Spark But they had their own on-prem cluster with a fixed amount of machines So that was also– for starters, it was very expensive, because it was always on 24/7 And that meant it was– either it was not big enough, because they wanted their data now, or it was standing around costing money doing nothing And it also meant there was one cluster So if two people at the same time wanted to do something with the data, they had to wait for each other And, yeah, there were multiple data scientists, multiple data engineers, and also, multiple data analysts And, yeah, if one of them is using the cluster, the others are not doing anything useful for RTL And at the same time, I also noticed that keeping your own Hadoop and Spark cluster up and running can be quite labor intensive And, yeah, RTL would much rather spend that effort doing useful business things, than running after their Hadoop Spark cluster So I suggested we could move them over to Dataproc for starters And yeah– and then they could stop worrying about maintaining them, and at the same time, like the example showed, if you want a 100-node cluster right now, that’s fine And if your job only takes 10 minutes, then you’d throw it away If there’s three people that want to do something at the same time, you spin up three clusters, and then you throw away three clusters So you can just keep working And, like Tino said also in the slide, if you separate the data from your compute, three people want to work at the same– want to work on the same data set, that’s also fine, because you’re not tied to get one data set that lives over in that one cluster TINO TERESHKO: How long did your project take? CONSTANTIJN VISINESCU: The whole project was several months But it also included mostly understanding the data One of the nice things about using Google Cloud was that setting up the infrastructure basically took me 15 minutes, because I’ve done it before But i can just click– I want a cluster, I uploaded their data to– I uploaded their data to Google Cloud [? desk ?] set, and one of their existing Spark jobs that they were already running on Premise And within 15 minutes, I had working infrastructure And I could start spending the rest of the time actually analyzing the data and figuring out what to do with it rather than babysitting the infrastructure TINO TERESHKO: So going forward, what would be your recommendation, if you were going to implement this type of infrastructure in the future? CONSTANTIJN VISINESCU: Well, if you already have Spark, you can just take your existing Spark jobs that usually work on HDFS and put your data in Google Storage,

so you get the date of separation Cloud Dataproc comes with integration for Google storage, basically the same way you use HDFS So you can just put your data in Google storage, take your existing job, and change the output from HDFS to Google storage and just run it And that’s a great first step TINO TERESHKO: Any other comments either one of you would like to share about the project? GEOFFREY VAN MEER: In a [? SQL ?] sense, it’s easy– relatively easy– to manage So you can allocate your time more on business problems So the maintenance is relatively easy It’s very transparent And what I also like is that each– as you explained, data scientists come to the office in the morning, they spin up at the cluster, and here they go So they’re not interfering with each other And they use all the same– the old cluster, all the applications on there, so they can play around And today, data starts, I want a cluster of 10 machines And tomorrow, I’ll do a different type of algorithm, and I’ll spin up a cluster of 50 machines after them And I know exactly, as a manager, who was using what So I can also say, you burned too much, lower down, or spin up TINO TERESHKO: Fantastic Constantijn? CONSTANTIJN VISINESCU: Yeah, that’s– like Geoffrey said, that really helped their data scientists And it also helped later on the project, because the original project used Spark to put their data into an existing– pre-process their data into a normal SQL database But that means that sometimes that, when some of the data engineers are busy, I literally told them– heard them tell one of the data analysts, like, can you please not do your work until 2:00 in the afternoon, because I’m busy using the database? And, yeah, like, I understand from a technical’s perspective, because, the data engineer was just keeping the database very busy But you don’t want that So instead of putting the data into a normal database, we put the data in BigQuery, as you saw And RTL has lots of data because, like Geoffrey said, there’s a large multimedia presence on the Internet from RTL, not just the RTL label, but multiple labels And that literally generates billions of data points And I can just put those into BigQuery I don’t even have to aggregate them anymore So before, all the data was aggregated, and you could only get the reports that the aggregation was specifically made for And now all the data is just in there And yeah, there are some premade aggregations to make things easier on the users But it’s basically do whatever you want And again, because it’s serverless, as long as you don’t request more compute power than Google has, like, you don’t make Google crash, then they can go nuts TINO TERESHKO: Fantastic [LAUGHTER] And we challenge you to do that [LAUGHTER] So we have a few minutes to take questions from the audience So please raise your hand, and there will be a microphone coming around Please, right here AUDIENCE: Thank you What’s the difference between BigQuery and Spanner? TINO TERESHKO: Fantastic question So BigQuery and Spanner are used extensively internally at Google and by our clients The easiest way to think about it is BigQuery is great for analytics, whereas, you know, things like aggregating lots of information And Spanner is a fantastic tool to create an application or to have an operational database They go hand in hand together Yeah And they actually both speak the same standard SQL language Any other questions? Please CONSTANTIJN VISINESCU: You have another question over there TINO TERESHKO: Yeah GEOFFREY VAN MEER: Over there AUDIENCE: Your 20 second query of what was it? 3,000 nodes? What’s the money being spent? TINO TERESHKO: That’s a great question So that particular query cost about $10 to run So the query has two pricing models, specifically One is paper job, as I did And for folks who are at a larger scale, who have specific use cases, you can pay a flat monthly fee that includes all your queries Good question AUDIENCE: Hi Like a lot of BI departments, in our company, we’re a bit of a bottleneck when it comes to data preparation So Dataprep looks cool, because we can put that out there How difficult would it be to productionize the output of Dataprep? TINO TERESHKO: That’s a good question In terms of productionize, you want to put this into a place where analysts can work on it AUDIENCE: Sure, selected analysts might get 20 files

They do their thing And then we’ve got 6,000 files that we need to run through a nightly process And so that kind of production, I think– TINO TERESHKO: Yeah, that is an intrinsic– that’s exactly the use case for Dataprep So it’s an intrinsic quality of that service CONSTANTIJN VISINESCU: Actually, we had to do pretty much the exact same thing at RTL One of the nice things about BigQuery is that, next to its native custom drivers that are really fast, it also has an existing ODBC driver And it speaks standard ANSI SQL So it can talk with anything that can talk SQL over ODBC And that is basically– anything that calls itself a data analytics tool can talk with it So at RTL, the main tool for business intelligence is Business Objects And I just set up an ODBC query And within a couple days, Business Objects could talk the BigQuery and no one would notice the difference And there’s also several other tools with RTL, just their own departments using their own tools And as long as they can talk ODBC with standard ANSI SQL, then it just works TINO TERESHKO: We’re essentially out of time, but we’ll take one final question, right there Thank you AUDIENCE: Thanks for taking my question So, a lot of analytic projects live in an IT landscape, not just a Google landscape Can you say something about how to roll out into production both code and infrastructure? And how do you see integration with existing, let’s say, applications and deployment pipeline? So I’m thinking, so how do you use this stuff with Git? How do you roll all that stuff? TINO TERESHKO: Yeah CONSTANTIJN VISINESCU: At RTL, we had– our pipelines were in code There’s two out there called the Air Flow and that’s also from Apache And that allows you to model all your data pipelines as code So there’s jobs in there like, spin up a cluster, run these Spark jobs, spin down the cluster, and take the output from that Spark job and put it into BigQuery That’s all code And that code lives in a Git repository And there’s a Jenkins on top of that Git repository So if I commit to master, Jenkins runs the tests If the tests look good, it automatically puts it to the accept environment And if I hit the button, the same code goes to production TINO TERESHKO: The value of ecosystem that creates orchestration and monitoring, things like that can’t be understated So Google specifically invested in partnerships and in supporting open source technologies like Airflow that we’ve extended into our services And if there are technologies that are compelling that we don’t know about, we want to know about them as well Thank you, folks I think this is– we have run out of time But thanks for your time Great And thank you as well [APPLAUSE]