Real-time Analytics with Azure Cosmos DB and Apache Spark – BRK4004

IS ROHAN I’M FROM COSMOS DB. I’M HERE WITH MY COLLEAGUES SRI AND ROHAN, I’M HERE WITH TALKING ABOUT AZURE-COSMOSDB-SPARK CONNECTORS SO MANY ANNOUNCEMENTS AND THINGS GO ON. REST ASSURED WE SAVED THE BEST FOR LAST. DO NOT WORRY, WE ARE IN SAFE HANDS. WE ARE TALKING ABOUT A LOT OF EXCITING STUFF. A LOT OF LEARNING IN THIS SESSION AS WELL. SPEAKING OF LEARNING WE HAVE DESIGNED THIS SESSION AROUND A FEW OBJECTIVES THAT I WOULD LIKE TO BRIEFLY DISCUSS. NUMBER ONE, WE WANT YOU TO BE COMFORTABLE WALKING OUT OF HERE UNDERSTANDING WHY AND WHEN TO USE COSMOS DB IN CONJUNCTION WITH SPARK. THE USE CASES WHEN IT IS CONDUCIVE TO USE SPARK IN CONJUNCTION WITH COSMOS DB. SECOND, WE ARE GOING TO WALK YOU THROUGH EVERY SINGLE LINE OF CODE TO IMPLEMENT A REAL-TIME ANALYTICS SOLUTION CAN THIS. I’M SURE YOU WILL CODE THIS AFTER THE SESSION. WE WILL TALK ABOUT A REAL PRODUCTION SCHEDULERS. COCA-COLA IMPLEMENTED IN JOURNEY TO OVERCOME ANALYTIC CALL CODE, USING COSMOS DB AND SPARK. IMPLEMENTED A SOLUTION OVER THE LAST FEW MONTHS. SUPER HAPPY WITH IT. GETTING INTO THE DETAILS OF IT LATER. LAST BUT NOT LEAST, I HEARD YOU HEARD ABOUT THE NEW FEATURES BUILT IN COSMOS DB EVEN IF YOU HAVEN’T ATTENDED SCOTT GUTHRIE KEYNOTE, TOTALLY FINE. WE WILL WILL TALK ABOUT THE FEATURES INTRODUCED, ANALYTIC, NATIVE SPARK SUPPORT INSIDE COSMOS DB WHICH CAN ENABLE A LOT OF USE CASES ESPECIALLY WITH REAL-TIME ANALYTICS SCENARIOS WE’LL GET INTO DEMOS. WE’LL TALK ABOUT THE BENEFITS OF THAT AND WHAT VALUE PROPOSITION THAT BRINGS FOR CUSTOMERS AND BUSINESSES. HAVING SAID ALL OF THAT, LET’S MOVE ON TO A SIMPLE QUESTION. WHY SHOULD WE USE AZURE-COSMOSDB-SPARK CONNECTORS TOGETHER? IF YOU THINK ABOUT IT, SPARK IS THE GO-TO AN ANALYTICAL SOLUTION DEEMED FOR PERFECT FIT FOR REAL-TIME ANALYTICS IN THE WORLD SPARK CAME ABOUT IN 2014 WHEN THE FIRST RELEASE I BELIEVE. BACK THEN, I WAS WORKING AS A DEVELOPER AT ONE OF THE OLDEST BANKS IN THE WORLD DEVELOPING A FRAUD DETECTION SYSTEMS SPARK WAS A NEW THING BACK THEN THE FIRST THING I TRIED IT, IT IS AMAZING. IT IS BLAZING FAST. IT HELPS YOU TO COMPUTE YOUR BIG DATA ESSENTIALLY IN A VERY SEAMLESS MANNER AND YOU CAN DO A NUMBER OF THINGS WITH IT. RIGHT? IT MAKES IT SO SIMPLE YOU ARE STREAMING, MACHINE LEARNING AND WHAT THOUGHT. ON THE OTHER HAND, COSMOS DB IN THIS PICTURE, NOTHING BUT A NOSQL DATABASE ON THE CLOUD THAT IS OFFERED ON AZURE. THE BENEFITS, A LONG LIST. BUT TO SUM IT UP IN ONE SENTENCE, OFFER GLOBAL DISTRIBUTION, WE ALLOW SCALING THROUGH IN AND OUT WHATEVER TIME YOU NEED TO SCALE IN AND OUT. IT IS SUPER FLEXIBLE ANY CHANGE YOU MAKE IS INSTANTLY REFLECTED. LAST BUT NOT LEAST, MULTI MODEL CAPABILITIES. WHAT THAT MEANS IS, IF YOU COME FROM A STRUCTURED BACKGROUND OR UNSTRUCTURED BACKGROUND, DOESN’T MATTER. OPTIONS FOR BOTH OF THE SCENARIOS MUCH WE HAVE SQL API AND GREMLIN API FOR GREENFIELD, MIGRATING FROM EXISTING WORKLOADS AND MIGRATION API’S FOR CUSTOMER WHOSE ARE USING NOSQL DATABASES PRESENTLY ON-PREM OR IS SOLUTIONS LIKE CASSANDRA, MONGO DB AND WHAT HAVE YOU. APs FOR EVERY SINGLE SCENARIO THAT YOU COULD THINK OF. THE COSMOS DB OFFERS A LOT OF THINGS AND THIS SORT OF PRESENTATION CATERS TO THE SCENARIO OF H IT AP. PEOPLE HAVE OFTEN THOUGHT OF THIS DISTINCTION BETWEEN DATABASES AND ANALYTICAL DATABASES. OLD APP VERSUS OLTP OVER THE TIME, YOU MIGHT HAVE SEEN THIS LAP IS BLURRING. OLTP AND HTAP IS CONVERTING NOW. COSMOS DB IS PRINCIPLE ON HTAP. THIS IS WHERE YOU HAVE A LAYER, RATIONAL DATA SITS ON. THE SAME TIME SINCE STORAGE IS NOW AVAILABLE AND WITH COSMOS DB STORAGE IS NOT YOUR PRIMARY CONCERN MOST OF THE BUILD YOU PAY FOR, THRUPUT, WHICH IS SENSITIVE IN LOW SCENARIOS THE COSMOS DB ENABLES ALL OF THE SCENARIOS THAT SATISFY THE LOW LATENCY AND STUFF LIKE THAT. I’M NOT GOING TO GET INTO DETAIL BECAUSE THIS, THIS SESSION IS MORE ABOUT REAL-TIME ANALYTICS THAT HAS ENABLED THROUGH COSMOS DB AND SPARK. WE ARE GOING TO STICK TO THIS SCENARIO. IF YOU LOOK AT ALL OF THE REALMS OF INDUSTRY, THERE IS A LOT OF USE CASES THAT CAN BE USED TO COSMOS DB AND SPARK AND CONJUNCTION. THERE ARE PEOPLE FROM FINANCE, RETAIL GAMING AND HEALTHCARE, IOT, WEATHER, LOGISTICS,

AVIATION AND ALL SORTS OF INDUSTRIES THAT ARE USING COSMOS DB TO PUT THEIR DATA IN AND USE IT IN REAL-TIME WE ARE GOING TO TAKE A FEW EXAMPLES OF THESE, I’LL DISCUSS WHATEVER WE SEEN CUSTOMERS USE. SOME PRIMARY ONES OF COURSE. I DON’T WANT TO HAVE TOO MUCH TIME ON THAT. AS MUCH TIME ON DEMOS. THE STUFF YOU CAN TAKE BACK AND IMPLEMENT YOURSELF NOW A 5, 000 FOOT VIEW, HOW DOES REAL-TIME ANALYTICS WORK WITH APACHE SPARK AND COSMOS DB. LIKE I SAID, ALL THESE INDUSTRIES ARE DUMPING DATA INTO COSMOS DB AND THE BASIC QUESTION THAT YOU WOULD WANT TO ASK PEOPLE ANY SORT OF FORTUNE 500CEO OR DATA SCIENTIST OR ENGINEER ALL THE WAY TO BUSINESS ANALYST OR PRODUCT MANAGER, SIMPLE QUESTION THAT EVERYONE IS ASKING, IS WHAT CAN WE DO WITH THIS RICH BIG DATASET THAT WE HAVE IN COSMOS DB. WHAT IS THE VALUE PROPOSITION THAT WE CAN DERIVE OUT OF RICH DATA WE HAVE IN THE DATABASE THAT IS HIGHLY ACCESSIBLE AND AVAILABLE AT A LOW NC ASPECT ? AND THE ANSWER IS, THE POSSIBILITIES ARE ENDLESS ONCE YOU HAVE THE RIGHT DATA AND YOU HAVE THE RIGHT COMPUTE FOR YOUR WORK, THERE ARE SO MANY THINGS THAT YOU CAN IMPLEMENT. A GOOD EXAMPLE, I’M GOING TO DISCUSS THE USE CASES THAT WE HAVE SEEN IN ALL OF THE INDUSTRIES IN A FEW. THE WAY THAT IT WORKS IS THE ARCHITECTURE FLOW IS BASICALLY ALL OF THE DATA IS INGESTED AND STORED IN COSMOS DB COSMOS DB IS VERY LATENCY EFFICIENT DATABASE. IT KEEPS INTO YOUR LATENCY NEEDS AND CUSTOMIZE THAT FOR YOUR NEEDS. ONCE THE DATA IS STORED INSIDE COSMOS DB, THERE IS A VERY USEFUL, HANDY NIFTY FEATURE OF COSMOS DB CHANGE FEED TO SIMPLY DEFINE WHAT CHANGE FEED IS. IT IS NOTHING BUT A BACK END COMMIT LOG ON TOP OF COSMOS DB. SHOWS YOU WHAT DATA IS CHANGING IN REAL-TIME. IF YOU WANTED TO SEE WHAT IS HAPPENING IN YOUR COSMOS DB ACCOUNT IN REAL-TIME, THIS COMES OUT OF THE BOX. CAN YOU SEE WHATEVER IS HAPPENING IN REAL-TIME IS CHANGE. WITH THIS RICH STREAM OF CHANGING DATA USING CHANGE FEED YOU CAN SOURCE ALL EVER THE EVENTS WITH SPARKS, SPARK STREAMING AND ONCE THE DATA LANDS IN SPARK, THERE IS NUMBER OF THINGS YOU CAN DO WITH IT. MACHINE LEARNING, PRESCRIPTIVE ANALYTIC, DESCRIPTIVE ANALYTICS AND WHAT HAVE YOU. THE WHOLE NINE YARDS. FROM THERE, YOU CAN DO A COUPLE OF THINGS. IF YOU WANT TO STORE VIEW, MATERIALIZE VOWS AND SMALL VIEWS YOU WANT TO HAVE FOR COMPUTER BATCHES, DO THAT AND DUMP THAT BACK INTO SPARK. IF YOU STRICTLY WANT TO SATISFY THE ANALYTIC REQUIREMENTS, CONSERVE THOSE AND ENABLE THE BUSINESSES TO MAKE KEY DECISIONS BASED ON THE INFORMATION THAT IS COMING INTO THE DATABASE AND REAL-TIME. NOW BEFORE I MOVE ON, TWO THINGS THAT WE ARE GOING TO BE TALKING ABOUT IN THIS SESSION, THE BASIC CLASSIFICATION OF REAL-TIME ANALYTICAL PROCESSING ARCHITECTURES. TWO THINGS THAT THE INDUSTRY HAS SEEN KAPPA ARCHITECTURE AND LAMDA ARCHITECTURE. WE’LL TALK WHAT THESE TWO ARE. APP KAPPA IS NOTHING BUT REAL-TIME STREAM, ONE OR MORE INGESTING, PUTTING DATA INSIDE THE COMPUTE FRAMEWORK WHICH COULD BE SPARK OR ANYTHING ELSE NOW IN KAPPA ARCHITECTURE IT SEEMS PRETTY SIMPLE BECAUSE YOU HAVE STREAMS AND YOU ARE PROCESSING SOMETHING IN THE COMPUTE LAYER AND OUTPUT SOMETHING THAT MIGHT ENABLE A BUSINESS TO MAKE YOUR DECISIONS OR MIGHT JUST BE INFORMATION OR ANALYSIS OF YOUR DATA AS IT IS COMING IN BUT OVERTIME, WHAT YOU WOULD SEE IS THAT MOST KAPPA ARCHITECTURES TURN TO SOMETHING CALLED LAMBDA ARCHITECTURE, THAT IS NOTHING BUT YOUR STREAMS IN CONJUNCTION WITH YOUR BATCH HISTORICAL DATA. BOTH COME TOGETHER TO SUPPLEMENT OR SOMETIMES EVEN COMPLEMENT EACH OTHER TO HELP THE BUSINESSES OR CONSUMERS MAKE MORE INFORMED DECISIONS. I’LL TALK MORE ABOUT THAT AS WE DELVE INTO THE LOW LEVEL DETAILS. NOW THE BASIC USE CASES THAT WE SEE IN THE INDUSTRY AMONG THE COSMOS DB CUSTOMERS ARE RETAIL, VERY CLASSY EXAMPLE, YOU HAVE ALL OF THE CONSUMER BEHAVIOR DATA. THE CUSTOMER TRANSACTIONS THE CUSTOMER PROFILE. CUSTOMER PROFILE ISN’T SOMETHING THAT CHANGE A LOT STATIC DATA. SOMEONE’S AGE CHANGE ONCE A YEAR. SOMEONE’S NAME WILL NEVER CHANGE. ON THE OTHER HAND, THIS BATCH VIEW, CUSTOMER TRANSACTIONS AND SEEING WHAT THEY ARE BUYING USING THAT, YOU CAN PREDICT AND RECOMMEND PRODUCTS TO THEM BASED ON WHATEVER RETAIL SUBSECTOR YOU ARE IN. NOW IF YOU HAVE REAL-TIME DATA COMING IN, WITH TRANSACTIONS IF SOMEONE IS BUYING SOMETHING, AND THAT TRANSACTION HAS PARAMETERS LIKE, OH, THIS PERSON IS IN SEATTLE, DOWNTOWN, WHY NOT SEND THEM A PRODUCT RECOMMENDATION FOR NEARBY STORES AND THEY MIGHT BE INTERESTED IN BUYING SOMETHING SINCE THEY ARE SHOPPING. USE SUCH CASES AS CLASSIC EXAMPLE FOR RETAIL, ANALYTICAL PROCESSING SCENARIOS. SPARK AND COSMOS DB WORK TOGETHER TO ENABLE SUCH USE CASES FINANCE ANOTHER CLASSY EXAMPLE AND YOU HAVE BANKING TRANSACTION

DATA THAT DEFINES A CONSUMER OR CUSTOMER BEHAVIOR FROM THE LAST 10, 15, 20 YEARS HOWEVER LONG YOU HAVE BEEN STORING CUSTOMER DATA FOR. AND WITH THAT DATA, WITH THAT RICH DATASET, YOU CAN UNDERSTAND HOW SOMEONE PERFORMANCE AND TRANSACTIONS CAN NORMALIZE THE BEHAVIOR OF THAT CUSTOMER. IF TRYING TO MAKE A FRAUD DETECTION SYSTEM, IT BECOMES SUPER EASY BECAUSE YOU HAVE REAL-TIME TRANSACTIONS COMING IN AND THEN YOU HAVE THIS PAST 10, 15 YEARS WORTH OF DATA THAT SHOWS WHAT THE NORMAL BEHAVIOR FOR TRANSACTION FOR CUSTOMER. THE TRANSACTION COMES IN WITH THE AMOUNT OF $50, 000 AND YOU SEE FOR A CUSTOMER YOU ONLY USED TRANSACTION, ONLY MADE TRANSACTIONS UNDER $100 IN THE LAST TEN YEARS THAT SEEMS SUSPICIOUS. AND YOU CAN COMBINE THESE TWO TO SUPPLEMENT INFORMATION AND INITIATE FRAUD DETECTION MODEL THAT TRIGGERS TRANSACTION FRAUD. THAT WAS A USE CASE OR NUANCES TO FLAG A TRANSACTION. THEN AGAIN, THERE ARE SO MANY THINGS THAT YOU CAN DO WITH IT. GAMING IS A USE CASE THAT IS VERY SIMILAR TO RETAIL BECAUSE YOU HAVE A LOT OF INFORMATION ABOUT THE CUSTOMER WHICH IN THIS CASE IS A GAMER. THE KEY USE CASE THAT WE HAVE SEEN WITH COSMOS DB AND SPARK IN THIS INDUSTRY AS MONETIZATION OF GAMES DURING THE PLAYER’S EXPERIENCE YOU WANT TO SHOW THEM AS RELEVANT OF RECOMMENDATION AS YOU CAN USING THEIR PAST EXPERIENCE AND THE CURRENT OPERATIONS THAT THEY ARE DOING USING THEIR GAMES. THESE ARE VERY CENTERED CASES. THIS SLIDE I FIND SOMETHING VERY INTERESTING ABOUT. EACH OF THESE USE CASES, EACH OF THESE INDUSTRIES HAVE A VERY NICHE MARKET OR A VERY NICHE USE CASE. IN HEALTHCARE, WE ARE LITERALLY SAVING LIVES BY GIVING CONSUMERS OR HOSPITALS THE ABILITY INFRASTRUCTURE TO TRIGGER ALERTS BASED ON PAST HEALTH PATTERNS THAT WE HAVE OBSERVED AMONG OTHER PATIENTS RIGHT? YOU GOT ALL OF THE PATIENT DATA IN THE WORLD. AS SOON AS SOMEONE ELSE COMES IN, AND HOOKS UP THEIR ALS, PULSES, YOU CAN SEE FROM PAST DATA WHEN THE LAST TIME SOMEONE WAS A CARDIAC ARREST PATIENT THAT, YOU KNOW, YOU SEE PATTERNS THAT LED UP TO THAT CARDIAC ARREST. AND YOU CAN DEDUCE THAT BEHAVIOR FOR WHAT TRIGGERS THAT SORT OF BEHAVIOR USING MACHINE LEARNING MODELS IMPLEMENTED IN SPARK. YOU CAN RECREATE THAT ENTIRE EXPERIENCE TO TRIGGER ALERTS TO NOTIFY STAFF LIKE DOCTORS AND NURSES, SOMETHING IS GOING ON HERE YOU MIGHT WANT TO CHECK ON THAT PERSON. IT IS LITERALLY SAVING LIVES MANUFACTURING SIDE, YOU HAVE ANOTHER VERY INTERESTING USE CASE. WITH THE, WITH THE ADVENT OF SENSORS AND IOT, YOU HAVE GOT EXPENSIVE MACHINES LIKE DRILLING MACHINES, IF YOU ARE IN OIL AND GAS, YOU ARE DRILLING WITH THESE EXPENSIVE MACHINES THAT COSTS THOUSANDS OF DOLLARS EVERY MINUTE TO OPERATE. IF YOU START DRILLING FOR OIL IN THE WRONG DIRECTION AND YOU ARE NOT AWARE OF IT, EVERY SECOND THAT YOU WASTE, DRILLING IN THE WRONG DIRECTION, YOU ARE WASTING THOUSANDS OF DOLLARS NOW THINK ABOUT SENSORS BEING ATTACHED TO THE DRILLING MACHINES. IF YOU COULD USE, SOMEHOW LEVERAGE THE SENSORS TO PROVIDE YOU REAL-TIME FEEDBACK AND COUPLE THAT WITH PAST DATA TO SORT OF IDENTIFY IF YOU ARE GOING IN THE WRONG DIRECTION, THAT REALLY HELPS YOU OUT IN SAVING COSTS. AND YOU CAN STOP DRILLING RIGHT AWAY AND BACK OFF FROM THAT AREA AND NOT WASTING A COST EFFORT LOGISTICS, ANOTHER USE CASE. MORE OF UMBRELLA, COVERS A LOT OF INDUSTRIES AS WELL. THE PRIMARY USE CASE, STANDARD COMPUTER SCIENCE PROBLEM, THINK ABOUT ALGORITHM. OPTIMIZING ROUTES FROM A TO B AND OPTIMIZE THE ROUTE AND A LOT OF THINGS PLAY INTO THE VARIABLES IN THE AIR OVER HERE AND YOU HAVE GOT, SAY YOU HAVE A PLANE OR A SHIP OR A TRUCK ON THE STREET. AND IT COULD BE THE WEATHER, THE WIND CONDITION, THE TRAFFIC ON THE STREET THAT IS REALLY AFFECTING YOUR DELIVERY FROM A TO B. ALL OF THESE THAT ARE SHIPPED. RIGHT? AND A LARGE PART OF IT IS DETERMINED BY REAL-TIME DATA WHICH IS COMING IN THROUGH WEATHER REPORTS OR WHATEVER YOUR SOURCE IS. YOU CAN AGAIN USE SPARK AND COSMOS DB IN CONJUNCTION TO OPTIMIZE THE ROUTE AND MAKE INFORMED DECISIONS ABOUT THE LOGISTICS ROUTE OPTIMIZATIONS. THIS IS A VERY GOOD USE CASE THAT YOU SEE A LOT OF LOGISTICS COMPANIES USE IN THE FUTURE AS WELL NOW I KIND OF MENTIONED THAT THERE IS TWO MAIN DATA PROCESSING ARCHITECTURES WHEN IT COMES TO REAL-TIME DATA KAPPA IS AGAIN SOMETHING THAT IS A VERY SMALL BUCKET OF WHAT IS USED IN THE INDUSTRY. BECAUSE MOST OF THE TIMES IT ENDS UP CONVERTING TO LAMBDA ARCHITECTURE IN THE LONG RUN. KAPPA ITSELF IS NOT GOOD ENOUGH TO PROVIDE A HOLISTIC VIEW OF THE CUSTOMER’S JOURNEY AND THE IMPACT THAT EACH TRANSACTION IS MAKING WHICH IS WHY YOU NEED TO COUPLE IT WITH LONG TERM DATA AND DEFINE THE CONSUMER BEHAVIOR WITH BATCH VIEWS OF THE PAST HISTORICAL TRANSACTIONS AND THE AVATIONS OF THEIR BEHAVIOR

AS WELL. TYPICALLY, WE ARE GOING TO FOCUS ON LAMBDA ARCHITECTURE AT THIS POINT. IT COVERS ASPECTS OF KAPPA, THE BIGGER USE CASE SUCH AS LAMBDA. LET’S LOOK AT HOW LAMBDA ARCHITECTURE HAS BEEN IMPLEMENTED IN THE PAST. AND WHAT IS WRONG WITH IT AND WHAT IS RIGHT WITH IT. TYPICALLY, NEW DATA COMES IN AND YOU HAVE, YOU HAVE A MULTI CAST THAT NEW DATA INTO TWO LAYERS. THREE MAIN COMPONENTS IN ANY LAMBDA ARCHITECTURE. AND YOU HAVE THE BATCH LAYER THAT HAS THE MASTER DATASET THAT HELPS YOU SORT OF CREATE A PRECOMPUTED VIEW THAT CAN BE QUERIED WHENEVER YOU NEED TO HAVE A OPTIMIZED VIEW OF THE ENTIRE SET. YOUR SERVING LAYER IS NOTHING BUT THE LAYER THAT HOLDS THAT PRECOMPUTER VIEW THAT CAN BE QUERIED WHENEVER YOU NEED IT. NOW TYPICALLY WHAT HAPPENS IS, LAMBDA ARCHITECTURE IS IMPORTANT BECAUSE LIKE I SAID, YOU NEED THE HOLISTIC VIEW OF EVERYTHING. THERE IS A LOT OF LIMITATIONS WITH LAMBDA ARCHITECTURE, SOME OF THEM, IMPLEMENTING ARCHITECTURE IN 2010, WHEN SPARK WAS NEW, THE BIGGEST CHALLENGE, THE SERVING LAYER, YOU NOTICE THAT OR THE BATCH LAYER, LET ME START WITH THAT. THAT IS MORE OF A CLOSER TO ACTUAL INITIATION OF THE PROCESS. THE BATCH LAYER NEEDS TECHNOLOGY THAT IS VERY ROBUST AND STABLE. MOSTLY PEOPLE ENDED UP USING HADOOP OR HGSS FOR BATCH LAYER, A LOT OF THE MASTER SET IS RAW DATA STORE IN THE RAW FORM RIGHT? YOUR SERVING LAYER KEEPER HAD, OR IN YOUR SPEED LAYER PEOPLE HAD APACHE STORM OR APACHE SPARK WHICH ARE FAST COMPUTE ENGINES THAT CAN TRIGGER REAL-TIME DATA AND SOURCE IT WHENEVER YOU NEED IT. YOUR SERVING LAYER IN THE CONVENTIONAL ARCHITECTURE BACK IN THE EARLY 2010, PEOPLE HAD STUFF LIKE APACHE H SPACE OR ANY SORT OF FASTER LOOK UP TABLE DATABASE THAT CAN BE USED TO HAVE READILY AVAILABLE DATA TO BE QUERIED AND COUPLED WITH THE REAL-TIME. THE CHALLENGE WITH HAVING SO MANY COMPLEX TECHNOLOGIES IN THIS ONE FRAME IS THAT THERE IS TOO MANY MOVING PARTS . IT BECOMES A LOT OF OVERHEAD THAT FOR ME AND OTHER PEOPLE THAT I WORKED WITH, AND THE TRENDS IN THE INDUSTRY SPEAK, THIS IS A HUGE, HUGE BLOCKER. IT WAS HUGE DEVELOPER CONCERN . HOW DID WE ARCHITECTURE THIS? SIMPLE. WE DON’T HAVE THE MULTI CAST ANYTHING TO TWO DIFFERENT LAYERS. USE THE CHANGE FEED THAT I SPOKE OF EARLIER. SO YOU ONLY HAVE TO SOURCE, OR INGEST THE DATA IN COSMOS DB. SINCE THEY HAVE A FEATURE CALLED THE CHANGE FEED WHICH IS NOTHING BUT THE BACKEND COMMIT LOG, OR DATA COMING IN, YOU CAN USE THE CHANGE FEED AS A SOURCE FOR YOUR SPEED LAYER. YOU ARE ONLY CASTING TO ONE LAYER WHICH IS THE BATCH LAYER MANY AND CHANGE FEED TAKES CARE OF THE REST. YOU HAVE TO HOOKUP WITH CHANGE FEED WITH SPARK TO SOURCE OR STREAM THIS REAL-TIME DATA COMING IN. AND YOUR SPEED LAYER IS READY. JUST LIKE THAT. THE BATCH LAYER AND SERVING LAYER ARE QUITE SIMILAR TO WHAT THEY WERE LIKE IN THE CONVENTIONAL LAMBDA ARCHITECTURE AS WELL. THIS IS PRETTY MUCH THE SAME THING. MASTER DATASET RESIDING IN THE BATCH LAYER. SPARK THAT PRECOMPUTES OF YOU. PUSHES TO THE SERVING LAYER THE SERVING LAYER IS FASTER LOOK UP TABLE OR COLLECTION FOR YOUR HISTORICAL WINDOW OF DATA. SEE IF YOU ARE A RETAIL REAL-TIME ANALYTICS USE CASE, YOU DON’T WANT TO HAVE TEN YEARS WORTH OF DATA ON THE PRE-COMPUTER BATCH VIEW, ONE YEAR’S WORTH OF WINDOW AND TO SERVING LAYER TO HAVE FASTER LOOK UPS. ONCE THAT HAPPEN, REAL-TIME VIEW AS WE DISCUSSED WAS ALREADY READY, COMBINE THE VIEWS OR REAL-TIME COMPUTER VIEW FROM THE SPEED LAYER WITH THE SERVING LAYER DATA AND THAT GIVES YOU A HOLISTIC VIEW OF WHAT HAPPENED IN THE PAST. AND WHAT IS HAPPENING RIGHT NOW. YOU CAN MAKE INFORMED DECISIONS BASED ON THAT. HAVE YOU SO MANY THINGS THAT YOU COULD DO NOW LET’S, LET’S SUMMARIZE WHAT IS HAPPENING HERE. THIS IS LAMBDA ARCHITECTURE, RE-ARCHITECTED AFTER MOVING SOME PARTS . NEW DATA COMING IN. SINGLE LAYER COSMOS DB. USING THE CHANGE FEED, REAL-TIME DATA IS GOING INTO SPARK. WHERE IT IS COMPUTING REAL-TIME VIEW. THEN EXISTING DATA PRE-COMPUTED USING SPARK AND MADE AVAILABLE IN THE SERVING LAYER COMBINING THE TWO VIEWS YOU CAN DO SO MANY THINGS BECAUSE YOU CAN COMBINE THE PAST PATTERNS OF CONSUMER, CUSTOMERS, WHAT HAVE YOU. YOU CAN COUPLE THAT WITH YOUR REAL-TIME DATA WHICH TRIGGERS A LOT OF THE EVENTS, RIGHT? NOW ONE THING THAT I DO WANT TO CLARIFY ABOUT COSMOS TV BACK TO THE BASICS ONCE AGAIN, I SAID THAT IT IS NOSQL DATABASE ON THE CLOUD. WHEN I SAY COSMOS

DB AND NOSQL DATABASE, IT IS NOT A NO STRUCTURED QUERY LANGUAGE DATABASE THAT IS A MISCONCEPTION. WHEN I SAY COSMOS DB NOSQL, IT MEANS NOT ONLY SQL DATABASE. THAT IS A MISCONCEPTION THAT PEOPLE HAVE. WE SHOULD CLARIFY MORE OFTEN. WHAT THAT BASICALLY MEAN, YOU CAN HAVE UNSTRUCTURED AND STRUCTURED ON COSMOS DB. WE DO NOT CONSTRUCT TO ANY MODEL, SO MANY MODELS API THAT YOU CAN LEVERAGE HAVING SAID THAT, COSMOS DB OFFERS BENEFITS. GLOBAL DISTRIBUTION WHICH WE ARE GOING TO TALK ABOUT AND HOW SPARK LEVERAGES DISTRIBUTION IN SUFFICIENT MANNER. WE’LL TALK ABOUT THAT IN A FEW. ALSO TALK ABOUT THE LOW LATENCY ASPECTS AND HOW THAT HELPS WITH REAL-TIME ANALYTICS NEWS CASES WITH COSMOS KB AND SPARK NOW I DON’T WANT TO BE ALL ABOUT JUST ARCHITECTURE DISCUSSIONS AND WE WANT TO DELVE INTO THE DEMOS AS WELL. SO I’M GOING TO HAND OVER TO MY COLLEAGUE SRI. SHE IS GOING TO TALK ABOUT A FRAUD DETECTION USE CASE THAT WE ARE GOING TO START FROM SCRATCH AND IMPLEMENT. YOU WILL SEE HOW IT IS DONE. IT IS SUPER EASY. SO SRI, OVER TO YOU. >> HELLO CAN EVERYONE HEAR ME BACK THERE? AWESOME. THANK YOU ROHAN. THAT WAS GREAT. SO AS ROHAN WAS SAYING WE HAVE ENDLESS POSSIBILITIES WHERE YOU CAN LEVERAGE COSMOS DB WITH SPARK. THERE ARE TRILLIONS OF THINGS YOU CAN DO. ENDLESS POSSIBILITIES YOU CAN DO, SOLVE A BUNCH OF DATA SCIENCE PROBLEMS. YOU CAN USE SPARK TO DO FANCY ML STUFF. YOU CAN DO COMPLEX AGGREGATE CRAZE AND SO FORTH IN THE NEXT FEW MINUTES WHAT I AM GOING TO FOCUS ON, HOW YOU CAN DO THIS. AS YOU HEARD, WE ANNOUNCED SOME EXCITING NEW STUFF. NOW WE INHERENTLY HAVE SPARK INTO COSMOS DB. AND YOU WILL HEAR MORE ABOUT IT FROM COLLEAGUE ROHAN. BEFORE THIS, MOVING BACK A FEW YEARS WE HAD, WE SEE AN INDUSTRY TREND WHERE WE WANT TO, WHERE THE INDUSTRY NO LONGER WANTS TO SEPARATE BETWEEN OLTP AND OVERLAP CAPABILITIES. THEY WANT A DATABASE THAT IS HIGHLY, WHICH IS SUPER FAST, WHICH IS SUPER GOOD FOR H IT AP CAPABILITIES. BUT ALSO HAVE THE POWER, SHOULD HAVE THE CAPABILITIES TO RUN AGGREGATIONS OR ABLE TO DRAW SOME MEANINGFUL BUSINESS INSIGHTS. AND THAT IS WHEN WE DECIDED TO GO AHEAD AND BUILD OUR OWN LIBRARY, OUR OWN KECKOR WHICH WE CALL THE SPARKS COSMOS DB CONNECTOR. WRITTEN IN JAVA. OPEN SOURCE CONNECTOR. IF YOU GOOGLE IT, YOU WILL BE ABLE TO GO AND SEE OUR SOURCE CODE. WE HAVE WRITTEN THOSE LIBRARY ABOUT TWO YEARS AGO AND BEFORE I JUMP IN AND SHOW YOU THE DEMO, WHAT I AIM TO DO IS HOW YOU CAN USE THE COSMOS DB SPARK CONNECTOR TO DO THE TWO BASIC FLAVORS OF PROCESSING. THAT ARE BATCH READS AN BATCH WRITES AND STREAM READS AND STREAM WRITES. HOW DOES A COSMOS DB SPARK CONNECTOR WORK? FIRST OF ALL, THE KECKOR GIVES YOU THE ABILITY TO POSITION COSMOS DB EITHER AS THE SOURCE OR THE SINK OF YOUR SPARK JOBS. WHENEVER RUNS SPARK, DATABASE, INSIDE, WHEREVER IT IS, YOU FIRST GO AND INSTALL YOUR COSMOS DB CONNECTOR LIBRARIES. THIS YOU CAN DO THROUGH THE MAIN COORDINATES OR DOWNLOADING THE UBER JAR. I’LL SHOW YOU HOW TO DO THAT AS WELL. ONCE YOU HAVE THE CONNECTOR INSTALLED NEXT WHAT YOU DO IS SPECIFY THE READ OR WRITE CONFLICT. PRETTY MUCH YOU ARE TELLING WHICH COSMOS DB CONTAINER AM I CONNECTING TO. AND ONCE YOU HAVE THE CON FIG IN PLACE, WHAT WE DO IS ESTABLISH A CONNECTION FROM THE SPARK MASTER NODE TO YOUR COSMOS DB GATEWAY NODE SO WHEN THIS CONNECTION IS ESTABLISHED, WHAT ESSENTIALLY HAPPENS IS, THE COSMOS DB PASSES BACK THE PARTITION KEY OR THE PARTITION MAP OF THE DATA. SO NOW ONCE YOU HAVE THIS PARTITION MAP YOU CAN ACTUALLY, WE CAN, AT THIS POINT, WE CAN ACTUALLY SPARK, TO SEE WHAT IS THE DATA I NEED TO ANSWER THIS GRADE. AND SINCE WE HAVE THE PARTITION MAP, NOW WE PUSH DOWN THIS INFORMATION TO OUR WORKER NODES SO THAT THE WORKER NODES CAN TALK TO THE COSMOS DB DATA NODES. SO LIKE YOU KNOW, SO AS YOU KNOW, SPARK IS A HIGHLY PARALLEL DISTRIBUTED SYSTEM. AND COSMOS DB IS A HORIZONTALLY DISTRIBUTED DATABASE WHEN YOU HAVE THESE WORKER NODES, TALKING TO THESE DIFFERENT DATA NODES, OBVIOUSLY YOU ARE PROCESSING IS GOING TO BE MUCH FASTER . OKAY SO FOR TODAY’S DEMO, ROHAN SPOKE ABOUT A LOT OF INDUSTRY SCENARIOS AND VERTICALS WHERE THIS IS BEING USED. SO FOR THIS PARTICULAR SCENARIO, LET’S TAKE FINANCES AND EXAMPLE IN FINANCE, WE HAVE A NUMBER OF E-COMMERCE SITE, E-COMMERCE ON THE RISE. I THINK THAT I HAVE STOPPED EVEN GOING TO GROCERY STORE. NOW YOU HAVE FRESH AND TONS OF OTHER

STUFF. MAIN PROBLEM WITH ON-LINE TRANSACTIONS IS ON-LINE FRAUD. THIS COULD BE CREDIT CARD IMPOSTERS, STEALING OF CREDENTIALS, A BUNCH OF STUFF. TODAY’S SCENARIO, LISTS, IF YOU HAVE A CERTAIN RANK, A BANK, THEY ARE TRYING TO DO ON-LINE PAYMENT PROCESSING SYSTEMS. SO THEY, THEIR CUSTOMERS AND END USERS ARE MERCHANTS THIS COULD BE ANY ON-LINE SITE LIKE HATCHET. COM, EXPRESS. COM. WHATEVER YOU WANT. SO YOU HAVE, THEY HAVE CUSTOMER, OR MERCHANTS ALL AROUND THE WORLD. AND WHAT THEY AIM TO DO IS THEY WANT TO PROVIDE REAL-TIME FRAUD DETECTION SYSTEM TO BLOCK FRAUDULENT TRANSACTIONS AS AND WHEN THEY COME IN. FOR EXAMPLE, HAVE YOU A CUSTOMER WHO IS TRYING TO MAKE AN ON-LINE PURCHASE. AND THIS SYSTEM AIMS TO, AIMS TO DETECT, IF THIS TRANSACTION IS FRAUDULENT OR NOT. BY USING SOME ADVANCE ANALYTICS AND SOME MACHINE LEARNING MODELS SO HOW, SO HOW DO YOU GO ABOUT CREATING SUCH SYSTEM? SO WE HAVE A FEW REQUIREMENTS JUST TO POINT OUT, I DID SAY WOODROW BANK HAS CUSTOMERS ALL AROUND THE WORLD. YOU WANT, SO SOME OF THE REQUIREMENTS ARE THE FRAUD DETECTION SYSTEM NEEDS TO BE HIGHLY RESPONSIVE THAT MEANS YOU SHOULD BE ABLE TO DISTRIBUTE YOUR DATA AS CLOSELY AS POSSIBLE TO WHEREVER YOU USERS ARE. SO THAT, I’M SURE, THAT GIVES YOU AN IDEA, LIKE YOU KNOW, COSMOS DB SUPPORTS TURN KEY GLOBAL DISTRIBUTION THAT MEANS ONCE YOU HAVE YOUR TRANSACTIONS OR LAND YOUR DATA, IN COSMOS DB, CAN BE SEAMLESSLY REPLICATED TO USERS THROUGHOUT THE WORLD. THAT WOULD MAKE COSMOS DB A GREAT FIT AS AN INGEST STORE. SO JUST, BEFORE I DO THE EXAMPLE, I’M GOING TO WALK YOU THROUGH THE QUICK ARCHITECTURE OF HOW THIS WOULD LOOK FOR FRAUD DETECTION SOLUTION. YOU HAVE YOUR PAYMENT TRANSACTIONS OR YOUR STREAMING EVENTS THAT COME IN. WE ARE GOING TO LAND ALL OF THESE EVENTS INTO COSMOS DB OR THIS IS GOING TO BE STREAMING AND INGEST STORE. AS THE TRANSACTIONS KEEP COMING IN, YOU HEARD SOME, YOU HEARD SOME INFORMATION ABOUT THE CHANGE FEED. SO WE HAVE A FEATURE CALLED CHANGE FEED WHICH ALLOWS YOU TO STREAM EVENTS OR STREAM UPDATE AS AND WHEN THEY HAPPEN SO USING THE CHANGE FEED AND STRUCTURE SPARK STREAMING, NOW YOU CAN LAND ALL OF THESE TRANSACTIONS INTO SPARK SO NOW THAT YOU HAVE THESE REAL-TIME EVENTS LANDING IN SPARK, YOU CAN DO SEVERAL STUFF. FIRST, FIRST REQUIREMENT IS BE ABLE TO BLOG THESE TRANSACTIONS INTO REAL-TIME. THE WAY THAT WOODROW BANK DID THIS, THEY HAD SOME HISTORICAL TRANSACTIONS OR SOME CLEAN DATA WHICH THEY HAD IN THE CSB FILE WHICH THEY USED TO TRAIN A MODEL SO THEY TRAINED A MODEL AND THEY DEPLOYED THE MODEL IN SPARK, THEY USE SPARK ML OR AZURE ML. AND YOU CAN USE DIFFERENT ML SERVICES IN SPARK BETWEEN THE MODEL AND NOW DEPLOYED THE MODEL AS THESE NEW TRANSACTIONS COME IN, WE SCORE THESE TRANSACTIONS OR DO REAL-TIME SCORING AGAINST THIS PRE-TRAINED MODEL TO DETECT IF THIS IS FRAUD LEBT OR NOT. BASED ON FRAUDULENT OR NOT FRAUDULENT, OR HOW LIE THE SCORE IS, HOOK INTO WEB SERVICES OR WEB API OR BLOG THIS TRANSACTION OR REPORT TO THE MERCHANT AS SUSPICIOUS TRANSACTION OR TAKE FURTHER ACTIONS THIS IS THE REAL-TIME ASPECT OF IT. BUT AS YOU KNOW, AS THE TRANSACTIONS KEEP COMING, OR OVER YEAR, YOU HAVE MORE NEW SETS OF DATA. YOU CANNOT USE THE SAME HOLD MODEL YOU USE YOU HAVE THE REQUIREMENT TO RETRAIN THIS MODEL. OR EVEN IN CERTAIN TIMES, FOR EXAMPLE, I WAS IN AUSTRALIA A FEW MONTHS AGO, AND THE MERCHANT BLOCKED MY TRANSACTION. I WAS LIKE, HEY, WE SEE SOME NEW TRANSACTION COMING ACROSS THE WORLD. IT MIGHT NOT BE YOU. FLAGGED AS SUSPICIOUS IT WAS NOT NECESSARILY A SUSPICIOUS TRANSACTION. THIS ALSO MEANS YOU WILL HAVE FEEDBACK COMING IN FROM THE MERCHANTS AND THE BANKS ITSELF AND THE CREDIBILITY OF THE MODEL PREDICTION. ONCE YOU HAVE THIS FEEDBACK IN THERE. WE HAVE, THE WOODROW BANK WILL SCHEDULE SOME OFF LINE, OFF LINE BATCH PROCESSING TO DO SEVERAL THINGS. SO THE FIRST THING IS TO RETRAIN THE MODEL DEPENDING ON THE, DEPENDING ON THE FEEDBACK. AND DEPENDING ON THE NEW TRANSACTIONS THAT CAME IN. AND THE SECOND ONE IS TO THE VIEWS. NOW WE HAVE DIFFERENT TYPES OF TRANSACTIONS THAT COME IN. AND THE WOOD GROVE BANK WANTS TO PROVIDE NICE AGGREGATED VIEWS TO THE MERCHANT LIKE, HEY, BY THE WAY, THESE ARE, YOU HAVE THE HOST NUMBER OF TRANSACTIONS COMING ON, COME INTERESTING COUNTRY Z. OR YOU HAVE THIS PARTICULAR CUSTOMER WHO HAD THIS MANY TRANSACTIONS OVER A PERIOD OF ONE YEAR. THIS IS YOUR TOP CUSTOMER. OR TO DO FURTHER ANALYSIS, WE CAN BUILD NICE NEUTRALIZED VIEWS WHERE THE MERCHANT CAN GO AND DASHBOARDS

OR TOP OFF AND BUILD REPORTS TOP OFF IN A NICE FAST REAL-TIME MANNER SO NOW, LET’S TAKE A DEMO OF HOW THIS SYSTEM IS IMPLEMENTED . TO START OFF, THE FIRST QUESTION IS, WHERE DO I GET MY SPARK KECKOR? TWO WAYS TO DO THIS. AND YOU CAN POINT TO THE COORDINATES. WE DID HEAR A FEW ISSUES WHERE A SPARK STREAMING RIGHTS DID NOT WORK. SO UNTIL WE GET THAT ISSUE FIXED, WE HIGHLY RECOMMEND THAT YOU DOWNLOAD THE UBER JAR. ALSO INCLUDED THE LINKS TO ALL OF THIS IN MY PRESENTATION YOU WILL BE ABLE TO FIND IT. SO ONCE YOU DOWNLOAD THIS FOR THE PURPOSES OF MY DEMO, I’M USING DATABRICKS IN DATABRICKS, YOU CREATE YOUR CLUSTERS SPECIFYING HOW MUCH MEMORY YOU NEED, SO ON AND SO FORTH. ONCE YOU CREATED THAT, YOU HAVE AN OPTION TO INSTALL LIBRARIES. HERE I HAVE BOTH OF THEM INSTALLED USING THE JARS AND THE COORDINATES. I CAN SAY, INSTALL NEW. AND THEN I SAY, UPLOAD A JAR AND ONCE, WHATEVER JAR I HAVE DOWNLOADED I WILL JUST DRAG AND DROP IT. EASY AS THAT. ONCE YOU INSTALL THE CONNECTOR, THIS IS A FUN PART. THAT WAS GOING THROUGH THE BASICS. I’M GOING TO SHOW YOU TWO THINGS. HOW TO DO STREAM WRITES AND STREAM READS. BATCH READS AND BATCH WRITES. SO HERE I HAVE A NOTEBOOK OPEN. AND ONE SECOND SO THERE YOU GO. SO WHAT YOU SEE HERE, OR WHAT I WANT YOU TO PAY ATTENTION TO IS THE READ CONFLICT SO IS THIS A FEW LINES OF CODE, ESSENTIALLY WHAT I AM SPECIFYING HERE IS THE FIRST POINT. WHICH COSMOS DB CONNECTOR DO I CONNECT TO? THESE ARE YOUR ENDPOINTS AN MASTER KEYS I USED AZURE KEY WALL TO STORE MY SECRETS. IF NOT, ALSO PASTE YOUR URI AND ENDPOINT AND HIGHLY DON’T RECOMMEND IT. AND THEN YOU HAVE YOUR SPECIFY DATABASE IN COLLECTION AND THEN THE INTERESTING THING TO PAY ATTENTION TO HERE IS RECHANGE FEED IS TRUE. THAT MEANS YOU ARE ENABLING STREAMING AND YOU WANT TO READ OFF THE CHANGE FEED. THE OTHER THINGS YOU NEED TO PAY ATTENTION TO IS THE CHANGE FEED CHECK POINT LOCATION. AND THE CHANGE FEED CREATE NAME. THESE ARE THE LOCATIONS ON YOUR DATABRICKS FILE SYSTEM ON WHERE WE ARE STORING THE CHECK POINT AND LOGIC, FOR EXAMPLE, IF YOUR JOB FIELD, WHERE DO YOU CONTINUE TO READ? OR WHAT IS YOUR CHANGE FEED CHECK POINTS? ONCE YOU HAVE THE READ CON FIG, YOU ARE GOING TO OPEN, OR YOU ARE GOING TO START A READ STREAM WHICH IS THE NEXT COMMAND ALL I AM SAYING START A SPARK READ STREAM AND SPECIFY, ALSO ANOTHER INTERESTING THING I WOULD LIKE TO POINT OUT IN THE FORMAT. YOU SEE THAT WE ARE USING SPARK STREAMING SO THIS FORMAT DIFFERS FOR BATCH READS, SORRY, STREAM READS AND STREAM WRITES AND BATCH STREAMS AND BATCH READS. YOU NEED TO SPECIFY BATCH RIGHT FORCE THAT TOO WORK. YOU SPECIFY, YOU SPECIFY PRETTY MUCH YOU ARE SAYING YOU USE THE CON FIG THAT YOU PROVIDED ABOUT. AND FOR THE PURPOSES OF THIS DEMO, I’M JUST WRITING THE OUTPUT OF THIS TO MEMORY SYNC. YOU CAN WRITE IT TO ANYWHERE DELTA TABLES. WRITE IT BACK TO A DIFFERENT COSMOS DB ACCOUNT IF YOU DID SOME CLEANING OR SLIDING WINDOW AGGREGATES OR WHATEVER YOU WANT SO A STEPPING BACK ONE SECOND, I’M GOING TO SHOW YOU HOW, HOW MY SAMPLE DATA IN COSMOS DB LOOKS AND DECIDE TO REFRESH. SO IN THE MEANTIME, WHAT I HAVE HERE IS AN, I HAVE AN OFF LINE OR LOCALLY, I’M RUNNING A TRANSACTION GENERATOR WHICH IS SIMULATING SOME TRANSACTION DATA FOR ME. SO I HAVE THIS RUNNING RIGHT NOW. I’LL GIVE IT A FEW MINUTES TO START UP. PREVIOUSLY WHEN I RAN THIS, I’M GOING TO JUST SHOW YOU WHAT SOME OF MY DOCUMENTS LOOK LIKE OKAY. I’M SORRY. THAT IS TAKING TIME. BASICALLY, IF YOU SEE THE RIGHT CORNER, THAT IS A SAMPLE DOCUMENT I HAVE DIFFERENT INFORMATION LIKE COUNTRY CODE, THE TRANSACTION ID THE TRANSACTION AMOUNT. THE PAYMENT BILLING POSTAL CODE. JUST TO GIVE YOU AN EXAMPLE OF A DOCUMENT. WHAT I DID HERE, I HAVE, I STARTED MY TRANSACTION GENERATOR WHICH IS EMITTING SOME STREAMING DOCUMENTS WHICH WILL BE INSERTED TO MY COSMOS DB ACCOUNT HERE. AND NOW WHEN I RUN WHAT I AM GOING TO DO, RUN MY READ CON FIG OR RUN THIS CHANGE FEED TO READ OFF OUR CHANGE FEED AND YOU WILL

BE ABLE TO, YOU CAN, WE CAN SEE THAT AS WE ARE READING OFF THE CHANGE FEED, WE CAN SEE AT WHAT PROCESS WE ARE READING IN THE DOCUMENTS AND WE ARE PROCESSING THESE DOCUMENTS AND YOU CAN SEE THAT THE SPARK JOBS ARE RUNNING. LIKE YOU SEE, YOU CAN SEE THE TRANSACTIONS COMING. THIS IS, THIS I’M WRITING TO OUTPUT SYNC SO LET’S SAY ONCE YOU NOW HAVE YOUR TRANSACTIONS. LODE YOUR DATA AND MODEL FROM ML MODEL, INTEREST THE IT DB DATABRICKS FILE SYSTEM. NOW RECORD THESE TRANSACTIONS AS THEY COME IN, SAYING WHETHER THEY ARE FRAUDULENT OR NOT. I’M NOT GOING TO GO INTO MORE DETAIL. FOR THE EXAMPLES OF THIS , WE RETRAINED SPARK ML AND LIBRARIES. AFTER WE RECORD THE TRANSACTIONS FOR WHATEVER PURPOSES, IF YOU WANT TO WRITE IT BACK, YOU CAN WRITE IT BACK TO A DIFFERENT COSMOS DB COLLECTION THIS IS EXAMPLE HOW THE WRITE CONFIG WRITE YOUR STREAMING JOBS BACK TO COSMOS. THE THING TO PAY ATTENTION TO, IS, IN THE FORMAT, FOR THE RIGHT FORMAT, YOU WILL HAVE TO PROVIDE COSMOS DB SYNC PROVIDER WHEREAS IN READ, YOU SAY COSMOS DB PROVIDER, THESE ARE TINY THINGS POINTING OUT, BUT BROKE IN MY HEAD SOMETIMES GETTING IT TO WORK WHEN I MISS SOME SMALL MISTAKES IN THE COMMANDS. BUT THESE ARE SIMPLE LINES OF CODE. AND THIS IS, THIS IS ABOUT IT. FOR THE STREAM STUFF. SO LET’S, MOVING ON TO BATCH SO WHAT I HAVE HERE, SO NOW YOU HAVE, NOW YOU COVERED THE REAL-TIME SCORING PART. SO NOW YOU HAVE ALL OF THESE TRANSACTIONS IN COSMOS DB. AND YOU WANT TO READ THESE COSMOS DB, READ THESE TRANSACTIONS FROM COSMOS DB TO DO, TO RETRAIN YOUR MODEL OR TO, OR TO DO SOME AGGREGATIONS AND WRITE THEM BACK INTO MATERIALIZED VIEW. THIS IS HOW A READ CON FIG WOULD DO FOR BATCH JOB. SPECIFY THE MASTER KEY AND ALL OF THAT STUFF ONE INTERESTING THING I WANT TO POINT OUT IS THE CUSTOM. WHY DO YOU WANT TO SPECIFY THIS? SO LIKE WE WERE TALKING ABOUT, PROBABLY WOULD LIKE TO SCHEDULE YOUR BAD JOBS TO SAY, RUN ONCE EVERY 24 HOURS OR ONCE A DAY WHICH WOULD BE IN ALL OF THE NEW TRANSACTIONS HERE I FILTERED ON COLLECTION TYPE TRANSACTION BUT ALSO FILTER BY DATE TIME STAMPS TO READ IN ONLY THE MOST RECENT TRANSACTIONS YOU NEED. SO THIS IS REALLY HELPFUL, FOR EXAMPLE, YOU HAVE TWO TERABYTES OF DATA IN COSMOS DB, NOT NECESSARILY WANT TO PULL OUT ALL TWO TERABYTES OF DATA IN MEMORY IN SPARK. THIS IS WHERE THE CONNECTOR HELPS IN PUSHING DOWN THE ONES AND PUSH DOWN THE FILTERS ALL THE WAY TO THE COSMOS ENGINE AND PULL BACK THE DATA YOU ACTUALLY ONLY NEED. ONCE YOU ESTABLISHED YOUR READ CON FIG, WHAT I AM DOING HERE, I AM CLEANING SOME COLUMNS OR ROWING SOME COLUMNS JUST TO SHOW HOW THE TRANSACTION DATA LOOK LIKE IF I ARE TO RUN THIS, SOME OF THE DATA IN HERE, YOU HAVE THE TRANSACTION ID. YOU HAVE THE POSTAL CODE. IS THE USER REGISTERED? SOME OF THE OTHER STUFF. SO NOW THAT YOU HAVE YOUR BATCH TRANSACTIONS OR YOU ARE READING YOUR BATCH TRANSACTIONS, INTO SPARK AND DATABRICK, SO THE NEXT THING IS, WHAT ARE, WHAT ARE SOME OF THE AGGREGATIONS THAT I COULD DO. IN SUCH DATA, SOME USEFUL THINGS YOU COULD DO, ING AGGREGATE ALL OF THE TRANSACTIONS COMING BY COUNTRY CODE. AND GIVE ME THE AVERAGE TRANSACTION AMOUNT. SO THIS IS JUST A SAMPLE. SO ONCE I HAVE THIS IN HERE, WHAT I AM GOING TO DO I’M GOING TO WRITE IT BACK TO COSMOS DB. SO FOR WRITING BACK, AGAIN, THE WRITE CON FIG LOOKS SOMETHING SIMILAR. SPECIFY THE ENDPOINT AND THE MASTER KEY AND THE COLLECTION YOU WANT TO WRITE BACK. I HAVE SOMETHING CALLED THE MATERIALIZED VIEW AND ALSO SPECIFY THE PARTITION KEY THE DEFINITION OF YOUR PARTITION KEY. SO ANOTHER INTERESTING THING, TO SPECIFY, OR TO PAY ATTENTION HERE IS THE WRITE FORMAT. FOR STREAMING, YOU WILL HAVE TO SPECIFY IT IS COSMOS STREAMING. WHEREAS HERE, YOU STAY DETECTIVE, THE FORMAT IS MICROSOFT AZURE COSMOS DB SPARK AND ALSO SPECIFY MODES. MODES LIKE OVERRIDE, ATTEND, SO FORTH. AND THAT IS BASIC IDEA LET’S TRY, SO THESE ARE THE RESULTS FROM WHAT I HAD BEFORE. I’LL QUICKLY TRY TO RUN IT FOR YOU TO GIVE YOU AN IDEA OF HOW THIS WORKS. BUT THAT IS PRETTY MUCH THE EXPLANATION SO YOU HAVE THE RECON FIG. WHAT I AM DOING IS DOING SOME DATA AND

BATCH RUN, CREATING THE TEMP TABLE TO WRITE, TO WORK WITH THIS DATA TO DO SOME AGGREGATES IN MATERIALIZED VIEW. SO LIKE YOU CAN SEE, SO THIS, THIS PRETTY MUCH WORKS. SO, JUST BECAUSE WE ARE SHORT ON TIME, WE WILL WRAP UP THERE. THIS IS THE IDEA. I HOPE YOU UNDERSTOOD WHAT ARE THE DIFFERENT CASES YOU WOULD USE STREAMING VERSUS BATCH AND THE DIFFERENT CON FIGS . ALL OF THIS IS AVAILABLE. THE FEW LINES OF CODE YOU NEED TO ESTABLISH CONNECTION FOR STREAM READS AND STREAM WRITES WHATEVER YOU DO IN DATABRICKS IS PRETTY MUCH THE SAME YOU WORK WITH WITH THAT, I GIVE TO ROHAN WHO WILL TALK ABOUT COCA-COLA WHO HAS BEEN USING THE DB SPARK CONNECTOR FOR SEVERAL YEARS FOR A FEW MONTHS NOW FOR ONE OF THEIR PROJECTS. HE ALSO TALK ABOUT THE NEW SPARK API. THANK YOU. >> THANK YOU. SRI . >> WELCOME EVERYONE. ROHAN, HERE FROM THE GROUP AS WELL. HOPE YOU ARE HAVING AMAZING BUILD THIS YEAR. PARTICULARLY ON THIS TOPIC, TOPIC OF REAL-TIME ANALYTICS AN AN LIT YKS ON TOP OF COSMOS DB WHICH IS SOMETHING THAT IS NEAR AND DEAR TO ME. A COUPLE OF YEARS BACK, I WAS ONE OF THE FOLKS WHO STARTED THE COSMOS DB SPARK CONNECTOR WHICH SRI WAS TALKING ABOUT. TODAY, I’M HELPING DESIGN AND BUILD THE OPERATIONAL INTEGRATED OPERATIONAL ANALYTICS SUPPORT WITHIN COSMOS DB. WE ARE REALLY EXCITED TO ANNOUNCE THE SUPPORT AT BUILD. AND EXCITED TO SEE HOW WE CAN WORK WITH YOU TO UNDERSTAND THE SPARK SCENARIO AND REALLY INTRODUCE THIS NEW PARADIGM OF GLOBALLY DISTRIBUTED OPERATIONAL ANALYTICS WHICH IS VERY UNIQUE TO SOMETHING THAT WE ARE OFFERING TODAY BECAUSE OF THE CAPABILITY OF COSMOS TO GLOBALLY DISTRIBUTE THE DATA SO MY IMMEDIATE TEAM IN COSMOS DB ENGINEERING IS RESPONSIBLE FOR LEADING THE TECHNICAL CUSTOMER ENGAGEMENTS FOR THE SERVICE. WITH THAT, WE ENTERED INTO INTERFACE WITH CUSTOMER, ENTERPRISE CUSTOMERS AT AZURE WITH DIFFERENT STAGES OF DATA MODERNIZATION WHICH WE CALL IT. THIS IS A BIT OF A BUZZ WORD. REALLY IF YOU THINK ABOUT IT, IT IS ALL OF THE ENTERPRISES COMING BACK TO REDESIGN SOME OF THE CORE PRINCIPLES IN HOW THEY STOLE DATA AND THIS IS ALL OPERATIONAL DATA WHICH IS TODAY GROWING INTO TERABYTES AND PETABYTES SCALE. THIS IS HOW THEY STORE IT IN GLOBALLY DISTRIBUTED FASHION. ALSO BE ABLE TO DERIVE INSIGHTS INTO THAT DATA THAT IS THE REAL KEY TODAY, WHICH WITH ALL OF THE COSMOS DB SESSION YOU ARE ATTENDING YOU WILL UNDERSTAND HOW TO ACHIEVE SCENARIOS IN GLOBAL DISTRIBUTING SETTING. RIGHT NOW, LET’S FOCUS MORE ON HOW YOU CAN DERIVE INSIGHTS OUT OF TERABYTES AND TRANSACTIONAL DATA. I THINK THAT A PERFECT EXAMPLE WOULD BE COKE. LET’S TAKE A LOOK AT A VIDEO FROM THE CIO AND EXECS AT COKE WHERE THEY TALK ABOUT HOW THEY, ABOUT THE WHOLE JOURNEY ON AZURE AS WELL AS CENTERED AROUND COSMOS DB . IT IS UPLIFTING AND OPTIMISTIC. OVER 200 COUNTRIES. AS SYSTEM, WE EMPLOY 770 PEOPLE. WE HAVE BRAND PARTNERS THAT WE WORK WITH. WE HAVE 1. 9 BILLION SERVES OF OUR PRODUCTS A DAY . ACROSS OUR SYSTEM, WE HAVE SALES, REVENUE, MULTIPLE CURRENCIES AND MULTIPLE COUNTRIES FROM MULTIPLE SOURCES. OUR CHALLENGE HAS BEEN TO GET INSIGHTS UP TO SPEED. WE ARE FULLY IN WITH THE MOVE TO CLOUD THE COOL PART OF THAT CLOUD IS COSMOS >> SORRY. >> DB. BEING A GLOBALLY DISTRIBUTED COMPANY WE HAVE PETABYTES OF DATA THAT CONTINUES TO GROW WE WANTED TO PARTNER WITH SOMEBODY WHO HAS A BATTLE TESTED INFRASTRUCTURE AS WE ARE RUNNING MISSION CRITICAL APPLICATIONS. AS FOR QUITE FRANKLY LED US TO MICROSOFT. WE ARE COLLECTING REVENUE AS WELL AS VOLUME DATA FROM ALL OF OUR ECOSYSTEM AND PUTTING IT IN ONE PLACE THAT ALLOWS US TO DRAW INSIGHTS. ABLE TO SCALE AND HAVE INSIGHTS THAT ARE ACTUALLY DELIVERED WITHIN MINUTES IS VERY, VERY IMPORTANT FOR US. WE ARE VERY, VERY EXCITED ABOUT COSMOS DB AND THE UPCOMING NATIVES SPARK SUPPORT AS WELL ASPIRATIONAL ANALYTICS THAT ARE BUILT ON TOP OF THAT. >> THE PROGRAMS BROUGHT OUR SYSTEM TOGETHER AS WE BRING DATA FROM OUR FRANCHISE PARTNERS AND THE COMPANY TOGETHER TO BETTER SEGMENT AND BETTER SERVE THE NEEDS OF OUR CONSUMER. IT IS DELIVERED AGAINST THAT. IT IS DELIVERED US THE COSMOS DB TECHNOLOGY THAT ALLOWS US TO SKY MASSIVELY AND DELIVER

AGAINST ALL OF THE TIME REQUIREMENTS WE HAD ABOUT GETTING INSIGHT TO MARKET AT THE PACE THAT THE MARKET MOVES. THIS IS THE TIME THAT WE CAN MAKE A FUNDAMENTAL DIFFERENCE TO OUR BUSINESS. AND I DON’T THINK THAT IT GETS BETTER THAN THAT >> SO THAT IS A REALLY, REALLY SATISFYING VIDEO. PERSONALLY BECAUSE I CAN VERY DISTINCTLY REMEMBER THE DAY, OVER A YEAR BACK WHEN WE WALKED INTO THE HEADQUARTERS AT COCA-COLA TO PROPOSE THE SOLUTION THAT WE WANTED TO BUILD ON COSMOS DB AND THE ENTIRE AZURE. FOR THIS PARTICULAR PROJECT WHICH IS CALLED NSR. SO WE, WE LOOKED AT THE VIDEO. RIGHT? LET’S LOOK AT SOME OF THE KEY HIGHLIGHTS HERE. WHAT IS THE SCALE CHALLENGES HERE? WHICH IS COCA-COLA AS A COMPANY AND ALL OF THE DATA WHICH, WHICH GENERATES AS WELL AS CONSUMES. IT IS SPREAD ACROSS 200 COUNTRIES AND THEY HAVE, THEY SERVE OVER 2 BILLION PRODUCTS IN A SINGLE DAY YOU CAN IMAGINE EVERY SINGLE PRODUCT SALE IS GOING TO GENERATE AN ORDER OF LARGER VOLUME OF RECORDS ITSELF IT IS BASICALLY, THIS IS A VOLUME OF DATA THAT THEY HAVE TO PROCESS THIS A SINGLE DAY. IN THIS NSR PROJECT WITHIN COCA-COLA IS BASICALLY ALL, SO COCA-COLA IS FRANCHISE BUSINESS AT THE END OF THE DAY. WHICH MEANS THEY INTERACT WITH PRODUCTS ACROSS THE WORLD THAT IMPLORE DATA EVERY DAY ABOUT THE SALES VOLUME AND REVENUE ON A SPECIFIC DAY. AND THESE BOTTLERS ARE SPREAD ACROSS THE WORLD. THEY CAME TO US, ASKING, CAN YOU PROPOSE TO US A GLOBALLY DISTRIBUTED COMMON DATA STORE WHERE I CAN STORE DATA NOT JUST FROM THE BOTTLERS BUT ALSO DATA COMING IN FROM INTERNAL BUSINESS UNITS AS WELL AS OTHER THIRD PARTY PUBLIC DATAS SOURCES INTO A COMMON DATA STORE WHICH HAS ALL OF OPERATIONAL DATA. THIS IS NOT JUST A DATA LINK THAT YOU CAN HOST THE DATA. THIS HAS TO BE DATA WHICH HAS TRANSACTIONAL REQUIREMENTS AND THEN BE ABLE TO ALSO DERIVE INSIGHTS OUT OF THAT IN TIME. THAT WAS NOT ENOUGH. BECAUSE THIS COMMON DATA STORE, THEY WANTED TO FUTURE PROOF IT. WHICH IS TODAY THEY HAVE BATCH DATA COMING IN FROM ALL OF THESE BOTTLERS BUT GOING FORWARD, THEY NEED TO BE ABLE TO INGEST IOS DATA. POINT OF SALES MACHINES AS WELL AS VENDING MACHINES THIS IS A COMMON DATA STORE THAT HAS TO SCALE AT THIS VOLUME AS WELL AS GLOBAL DISTRIBUTION NEEDS AND ALSO BE ABLE TO NEED TO SCALE TO BATCH AS WELL AS STREAMING SCENARIOS THIS IS A VERY NICE SLIDE. YOU MIGHT HAVE TO SQUINT A BIT. THIS IS ONE OF THE SITES, THAT ONE OF THE EXECS PRESENTED. CAUGHT MY ATTENTION IT SHOWS NOT JUST ABOUT COKE, BUT EVERY COMPANY, ENTERPRISE CUSTOMER WHO WANTS TO COLLECT DATA, OPERATIONAL DATA NOT JUST AT THE SOURCE OF THE DATA BUT ALSO ALL THE WAY FROM B2B OR C2C BUSINESS RIGHT FROM THE PLACE YOU MANUFACTURER DATA THROUGH THE LOGISTICS, THROUGH THE POINT OF SALES THAT YOU CAN NOW SEGMENT THIS DATA AND DERIVE INSIGHTS OUT OF THE BUSINESS AND UNDERSTAND THE CUSTOMER BETTER. THIS IS A VERY NEAT SLIDE THAT CAPTURES THE ESSENCE OF WHY WE BUILT THIS OPERATIONAL NATIVE OPERATIONAL SUPPORT AS WELL SO NO ONE WAS GUESSING RIGHT. COSMOS DB WAS COMMONLY DELIBERATED DATA WHICH, WHICH HAS TO SERVE BOTH TRANSACTIONAL REQUIREMENTS AS WELL AS ANALYTICS REQUIREMENTS. I’M NOT GOING TO SPEND TOO MUCH TIME ON THE OTHER ASPECTS AS WHY COSMOS DB WAS CHOSEN. IT WAS CLEAR. 200 COUNTRIES. IT HAS GLOBAL DISTRIBUTION. YOUR BOTTLERS ARE GOING TO START LOADING FILES AT ANYPOINT IN THERE. YOU DO NOT WANT, I ASKED THE SYSTEM WHERE YOU HAD TO PROVISION FOR PEAK WORKLOADS MEAN AS HIGHER COST OF OWNERSHIP VERSUS NOW WITH COSMOS DB. AND YOU HAVE AVAILABILITY AND YOU HAVE LATENCY GUARANTEES AS WELL. IF YOU WANT TO BUILD A WEB APP THAT INTERACTS WITH DATA AND LOOK UPS. IN THIS PARTICULAR THING, IN THIS, IN THESE NEXT FEW MINUTES TO LOOK AT, IS THE FAST TIME TO INSIGHTS ASPECT THIS IS AGAIN, A SIMPLIFIED ARCHITECTURE AROUND WHAT COCA-COLA DOES TO, THIS IS MAIN SCENARIO WHEN A BOTTLER DROPS, FILES EVERY DAY AROUND, HOW MUCH VOLUME SALES AND VOLUME REVENUE THEY HAD. SO THE DATABASEICALLY LANDS IN BLOB, VERY SIMPLE. THIS SEEMS LIKE A SIMPLE ETL ME CANNISM THAT THEY HAVE TO DO ETL FROM BLOB COSMOS DB. WHAT IS BIG ABOUT IT? THE REASON WHY SPARK IS IN THIS PICTURE, IT IS NOT JUST ETL, IT THEY HAVE TO DO A LOT OF ANALYTICS AS WELL AS COMPLEX AGGREGATIONS AND JOINTS WHEN THEY HAVE TO MOVE DATA FROM BLOB TO COSMOS DB. THIS IS BECAUSE CUSTOMERS, BOTTLERS DROP

TRANSACTION DATA BUT PERFORM VERY COMPLEX GIANTS AGAINST REFERENCE DATA AS WELL AS MASTER DATA STORED IN OTHER CONNECTIONS IN COSMOS DB AND YOU NEED TO BE ABLE TO DO SCHEME OF VALIDATION, DATA WRANGLING. SPARK ISING AGGREGATION FRAMEWORK NEATLY POSITIONED TO BE ABLE TO ACHIEVE THIS. ONCE YOU LAND DATA IN COSMOS DB, YOU THEN, THROUGH THE NATIVE AS, AZURE ANALYSIS FOR CONNECTOR DB, WE DO A REFRESH OF CUBE, THEY BUILD THIS AZURE ANALYSIS SERVICES THAT POWERED BY POWER BI DASHBOARDS AS WELL AS AD HOC EXCEL ANALYSIS ONE THING TO NOTE HERE, IN HERE, THE ORCHESTRATOR OF THE WHOLE PLATFORM AND THIS IS ONE SCENARIO, RIGHT, WHERE YOU ARE INGESTING DATA. THAT IS ESSENTIALLY ONE THAT IS BOTTLEER FILE APPROACH. BOTTLERS LATER NEED TO COME IN RUN ONE ANALYTICS AN ROMPS ON TOP OF THE DATA THAT THEY HAVE UPLOADED. THEY DON’T WANT TO LOOK AT JUST THEIR DATA BUT LOOK AT DATA THAT OTHER BOTTLERS IN THEIR REGION AND NOT THEIR REGION UPLOADED THIS IS WHERE ALL OF THE COMPLEX AGGREGATIONS AND JOINT CAPABILITIES REALLY HELP. AND THEN COMES IN, THIS IS, MORE OF OPERATIONAL SCENARIO RIGHT? ANALYTIC SCENARIO. ALSO HAVE REGULAR TRANSACTIONAL SCENARIO, RELATION Y’ALL DATABASE, TRANSACTIONAL DATABASE, YOU WANT SQL QUERIES ON TOP OF THE DATA. AND YOU DON’T NEED TO ADD THE LATENCY OF SPARK IN THERE SO THAT IS ALSO ANOTHER CAPABILITY WHERE YOU CAN HIT COSMOS DB. FINALLY FROM THE CUBE, WHICH POWER THE DASHBOARD, BUSINESS ANALYSTS THAT VIEW THE DATA. SO THIS AGAIN YOU MIGHT JUST WANT TO SQUINT A LITTLE BIT. IF YOU SEE TWO PARTS HERE. THERE ARE CIRCLES THAT REPRESENT AZURE, THE LOCATION OF AZURE REGIONS TODAY IF YOU SEE THE COCA-COLA SYMBOL, THOSE REPRESENT THE REGIONS WHERE THEIR COCA-COLA BUSINESS UNITS WHAT IS REALLY NEAT HERE, WHEREVER THERE IS A BUSINESS UNIT, WHICH COCA-COLA HAS, THERE IS EITHER AZURE REGION RIGHT IN THE SOME PROVINCE OR WITHIN A THOUSAND MILE RADIUS AROUND THE BUSINESS UNIT. SO, THE COOL THING IS COSMOS DB, THE ZERO SERVICE. COSMOS DB IS PRESENT WHEREVER ANY AZURE REGION IS. THIS HAS HELPED BUILD A GLOBALLY DISTRIBUTED DEPLOYMENT WHATEVER WE SAW SO FAR, SINGLE REGION DEPLOYMENT. THIS IS WHAT HAPPENS WHEN YOU ARE DEPLOYING OPERATIONAL PLATFORM WHICH IS GLOBALLY DISTRIBUTED THERE ARE TWO PARTS HERE, RIGHT? IF YOU SEE IN ANY ONE OF THOSE SQUARES, THERE IS A DATA TIER AND THEN A COMPUTE TIER. SO COSMOS GLOBALLY DISTRIBUTED, ALL YOU HAVE TO DO HERE IS CREATE ONE SINGLE COSMOS DB ACCOUNT. AND ADD ADDITIONAL REGIONS WHEN YOU NEED TO SCALE OUT DATABASE TO ANOTHER REGION. BUT THE SAME CANNOT BE NATURALLY SAID FOR THE COMPUTE. WHICH IS, LET’S FOCUS BEYOND ANALYTICS PIECE HERE. WHICH IS, IF YOU ARE RUNNING SPARK, YOU KNOW, YOU CANNOT GUARANTEE ARTICLE OF ZERO. OBJECTIVE OF ZERO. COSMOS DB, IF A SINGLE REGION GOES DOWN, AUTOMATIC FAILOVER TO ANOTHER REGION HOW DO YOU DO THAT FOR A COMPUTE HERE? EAST U. S. WENT DOWN, TAKE THE ANALYTIC STACK AND AUTOMATICALLY REPLICATE TO ANOTHER REGION? DO I PLACE IT INSIDE THE SAME V NET AND ALL OF THESE? THIS IS THE CORE REASON WHY COCA-COLA AND OTHER CUSTOMERS PUSHED US HARD TO BUILD OUT THE NATIVE SUPPORT FOR SPARK. THIS IS WHAT THE SOLUTION WOULD NOW MOVE TO IN FUTURE SIDE AS WELL AS WHAT WE ARE PROPOSING FOR EVERYONE OUT HERE TO BUILD A GLOBALLY DISTRIBUTED ANALYTICS SET UP. RIGHT? WHICH IS, YOU KNOW HAVE SPARK. ONCE YOU SET UP SPARK API ON TOP OF ANY OTHER DATA API IN COSMOS DB. AND YOU HAVE SPARKS JOB WHICH — JOBS WHICH CAN NOW RUN IN ANY, ALL OF YOUR AZURE REGIONS WHICH ARE ASSOCIATED WITH THE COSMOS DB DATABASE ACCOUNT AND NOW WITH COSMOS DB, MULTI MASTER CAPABILITY, EACH OF THE REGIONS IS NOW A READ AND WRITE REPLICA WHICH MEANS YOUR SPARK JOBS CAN PERFORM BOTH QUERIES, READS, WRITES, ALL INTEREST THE NEAREST REGION SO THIS WAY NOW, IF COCA-COLA TOMORROW, UNBOARDS ANOTHER BUSINESS UNIT ON THIS PLATFORM, ALL THEY HAVE TO DO IS ADD ANOTHER COSMOS DB REGION DATA REPLICATE AND COMPUTE, STACK IS REPLICATED NATIVELY THERE. THIS IS WHY WE BUILT IN NATIVE SUPPORT FOR SPARK. AND JUPITER NOTEBOOKS INTO COSMOS DB WHICH WORKS SEAMLESSLY WITH ALL OF THE DATA API. AND WORK WITH CASSANDRA, API, SQL API, ALL OF THEM. SO LET’S LOOK AT PARTICULARLY WHAT, WHAT THE STACK WOULD, DATA ENGINEER AND ML ENGINEER WOULD WORK WITH, RIGHT? HOPEFULLY AT THE END OF THE DOCK HERE, IT IS CLEAR TO YOU THAT COSMOS DB IS ONE OF THE

TOP CONTENDER FORCE THE DATABASE PLATFORM IF YOU WANT TO BUILD A TRANSACTIONAL/OPERATIONAL ANALYTICS SYSTEM TOGETHER. SO THAT PIECE LET’S LEAVE IT OUT. AS OF YESTERDAY, BEFORE WE ANNOUNCED THE SPARK INTEGRATED SPARK SUPPORT, THE ENTIRE TOP PART WHICH IS, THE CHOICE OF THE COMPUTE PLATFORM AS WELL AS SPARK AND JUPITER NOTEBOOKS WHICH IS TYPICALLY THE MOST PREFERRED ECOSYSTEM FOR RUNNING VISUALIZATIONS AS WELL AS A COLLABORATED ENVIRONMENT WHERE YOU CAN PULL IN RICH SET OF PLUG INS AN EXTENSIONS TO VISUALIZATIONS. THAT IS GIVEN THAT IS THE REASON WE ARE GOING TO ADOPT OPEN SOURCE TECHNOLOGIES WITHOUT REBUILDING SPARK. WITHOUT REBUILDING JUPITER NET BOOKS. WE WANT YOU TO BRING EXISTING JUPITER NOTEBOOKS AND EXISTING SPARK NOTEBOOKS AN JOBS AND RUN NATIVELY WITHIN COSMOS DB. AS ALWAYS, WE CAN NEVER TALK ABOUT COSMOS DB ISOLATION AND YOU WILL ALWAYS, USING SPARK TO DO ETL. STREAMING ETL. AND YOU WANT TO CONNECT, HUBS, EVEN IF YOU WANT TO CONNECT TO OTHER DATA SOURCES AS WELL AS DB, FROM SOMEWHERE YOU NEED TO CONNECT TO THOSE DATA SOURCES, RIGHT? THAT IS WHEREAS YOU ONBOARD WITH THE COSMOS DB SPARK, API, YOU WILL SEE THAT ALL OF THESE CONNECTORS COME IN BUILT WITH THE RUN TIME WHERE WE WANT TO ELIMINATE THE REQUIREMENT FROM YOU TO GO ADD THESE CONNECTORS, GO AROUND AND FIX THE PARTS AN SOLVE THAT PROBLEM WE WANT. AS WELL AS DATA SCIENCE AND ML PACKAGES. AS YOUR MACHINE LEARNING SERVICE LIBRARIES, BY SPARK, BASICALLY ALL OF THE DATA SCIENCE ECOSYSTEM COMES IN AS PACKAGED IN THE RUN TIME. IF THERE ARE ADDITIONAL REQUIREMENTS, YOU CAN ADD ADDITIONAL JOBS TO THE RUN TIME. OTHERWISE, ALL OF THIS COMES AS NATIVE STACK WHAT WE NOW SAY IS THE RED LINE IS REMOVED AND THE ENTIRE STACK OF JUPITER NOTEBOOKS AS WELL AS SPARK COMES BUILT IN TO COSMOS DB WE ACTUALLY, I WOULD LOVE TO GO INTO DETAIL HOW WE ARE BUILDING THIS OUT. THERE IS SIGNIFICANT, A LOT OF DESIGN CHOICES PERFORMED IN NOT RUNNING SPARK ON JUPITER NOTEBOOKS ON VM. HOW YOU WOULD RUN A SPARK CLUSTER RIGHT NOW. THERE ARE, THERE IS A LOT OF RESOURCE GOVERNANCE AND RESOURCE UTILIZATION THAT WE ARE ABLE TO IMPROVE ON BY RUNNING THESE JUPITER NOTEBOOKS AND SPARK MASTER AND NODES AS CONTAINERS THAT IS THE REAL COOL THING WE ARE DOING RIGHT NOW. RUNNING ALL OF THESE AS CONTAINERS WITHIN THE COMPUTE STRUCTURE OF COSMOS DB. WHICH YOU WILL SEE NOW, WHICH IS HOW WE BOTH FROM A TCL, TOTAL COST FROM RESOURCE UTILIZATION POINT OF VIEW IS SIGNIFICANTLY BETTER THAN COMPARED TO THE EXISTING CHOICES YOU HAVE AND YOU HAVE TO RUN IT. EVEN IF YOU SAY, EVEN IF APART FROM ALL OF THE GLOBAL DISTRIBUTION, AND THE ZERO CAPABILITIES WHICH INTEGRATED SPARK EXPERIENCE GIVES YOU. SO WE, YOU MIGHT BE RUNNING A LITTLE SHORT OF TIME. IF YOU LOVE TO STAY ON, I WOULD LIKE TO SHOW YOU A COOL DEMO THAT WE PUT TOGETHER WITH COGNITIVE SERVICES TEAM. WITH THIS COMPUTE INFRASTRUCTURE COMING WITHIN COSMOS DB, THIS ALLOWS US TO NOT JUST RUN SPARK BUT ALLOWS US TO BUILD NATIVE INTEGRATIONS WITH AZURE SERVICES AND COGNITIVE SERVICES THAT YOU CAN RUN COGNITIVE PIPELINES NATIVELY WITHIN COSMOS DB. I’LL SHOW YOU A VERY COOL DEMO I’M NOT SURE IF ANY OF YOU WERE THERE, WE SHOWCASED THIS DEMO AT THE COGNITIVE SERVICES SESSION YESTERDAY AS WELL. BUT I WOULD LOVE TO WALK YOU THROUGH THIS. THIS IS A REAL LIFE SCENARIO. JUST A QUICK THING BEFORE WE GET INTO JUPITER LAB NOTEBOOK BEFORE ONBOARDING CUSTOMERS ON TO THE SPARK API, ALL YOU HAVE, THIS IS HOW YOU WOULD LOOK AT IT. YOU CAN CHOOSE THY OF THE EXISTING DATA APIs ALONG WITH THE SPARK API. AND ONCE YOU DO THAT, YOU CAN NOW REPLICATE ANY, ONCE YOU ADD ANY REGION, YOU CAN NOW SEE THAT IT COMBINES BOTH THE DATA API AS WELL AS THE ANALYTICS API IN BOTH OF THE REGIONS. BOTH COLOCATED IN EACH OF THE REGIONS WHAT WE ARE LOOKING AT TODAY, HENDRIX MOTOR SPORTS. I DIDN’T REALIZE — YEAH. SO THE, OKAY. SO HENDRIX MOTOR SPORTS IS ACTUALLY ONE OF AZURE CUSTOMERS. THEY ARE RAISING TEAM IN NASCAR. I’M NOT A HUGE NASCAR FAN. BUT I’M SURE YOU ARE. AND YOU MIGHT RELATE TO THIS. SO TEXAS 500 IS ONE OF THE APPARENTLY REALLY COOL NASCAR RACES. THIS IS WHAT WE DID WE WORKED WITH HENDRIX MOTOR SPORTS TO ACTUALLY GET THE RACE TELEMETRY FROM THE TEXAS 500 RACE THIS IS OVER 30 RACECAR DATA AND EACH OF THEM AT OVER 5 HERTZ. LET’S LOOK AT WHAT THAT DATA LOOKS LIKE THIS IS THE ACTUAL DATA. YOU CAN SEE IT IS VERY RICH. THE TELEMETRY

COMING FROM EACH OF THE CARS WHICH SHOWS YOU THE LATITUDE, LONGITUDE, WHAT IS THE RPM, WHAT IS THE BRAKE, WHERE IS IT HEADING AND A LOT OF OTHER COOL THINGS. THIS IS COSMOS DB CAN SCALE TO INGEST ALL OF THE DATA. WHERE IS THE ANALYTICS PIECE COME IN? TELEMETRY, WE ALSO HAVE A COUPLE OF OTHER SOURCES OF DATA IN COSMOS DB. ONE IS THE DATA ABOUT THE PLAYER PROFILE. AS WELL AS AUDIO DATA. SO THIS IS WHERE THE DATA, AUDIO DATA ITSELF IS STORED IN BLOG BUT COSMOS DB IS INDEXED TO THE BLOG. YOU HAVE DOCS TO IT. AND YOU CAN NOW LOOK UP TO KNOW WHICH EXACT LOCATION I CAN GO PICK UP MY AUDIO FILE FOR SPECIFIC DRIVER AT PARTICULAR POINT IN TIME. SO NOW THAT WE HAVE ALL OF THIS DATA, THIS IS ACTUALLY AS SIMPLE AS THIS GETS, RIGHT? THIS IS ACTUALLY THE UI WHICH YOU WILL SEE ONBOARD TO THE SPARK API WHICH IS THE JUPITER NOTEBOOK COMES BUILT IN TO THE DATA EXPLORER OF COSMOS DB. AND YOU CAN NOW EXPLORE YOUR DATA BUT YOU ALSO CAN EXPRESS NOTEBOOKS AND EXPLORE ALL OF THE NOTEBOOKS IN API. AND YOU CAN COLLABORATE WITH MORE PEOPLE AND THIS IS HOW WE WANT MORE PEOPLE TO COLLABORATE ON TOP OF THE DATA. AS SRI JUST SHOWED YOU, WITH A SINGLE LINE OF CODE YOU CAN READ DATA AND APPLY FOR READING DATA FROM COSMOS DB IN SPARKS. THE CORE THING IS VISUALIZATION A LOT OF TIMES I HAVE HAD THIS ISSUE, WHEN WE WORK WITH CUSTOMERS LOAD THE DATA IN COSMOS DB. NOT ENOUGH TO RUN THE MAX QUERIES IN THE DATA SOMETIMES YOU NEED RICH VISUALIZATIONS TO SEE WHAT THAT DATA MEANS. HERE, BECAUSE OF ALL OF THE, NOT BUILDING ANYTHING NEW HERE. USING A RICH EXTENSION IN JUPITER LABS FOR VISUALIZATIONS HERE, I AM BASICALLY TAKING THE FIRST COUPLE OF MINUTES OF THE RACE AND I’M, BASICALLY LOOKING AT THE DATA FOR THE TOP FIVE RACES. SO YOU CAN NOW LOOK AT WHAT IS, THIS IS BASICALLY THE FIRST TEN MINUTES OF THE PLAYERS AND THE RELATIVE POSITIONS. AND THEN COMES IN, DISTRACT STRATEGY. I WANT TO BASICALLY LOOK AT, I WANT TO EMBED THIS DATA ON A REAL MAP. AND SEE WHERE OUR RACES ACCELERATING AND WHERE ARE THEY BREAKING? WHAT IS THE COOL WAY TO DO THAT? THIS IS ALL OF THE CODE I HAD TO WRITE. WITH THAT YOU CAN NOW LOOK AT THIS. WHICH IS THE GREEN REPRESENTS WHERE PEOPLE APPLY THROTTLES AND THE RED REPRESENTS WHERE PEOPLE APPLY BRAKES. NOTHING EXTREMELY NEW HERE. YOU CAN SEE THAT IT HAPPENS WHEN PEOPLE BRAKE MORE, COMPARED TO THE STRAIGHT STRETCHES WHERE THEY ACCELERATE. THEN COMES IN MORE NATIVE SPARK SUPPORT. I WANT TO LOAD THE RACE TELEMETRY ABOUT DRIVERS AN ACTUALLY ANALYZE HOW DRIVERS ARE PERFORMING. THIS IS WHERE, SPARKS, AGGREGATION FRAMEWORK, GROUPED BY AND ALL OF THE AGGREGATION CAPABILITIES WITH JOINTS COMES IN. NOW WE HAVE THE OPTION TO VISUALIZE THAT NEATLY INTERACTIVELY. HERE I AM LOOKING AT WHAT THE TIME THAT EVERY DRIVER SPENDS IN ACCELERATION WAS BRAKE IT IS VERY CLEAR THAT THE MORE EXPERIENCED DRIVERS SPEND LESS TIME BREAKING AND PUSH FOR MORE TIME WITH THE THROTTLE. THIS IS EXAMPLE OF HOW YOU CAN RIFFLY VISUALIZE YOUR DATA HERE. TO ONE OF THE COOL THINGS AS I WAS SAYING, WHAT WE ARE WORKING ON, AND YOU WILL SEE IN THE COMING ANNOUNCEMENTS AS WELL AS WORKING WITH COGNITIVE SERVICES AML, FOR NEAR INTEGRATIONS WITH COSMOS DB AND YOU CAN STORE DATA IN COSMOS DB AND ENRICH YOUR DATA WITH COGNITIVE SERVICE PIPELINES SO THAT YOU DON’T HAVE TO WRITE MUCH CODE TO MOVE YOUR DATA OUT OF COSMOS DB TO UNDERSTAND AND ENRICH THAT DATA. SO ML SPARK IS BASICALLY, IS A VERY WELL ATTEND SPARK PACKAGE THAT COMBINES A LOT OF COGNITIVE SERVICES NATIVELY WITHIN SPARK. HERE WE ACTUALLY SHOWCASE AN ANOMALY. ANOMALY DETECTION WAS A COGNITIVE SERVICE THAT WAS ANNOUNCED RECENTLY. HERE WHAT WE SHOW IS, CAN I APPLY ANOMALY DETECTION ON THE RPM DATA THAT I HAVE FROM RACECAR TELEMETRY? HERE WE APPLY ANOMALY DETECTION ON TOP OF THE RPM DATA AND LET’S LOOK AT, IT JUST TAKES A COUPLE OF SECONDS FOR ANOMALY DETECTION. LET’S LOOK AT WHAT THE ANOMALIES LOOK LIKE. IT IS PRETTY, IT IS A NICE GRAPH HERE. LET’S LOOK AT SOME PORTIONS OF THIS. HERE, THIS IS RPM. RIGHT? IF YOU SEE THE YELLOW AND THE RED, AND THE GREEN LINE, THAT ACTUALLY REPRESENTS IN NASCAR, YOU HAVE A LOT OF CRASHES MUCH MORE THAN AVERAGE 5-6 PER RACE THE YELLOW REPRESENTS A POINT WHEN THE WAVING FLAG WAS RAISED. USE, THE DRIVERS CANNOT RACE AMONGST THEMSELVES AN MAINTAIN RELATIVE POSITION WHICH IS WHY YOU SEE A SIGNIFICANT DROP IN THE RPM. THAT IS ACTUALLY WHAT ANOMALY DETECTION JUST, IT, STOOD OUT AS WELL. THERE IS SOME NICE PARTS WHICH IS, IF YOU LOOK AT THE LAST PART, RIGHT?

SO WHAT IS THIS? HERE I SEE AGAIN THAT THE YELLOW LINE REPRESENTS A WAVING FLAG WAS RAISED. AFTER THAT, RPM WENT TO ZERO. WHAT DOES THIS MEAN? WHAT WE DID, WE FOUND THE VIDEO OF THIS EXACT RACE AND IF YOU ARE JUST — HE CHOOSES THE INSIDE LINE AGAIN. THE 12 CAR, HE HAS SOME EXPERIENCE ON THE OUTSIDE LINE. >> THAT WAS CLOSE. >> YES, THAT IS WHY IT HAPPENED. THAT PARTICULAR CAR WHICH WE WERE LOOKING AT, WE WERE BASICALLY DOING ANOMALY DETECTION WAS THE ONE THAT ACTUALLY CRASHED INTO THE SIDEWALK THERE . SO IT IS NOT SURPRISING THE WAY THAT RPM WENT TO ZERO. NOW THIS IS A COOL THING THAT YOU CAN RICHLY ANALYZE YOUR DATA, NATIVELY WITHIN COSMOS DB. THEN UNDERSTAND THE SEMANTICS OFFERED BY EITHER INCLUDING YOUR DOMAIN KNOWLEDGE, GOING AND DOING ANY EXTERNAL RESEARCH OUTSIDE. ONE MORE REALLY COOL PART, BECAUSE WE DIDN’T WANT TO DO ONE COGNITIVE SERVICE. WE SAID, OKAY, LET’S SHOW THIS. I SAID, WE ALSO GOT THE AUDIO DATA FROM THE DRIVERS DURING THE RACE. RIGHT? WE STOLE THAT IN COSMOS DB. WHAT WE DO, WE PASS IT THROUGH A SPACE TO TEXT, A COGNITIVE SERVICE HERE. AND WHAT WE DO THEN IS, WE BASICALLY GET TAGS OUT OF THAT RIGHT? WHAT YOU CAN THEN DO, ENRICH YOUR TELEMETRY DATA WITH POTENTIALLY KEY WORDS WHICH YOU THINK ARE FLAGS LIKE FOR INSTANCE, IN THIS PARTICULAR SCENARIO, WHAT YOU COULD DO IS, IF THE DRIVER IS ESSENTIALLY EXPERIENCING SOME SORT OF DISCOMFORT, WHEN HE IS SAYING SOMETHING LIKE THAT, YOU CAN NOW BUILD PIPELINES, ALERTING PIPELINES ON TOP OF THAT TO SAY, WHEN DO YOU NEED A PIT STOP? PREDICTIVE MAINTENANCE TYPE OF ARCHITECTURE SO THIS IS ONE OF THE COOLNESS OF THE CAPABILITY OF COGNITIVE SERVICE SO YOU CAN SEE THE VIDEO, THE AUDIO WHICH I’M GOING TO PLAY IS MUFFLED BUT IT IS ACTUALLY ABLE TO, THE TEXT IS WHAT THE SPEECH TO TEXT DETECTED. YOU CAN SEE IT IS HERE >> HE WAS RIGHT ON THE BOTTOM. I’M READY TO GO HERE. >> YOU CAN SEE THAT IT ACTUALLY ABLE TO VERY CLEARLY CAPTURE WHAT THE TEXT WAS. THIS IS THE COOL CAPABILITIES OF WHICH MSR AND A LOT OF AZURE AI FOLKS ARE DOING MUCH WE WANT TO BE ABLE TO OPERATIONALIZE THIS. WHICH IS WHY WE ARE BRINGING THAT NATIVELY INTO THE PLACE WHERE YOU ARE STORING YOUR DATA. SO THE LAST PART WHICH I WANTED TO SAY, THIS IS GOOD, BUT YOU WANT TO OPERATIONALIZE THIS WHAT YOU WOULD WANT TO DO, DEFINE A PIPELINE, WHERE YOU DEFINE A PIPELINE TAKING THIS DATA, APPLYING ANOMALY DETECTION, RUNNING IT THROUGH SPEECH TO TEXAS WELL AND STORING BACK IN COSMOS DB. THIS IS HOW YOU WOULD DO THAT. DEFINE A PIPELINE OUT OF THIS MULTIPLE STAGES IN SPARK. THIS ISLE THE COOL THING ABOUT SPARK DEFINE SUCH PIPELINES. AND EASILY TURN YOUR BATCH LAYER TO STREAM LAYER CODE. THIS IS I’M SETTING UPSTREAMING JOB. IT STARTED, BECAUSE IT SAYS PROCESSING NEW DATA. END TO END, THIS IS BASICALLY WHAT THE WORKFLOW LOOKS LIKE. WHERE I STORED MY DATA, RACECAR TELEMETRY IN COSMOS DB. APPLY A BUNCH OF AGGREGATIONS ON TOP OF REDUCING SPARK, PASS THROUGH COGNITIVE SERVICE PIPELINES AND THEN ENRICH MY DATA INTO COSMOS DB INTO ANOTHER COLLECTION OR UPLOAD THE EXISTING DATA. ALL OF THIS NOW, HAPPENED WITHIN COSMOS DB. IT IS THE SINGLE DATA EXPLORER WHERE I DID ALL OF THIS. I NEVER TOOK MY DATA OUTSIDE COSMOS DB TO DO THIS THAT IS REALLY ONE COOL THING THAT I WANTED TO SHOWCASE ABOUT WHAT THIS NOW OPENS UP AS OPPORTUNITIES FOR PEOPLE TO DO WITH THE REAL-TIME DATA. THIS IS A CUSTOM, PRE-BUILT AI MODEL. YOU NOW HAVE THE CAPABILITY TO BUILD CUSTOM MODELS AS WELL WHERE YOU CAN, SO IF YOU SEE, I THINK, IT IS A BAD OMEN IF ANY DEMO, SOMETHING GOING, YEAH, THIS IS ESSENTIALLY HOW YOU CAN GO OUT AND BUILD. AND YOU CAN USE SPARK ML TO GO BUILD A CUSTOM MODEL AS WELL. HERE, THIS IS ANOTHER SCENARIO WHERE WE TAKE RETAIL DATA, YOU CAN NOW FETCH THAT DATA INTO COSMOS DB. PUT IN TRAIN AND TEST. AND HERE WE APPLY A VERY SIMPLE LOGISTIC MODEL TO DO A CLASSIFY CARK ON WHETHER OR NOT BASED ON THE CONTEXT AROUND THE PURCHASE, WHETHER A PERSON WOULD ACTUALLY PURCHASE A SPECIFIC PRODUCT OR NOT SO THIS IS JUST ANOTHER CASE OF THAT YOU CAN NOW USE PRE-BUILT AI MODELS AS WELL AS CUSTOM AI MODELS BUILT WITHIN THE SPARK INFRASTRUCTURE OF COSMOS DB. SO I THINK THAT THIS IS KIND OF WHAT WE WANTED TO LEAVE YOU WITH. AND WE REALLY WANT YOU TO GO AND SIGN UP FOR THIS SPARK API REVIEW KIND OF, WE WANT TO WORK WITH YOU TO UNDERSTAND WHAT YOUR EXISTING SPARK USAGE IS. AND WE WORK WITH YOU TO ONBOARD YOU AND KIND OF HELP YOU ON THE PARADIGM SHIFT TO THE GLOBALLY DISTRIBUTED ANALYTICS AN AI. SO I THINK WITH THAT, I WILL LEAVE YOU AND SORRY