# Database Lesson #8 of 8 – Big Data, Data Warehouses, and Business Intelligence Systems

their new vehicles In Bubba’s case, he drives 10,000 miles per year And in his current vehicle, he gets 10 miles per gallon Therefore he will consume 1,000 gallons of gasoline per year In his new vehicle, Bubba will still drive 10,000 miles per year, but now will get 12 miles per gallon, and therefore will consume 833.33 gallons of gasoline per year By contrast, in Daisy’s current vehicle, she gets 30 miles per gallon And driving 10,000 miles per year, she consumes 333.33 gallons of gas Whereas with her new vehicle, which gets 50 miles per gallon, she will consume 200 gallons of gasoline every year Simple subtraction, then, shows that Bubba will save 166.67 gallons of fuel per year if he purchases his new vehicle Whereas Daisy will only save 133.33 gallons of fuel per year That is, Bubba will enjoy it more fuel savings per year by buying a new truck, which has 12 miles per gallon, versus his old truck, which had 10 miles per gallon, than will Daisy by buying a new car which has 50 miles per gallon, versus her old car which provided 30 miles per gallon So if you’re like most people, in this problem you’re mathematical intuition has failed And it is for this reason that I would argue that managers should not rely on their intuition when making important managerial decisions, but rather should rely on data Many managers have begun to recognize the deficiency is in human intuition, and are hence adopting technologies like business analytics, business intelligence, data warehousing, and data mining, so as to better support their decision-making activities with the data that the organization has at its disposal By the way, if you’re interested, this problem is known as the miles per gallon illusion, and stems from the fact that we here in the United States commonly measure fuel economy in miles per gallon, as opposed to the much more appropriate measure which would be gallons per mile Now that we’ve seen for ourselves how human intuition can fail, in even very simple mathematical problems, we can begin discussing some of the systems and technologies that have been developed to help managers make better decisions and avoid the failures that are commonly associated with relying upon intuition alone First among these technologies are business intelligence, or BI, systems BI systems are information systems that are intentionally designed to allow managers to capitalize upon organizational data for the purpose of improving their decision making It’s important to note that these business intelligence systems do not support the real-time operational activities of the organization Those activities, by contrast, are supported by transaction processing systems Instead, these BI systems are designed to support managerial assessment activities, analysis, planning, controlling, et cetera Broadly speaking, we can classify these BI systems into two categories First, are simple reporting systems, the purpose of which is to provide managers with the ability to look at data in a flexible, real-time way, that is, these reporting systems support simple data organization capabilities, such as sorting, filtering, and grouping data And they often provide the ability to make simple calculations on data in real time These simple calculations might include operation such as a sum, or an average, or a count By contrast, second category of BI systems are data mining applications Data mining applications are typically designed not to rely on that real time data,

level of granularity Consider that if I were to take all of the sales data for one day and add them all together, then I have a daily total Similarly, if I take the daily totals for seven consecutive days, then I have a weekly total I can, in such a way, continue to create monthly totals, quarterly totals, yearly totals, et cetera It is the same information that is available in the individual sales transactions, except it has been pre-aggregated and stored in the database in a pre-aggregated form, such that we can vastly improved query speed That is, when a query is run where we want to look at the data in the form of weekly totals, or monthly totals, or quarterly totals, the database at that time does not need to aggregate all of the individual sales transactions in order to produce the result The result already exists in the dimensional database because it has been pre-processed prior to being added into the database So again, the purpose of these dimensional databases, then, is to intentionally implement redundant data in a non-normalized design, such that we can vastly improved query speed We want to be able to extract data very quickly from the database The most common data model for a dimensional database is known as a star schema And the general characteristics of a star schema are that we have several dimension tables In this case, we have a time dimension table, a customer dimension table, and a product dimension table And at the intersection of all of those dimension tables, we have something called a fact table Philosophically speaking, the reason that the intersection table at the center of the dimension tables is called a fact table, is because a fact in very real terms is the intersection of information For example, imagine that I’m interested in knowing how many of a particular product a specific customer purchased during the month of December Well, the answer to that question is the intersection of the customer, the product in which I’m interested, and the specific time frame that I specify, in this case, the month of December And the answer might be 14 units So a specific customer purchased 14 units of a particular product during the month of December The intersection of those three values is a fact And it is for this reason that we label the table at the center of a star schema, a fact table To help you better conceptualize this concept, let’s consider the intersection of two dimensions And in this example, we’re looking at the customer dimension contrasted with the product dimension in the form of a two dimensional matrix The value contained in each cell within this matrix, then, is a fact And it expresses to us the quantity of a particular product that was purchased by a particular customer And you will notice, of course, if you recall from our last lecture, that this structure is very similar to a bitmap index Extending this concept out into a third dimension, we can see here that we’re representing a fact as the intersection of three different dimensions in a three dimensional matrix Along the horizontal axis I, again, have customers Along the vertical axis I, again, have products But now, along the z-axis, I have a time dimension Therefore, the value contained in any of the cells in this three dimensional matrix will tell me the quantity of a given product that was purchased by a particular customer during a particular date or time frame Now unfortunately, human beings have a great deal of difficulty envisioning higher order spaces beyond three dimensions However, this concept scales very easily up to higher dimensional spaces So we might consider, for example, a fact to be the intersection of four dimensions, or five dimensions

And although it is not easy to depict such a situation, conceptually, it is just a natural extension of the two-dimensional and three-dimensional examples we saw here In either case, I hope you can now understand why databases designed in this way are called dimensional or multi-dimensional databases Next, I’d like to talk briefly about OLAP and data mining technologies If you recall from earlier in the lecture, we said that, generally, there are two broad categories of business intelligence applications And they were, reporting applications, and data mining applications Online analytical processing, or OLAP, then, is a technique that supports these reporting applications That is, OLAP allows us to dynamically examine database data in real time, and apply simple transformations like sorting, filtering, grouping, et cetera And it allows us to perform simple arithmetic functions, such as summing values together, finding an average, account, the standard deviation, et cetera And again, this is intended to be used in real time By contrast, data mining techniques support data mining category of business intelligence applications And data mining, broadly, refers to a collection of mathematical and statistical methods that can be used to gain deep insights into an organization’s data Again, remember that the level of sophistication of these techniques generally requires that they not be executed in real time, so as to avoid interfering with the real time operations of the organization OLAP systems, then, when used in support of simple BI reporting needs, produce something called an OLAP report, which some people will refer to as an OLAP cube And the general idea here is that our inputs are a set of dimensions, while our outputs are a set of measures So recalling the two dimensional and three dimensional matrices that we saw just a few moments ago, a manager might select a series of dimensions And the OLAP system might allow him or her to perform simple transformations or drill down operations on the data which lie at the intersection of those dimensions so as to gain real time insights into the organization And here we see that these OLAP cubes can be constructed using our standard SQL SELECT queries In this case, we’re selecting a number of different dimensions We are then performing a join on four separate tables, and are imposing some group by and order by requirements The result of this query in OLAP terminology, then, would be a result set which represents a collection of measures that a manager could use to gain some insights into his or her organization And of course, rather than constructing these SQL commands repeatedly, we might take advantage of the capability of relational databases to create views, so as to save the SQL statements, which are used to produce common OLAP cubes in the database itself Data mining, then, can be viewed as the convergence of many different disciplines A skilled data miner needs not only working knowledge of databases, but also needs statistical and mathematical knowledge, perhaps knowledge of artificial intelligence, or machine learning algorithms, knowledge of data management technologies, and so forth In the modern world, many people become highly specialized in one particular area But the people who are most valuable to an organization often have expertise in two or more areas And this is certainly the case with people who are experts at data mining To conclude our overview of data mining, I just wanted to briefly describe some of the most common techniques that are used to perform data mining against an organization’s data Among these techniques are cluster analysis, in which case, the objective is to group sets of entities together according to their level of similarity

along one or more dimensions We also have decision tree analysis, in which we can process a large quantity of historical data and generate a decision tree, which tells us what to do under different circumstances, in order to achieve some kind of the desired result We also have regression available to us as a very powerful data mining tool The goal of which is to produce mathematical equations, or mathematical models, that not only describe the relationships between variables, but also provide us a basis for predicting future events based upon past observations Data mining applications might also rely on sophisticated artificial intelligence, or machine learning algorithms, such as neural networks or support vector machines And recently, we’ve seen a rise in a technique known as market basket analysis, or affinity analysis, which allows us to look for patterns of co-occurrence, for example, determining which products are commonly purchased together And the results of these affinity analyses can then be used as the foundation of a recommendation engine, which can suggests to you movies that you might like, books that you might like, et cetera Now I’d like to move into the final topic in our course on databases, and that is the rise of the big data paradigm Scientists and researchers have recently noted an exponential increase the quantity of data being produced by the human species If my memory serves correctly, I believe the current rate of growth is that the amount of data doubles every 40 months At present, the world is generating many exabytes of new data every single day And if you’re unfamiliar with the term exabyte, consider that one exabyte is slightly more than one million terabytes So you may have a computing device at home that saves several terabytes of data But consider that several million terabytes of new data are being generated by the human species every single day And this situation creates a vast array of new and interesting problems for organizations The term big data, then, refers to the rapidly expanding amount of data that is being stored and used by organizations These data sets can be very large and very complex And because of their size and complexity, they can be extremely difficult to process using traditional database technologies And an important point to note is that much of what is considered big data is being generated by web 2.0 applications, and the emerging collection of web 3.0 applications Traditional examples of web 2.0 applications might include social networking sites, video sharing sites, blogs, discussion forums, et cetera So this rapidly accumulating quantity of data presents many challenges for organizations Among these are simply capturing all of the data and storing it, maintaining the data once we have it This is also commonly referred to as curation, in the same way that the curator of a museum must maintain all of the ancient artifacts, so, too, must the curator of a big data set be able to maintain the quality and the integrity of the data in light of things like failing hardware, and the desire of the data to be used by many people from all over the world simultaneously Additional challenges include things such as search How does one search efficiently through such an enormous quantity of data? Data transfer, consider for example, that if you have a 100 megabit network connection you can transfer approximately one terabyte of uncompressed data per day At this speed, it would take you more than a million days to transfer one exabyte of data Further challenges include analyzing these massive data sets, visualizing the massive quantities of data, and so forth

In the past few years, a term has arisen in the area of big data that is used to describe the movement toward using non-relational databases in order to support these huge and highly distributed, highly replicated collections of data And this term is called NoSQL Although the name, to many people, implies that SQL is not involved in these databases, that is, they do not support SQL-like queries, this assumption is actually wrong As it is used in contemporary database circles, NoSQL means not only SQL, that is, these very large databases, although they may not be based on relational algebra, in the same way that a relational database is, they nevertheless support querying through a SQL-like query language Unlike the relational database world where the relational model is fixed and predominates all relational database vendors, in the NoSQL world, there are many different architectures for non-relational databases that are currently being used These include architectures which rely upon a key value store, a wide columnar store, a documents store There are databases that rely upon graph theory, and so forth Collectively, all of these different types of very large data stores are commonly referred to as structured storage And they have a few attributes in common First, they arguably employ simpler designs, as opposed to relational databases And second, they almost always have a looser consistency model than one will find in a relational database Another way of saying that is these structured storage databases do not provide ACID guarantees If you remember, ACID is an acronym which stands for Atomicity, Consistency, Isolation, and Durability And ACID guarantees are the hallmark of a normalized relational database We cannot expect to have that level of consistency in these massive, highly distributed, highly replicated, structured storage databases When discussing the sort of data models that are actually used in these structured storage databases, I like to use the data model that is employed by the Apache Cassandra database as an example And the reason for this is that it is one of the most popular structured storage database management systems And it is currently the most popular wide columnar store that is available Broadly speaking, the Apache Cassandra database can be classified as a hybrid, key value slash wide columnar database So its architecture contains elements of both a key value store and a wide columnar store The Apache Cassandra database itself was originality created at Facebook by two of their software architects, after which it was transferred to the Apache Foundation, where it now resides as entirely open source and free database Apache Cassandra has cross-platform support The reason for this being that it was a written in Java So it can run on Linux-based machines, Windows-based machines, Unix, et cetera And Cassandra supports a massively distributed database environment That is, it allows us to subdivide our database among dozens, or hundreds, or even thousands of separate database servers, potentially spread out all over the world The database is a highly scalable and decentralized By scalable here, I mean it’s extremely easy to add an extra node, that is, an extra server to the cluster, thereby expanding the size of the database And by decentralized, what I mean here is that all of the nodes, that is all of the database servers, that are involved in a Apache Cassandra database, have the same role And this provides the very desirable characteristic of there being no single point of failure Another very valuable characteristic of the Apache Cassandra model is that it provides for automatic data replication

That is, the database itself can automatically make copies of data and store those copies in different locations throughout the cluster This makes the database highly fault tolerant, such that if an individual node, that is an individual database server, were to fail, the redundant data stores takeover instantaneously There’s no down time with the database at all Further, Apache Cassandra supports the MapReduce process, which is a computational model for solving data processing problems in a highly distributed environment And I’ll talk more about the MapReduce process here in a few minutes And to illustrate the legitimacy of the Apache Cassandra model, consider that it is currently used by CERN, organization such as Constant Contact, Digg, Instagram, Netflix, Reddit, Walmart, Twitter, et cetera Now let’s talk about the Cassandra data model itself As you know, in a relational database management system, related data for an application are stored together in a container which is referred to as a database, or sometimes as a schema And within that database or schema, we have one or more tables The analogous concept in Cassandra is something called a key space That is, data for an application are stored together in a container known as a key space And inside of that key space, instead of tables, we have something known as column families So just as in the relational database world, a single DBMS might contain many databases, each of which contains many tables In the world of Apache Cassandra, the Cassandra database might contain many key spaces, each of which contains many column families The column families, then, contain columns But this is not the same as a column in a relational database In Cassandra, a column consists of a name, that is the name of the column, a value, that is the data value for that column, and the time stamp, where the time stamp indicates the point in time at which the data value was changed Related columns, then, are all stored within the same row And each row is identified using a unique row key This notion of a unique row key is directly analogous to the idea of a primary key in the relational database world Rows in the Cassandra model, however, are not required to contain the same set or number of columns That is, different rows within the same column family might have a different number of columns And the number of columns in a particular row is allowed to expand or contract on an as-needed basis A few additional important differences to note between the Cassandra data model and the relational data model are that in the Cassandra data model there are no formal foreign key relationships between column families That is, we cannot establish formal relationships between column families within the same key space And what’s more, it is not possible to join column families together using a query So whereas in the relational database model we can write queries that will join tables together, it is not possible in the Cassandra model to join column families together Now, I know that this sort of a verbal description of the Cassandra data model can be a bit challenging to follow So let’s look at a picture which I hope will help to make some of these concepts clearer Here we see a graphical representation of the Cassandra data model The outermost container represents all of the key spaces for the Cassandra database And in this case, we have just two key spaces, one of which is labeled as the blog key space, and the other which is labeled as the store key space, where the details of the store key space are not elaborated in this diagram Again, the idea here is that a key space is roughly analogous to a database within the relational database world This means that a key space, then, is typically oriented toward a particular need or a particular application

Within each key space, we can have one or more column families In this case, we have a column family for users, and we have a column family for blog entries Next, let’s look at an example of a column family And to begin, I will refer to the user column family So here, we can see that we have three rows within the column family And each row represents a unique user within the blog key space A user, then, is represented as a collection of one or more columns And remember that, in the Cassandra data model, the number of columns per row can vary from row to row So in this first row, that is the Dan 42 row, we see that we have three columns The first column is the name column It’s value is Dan And we have a timestamp The second column is the email column It’s value is dan@dan.com Again, it has a timestamp And the third column is the phone columns, which has a value and a timestamp For the next user, we have only the name column And for the third user, we have only the name and email columns So there is no requirement that each row contain the same number, or even the same type of columns Next, let’s look at the blog entry column family within this blog key space Again, we see that each row within the column family contains multiple columns In this case, both rows contain the same columns But again, that is not a requirement Here, the columns are the text column, the category column, and the user column Note, particularly, that the values stored in the user column can be used to determine which blog entries were written by which users However, remember that formal relationships between column families do not exist in Cassandra That is, we do not formally establish primary key, foreign key relationships So I hope that looking at this diagram of the Apache Cassandra data model demystifies things a little bit for you I know that learning about these structured storage data models for the first time can be intimidating, but I hope that through this diagram, you can see that it’s really not that complicated And I hope that is encouraging for you As I mentioned earlier, these structured storage databases are often highly distributed and highly replicated That is, they may be spread across many, many different nodes or database servers Now this structure has substantial advantages Not only does it provide fault tolerance, but it allows for data requests to be handled by the nearest available node that is able to service the request So for example, if you are in Tokyo, and it happens that a copy of the data in which you are interested is stored on one of my database nodes, which is located near Tokyo, it’s much more efficient for that node to handle your request than it would be to route the request to a distant geographical node, say, one which might be located in Berlin The problem with this model, however, is that it can cause problems with consistency Consider what happens when a data item is updated So if I update a data item on one node, it will take time for that update to cascade to the other nodes within the cluster that contain a copy of the data So imagine that my distributed structure storage database contains 1,000 nodes, spread all over the world, and the data item I’m interested in updating is replicated across 100 of those nodes So I may then perform the update on one of the nodes And until that update cascades throughout all of the other nodes in the cluster, any requests for that same data item that are made of those other nodes will be returned values that are out of date And again, this is due to the fact that the database is widely distributed and widely replicated And because we typically do not enforce

an ACID level of consistency Thus, in these replicated data environments, we commonly use a consistency model that is referred to as eventual consistency And what eventual consistency means is that if no new updates are made to a specific data item for a period of time, eventually all of the requests for that data item will return the most up to date value, regardless of which node is servicing the request And the time stamps that are recorded during each item update are the key which allows us to reconcile any inconsistencies in replicated data values between nodes Finally, I would just like to take a few moments to discuss the MapReduce process Broadly speaking, MapReduce is a programming model that relies on parallelization in order to perform data processing tasks on these huge data sets that may be distributed across many different servers, or many different nodes So conceptually speaking, then, the MapReduce process involves two different types of nodes There will be a master node and a worker node Put simply, the master node is usually the node which receives the data processing request from the user While the worker nodes are nodes which are assigned to complete part of the processing task by the master node So this MapReduce process, then, unfolds in two steps The first step is called the map step And in the map step, the master node will take the data processing problem and subdivide it into a series of sub problems And each of these sub problems is then assigned to, and carried out by, a worker node The second step in the MapReduce process, then, is the reduce step So after the worker nodes have completed their assigned tasks, they pass the results of their work back to the master node The master node will then do the final processing, or final combining, of those results in order to produce the overall answer to the problem, which is then returned back to the user Again, I know that concepts such as this can be difficult to understand in verbal terms, so let’s see if we can get a better idea using an image So toward the top of this figure we have the master node And toward the bottom, we have various worker nodes, which here are labeled one, two, three and n, up to however many worker nodes we need to solve the problem So the MapReduce process unfolds as follows As input, the data processing problem is passed into the master node The master node will then divide that data processing problem into sub problems, which are then assigned to and carried out by the various worker nodes After completing their tasks, the worker nodes will return their results back to the master node, which performs the final combining and processing of the worker nodes’ results, in order to produce the final answer, which is then the output of the MapReduce process Let’s consider an example, imagine that we are a wireless service provider, and it we use a highly distributed, structured storage database, which has 1,000 different servers all over the world Let’s further assume that our 100 million customers are equally subdivided among our 1,000 servers So that means we would have data for 100,000 customers per node within our database environment Now let’s imagine that our data processing problem is that we want to figure out the average number of text messages sent during the month of November And we want those results organized by age So we would like to know what is the average number of text messages sent by 18-year-olds, and 19-year-olds, and 20-year-olds, and 21-year-olds, and so forth, all the way up until our oldest customers Now let’s see how the MapReduce process

can be used to solve this data problem First, the problem is passed to the master node And the master node might subdivide the problem such that it instructs each of the 1,000 worker nodes within our database environment to count the total number of text messages sent by each customer during the month of November, and aggregate those results by age The results of the worker nodes tasks, then, would be a table of data, which might contain three columns First would be all of the distinct ages of the customers whose data resides on that particular worker node The second column might be the number of customers who are that age So we might have 1,000 18-year-olds, 714 19-year-olds, 235 20-year-olds, et cetera And then, the total number of text messages sent by customers of each particular age So perhaps, 18-year-old sent 10 million text messages 19-year-olds sent 9,800,000 text messages, and so forth So each worker node performs this task for all of the customers whose data are stored on that node And those results, then, are returned back to the master node The master node will then combine the results So it will calculate, for example, the total number of 18-year-olds and the total number of text messages sent by 18-year-olds After which it can divide those two numbers in order to produce the average number of text messages sent for 18-year-olds That process is simply repeated or customers of every page And we then have the results, which we can send back to the user who requested them So I hope that you can appreciate that this MapReduce process is a very clever way of efficiently handling data processing problems on distributed database environments by taking advantage of parallelization in order to solve the problem Well, my friends, thus ends our overview of big data, data warehousing, and business intelligence applications And more broadly speaking, thus ends our series of lectures on database technologies It is my sincere hope that you have found these lectures useful, and most importantly, that you now have the self-confidence to go out and start creating and working with these databases It has been a great pleasure to lead you on this journey, and I hope that you have a great day