R Stats: Multiple Regression – Data Visualisation

hi welcome to our stats I’m Jakob Cebulski in this series of videos I’m going to introduce you to some predictive statistical models such as naive Bayes and regression with plenty of data visualization welcome back welcome to this mini series on regression modeling in our in a previous lesson I showed you how to deal with multiple variables how to select those variables eliminate extreme values and deal with multiple collinearity between variables to arrive at a good quality model and I’m going to follow from that to show you how you could visualize data and which spans multiple dimensions and it’s useful especially for multiple regression we don’t use the same data as before from the UCI University on automobiles where we try to predict the price based on the car specifications so I’m not going to spend much time explaining it for explanations of how to construct the model refer to the previous video I’m going to use three four five libraries which will help me deal with with those problems let’s get the data in isolate the variables and eliminate the missing values create a data frame and transform to heavily dot normal parables a split the data set into training the validation set and eliminate extreme values which was quite a long process but the venture arrived at those particular instances of observations and I remove the wrong names eventually we create the feet which behaved well and which had no multiple colinearity as evidence from the function vif okay so it was very quick let’s get a subset of the data for exploration and we can see the first visualization that you’re familiar with it’s correlation plot which shows the relationship between the price curb weight and pick rpm now we can do better than this and visualize more information about those variables against the feet and therefore have better understanding of errors around the linear model the first way the first approach to visualization will be to fix one of those variables in this case the peak rpm looks like is a bit of chaotic and dispersed so we’ll fix the values in this particular variable and observe the impact of that value on the relationship between curb weight and price so I’m going to select five specific values from the range of possible values for peak rpm from the minimum to the maximum so take the minimum first second third first second third quartile and the maximum five values I’m going to isolate them take them out of the summary table and store them as numeric values in five different variables what I’m going to do next is to create a range of possible values from the second variable which is curb weight and I observe after the curb weight varies from around 1400 up to 4100 and therefore I created 100 values in the trench and stored away in the vector CW now I’m ready to play with this let’s recall and let’s create the first feet I’m not going to do the second one and let’s predict the values for different specific values selected for the peak rpm while bearing the remaining two

variables some I’m trying to predict the price and I’m going to use the other two variables for the prediction and creating a new vector for each of those specific values of peak rpm now let’s plot all of them so I’m plotting five lines against a range of data points in in the data okay here we are as you can see when we have the lowest possible value for the peak rpm it actually has the lowest predicted price at the highest value for the peak rpm it corresponds to the highest produced price across all ranges in peak rpm so it’s a first way of having a look at it also we can see that the lines are parallel often when you do that you can see the lines will be intersecting it LC means that there’s a good linear dependency between the curb weight and price and it’s a very linear dependency okay let’s fix one issue as you can see the y-axis is very very small values that’s because at the very beginning I did a lock transform of the price this is actually logarithmic transformed set of variables let’s fix it by producing the plot where the price is actually we do inverse transformation of the price and this is done across our bellies and I can see that we have the y axis is pretty much in the correct units but that had effect on the linear model fit is no longer linear it’s logarithmic that’s because it’s actually it’s exponential here that’s because we applied the back transformation a good question this point is can we trust that this data is literally linear at all levels of the peak rpm well the answer that we’re going to repeat this and instead of using a simple linear model we’re going to use Lois model Lois polynomial fitting of data so let’s recalculate all the predictions and let’s plot initially in the transformed units we have some data outside this range and interesting things happen here the blue line corresponds the lowest level of peak rpm and you can see that in the mid-range it’s as good as the highest revving engines on cars later on it drops off that means yep the low revving engines will produce the lowest predictive price but it the most interesting is the red line which is the Indians which read very very fast those cars it looks like at the lightest cars the curb weight here is the lightest they’re the cheapest you may have seen some cars which sound like a lawnmower they are small the Rev fast they’re cheap and they are but not very powerful so they have the lowest price you can see there’s not quite a linear dependency across all levels of peak rpm in cars and that’s evident from this plot and of course we could do the correction of the price units so at least we can see the y-axis in all is dollar glory it’s a bit harder to read and that’s why we need to have the next

type of plot we use core plots which allow us to split this type of chart into several channels here we fixed the price we fixed the value of one of the variables peak rpm and observed the other variables and how they relate one to another what if we wanted to see the whole range of values in the third variable this is where couplets come to help so we have here a function which will be helpful in creating the cop Lots that function simply plot the contents of several possible plots here we have six panels there’s six panels which are guided by the value of peak rpm variable which is conditioning variable for the car plot the system identify different ranges of peak rpm which are worth grouping and plotting as separate charts so this bar which corresponds to the lowest range of peak rpm is represented here as a relationship between curb weight and price in that range of peak rpm the next behind the bottom corresponds to this chart the third corresponds to the third child and then the third from the top corresponds the left top chart middle and finally the last the longest bar to the right top charts each one of them as small as the same number of points and they do overlap indeed as you can see this bar and that they overlap in peak rpm values quite significantly and so they’re very similar here so this is a way of getting a glimpse into your data across ranges of the third variable when you limit it to two dimensional representation however lucky for us we’re dealing here with three dimensions only and of course three dimensions can be visualized easily on on a two-dimensional monitor and so we turn to the third function scatter 3d the help is shown here which allows you to plot X Y Z coordinates so let’s pick curb weight pick rpm and price against each other from the original data the data in which we simply removed missing values let me bring the plot up this is the data the price peak rpm and the weight and the blue plane represents the linear model the yellow dots represent the observations and the green lines the distance to the model we can actually move this chart to observe it from different angles we can see that the original data was quite far from the linear model which was constructed for this data set quite interesting let’s look at three other variables we rejected horsepower as the variable for linear model and also the city mass per gallon let’s see how they compare in 3d you can see that horsepower and CT miles per gallon we have huge extreme values and a lot of them in this case so that probably wasn’t a good triple and you can see it by visual inspection and let’s look at the third combination cross para curb weight and normalized losses if in case we wanted to predict normalized losses based on the weight and the horsepower of the car the losses on insuring the car and you can see here the data is all over the place as the data is very very

far away from the regression plane so we’ve made the right peak and there was a process to it now let’s look at the effect of cleaning up them the data in the model starting from the raw data and then selecting the diatom which had no colinearity x’ and which had no more extreme values i’m going to restrict viewing the data to the final training set so that we compare same with same so the first plot is what God however this is the data where we had colinearity x’ and extreme values and two variables were not transformed so they were not normally distributed this is the effect of transformation of price into logarithmic transformation of the price and you can see not much changed but the units of the price changed so now it’s image from three to five and so the whole a lot of dots must have been squashed a little bit finally when we got rid of extreme values from the data set and created a linear model we have a very nice set of variables which are hugging the regression plane and that’s why we used it eventually as the plane which gave us the minimum overall error now we could use the same technique for creation for the creation of a similar fact that we started with when we fix one of the variables and we look at dependencies between the others here are only three variables so I’m going to use the the price as the variable to group the other relationship between two other related to other variables so I’m going to create a class variable based on the price and the brackets will be 1,000 up to 30,000 and above I’m going to also create a similar split for the training set so I’m going to show you the raw data and the data which was corrected through the cleaning up of the regression model so two class variables and let’s look at the first plot as you can see I create a model in the original data set and the grouping will be by price categories let’s look at this you can see we have three planes we have three groups of observations which belong to three categories of prices and you can see the top one we have the list observations the list cars lots of cars in the middle and less in the bottom range of the price so that was a nice representation let’s look at the view of the data after it was cleaned up and the model was most effective not much changed you can’t actually assess it visually because the price is squashed maybe it’s less readable so visual inspection is not always the best way of assessing the model performance let’s look at creation of lowest plain fitting here the original data you can see that the regression planes we’re not actually fitting the data well so this is actually a two dimensional fit of three dimensional data and you can see those planes are very deformed so there’s a lot of non-linearity in in the data and the models let’s see if it improved after we created a better model you can see perhaps that diet has been better separated that’s less overlapping

between the sections of the data of course when the Dittus has read close there will be a close overlap there is a slight improvement in the distribution of data and those hyper planes are not so multi formed finally we could use the class grouping to actually group the data and see the groupings so you can see the data is grouped into sort of a spheres and you can see the groups are nonlinear in distributed and the groups the direction of of the groups is twisted in at least two dimensions or three three dimensions here what happens when the data was eventually improved and cleaned up and the new models created the groups are more lined up in three-dimensional space so the very top group it follows the same direction more or less as the other two groups so this is just a visual inspection of groups of the item of data distributed in three dimensions if we have more than three dimensions then we’re back to similar techniques that we showed before we need to fix the values for one or two or three other in ranges of the variable values or picking a specific value and observe the effect of those on the remaining variables so a bit of 3d data visualization and 2d visualizations of multiple dimensions on 2d in color thank you very much good luck enjoy plotting of your linear regression you