video tutorial regression analysis in R

hello everybody after homework one is done and some of you started already with homework to the hypothesis testing it’s now time to move on to correlation and regression analysis the purpose of this tutorial is not to give you a complete overview of all the underlying concepts of regression modeling but mainly to show you how to do it in our and how to do the analysis in our the correlation coefficient is a quantitative measure of the strength of the linear relationship between two variables the correlation coefficient ranges between minus one and plus 1 and the correlation of plus or minus one indicator perfectly linear relationship while a correlation of zero indicates no linear relationship at all then I continued with the third slide with examples for correlations the first example is a positive linear relationship the second one a negative linear relationship both are quite rather strong and you can see that we easily can draw our linear trendline with through the observations the third example is covered linear relationship with a negative slope at the end the fourth one it’s also curvilinear with a positive trend at the beginning the fifth and sixth chart I examples where there’s no relationship and if you try to draw trendline it would be a horizontal line through the observation points the difference between E&F is that they differ in their variance and he has larger standard deviation than if so let me repeat the properties of their correlation coefficient the first one is that the correlation coefficient is you need free what that means is that it’s a pure number and it allows us to compare the correlation between it’s a variable of age and weight as well as the correlation of aware of H and hate and it is in particular independent of in which units we measure our variables so we had already that the range is between minus 1 and +1 and the closer it is to minus 1 the stronger the negative linear relationship and the closer to +1 the stronger the positive linear relationship so and the closer to solo the weekend the linear relationship in real life you will see very seldom a correlation coefficient coefficient which is exactly zero but we see often correlation coefficients which are rather close to zero and what we do then is to test whether they are significantly different from zero correlation coefficients of plus 1 or minus 1 are perfect correlations were all data points for directly on straight line we have perfect correlations in particular when we measure the same variable as for instance height in two different units for instance in centimeter and inches so now let’s look at an example and probably even more important how do i get the correlation with up i use again our example data of the hourly wages and see how the hourly wages are correlated with age education and experience so first our command in our is cor cor which stands for correlation so it’s

relatively easy to remember what is the command cor next we have to tell our rich correlations are we interested in so correlations between which variables so we have called at that time our data said my data so we want to know the correlation of my data but in this case we are not interested in all variables but only in four variables so we restrict our correlation to these four variables in the following way we have these brackets which say we take only part of our my data here a comma and what is right of the coma tells us which variables we are interested in and we say see break it and we give the name of our variables of interest so in case of finale we don’t have wage or age but q4 and q10 it would look like that the output of our correlation command is a matrix with our four variables as rows as well as columns first of all we see that the diagonals are all of one and the trust tells us that each variable is perfectly correlated with it wit with itself so which is perfectly correlated with rage what we also see but it’s a little bit more difficult is that it’s a metric so that age and wage is the same as way to an age and the difference between these two is only the rounding normally we look only at a part of this metric and it is sufficient to look at the lower part of the matrix when we look at the correlations of variables you should always think about does this make sense to us can we understand the correlation coefficient let’s pick some of these cough correlation coefficient so let’s take wage and education which is 0 dot 38 so it’s a moderate we would say that’s a moderate correlation it’s positive so can we explain this relationship and yes i would say it makes a lot of sense that the larger or the higher education the high other way another example is the correlation between edge and experience and this is actually very high and very close to perfect correlation and it just reflects the fact that the older the people are the older the workers are the more experience they amounted urine deco here now let’s look at this experience and pledge this is very small and I guess if we made a significance test it may not be significantly different from zero in this result is kind of counterintuitive normally we believe and think that with increasing experience the weights or the salary should increase so that can give us actually some indication for our future regression analysis because that’s something interesting to look at one possible explanation can be that a large part of the data set consists of chops which do not require much experience the correlation matrix is actually a very good tool to check whether their data at your hand are in line with your common sense and your expectation of what is going on in what I am looking at think about the finale case would you expect positive linear relationship between service and overall customer satisfaction while the correlation coefficient is

very good to summarize the overall linear relationships between two variables it may also sometimes mislead us if there is a nonlinear relationship between variables for instance in the examples of the curve curvy linear relationships we saw earlier there was one range of the data where there was really no relationship but another part or rather strong relationship and we may not see that if you only look at the correlation coefficient in our you can easily plot the scatter plots for all your variables of interest and the way you do that is you right plot and in parentheses the same data as you’ll provide for the correlation command and what we get now is a scatter plot matrix and for instance it’s very oppressed that there’s a very strong positive relationship between age and experience as we also saw with a correlation coefficient for education and weight you have this cattle diagram and what we can see first is that there are a couple of points which have a very clear trend while the majority of observations follow a less clear trend however what we can see is if we increase the education the wage is further to the right so compared to the correlation coefficient we had were thrilled for the 38 yes we can see a trend but it is less obvious than in the case of Asian experience and what we can learn from this is that the correlation coefficient is one single number and the scatterplot can complement each other very well so now it’s time to move on to linear regression and we start with simple linear regression which is a relationship between one independent variable that influences our dependent variable and we developed all the Lord second underlying concepts for linear regression with simple linear regression because it’s much easier to draw the observations and the trend lines and our underlying logic here’s the example of the book with sales agents and the sales they make in a given period of time and the dependent variable is the years this was the time employee is within the company and the sales yo she makes and what we can see is that there is clearly a positive relationship between these two variables and from the logic we would say that the longer an employee is with the firm the more sales he or she makes so I never I would ask you to draw a trend line through these points of observations most of us would come up with the line similar to this one and what we are automatically doing when we draw such a trend line is that we minimize the distance of our observations tower trend line and actually what we do is that we minimize the squared distance to our trench right so now why are we doing this so if we have this relationship and if we have our blue trendline what we can do is first of all we know there is a relationship so in our case you know

it’s favorable over sales employee is longer time with a firm and it also allows us to predict given I employee with a given years time within the firm what these employees sales will be on average sales employees may use this relationship in negotiations on salaries the managers may want to think about measures to retain sales staff who are already for many years within the comp one is because they are the most valuable so knowing about the relationship between two variables and being able to predict can be used from different parties to improve their situation so if you write linear regression modeling into a formula and we are currently looking at simple linear regression we have dependent variable in actually real life we start earlier we start with a problem or a question we want to answer so in the case of the simple example what is influencing our revenue or per employee and normally what we want to do is to improve our situation so the variable of interest is what we call the dependent variable and we call it the variable and that is what we are interested in and what we want to explain and this is influenced by our independent variable X which is influencing our dependent variable Y and the strengths of influence is in the coefficient beta better one in this case which is the slope of the regression line in a scatter plot for line we have two parameters to define it the slope which is our coefficient better one and we have the intercept the y value for value of x and last but not least we have epsilon which is a random error term and that captures everything which is not explained by our independent variable and the error town captures all the influence from other variable in our example of the sales and employees it could be the education of the employees it could be their customer relationship underlying our regression model there are several assumptions and it is important that we meet these assumptions otherwise our estimations may be wrong for the purpose of this tutorial I don’t want to go into them now but I urge you to go back either to the slides in class and or to the textbook and familiarize yourself with these assumptions this lights this slide shows the assumptions in a graphical form and it is actually a good test for you if you can explain the assumptions within this in this chart you have quite well understood what needs to be made in order to perform a regression analysis and now switch to multiple linear regression and it looks very similar to the simple linear regression and the major difference is that we have no more than one independent error and again we start with a question we start with a problem we want to solve you want to understand and this gives us our variable of interest which we call the dependent variable which is why and then based on theoretical considerations on common sense based on your business understanding you formulate your independent variables which you will think are the factors are the variables that are influencing the and all we have now more than one

variable we will still have an epsilon and random error term which captures all the influences influences which we do not explicitly formulate in our regression model so this formula is the formula for the total population but what we will have in real life is our data will be a sample of the overall population and we use our data so for instance the data on the waitress or are often army data and estimate these parameters based on our sample and we will get then a regression coffin coefficient p 1 p 2 and so on and these variables will indeed tell us whether our independent variables influence our variable of interest the dependent variable now let’s estimate an example and as an example we use the prices of housing and we have several variables like square feet the h of the house the number of bedrooms the number of petrol bathrooms and the number of garages and we start with the correlation matrix and let’s look at the correct correlation of price with the potential independent variables so first we have a rather strong relationship between the square feet and the price does this make sense yes I would say so because the larger the house the larger the higher the price the next one is the H and here we have a negative relationship so the only other house the lower the price also make sense and we have also positive correlations between the number of bedrooms the number of bathrooms and the number of garages with the price for the house let’s look at a last example of for correlation coefficient the correlation between the number of bedrooms and square feet and this is 0 dot 7 so first of all it makes sense so the more bedrooms our house has the larger the area of the house the second what we see is that the correlation is rather high and that actually means that they both are rather similar things and in regression analysis we call this multicollinearity so if there’s a rather strong correlation between two variables we have multicollinearity and this can cause problems in our linear regression and you will come back to that and of course if you really want to do it well also look at the scatter plots so as the purpose of this tutorial is to show you how to do linear regression in our and in particular how to analyze the results how to diagnose it I have skipped a lot of our theoretical concept some of these will come back when we now switch on doing linear regression modeling and interpreting the results so now let’s develop a regression model where we are interested in the price for houses we first start with all these variables as independent variables so before we do all this modeling let’s think about who would be interested in doing something like this and it could be you who owns a house and is considering to sell the house and you want to get a sense what is the market value of my house in order to get better prepared in negotiations or you’re on the other side of the table and you want to buy a house you have a certain budget and you want to get a better understanding what can you get

for your budget Oh your developer and you want to make a decision whether you want to build a new complex and whether there is actually a business case hood or how many bedrooms would be actually the perfect number before we look at the actual result of the regression model first let’s recap you too late what we should look at in the results and the first thing is is the overall model significant or in other words is it telling us more than trusting the price of the house is there each price of all the houses in our sample in addition to the overall significance we also want to know what is the explanatory power of our regression model after we have looked at the overall model we are interested in the individual variables and whether they are actually significant what that means is thus the specific individual variable actually influenced our dependent variable let’s take the example of house prices and your developer and let’s assume you find out that the number of bathrooms does not actually influence the price so why should you build a house with two bathrooms rather than with one bathroom the third one is for the standard deviation over there model error is too large to provide meaningful results what that means is we are doing our regression analysis not only understand which variables are influencing our dependent variable but in most cases also to predict a dependent variable so to predict in our example the price of our of our house which we want to sale so based on the characteristics of our but what is if you get a prediction of house price of twenty thousand dollars plus minus one hundred thousand dollars that does not really help us it is too broad and it does not really have meaning for our decision-making the next two items are more looking at the assumptions and the first one is is the problem of multicollinearity so did we put variables in which are actually are more or less covering the same thing and the last point actually are the assumptions of a regression analysis satisfied and this I will discuss later so now let’s look at the result we get from our and linear regression the first thing what we get is the call so this repeats the command in our for linear regression command for linear regression is LM and LM stands for linear model the next thing is the formula and this is the theoretical formula written down for our specific example so our dependent variable Y is price and rather than writing equal you have to use the tilde and the probably most difficult thing in writing this example is to find the tilde on the keyboard and next you write one independent variable after the other separated by a plus and finally we have to tell our what are our data and in this case our data set what was called housing the remaining part of the output are the results the first question is about the overall significance of our model thus our model has some

explanatory power and this we do with the f-test and we look at the p value of F test and if the p-value is less than our alpha very often would you zero dot zero five percent reject the null hypothesis and the null hypothesis is that there is no explanatory power and now a case we will checked it so if we have a very small value for the p value of our F test we conclude our model explain something to us and this zero dot 0 2 minus x 10 minus 2 minus 16 is zero taught is to door to x 0 0 0 0 with 16 x zone so a very small value more or less same as 0 and this is definitely smaller than alpha the next thing we look at is the adjusted r square and this tells us how much of their the variance in the data is explained by the model we have a value of zero dot 8 1 33 1 which is rather high so our model has a rather high explanatory power and remember they trusted ask there is always less than the a square to take into account that the more variables we include the higher our predictive power of the model will become and they are trusted or the a square can reach a maximum of the next thing are the coefficients of our model and first we have their estimate so for instance or square feet 63 and that means for an increase of the area of the house by one square foot the price would increase by $63 and with each year increasing age the price discrete decreases by 1144 dollars on average for each estimate we have a standard error from which we can calculate the t value and obtain our p value what we want to know is does this variable indeed influence our dependent variable and the underlying null hypothesis is that the coefficient the slope is 0 and we test for this hypothesis and we see that for all our independent variables the p-value is either very small like in the case of square feet and age and also bedrooms are still very small bathrooms it’s not as more but still smaller than an significance level of five percent and garages again very small so what yet what we actually testing is whether the independent variables have indeed an influence on the dependent variable and in the language of hypothesis testing that means that our coefficients the beaters which are the slope of our line is zero if the p-values are less than alpha the significance level which is typically said to five percent we have to reject the null hypothesis that the slope is zero and we conclude that there is indeed an influence of the independent variables on our dependent variable the house price so in our case we have to reject the null hypothesis fall our independent variables and we conclude that all of them influence our house price the last thing we want to

understand for now in this block of the coefficient what is the meaning of these stars very often we look at the stars but rather than the p-values themselves to make our decision on the significance and the meaning of the stars are also called significant codes is given here so three stars means the p-value is less than zero dot zero zero one and you can check it here that these are indeed very small values two stars means it’s less than 0 top 0 1 and if we look at here this example it is less than zero dot 0 1 but it’s larger than zero dot 0 0 1 which would earn it otherwise 3 stars and the one star is less than zero zero five and we have an example here with Serrat 0 to 6 which is less than 0.05 and as we normally set implicitly our significance level 2005 you consider all variables which have either one two or three stars as indicating that the slopes are significantly different from zero and indeed we see a negative sign with the bedrooms and the price so that would mean the more bedrooms our house has the less is the price and this is very counterintuitive and also in contradiction to the correlation coefficient we saw previously so this is an indication of multicollinearity and we saw already in the correlation matrix that bedrooms and square feet is highly correlated a cure to this problem is just to omit one of these two variables ah highly correlated and in our example it would make sense to leave out bedrooms as independent variables in our regression model the next diagnosis step is multicollinearity so multicolored narrative is a high correlation between two independent variables and that these two variables then contribute redundant information to the model because they capture to a large degree similar information and the problem is that these highly correlated independent variables when they are together included in the gresham program can adversely affect the regression results as a rule of thumb we have that if the correlation coefficient is equal or larger than zero dot seven we have an indication that there may be an issue of multicollinearity and we need to be careful when we go back to our regression matrix we had already previously we remember that we had a rather high correlation between square feet and bedrooms so there is our concern on multicolinearity before we go back to our regression results let’s look at some indications for multicolinearity in the regression result the first one is unexpected and therefore incorrect signs of the coefficient the next is if we add our independent variables step by step create separate models that there’s a sizable change in the values of the previous coefficients when we add a new van or a previously significant variable becomes insignificant when we add a new independent variable all the estimate of the standard deviation of the model error increases when a variable Center so these three indicators require adding

new variables one of this we can find already with one model so now let’s go back to our result and see whether we see some unexpected sign and a coefficient so far we had only independent variables which are quantitative variables but we also can have qualitative variables in our regression our come an example all these qualitative predictors are so-called dummy variables and dummy variables are qualitative variables which can take only two values either 0 or 1 and a typical example is Chandler which can be female or male and which is all given then their values of 0 or 1 and in general we can have qualitative variables with more than two outcomes and for instance an example of our waitress their profession had more than two outcomes there were workers management service technical and so so the question arises how to interpret the results of a regression model if we add qualitative predictors and first let’s look at the addition of a demi variable in the example of our housing there’s an additional variable called area and it indicates the neighborhood so more or less is it a good or bad neighborhood we have one more line in the results for our coefficients we see that the coefficient is highly significant and we also see that it has a strong influence as area is a dummy variable it can have values of 0 or 1 the result of area equal to 0 is implicitly in the model it’s the default more and partially captured in the intersect and what area one tells us then that if the house is not in the area 0 but in the area 1 the price increases by sixty two thousand dollars we can understand this interpretation also graphically and I show you this in the next slide with an example from the book with mba graduates this is the example of regression model with HS independent variable and celery as the dependent variable and this is the regression line as default that the employees do not have an MBA and with a certain intercept close to zero and as it is a demo variable to have an MBA no or yes then the influence of having an MBA shifts the whole regression model up by the amount which is exactly the coefficient of their having an MBA yes or no so the coefficient of a dummy variable is actually the shift of the intercept by the coefficient of the time available so what happens now if we have not a dummy variable but a qualitative variable with more than two levels for this I go back to the example of the waitress and the different professions of the employees so this is the result of a regression model with rach as the dependent variable education experience occupation and change as predicted predictors and let’s only look at the received part for occupation and first of all we don’t see management and that implies that management is

default with a coefficient of zero and all the other professions are compared to the management professional when we cover descriptive statistics and use these data as example what we saw was that the managers had the highest salary compared to all other professions so what we expect now is that the coefficients for all the other professions are negative that is exactly what we see here all are negative however the coefficient for technical is not significant so it’s actually not different 20 and what we had seen in our descriptive statistics was that actually the average levels between management and technical wear our clothes from the previous example descript example of descriptive statistics we also had learned that work has had a relatively high wage level at lower than management and the coefficient of minus dot two feet for tells us that on average worker have a lower hourly wage of two wrote for dollars compared to a management employee now we can interpret fully all the results of our regression model what we have not yet actually is this part the residuals the residuals help us to check whether the underlying assumptions of the regression model I actually met so first let’s go back to the assumptions so first the error terms are statistically independent of one another so that means that the error term of this x value is independent on the error term of that Excel the next is that they are normally distributed so that they have this bell-shaped curve third one is that the distribution of the epsilon values the error values have equal variance so that the standard deviation are the same independent of X and that finally the means of the independent of the dependent variable can be connected by a straight line and we can use the residuals to test whether these assumptions are given so the procedure is the difference between the actual value of our dependent variable and the value predicted by the model and by plotting these procedurals we can detect problems with these assumptions so the first is you can see whether the Grateful function is not linear where they don’t have constant variance the residuals are not independent or whether they are not normal distributed in the following slides will look at examples of plots of these read residuals and we will check whether the assumptions of linear regression are given or not in the top chart here we see some trend between the x value and our residuals and we see some cyclically behavior and in case X would be some time with what call this a seasonable trental while in the bottom chart the residuals are randomly distributed above the line of sarah one below the zero line so we conclude that this is actually linear pattern here we see in

the top chart again the residuals randomly distributed above and below the zero line and no specific trend dependent on our x value while in the chart in the bottom what we see is that for small X the residuals are all very close to the 0 x for larger X values it’s increasing so this is an indication of normal constant variances and would indicate a problem in this chart again on the top random distribution and we conclude we have independent residuals while on the bottom chart there’s a clear pattern and in this case of time is an independent variable there is an independent there’s a dependence of this value on this value this value from this finally we can plot the histogram of the standardized residuals and check whether the mean is around zero and whether we have approximately a bell-shaped curve so I know this was a lot of diagnosis and if we want to do all the plots of residuals for different independent variables it may become endless endeavor so what are my expectations thoughts homework 3 and the regression modeling certainly I want to have a full regression model with the diagnosis meaning the significance of the overall model looking at the adjusted r-square analyzing each independent variable the size of its coefficient and its significance looking at some indications of multicollinearity regarding the analysis of the assumptions of the modeling I like you to look at the residuals and for that let’s go back to the result of regression model who we have here some information given of our residuals and actually they standardized residuals and it means we know that the need is zero the first thing what we can do is compare the median with the mean and yes the median is not zero but compared to our minimum and our maximum it’s rather close to zero so there’s a first indication of symmetry the next is looking at the first and third quartile and the absolute values are rather close together so again we have some indication of symmetry this is not any longer s code for the minimum maximum but these are really extreme values so from what we see here we have some evidence that the distribution of the residuals is rather symmetric and there is no indication for concern on that level so this is the diagnosis regarding the assumptions of the regression model I expect for the homework if you want to look in more details at the residuals you’re welcome to do that but it is not a requirement