Tuesday, December 12, 2017

Assignment 6: Regression Analysis

Introduction

This assignment was divided into two parts.  Both parts of the assignment deal with solving real world problems by using regression analysis on provided data.  This assignment worked to develop the skills to run a regression analysis in the computer software SPSS and be able to interpret the generated output including predicting results from given data.  Skills in order to run a regression in SPSS excel data must be first be manipulated and organized in Excel and joined if necessary in ArcGIS were practiced.  The assignment also allowed for practice mapping standardized residuals in ArcGIS and interpreting the spatial results using statistical outputs. 

Part I

For this part of the assignment, data was supplied in an excel file giving the percent of children that get free lunch in an area and the crime rate per 100,000 people in that location.  There had been previous assumptions made by a news station concerning the data that kids who receive free lunches increases as crime does.  The task of this part is to run a regression analysis using computer statistical software SPSS to determine if the news station is correct in its news announcement and there is a relationship between the two variables.

To begin first the independent and dependent variable titles had to be assigned to the two data variables (crime rate and free lunch percent).  The dependent variable is the variable explained by the independent variable.  Given the news station assumption of free lunch rate influencing crime rate, free lunch was determined to be the independent variable and crime rate was determined to be the dependent variable.  The computer software SPSS was used to run a linear regression analysis of the two variables to examine their relationship.  The excel file containing the two variables was first uploaded to the software.  Next in the software, it was selected to run the linear regression analysis and the two variables were inputted as their assigned independent and dependent variable labels.  The output resulted in a few tables explaining the relationship between the two variables.  The first table that was generated in the output was the Model Summary table in Figure 1.  The Model Summary table gives the R-Square value used as the coefficient of determination that describes how much of the X variable is used to describe the Y variable.  The higher the R-Square value is the less variation exists between the two variables and the strength between the two variables is high.  The R-Square value in this scenario is 0.173.  The range of R-Square values is between 0 and 1, where 0 there is no strength between the two variables and 1 having a strong relationship between the two variables.  An R-Square value of 0.173 is very weak relationship between the two variables.           

Figure 1. Model summary table for the independent variable the percent of children that get free lunch in an area and the dependent variable the crime rate per 100,000 people in that location generated in the SPSS linear regression analysis.

The Coefficients table was also generated as an output of SPSS.  The Coefficients table in Figure 2 helps build a liner model equation for the relationship between the two variables, that can be used to make predictions based on the given data.  The linear equation formed is y = a + bx, where "a" is the constant and "b" is the slope.  The Unstandardized B of the constant in the table in Figure 2 is the constant in the linear equation: a = 21.819.  The Unstandardized B of PerFreeLunch is the slope in the linear equation: b = 1.6885.  The slope is positive meaning the relationship between the two variables, as the percent of kids on free lunch increases the crime rate increases.  The formed linear equation for the weak relationship between kids on free lunch and the crime rate in those areas is y = 21.819 + 1.685x, where x is the percent of kids on free lunch and y is the crime rate.  From the Coefficients table the Significance value is also important.  The Significance value for PerFreeLunch is 0.005.  The null hypothesis that there is not a linear relationship between the independent and dependent variable can be rejected if the significance value is less than 0.05.  The significance value is 0.005, so the null hypothesis can be rejected and the alternative hypothesis that there is a linear relationship between the independent and dependent variable.  It can be concluded given the two tables that a very weak positive linear relationship exists between kids on free lunch and the crime rate in those areas.             

Figure 2. Coefficients table for the independent variable the percent of children that get free lunch in an area and the dependent variable the crime rate per 100,000 people in that location generated in the SPSS linear regression analysis.

Next, predictions are also requested for a town with 30% of the children in the town receiving free lunch for the corresponding crime rate given the SPSS output.  The linear equation created for the relationship between kids on free lunch and the crime rate in those areas (y = 21.819 + 1.685x) is used to make the prediction.  The 30% value of children receiving free lunch is used as the "x" value.  The output "y" value of 72.369 per 100,000 people.  In excel a scatterplot plotting free lunch percentages compared to crime rates was created.  The scatterplot can be seen in Figure 3.  The scatterplot can be used to see if prediction appears accurate given the distribution of the data points.  The prediction does fit in with already existing data.  It can be seen there is one outlier with a 704.1 crime rate that does not match up with the rest of the plotted data.  This point appears to be skewing the trend line slightly.  The linear equation and R-square value for this scenario is also given in Figure 3.        

Figure 3. Scatterplot of the comparison of the percent of kids receiving free lunch and crime rate.  

Given the results of the regression analysis the claim that the percentage of kids receiving free lunch influences the crime rate in those areas is accurate.  However, the low R-Square value suggests that there are other factors and variables that strongly influence crime rate than kids receiving free lunch which explains only a small amount of the variation in crime rate.  Only 17.3% of the variation in crime rate can be explained by percentage of kids receiving free lunch.  The significance value of 0.005 indicates a significant relationship between the two tested variables.  Due to this significance value there remains a significant amount of confidence in the results, even though the percent of kids receiving free lunch doesn't explain much of the variation in crime rate. 

Part II

Introduction

A company has expressed interest in building a new hospital in the city of Portland.  The company inquires where the best place to build the new facility would be given the need for one and how large of an ER the facility will need depending on where it is built.  A shapefile and excel file of data on 911 calls in Portland, Oregon was provided to answer these questions.  The supplied data included information on the number of 911 calls (Calls), number of alcohol sales (AlcoholX), number of unemployed people (Unemployed), number of foreign born people (ForgnBorn), the median income (Med Income), and number of college graduates(CollGrads).  All of the data was provided at the census tract level.

Methods

To begin, first three SPSS linear single regression analyses were ran on the data.  The provided excel table was used to conduct the regression analyses.  Three of the provided data variables (number of alcohol sales, number of unemployed people, and number of college graduates) were used as the independent variables to run the analysis against the number of 911 calls per census tract, given as the dependent variable.

After the three regression analyses were conducted the output tables were analyzed for the independent variable that had the highest R-Square value that was of a significant variable.  The number of unemployed people variable was selected.  This variable was used to create a residual map in ArcMap.  To create the map in ArcMap in ArcToolbox under the "Spatial Statistics Tools" and within the "Modeling Spatial Relationships" tool set the "Ordinary Least Squares" option was opened.  The supplied feature class containing all of the variable data at the census tract level was selected as the "Input Feature Class" and the "Unique ID Field" was set to the UniqID field within the feature class.  The dependent variable was set as Calls and the independent variable was selected to be Unemployed.  After the tool was ran an output shapefile was generated showing standardized residuals.             

Results

SPSS Regression Analyses

First the SPSS outputs from the linear single regression analyses were interpreted.  For each regression analysis the Model Summary and Coefficients tabled were analyzed.

The regression analysis ran between the number of alcohol sales and the number of 911 calls resulted in the Model Summary table in Figure 4 and the Coefficients table in Figure 5.  The Model Summary table gives the R-Square value of 0.152 suggesting a very weak relationship between the two variables.  The number of alcohol sales only explains 15.2% of the variation in the number of 911 phone calls. 


Figure 4. Model summary table for the independent variable the number of alcohol sales and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The Coefficients table in Figure 5 gives the data needed to build a linear model equation: y = a + bx.  The constant in the equation is the Unstandardized B of the Constant in the table in Figure 5, which is a = 9.590.  The slope in the equation is the Unstandardized B of AlcoholX is the slope in the linear equation: b = 3.069E-5.  The slope is positive meaning, a positive relationship between the two variables, as the number of alcohol purchases increases the number of 911 phone calls increases.  The equation formed for the weak linear relationship between the number of alcohol purchases and the number of 911 phone calls is y = 9.590 + 3.069E-5x, where x is the number of alcohol purchases and y is the number of 911 phone calls.  The significance level value determines if the null hypothesis, that there is no linear relationship between the two variables can be rejected.  The significance value given in the Coefficients table is 0.000 which is less than 0.05 allowing for the null hypothesis to be rejected.  The alternative hypothesis is accepted and it can be determined that there is a weak positive linear relationship between the number of alcohol purchases and the number of 911 phone calls.  When deciding where to place a hospital the number of alcohol purchases in the census tract should not be taken into great consideration when deciding where to place the hospital because it only explains 15.2% of the variation in the number of 911 phone calls.  This is a small percentage and would not encompass the overall consensus of where the majority of the 911 phone calls are coming from.                       

Figure 5. Coefficients table for the independent variable number of alcohol sales and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The regression analysis ran between the number of unemployed people and the number of 911 calls resulted in the Model Summary table in Figure 6 and the Coefficients table in Figure 7.  The Model Summary table gives the R-Square value of 0.543 suggesting a relatively strong relationship between the two variables.  The number of unemployed people explains 54.3% of the variation in the number of 911 phone calls.   

Figure 6. Model summary table for the independent variable the number of unemployed people and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The Coefficients table in Figure 7 gives the data needed to build a linear model equation: y = a + bx.  The constant in the equation is the Unstandardized B of the Constant in the table in Figure 7, which is a = 1.106.  The slope in the equation is the Unstandardized B of Unemployed is the slope in the linear equation: b = 0.507.  The slope is positive meaning, a positive relationship between the two variables, as the number of unemployed people increases the number of 911 phone calls increases.  The equation formed for the linear relationship between the number of unemployed people and the number of 911 phone calls is y = 1.106 + 0.507x, where x is the number of unemployed people and y is the number of 911 phone calls.  The significance level value determines if the null hypothesis can be rejected.  The significance value given in the Coefficients table is 0.000 which is less than 0.05 allowing for the null hypothesis, that there is no linear relationship between the two variables to be rejected.  The alternative hypothesis is accepted and it can be determined that there is a relatively strong positive linear relationship between the number of alcohol purchases and the number of 911 phone calls.  When deciding where to place a hospital the number of unemployed people in the census tract should be taken into great consideration when deciding where to place the hospital because it explains 54.3% of the variation in the number of 911 phone calls.  This is a large percentage that encompasses over half of the variation in 911 phone calls and would be most helpful in determining which census tracts the most 911 phone calls are coming from.          

Figure 7. Coefficients table for the independent variable number of unemployed people and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis. 

The regression analysis ran between the number of college graduates and the number of 911 calls resulted in the Model Summary table in Figure 8 and the Coefficients table in Figure 9.  The Model Summary table gives the R-Square value of 0.095 suggesting a relatively weak relationship between the two variables.  The number of college graduates explains only 9.5% of the variation in the number of 911 phone calls.   

 
Figure 8. Model summary table for the independent variable the number of college graduates and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The Coefficients table in Figure 9 gives the data needed to build a linear model equation: y = a + bx.  The constant in the equation is the Unstandardized B of the Constant in the table in Figure 9, which is a = 13.377.  The slope in the equation is the Unstandardized B of CollGrads is the slope in the linear equation: b = 0.029.  The slope is positive meaning, a positive relationship between the two variables, as the number of college graduates increases the number of 911 phone calls increases.  The equation formed for the weak linear relationship between the number of college graduates and the number of 911 phone calls is y = 13.377 + 0.029x, where x is the number of college graduates and y is the number of 911 phone calls.  The significance level value determines if the null hypothesis, that there is no linear relationship between the two variables can be rejected.  The significance value given in the Coefficients table is 0.004 which is less than 0.05 allowing for the null hypothesis to be rejected.  The alternative hypothesis is accepted and it can be determined that there is a weak positive linear relationship between the number college graduates and the number of 911 phone calls.  When deciding where to place a hospital the number of college graduates in the census tract should not be taken into great consideration when deciding where to place the hospital because it only explains 9.5% of the variation in the number of 911 phone calls.  This is a small percentage and would not encompass the overall consensus of where the majority of the 911 phone calls are coming from.         

Figure 9. Coefficients table for the independent variable number of college graduates and the dependent variable the number of 911 phone calls generated in the SPSS linear regression analysis. 

Mapping the Results

A map was created to determine in which census tracts in Portland, Oregon the majority of the 911 phone calls were coming from.  The map can be seen in Figure 10.  There are 6 main census tracts in Portland with the greatest numbers of 911 phone calls.  Also the general trend suggests the largest numbers of 911 phone calls existing in the center and eastern side of the city.  There appears to be a trend of fewer 911 phone calls on the western side of the city.  The new hospital should be placed in census tracts in central and eastern Portland, not western Portland.  The general recommendation would be to place the new hospital in the area where five different census tracts all with the highest number of 911 phone calls are clumped together, indicating a high need for a hospital.       


Figure 10. Map showing the number of 911 phone calls per census tract in Portland, Oregon. 

A map showing the standard deviations of the residuals of the number of unemployed people variable.  The map was created to show how well the number of unemployed people predicts the number of 911 phone calls in the census tracts of Portland, Oregon.  The equation that is being used to   predict values along the linear relationship between the number of unemployed people and the number of 911 phone calls is y = 1.106 + 0.507x, where x is the number of unemployed people and y is the number of 911 phone calls.  The map can be seen in Figure 10.   A residual describes how well a data point fits the linear model equation.  The smaller the residual the better that independent variable is at using the linear model to predict the dependent variable.  The red and dark blue colors on the map represent areas where the model was the worst at predicting the number of 911 phone calls that would be made in a census tract based off of the number of unemployed people in the census tract.  In the census tracts symbolized by orange and light blue the model is better at predicting the number of 911 phone calls that will be made in those areas.  The areas in yellow are best represented by the model using the number of unemployed people to predict the number of 911 phone calls made in those areas.  In Figure 10 the areas in red with the largest standard deviations, the model using the number of unemployed people is under predicting the number of 911 phone calls that would be made in those areas.  In those areas more 911 calls will be made than the number of predicted calls.  The areas in dark blue with the lowest standard deviations, the model using the number of unemployed people is over predicting the number of 911 phone calls that would be made in those areas.  In those areas less 911 calls will be made than the number of predicted calls.  The areas in red are going to have more 911 phone calls than the model suggests.  These areas should be payed extra attention to because that means that there are significantly more 911 calls coming out of these areas than the model can predict for.  On the map three red clustered census tracts can be seen.  These three counties are also highlighted as having some of the greatest number of 911 phone calls out of all of the census tracts, seen in Figure 11.  
          
Figure 12. Residual map showing how well the number of unemployed people predicts the number of 911 phone calls in Portland, Oregon.  

Given these results I would consider placing the new hospital in one of the three census tracts distinguished in Figure 13.  I selected these census tracts because there are the largest number of 911 phone calls coming out of these census tracts and the variable (number of unemployed people) that predicts 54.3% of the variation in the number of 911 phone calls is under predicting the number of 911 phone calls in these census tracts.  This highlights these areas as having a distinctly higher number of 911 phone calls than the other census tracts.  

Figure 13. Best census tracts for placement of the new hospital in Portland, Oregon.

Conclusions

The results for the question of where in Portland, Oregon a new census tract should be placed was determined by using the computer software SPSS and ArcMap to conduct and analyze regression analyses.  The results suggest that the variable that should be taken into greatest consideration when deciding where to place the new hospital is the number of homeless people present in that census tract because this variable describes 54.3% of the variation in the number of 911 phone calls made in a census tract.  Three census tracts stood out as having even higher numbers of 911 phone calls than the number of homeless people could predict.  These three census tracts were also highlighted as having some of the greatest numbers of 911 phone calls out of all of the census tracts in the city of Portland.  These three counties suggest the best placed in the city for a new hospital to be built.  Given the data worked with it cannot be determined the exact size the new ER should be, but with the largest number of 911 phone calls in the city coming from these areas, a large ER is recommended.  The same technique that was used in this analysis could be applied to numerous different applications by local governments, independent companies, for a variety of different topics.   

To determine the exact location the new hospital should be placed it would be important to also consider the locations of already existing hospitals in Portland, Oregon.  It would not make sense to have two hospitals in a close proximity of each other.  The new hospital should be placed a relatively significant distance away from any pre-existing hospitals so that the hospitals to not rely on the same pools of patients for business due to proximity of the hospital.  It would be better for business if the hospital was placed in an area with no already existing hospitals, so that the two hospitals do not need to compete with each other for patients.  Given the data had, three possible census tracts have been identified as places to place the new hospital.