Tuesday, December 12, 2017

Assignment 6: Regression Analysis

Introduction

This assignment was divided into two parts.  Both parts of the assignment deal with solving real world problems by using regression analysis on provided data.  This assignment worked to develop the skills to run a regression analysis in the computer software SPSS and be able to interpret the generated output including predicting results from given data.  Skills in order to run a regression in SPSS excel data must be first be manipulated and organized in Excel and joined if necessary in ArcGIS were practiced.  The assignment also allowed for practice mapping standardized residuals in ArcGIS and interpreting the spatial results using statistical outputs. 

Part I

For this part of the assignment, data was supplied in an excel file giving the percent of children that get free lunch in an area and the crime rate per 100,000 people in that location.  There had been previous assumptions made by a news station concerning the data that kids who receive free lunches increases as crime does.  The task of this part is to run a regression analysis using computer statistical software SPSS to determine if the news station is correct in its news announcement and there is a relationship between the two variables.

To begin first the independent and dependent variable titles had to be assigned to the two data variables (crime rate and free lunch percent).  The dependent variable is the variable explained by the independent variable.  Given the news station assumption of free lunch rate influencing crime rate, free lunch was determined to be the independent variable and crime rate was determined to be the dependent variable.  The computer software SPSS was used to run a linear regression analysis of the two variables to examine their relationship.  The excel file containing the two variables was first uploaded to the software.  Next in the software, it was selected to run the linear regression analysis and the two variables were inputted as their assigned independent and dependent variable labels.  The output resulted in a few tables explaining the relationship between the two variables.  The first table that was generated in the output was the Model Summary table in Figure 1.  The Model Summary table gives the R-Square value used as the coefficient of determination that describes how much of the X variable is used to describe the Y variable.  The higher the R-Square value is the less variation exists between the two variables and the strength between the two variables is high.  The R-Square value in this scenario is 0.173.  The range of R-Square values is between 0 and 1, where 0 there is no strength between the two variables and 1 having a strong relationship between the two variables.  An R-Square value of 0.173 is very weak relationship between the two variables.           

Figure 1. Model summary table for the independent variable the percent of children that get free lunch in an area and the dependent variable the crime rate per 100,000 people in that location generated in the SPSS linear regression analysis.

The Coefficients table was also generated as an output of SPSS.  The Coefficients table in Figure 2 helps build a liner model equation for the relationship between the two variables, that can be used to make predictions based on the given data.  The linear equation formed is y = a + bx, where "a" is the constant and "b" is the slope.  The Unstandardized B of the constant in the table in Figure 2 is the constant in the linear equation: a = 21.819.  The Unstandardized B of PerFreeLunch is the slope in the linear equation: b = 1.6885.  The slope is positive meaning the relationship between the two variables, as the percent of kids on free lunch increases the crime rate increases.  The formed linear equation for the weak relationship between kids on free lunch and the crime rate in those areas is y = 21.819 + 1.685x, where x is the percent of kids on free lunch and y is the crime rate.  From the Coefficients table the Significance value is also important.  The Significance value for PerFreeLunch is 0.005.  The null hypothesis that there is not a linear relationship between the independent and dependent variable can be rejected if the significance value is less than 0.05.  The significance value is 0.005, so the null hypothesis can be rejected and the alternative hypothesis that there is a linear relationship between the independent and dependent variable.  It can be concluded given the two tables that a very weak positive linear relationship exists between kids on free lunch and the crime rate in those areas.             

Figure 2. Coefficients table for the independent variable the percent of children that get free lunch in an area and the dependent variable the crime rate per 100,000 people in that location generated in the SPSS linear regression analysis.

Next, predictions are also requested for a town with 30% of the children in the town receiving free lunch for the corresponding crime rate given the SPSS output.  The linear equation created for the relationship between kids on free lunch and the crime rate in those areas (y = 21.819 + 1.685x) is used to make the prediction.  The 30% value of children receiving free lunch is used as the "x" value.  The output "y" value of 72.369 per 100,000 people.  In excel a scatterplot plotting free lunch percentages compared to crime rates was created.  The scatterplot can be seen in Figure 3.  The scatterplot can be used to see if prediction appears accurate given the distribution of the data points.  The prediction does fit in with already existing data.  It can be seen there is one outlier with a 704.1 crime rate that does not match up with the rest of the plotted data.  This point appears to be skewing the trend line slightly.  The linear equation and R-square value for this scenario is also given in Figure 3.        

Figure 3. Scatterplot of the comparison of the percent of kids receiving free lunch and crime rate.  

Given the results of the regression analysis the claim that the percentage of kids receiving free lunch influences the crime rate in those areas is accurate.  However, the low R-Square value suggests that there are other factors and variables that strongly influence crime rate than kids receiving free lunch which explains only a small amount of the variation in crime rate.  Only 17.3% of the variation in crime rate can be explained by percentage of kids receiving free lunch.  The significance value of 0.005 indicates a significant relationship between the two tested variables.  Due to this significance value there remains a significant amount of confidence in the results, even though the percent of kids receiving free lunch doesn't explain much of the variation in crime rate. 

Part II

Introduction

A company has expressed interest in building a new hospital in the city of Portland.  The company inquires where the best place to build the new facility would be given the need for one and how large of an ER the facility will need depending on where it is built.  A shapefile and excel file of data on 911 calls in Portland, Oregon was provided to answer these questions.  The supplied data included information on the number of 911 calls (Calls), number of alcohol sales (AlcoholX), number of unemployed people (Unemployed), number of foreign born people (ForgnBorn), the median income (Med Income), and number of college graduates(CollGrads).  All of the data was provided at the census tract level.

Methods

To begin, first three SPSS linear single regression analyses were ran on the data.  The provided excel table was used to conduct the regression analyses.  Three of the provided data variables (number of alcohol sales, number of unemployed people, and number of college graduates) were used as the independent variables to run the analysis against the number of 911 calls per census tract, given as the dependent variable.

After the three regression analyses were conducted the output tables were analyzed for the independent variable that had the highest R-Square value that was of a significant variable.  The number of unemployed people variable was selected.  This variable was used to create a residual map in ArcMap.  To create the map in ArcMap in ArcToolbox under the "Spatial Statistics Tools" and within the "Modeling Spatial Relationships" tool set the "Ordinary Least Squares" option was opened.  The supplied feature class containing all of the variable data at the census tract level was selected as the "Input Feature Class" and the "Unique ID Field" was set to the UniqID field within the feature class.  The dependent variable was set as Calls and the independent variable was selected to be Unemployed.  After the tool was ran an output shapefile was generated showing standardized residuals.             

Results

SPSS Regression Analyses

First the SPSS outputs from the linear single regression analyses were interpreted.  For each regression analysis the Model Summary and Coefficients tabled were analyzed.

The regression analysis ran between the number of alcohol sales and the number of 911 calls resulted in the Model Summary table in Figure 4 and the Coefficients table in Figure 5.  The Model Summary table gives the R-Square value of 0.152 suggesting a very weak relationship between the two variables.  The number of alcohol sales only explains 15.2% of the variation in the number of 911 phone calls. 


Figure 4. Model summary table for the independent variable the number of alcohol sales and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The Coefficients table in Figure 5 gives the data needed to build a linear model equation: y = a + bx.  The constant in the equation is the Unstandardized B of the Constant in the table in Figure 5, which is a = 9.590.  The slope in the equation is the Unstandardized B of AlcoholX is the slope in the linear equation: b = 3.069E-5.  The slope is positive meaning, a positive relationship between the two variables, as the number of alcohol purchases increases the number of 911 phone calls increases.  The equation formed for the weak linear relationship between the number of alcohol purchases and the number of 911 phone calls is y = 9.590 + 3.069E-5x, where x is the number of alcohol purchases and y is the number of 911 phone calls.  The significance level value determines if the null hypothesis, that there is no linear relationship between the two variables can be rejected.  The significance value given in the Coefficients table is 0.000 which is less than 0.05 allowing for the null hypothesis to be rejected.  The alternative hypothesis is accepted and it can be determined that there is a weak positive linear relationship between the number of alcohol purchases and the number of 911 phone calls.  When deciding where to place a hospital the number of alcohol purchases in the census tract should not be taken into great consideration when deciding where to place the hospital because it only explains 15.2% of the variation in the number of 911 phone calls.  This is a small percentage and would not encompass the overall consensus of where the majority of the 911 phone calls are coming from.                       

Figure 5. Coefficients table for the independent variable number of alcohol sales and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The regression analysis ran between the number of unemployed people and the number of 911 calls resulted in the Model Summary table in Figure 6 and the Coefficients table in Figure 7.  The Model Summary table gives the R-Square value of 0.543 suggesting a relatively strong relationship between the two variables.  The number of unemployed people explains 54.3% of the variation in the number of 911 phone calls.   

Figure 6. Model summary table for the independent variable the number of unemployed people and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The Coefficients table in Figure 7 gives the data needed to build a linear model equation: y = a + bx.  The constant in the equation is the Unstandardized B of the Constant in the table in Figure 7, which is a = 1.106.  The slope in the equation is the Unstandardized B of Unemployed is the slope in the linear equation: b = 0.507.  The slope is positive meaning, a positive relationship between the two variables, as the number of unemployed people increases the number of 911 phone calls increases.  The equation formed for the linear relationship between the number of unemployed people and the number of 911 phone calls is y = 1.106 + 0.507x, where x is the number of unemployed people and y is the number of 911 phone calls.  The significance level value determines if the null hypothesis can be rejected.  The significance value given in the Coefficients table is 0.000 which is less than 0.05 allowing for the null hypothesis, that there is no linear relationship between the two variables to be rejected.  The alternative hypothesis is accepted and it can be determined that there is a relatively strong positive linear relationship between the number of alcohol purchases and the number of 911 phone calls.  When deciding where to place a hospital the number of unemployed people in the census tract should be taken into great consideration when deciding where to place the hospital because it explains 54.3% of the variation in the number of 911 phone calls.  This is a large percentage that encompasses over half of the variation in 911 phone calls and would be most helpful in determining which census tracts the most 911 phone calls are coming from.          

Figure 7. Coefficients table for the independent variable number of unemployed people and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis. 

The regression analysis ran between the number of college graduates and the number of 911 calls resulted in the Model Summary table in Figure 8 and the Coefficients table in Figure 9.  The Model Summary table gives the R-Square value of 0.095 suggesting a relatively weak relationship between the two variables.  The number of college graduates explains only 9.5% of the variation in the number of 911 phone calls.   

 
Figure 8. Model summary table for the independent variable the number of college graduates and the dependent variable the number of 911 calls generated in the SPSS linear regression analysis.

The Coefficients table in Figure 9 gives the data needed to build a linear model equation: y = a + bx.  The constant in the equation is the Unstandardized B of the Constant in the table in Figure 9, which is a = 13.377.  The slope in the equation is the Unstandardized B of CollGrads is the slope in the linear equation: b = 0.029.  The slope is positive meaning, a positive relationship between the two variables, as the number of college graduates increases the number of 911 phone calls increases.  The equation formed for the weak linear relationship between the number of college graduates and the number of 911 phone calls is y = 13.377 + 0.029x, where x is the number of college graduates and y is the number of 911 phone calls.  The significance level value determines if the null hypothesis, that there is no linear relationship between the two variables can be rejected.  The significance value given in the Coefficients table is 0.004 which is less than 0.05 allowing for the null hypothesis to be rejected.  The alternative hypothesis is accepted and it can be determined that there is a weak positive linear relationship between the number college graduates and the number of 911 phone calls.  When deciding where to place a hospital the number of college graduates in the census tract should not be taken into great consideration when deciding where to place the hospital because it only explains 9.5% of the variation in the number of 911 phone calls.  This is a small percentage and would not encompass the overall consensus of where the majority of the 911 phone calls are coming from.         

Figure 9. Coefficients table for the independent variable number of college graduates and the dependent variable the number of 911 phone calls generated in the SPSS linear regression analysis. 

Mapping the Results

A map was created to determine in which census tracts in Portland, Oregon the majority of the 911 phone calls were coming from.  The map can be seen in Figure 10.  There are 6 main census tracts in Portland with the greatest numbers of 911 phone calls.  Also the general trend suggests the largest numbers of 911 phone calls existing in the center and eastern side of the city.  There appears to be a trend of fewer 911 phone calls on the western side of the city.  The new hospital should be placed in census tracts in central and eastern Portland, not western Portland.  The general recommendation would be to place the new hospital in the area where five different census tracts all with the highest number of 911 phone calls are clumped together, indicating a high need for a hospital.       


Figure 10. Map showing the number of 911 phone calls per census tract in Portland, Oregon. 

A map showing the standard deviations of the residuals of the number of unemployed people variable.  The map was created to show how well the number of unemployed people predicts the number of 911 phone calls in the census tracts of Portland, Oregon.  The equation that is being used to   predict values along the linear relationship between the number of unemployed people and the number of 911 phone calls is y = 1.106 + 0.507x, where x is the number of unemployed people and y is the number of 911 phone calls.  The map can be seen in Figure 10.   A residual describes how well a data point fits the linear model equation.  The smaller the residual the better that independent variable is at using the linear model to predict the dependent variable.  The red and dark blue colors on the map represent areas where the model was the worst at predicting the number of 911 phone calls that would be made in a census tract based off of the number of unemployed people in the census tract.  In the census tracts symbolized by orange and light blue the model is better at predicting the number of 911 phone calls that will be made in those areas.  The areas in yellow are best represented by the model using the number of unemployed people to predict the number of 911 phone calls made in those areas.  In Figure 10 the areas in red with the largest standard deviations, the model using the number of unemployed people is under predicting the number of 911 phone calls that would be made in those areas.  In those areas more 911 calls will be made than the number of predicted calls.  The areas in dark blue with the lowest standard deviations, the model using the number of unemployed people is over predicting the number of 911 phone calls that would be made in those areas.  In those areas less 911 calls will be made than the number of predicted calls.  The areas in red are going to have more 911 phone calls than the model suggests.  These areas should be payed extra attention to because that means that there are significantly more 911 calls coming out of these areas than the model can predict for.  On the map three red clustered census tracts can be seen.  These three counties are also highlighted as having some of the greatest number of 911 phone calls out of all of the census tracts, seen in Figure 11.  
          
Figure 12. Residual map showing how well the number of unemployed people predicts the number of 911 phone calls in Portland, Oregon.  

Given these results I would consider placing the new hospital in one of the three census tracts distinguished in Figure 13.  I selected these census tracts because there are the largest number of 911 phone calls coming out of these census tracts and the variable (number of unemployed people) that predicts 54.3% of the variation in the number of 911 phone calls is under predicting the number of 911 phone calls in these census tracts.  This highlights these areas as having a distinctly higher number of 911 phone calls than the other census tracts.  

Figure 13. Best census tracts for placement of the new hospital in Portland, Oregon.

Conclusions

The results for the question of where in Portland, Oregon a new census tract should be placed was determined by using the computer software SPSS and ArcMap to conduct and analyze regression analyses.  The results suggest that the variable that should be taken into greatest consideration when deciding where to place the new hospital is the number of homeless people present in that census tract because this variable describes 54.3% of the variation in the number of 911 phone calls made in a census tract.  Three census tracts stood out as having even higher numbers of 911 phone calls than the number of homeless people could predict.  These three census tracts were also highlighted as having some of the greatest numbers of 911 phone calls out of all of the census tracts in the city of Portland.  These three counties suggest the best placed in the city for a new hospital to be built.  Given the data worked with it cannot be determined the exact size the new ER should be, but with the largest number of 911 phone calls in the city coming from these areas, a large ER is recommended.  The same technique that was used in this analysis could be applied to numerous different applications by local governments, independent companies, for a variety of different topics.   

To determine the exact location the new hospital should be placed it would be important to also consider the locations of already existing hospitals in Portland, Oregon.  It would not make sense to have two hospitals in a close proximity of each other.  The new hospital should be placed a relatively significant distance away from any pre-existing hospitals so that the hospitals to not rely on the same pools of patients for business due to proximity of the hospital.  It would be better for business if the hospital was placed in an area with no already existing hospitals, so that the two hospitals do not need to compete with each other for patients.  Given the data had, three possible census tracts have been identified as places to place the new hospital.     

Sunday, December 3, 2017

Assignment 5: Correlation and Autocorrelation

Introduction

This assignment was divided into two parts.  Part one worked with correlation analysis to determine the strength and direction present between pairs of variables.  Excel manipulation, the creation of scatterplots, and the statistical computer software SPSS were used to interpret the correlations present in data sets.  Part two used a real life scenario to practice conducting spatial autocorrelations.  In part two data was downloaded from the U.S. Census Site and was manipulated and joined with shapefiles in order to generate spatial autocorrelations.      
The goal of this assignment is to  

Part I: Correlation

1. In this question Excel and SPSS was used to help correlation of two variables in a hypothesis test.  The data supplied in this question is given in the table in Figure 1.  Ten measurements for two variables distance and sound level were supplied.  The data was organized in excel and a scatterplot of the data was generated.  The scatterplot suggests the association and the direction of the two variables, distance and sound level.  The scatterplot with trendline can be seen in Figure 2.  In the statistical computer software SPSS a two tailed Pearson Correlation was conducted.  The statistical results of the correlation can be seen in Figure 3.   

For the hypothesis test being conducted in this question, a T test is being conducted because there is less than 30 samples being used, there are only 10 samples.  It is assumed that there is a linear association between the variables distance and sound level.  The null hypothesis is that there is no linear association between the variables distance and sound level.  The alternative hypothesis is that there is a linear association between the variables distance and sound level.  A 95% significance level two tailed test was used.  The scatterplot in Figure 2 suggests a strong association between the variables distance and sound level because, the data points are located close to the trendline on the scatterplot.  The scatterplot also suggests the relationship between the two variables to be a negative relationship because, as one variable increases the other decreases: as distance increases sound level decreases.  Looking at the SPSS output results similar, more detailed results to the scatterplot are displayed.  The measure of correlation ranges from -1 to 0 to 1, where the closer to -1 or +1 correlation coefficient is the stronger the strength of association between the variables, and the closer to 0 the correlation coefficient is the weaker the strength of association between the variables.  The Pearson Correlation coefficient suggests strength of the correlation is r= -0.896.  The strength is negative meaning the two variables have a negative relationship and the number is close to one which suggests a high correlation between the variables distance and sound level.  There is a strong negative relationship between distance and sound level.  The calculated significance level is significant at the 0.01 level which is less than 0.05 so the null hypothesis can be rejected and it can be concluded that there is a there is a linear association between the variables distance and sound level.     

Figure 1. Data measurements for distance and sound measurements given for question 1.

Figure 2. Scatterplot created in Excel of the data given in Figure 1.  

Figure 3. Output from SPSS software. 

2. Census tract and population data for Detroit, Michigan was supplied for this question.  A small portion of the data supplied in excel can be seen in Figure 4.  The SPSS computer software was used to create a correlation matrix for the data.  The matrix can be seen in Figure 5.

The correlation matrix was created to see the measured correspondence of several pairs of variables.  The matrix provides Pearson Correlation coefficient (r) values that range from -1 to 0 to 1.  The closer to -1 or +1 correlation coefficient is the stronger the strength of association between the variables, and the closer to 0 the correlation coefficient is the weaker the strength of association between the variables.  If the number is positive there is a positive correlation between the two variables and if the number is negative there is a negative correlation between variables.  The Pearson Correlation coefficients were calculated using a 95% significance level using a two tailed test.  There were greater than 30 samples in the supplied dataset so a Z test is being conducted.     

In the matrix it can be seen that there is perfect correlation (r=1) between the comparison of the variable with itself.  That makes sense because the two things being compared are identical resulting in a perfect correlation.  The four nationalities mentioned in the matrix are of the population from the 1000 census tract in and around Detroit. 

In the matrix median household income (MedHHInc) has a high positive correlation with number of people with a Bachelor's Degree (BachDegree), a high positive correlation with median home value (MedHomeValue), a low negative correlation with the number of manufacturing employees (Manu), a low positive correlation with the number of retail employees (Retail), a low positive correlation with the number of finance employees (Finance), a positive moderate correlation with the white population (White), a low negative correlation with the black population (Black), a low positive correlation with the Asian (Asian) population, and a low negative correlation with the Hispanic population (Hispanic).  
  
In the matrix number of people with a Bachelor's Degree has a high positive correlation with median home value, an extremely low positive correlation with the number of manufacturing employees, a low positive correlation with the number of retail employees, a very low positive correlation with the number of finance employees, a high positive correlation with the white population, a low negative correlation with the black population, a very low negative correlation with the Hispanic population, and a moderate positive correlation to the Asian population. 

In the matrix median home value has a very weak positive correlation with number of manufacturing employees, a weak positive correlation with the number of retail employees, a very weak positive correlation with the number of finance employees, a weak positive correlation with the white population, a weak negative correlation with the black population, a weak positive correlation with the Asian population, and a very weak negative correlation with the Hispanic population.   

In the matrix the number of manufacturing employees has a weak positive correlation with the number of retail employees, a weak positive correlation with the number of finance employees, a very weak positive correlation with the white population, a very weak negative correlation with the black population, a very low positive correlation with the Asian population, and an extremely low negative correlation with the Hispanic population.  

In the matrix the number of retail employees has a low positive correlation with the number of finance employees, a weak positive correlation with the white population, a weak negative correlation with the black population, a weak positive correlation with the Asian population, and an extremely weak negative correlation with the Hispanic population.

In the matrix the number of finance employees has an extremely low negative correlation with the white population, a very weak correlation with the black population, a low positive correlation with the Asian population, and a low negative correlation with the Hispanic population. 

It can be seen in the matrix that four nationalities (White, Black, Asian, and Hispanic) the population from the 1000 census tract in and around Detroit were tested for correlations between them.  There is a weak negative correlation between Black and Asian, Asian and Hispanic, and Hispanic and Black populations.  There is a strong negative correlation between Black and White.  There is a weak positive correlation between White and Asian and White and Hispanic. 

It is important to note trends in the matrix as well as to what level the correlations are significant.  All of the correlations with the black population are significant to the 0.01 level except for the number of finance employees.  The black population also has negative correlations with every variable.  This means that everywhere where there is low white population, low Asian population, low Hispanic population, low number of people with a Bachelor's degree, low median household income, low median home value, and a low number of manufacturing, finance, and retail employees, there is most likely a high black population.  The white population is significantly correlated to the 0.01 level with every variable except for the number of manufacturing and finance employees.  The number of manufacturing employees only has significant low correlations to the 0.01 level with three of the variables and one significant low correlation to the 0.05 level.  Median home value has a significant correlation to the 0.01 level with every variable except for number of finance employees which is significant to the 0.05 level and number of manufacturing employees which is of no notable significance.  Another notable trend is that the white population has a positive correlation with every variable except for the black population and number of finance employees. There are a lot of significantly noted correlations both at the 0.01 and 0.05 level in this matrix.                    

Figure 4. Small portion of supplied Detroit, Michigan excel dataset. 

Figure 5. Correlation matrix of supplied Detroit, Michigan data.

Part II: Spatial Autocorrelation

Introduction

In part two of this assignment a real world spatial question was posed and the computer software Geoda was used to calculate the autocorrelation of the variables and SPSS was used to calculate the correlation of the variables.  In this part the Texas Election Commission (TEC) is requesting for voting data from the 1980 and 2012 presidential elections to be analyzed for spatial patterns or clustering in the distribution of voting patterns and voter turnout.  Data at the county level was supplied for the voter turnout at the two elections and percent democratic vote from both elections.  TEC desires to supply this data to the governor of Texas to understand how election patterns have changes in the 32 years between the two analyzed elections.  It is plausible that there may be correlations to voting patterns and population variables, and those were analyzed too to get a full understanding of what is possibly a driver in the spatial patterns present in voting data. 

Methods

First all of the data for the analysis had to be downloaded.  The voter turnout and the percent democratic vote data was supplied by the TEC at the county level.  A shapefile of the counties in Texas was downloaded from the U.S. Census Bureau and an excel spreadsheet of Hispanic population data for Texas was downloaded from the U.S. Census Bureau also.  Next, the downloaded excel file was manipulated so that it was normalized and could be utilized by ArcMap.  In an ArcMap viewer the shapefile of the counties of Texas and the excel sheet of voting data and the sheet of Hispanic population data.  The two excel sheets were joined to the Texas counties shapefile using geo ID field to conduct the join.  The shapefile was exported as a new shapefile including all of the joined data.  Next, the data analysis computer software Geoda was opened and the newly created shapefile was opened in Geoda.  A spatial weights matrix was created in Geoda to determine the clustering of the data.  The rook contiguity method of calculating spatial weights was selected because it is the most common method.  After the spatial weights were calculated Moran's I, scatterplots, and Univariate Local Indicators of Spatial Autocorrelation (LISA) were generated by the software for each of the variables being tested: percent Hispanic population per county, voter turnout for the 1980 election, voter turnout for the 2012 election, percent democratic vote for the 1980 election, and percent democratic vote for the 2012 election.  Finally by using the SPSS computer software a correlation matrix was generated to look for correlations between the five tested variables.  To complete this first all of the variables needed to be placed in a normalized excel spreadsheet.         

Results

To be able to understand the Moran's I values, scatterplots, and LISA cluster maps below, first spatial autocorrelation must be understood.  Spatial autocorrelation looks for a variable if any systematic pattern of the spatial distribution exists.  If this pattern or clustering exists a variable is determined to be autocorrelated.  The autocorrelation test Moran's I works by comparing the variable's value with surrounding variable value locations.  This renders a Moran's I coefficient that below can be seen for every tested variable at the top of each scatterplot.  The coefficient can have values between -1.0 and 1.0, where the higher the coefficient the higher the autocorrelation.  The higher the autocorrelation the greater the clustering of the data.  The LISA cluster maps provide the visual, spatial element of autocorrelation.  Any area in white is of no significance value.  The map renders four categories of where data is significant (p=0.05) on the map. Areas where the variable has a high value and is surrounded by other areas of similar high values are High-High areas symbolized by red.   Areas where the variable has a high value and is surrounded by low value areas are High-Low areas symbolized by salmon.  Areas where the variable has a low value and is surrounded by high value areas are Low-High areas symbolized by periwinkle blue.  Areas where the variable has a low value and is surrounded by low value areas are Low-Low areas symbolized by royal blue.    

First the percent of the 2010 population that is Hispanic was analyzed for spatial autocorrelation.  The calculated Moran's I for the percent of the population that is Hispanic can be seen in Figure 6.  The Moran's I value is 0.778655 which is a positive number close to one, this means that the 2010 Hispanic population is highly clustered in the state of Texas.  The scatterplot in Figure 6 also is demonstrating a positive autocorrelation and the data points are located close to the trend line aiding in determining more clustering is present with this variable.  The LISA map for the percent of the population that is Hispanic can be seen in Figure 7.  Which also suggests high amounts of clustering of the Hispanic population in Texas.  In the map high clustering of the Hispanic population can be seen in the southwest corner of the state.  High Hispanic populations can be explained for clustering in that location because it is along the Mexico border.  In the map also clustering of low Hispanic populations can be seen in the northeast corner of the state.  Low Hispanic populations can be explained because this location in Texas is the farthest away from the Mexico border.  There is one outlier, Titus County in northeast Texas.  This county is a county with a high percentage of a Hispanic population.  This can be attributed to the large number of factory jobs available in this county that would bring in the Hispanic population to fill the job positions.          

Figure 6. Scatterplot of Texas counties and the percent of the population
 that is Hispanic in those counties.


Figure 7. Cluster map of Texas counties and the percent of the population
 that is Hispanic in those counties. 

The next variable analyzed for spatial autocorrelation is the voter turnout in Texas for the 1980 election.  The Moran's I value seen in Figure 8 for this variable is 0.575173 which is a positive number moderately close to 1 which suggests a moderate clustering rate, meaning the voter turnout should be slightly clustered in Texas.   The scatterplot in Figure 8 also suggests a moderate clustering of the 1980 election voter turnout.  The LISA map for the voter turnout of the 1980 election can be seen in Figure 9.  This map also suggests a moderate spatial autocorrelation with some moderate clustering of the voter turnout of the 1980 election.  In the farthest north counties and central counties of Texas there was clustering of low voter turnout and in the most southern counties of Texas there was clustering of high voter turnout.  There were also outlier counties like Swisher County, Bexar County, and Waller County that had high voter turnout surrounded by counties with low voter turnout.  This could be explained for Bexar County by the city of San Antonio being located in that county.  People in large cities typically have better access to voting locations than smaller cities.  Outliers with low voter turnout surrounded by counties of high voter turnout include Bowie County, McMullen County, and King County.  These counties are most likely smaller counties made up of ranching populations that might not have as good of acres to voting locations.                

Figure 8. Scatterplot of Texas counties and the voter turnout 
in those counties in the 1980 election.


Figure 9. Cluster map of Texas counties and the voter turnout
in those counties in the 1980 election. 

The next variable analyzed for spatial autocorrelation is the voter turnout in Texas for the 2012 election.  The Moran's I value seen in Figure 10 for this variable is 0.695853 which is a positive number close to 1 which suggests the voter turnout is highly clustered in the state of Texas.  This Moran's I value is greater than the Moran's I value from the 1980 election which suggests that there was less voter turnout clustering in 1980 than in 2012.  The scatterplot in Figure 10 also suggests a relatively high amount of clustering of the 2012 election voter turnout.  The LISA map for the voter turnout of the 2012 can be seen in Figure 11.  The map also shows a slightly high spatial autocorrelation some high clustering in Texas of voter turnout from the 2012 election.  In northern and central counties of Texas there is clustering of low voter turnout.  In the southern and western counties of Texas there is clustering of high voter turnout.  Foard County was an outlier that had higher voter turnout surrounded by counties with low voter turnout.  There were also outlier counties like Oak and McMullen.  These counties are most likely small counties made up of small ranching communities that do not have as good of access to voting locations.  McMullen County was a low voter turnout county surrounded by counties of higher voter turnout for both the 1980 and 2012 election.     
Figure 10.  Scatterplot of Texas counties and the voter turnout 
in those counties in the 2012 election.


Figure 11. Cluster map of Texas counties and the voter turnout 
in those counties in the 2012 election.

The next variable analyzed for spatial autocorrelation is the percentage of voters in Texas that voted democratic in the 1980 election.  The Moran's I value seen in Figure 12 for this variable is 0.489058 which is a positive number moderately close to 1 which suggests a moderate clustering rate, meaning the democratic vote of the 1980 election is slightly clustered in Texas.  The scatterplot in Figure 12 also suggests a slight clustering of the 1980 election democratic vote.  The data points are semi clustered around the trend line.  The LISA map for the percent democratic vote of the 1980 election can be seen in Figure 13.  This map also suggests only a slight amount of spatial autocorrelation with some mild clustering of the percentage of voters that voted democratic in the 1980 election.  In the farthest north counties there is a small clump of high percentages of democratic voters, and in the south and a little in the east there is clustering of low percentages of democratic voters.  There is a significant amount of outliers with 6 counties being counties of low percentages of democratic voters surrounded by counties of higher percentages of democratic voters and 4 different counties being counties of high percentages of democratic voters surrounded by counties of lower percentages of democratic voters.  This makes sense because with a greater number of outliers the Moran's I should be smaller, which it is and it results in less clustering of the variable.  It can also be seen that some of the locations with high Hispanic populations from the map in Figure 7 are similar locations highlighted as high areas of democratic voters in Figure 13.      


Figure 12. Scatterplot of the Texas counties and the percentage of voters that 
voted democratic in the 1980 election.

Figure 13. Cluster map of Texas counties and the percentage of voters that
voted democratic in the 1980 election.

The next variable analyzed for spatial autocorrelation is the percentage of voters in Texas that voted democratic in the 2012 election.  The Moran's I value seen in Figure 14 for this variable is 0.335851 which is a positive number not relatively close to 1 which suggests low clustering rate, meaning the democratic vote of the 2012 election is not very clustered in Texas.   This Moran's I value is smaller than the value form the 1980 election, this suggests that there was less spatial clustering of the percent of democratic vote in Texas counties in the 2012 election than in the 1980 election.  The scatterplot in Figure 14 also suggests a low clustering of the 2012 election democratic vote.  Some of the data points are clustered around the trend line while a significant amount of the data points are spread far out from the trend line.  The LISA map for the percent democratic vote of the 2012 election can be seen in Figure 15.  This map also suggests only the smallest amount of spatial autocorrelation with minimal clustering of the percentage of voters that voted democratic in the 2012 election.  At the southernmost part of Texas there is a clump of counties with low percentages of democratic voters.  In a few central Texas counties and a few northern Texas counties there are small clumps of counties with high percentages of democratic voters.  There are also outlier counties scattered across the state with both counties of low percentages of democratic voters surrounded by counties of higher percentages of democratic voters and counties of high percentages of democratic voters surrounded by counties of lower percentages of democratic voters.  It can also be seen that some of the locations with high Hispanic populations from the map in Figure 7 are similar locations highlighted as high areas of democratic voters in Figure 15.  There are less similar counties than there were in the 1980 election, but they are still present.                        
Figure 14. Scatterplot of the Texas counties and the percentage of voters that 
voted democratic in the 2012 election. 


Figure 15. Cluster map of Texas counties and the percentage of voters that
voted democratic in the 2012 election.

A correlation matrix was created for the 5 tested variables: the percent of the 2010 population
 that is Hispanic (HD02_202), the percentage of voters that voted democratic in the 1980 election (Pres80D), the percentage of voters that voted democratic in the 2012 election (PresD12), the voter turnout in the 1980 election (VTP80), and the voter turnout in the 2012 election (VT12).  The matrix was created to determine if there was a significant correlation between any of the variables.

The 2010 percent of the population that is Hispanic has a weak positive correlation with the percentage of the voters that voted democratic in the 1980 election, a strong positive correlation with the percentage of voters that voted democratic in the 2012 election with a significance level at the 0.01 level, a weak negative correlation with the voter turnout in the 1980 election with a significance level at the 0.01 level, and a negative strong relationship with the voter turnout of the 2012 election with a significance level at the 0.01 level.  There is a strong amount of overlap between the areas of strong democratic vote and high Hispanic populations. This indicates that the Hispanic population generally votes democrat because of the strong positive relationship.  It can also potentially mean that counties with higher percents of Hispanic populations will have a greater percentage of democratic votes.  This overlap occurs mostly along the Texas and Mexico border.  The negative correlations with the voter turnouts and the percent Hispanic population suggests that counties with high percentages of Hispanics are more likely to have lower voter turnout.  This analysis suggests that the Hispanic population has lower voter turnout.

The percent democratic vote of the 1980 election has a moderate positive correlation with the percent democratic vote of the 2012 election.  The voter turnout of the 1980 election has a moderate positive correlation with the voter turnout of the 2012 election.  Both of these correlations are significant to the 0.01 level.  Both of these correlations suggest that there were moderate positive changes in voter turnout and percent democratic vote between the two elections.  Voting patterns for the two elections increased only a little bit between the two years.  
Figure 16. Correlation matrix of the Texas variables of the percent of the 2010 population
 that is Hispanic, the percentage of voters that voted democratic in the 1980 election, the percentage of voters that voted democratic in the 2012 election, the voter turnout in those counties in the 1980 election, and the voter turnout in those counties in the 2012 election. 

Conclusion

Given the presented results, there are several trends that can be identified in the analyzed data that can help TEC and the Texas governor better understand the voting patterns of the state.  The Hispanic population is most likely to vote democratic but is less likely to get out and vote.  If the governor is a democrat or wants Texas to have a democratic result, the governor should promote voting in Hispanic populations to push those communities to get out and vote in the elections.  It can also be concluded that in the 32 years between the 1980 and 2012 elections that voting patterns like voter turnout and percentages of voters that voted democratic have increased only slightly between the two election years.  This can all be useful information for the TEC and governor as they work to better understand voting trends in Texas.

Sources

U.S. Census Bureau (2010). QT-P10: Hispanic or Latino by Type. Retrieved from https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t


Thursday, November 9, 2017

Assignment 4: Hypothesis Testing

Introduction

The goal of this assignment is to work with some key concepts related to hypothesis testing.  This assignment demonstrates a knowledge of Z and T tests including how to distinguish between each test and  the calculations involved with each test.  Also in this assignment the steps of hypothesis testing are used to make decisions about null and alternative hypothesis by using real-world data and making connections between geography and the calculated statistics.

Part 1: T and Z tests

Part 1 of this assignment demonstrates a basic knowledge of Z and T tests by answering a few short questions.

Question 1: The first question involved completing the table in Figure 1 when only the first four columns were given.  Three fields needed to be filled in including a, determining if it is a Z or a T test, and calculating the Z or T value for the test.  Column "a" was calculated by subtracting the confidence level from 100 and moving the decimal place two places to the left to convert a percentage to a decimal number.  Whether a Z or T test should be used was determined by how large a sample size (n) was being used.  If "n" is large than 30 a Z test should be used and if "n" is less than 30 a T test should be used.  Lastly the Z or T Value was calculated by using T score and Z score charts.  The T score chart is in Figure 2 and the Z score chart is in Figure 3.  If the test was two tailed the "a" value had to be divided by two to account for using two tails.  For a two tail Z test half of the "a" value was added to the confidence level to be used on the Z score chart.  To use the Z score chart the confidence level was used and the corresponding Z score was found by using the X and Y axis of the chart.  The T score chart was used by first calculating the degrees of freedom (n-1) and then finding the corresponding T value by locating the correct "a" value on the top of the chart.  For two tailed T tests the a value used should be half in order to account for the second tail.   
     
Figure 1. The first fours columns were given in the table and the last three columns
 were calculated for Question 1. 

Figure 2. T score chart used to                       Figure 3. Z score chart used to determine critical values.
determine critical values.

Question 2: The second question worked with estimates from a Department of Agriculture and Live Stock Development organization in Kenya on three main crops grown in the country.  The given estimates were for how much districts should approach in production of groundnuts, cassava, and beans.  The estimated were calculated from averages based out of the whole country of Kenya.  A survey was conducted with 23 farmers in Kenya to get a sample mean(μ) and standard deviation of the sample(σ) for the three crops as well.  The data provided for the question can be seen in Figure 4.   

Figure 4. The data provided in Question 2.  The estimated yield was based off
 of averages in the country of Kenya and the sample mean(μ) and standard deviation of 
the sample(σ) were calculated from a survey of 23 farmers.     

In this question a significant test was asked to be conducted.  The null hypothesis is that there is no significant difference between the estimated and the actual yield of the surveyed results of each of the three crops.  The alternative hypothesis is that there is a significant difference between the estimated yield and the actual yield of the surveyed results of each crop.  The hypothesis will be tested using a T test for each crop instead of a Z test because the sample population is small, less than 30.  The T test equation in Figure 5 was used to test the hypothesis for the three crops. Two-tailed T tests with a 95% Confidence Level was used to test the hypothesis. The estimated yield was used as the hypothesized mean in the equation and there were 23 observations in this study.   

Figure 5. T test equation. 

The calculated results of this question can be seen in Figure 6.  The critical value range for each crop was determined by using the T score chart in Figure 2.  degrees of freedom is calculated by subtracting 1 from the number of observations and because a 2 tailed T test was being used the column of the chart that was used was for 0.025.  For the crop groundnuts the null hypothesis cannot be rejected.  The calculated T value falls between the critical value range.  There is not a significant difference between the estimated yield of groundnuts and the actual average yield of this crop.  For the crop cassava the null hypothesis can be rejected.  There is a significant difference between the estimated yield of cassava and the actual average yield of this crop.  The calculated T value falls outside of the confidence interval range. For the crop beans the null hypothesis cannot be rejected.  The calculated T value falls between the critical value range.  There is not a significant difference between the estimated yield of beans and the actual average yield of this crop. 

     Figure 6. Calculated results of Question 2. 
                                               
Using the probability chart from the textbook in Figure 7 the probability of having or exceeding a specific test statistic was determined for each crop.  To use the chart the degrees of freedom had to be calculated by subtracting 1 from the sample size.  In this example the degrees of freedom is 22.  Using the T statistic and the degrees of freedom the probability was found.  If the T statistic was negative the probability given from the chart was subtracted from 1.  The chart provided by the textbook only went to 20 degrees of freedom, not 22, so 20 degrees of freedom was used to calculate the probability in this example.  The probability of ground nuts was determined to be 0.27762.  The probability of cassava was determined to be 0.1062.  The probability of beans was determined to be 0.95652. 


Figure 7. Probability chart used to find probability of calculated T statistics.


Question 3: In question 3 significance testing was used to determine if a stream's pollutant level is higher than the allowable limit of 4.4 mg/l.  There were 17 samples taken and a mean pollutant level was calculated to be 6.8 mg/l and a standard deviation of 4.2.  A one tailed t test was calculated because the number of samples was less than 30 so a z test was not conducted and a 95% significance level was used.

The null hypothesis of this scenario would be that there was no significant difference between the mean pollutant level of the water samples and the allowable pollutant limit.  The alternative hypothesis is that there is a significant difference between the mean pollutant level of the water samples and the allowable pollutant limit.  A one tailed t test is used because the number of samples was less than 30 so a z test was not conducted and a 95% significance level was used. 

The equation in Figure 5 was used to calculate the T statistic was calculated to be 2.356.   The chart in Figure 2 was used to determine the critical value of 1.746.  The calculated T statistic is larger than the critical value meaning the null hypothesis can be rejected.  There is a significant difference between the sample of steam pollutant levels and the allowable limit for steam pollutants.  The calculated T statistic falls outside the confidence interval range.  It can also be determined that the stream pollutant limit if over the allowable limit for pollutants.

Using the probability chart in Figure 7 the probability of having or exceeding a specific test statistic was determined.  To use the chart the degrees of freedom had to be calculated by subtracting 1 from the sample size.  In this example the degrees of freedom is 16.  Using the T statistic (2.356) and the degrees of freedom the probability was found.  The probability of the calculated statistic was determined to be 0.96945.

Part 2: Real World Scenario

Part 2 of the lab posed a real world spatial question that relied on a hypothesis test to answer the question.  A hypothesis test was conducted in this example to determine if the average value of homes for the city of Eau Claire block groups is significantly different from the block groups for Eau Claire County?

The null hypothesis is that there is no significant difference between the average home values by block in the city of Eau Claire and the average home values by block for Eau Claire County.  The alternative hypothesis is that there is a significant difference between the average home values by block in the city of Eau Claire and the average home values by block for Eau Claire County.  It was determined that a Z test should be conducted on the average home values of the homes in the city of Eau Claire because there were 52 home values recorded for the city (n=52) and when a sample size is larger than 30 a Z test should be used over a T test.  The equation for a Z test is shown in Figure 8.  A 95% confidence interval is a standard practice when working with U.S. Census Bureau data so a 1 tailed Z test with a 95% confidence interval will be used.  The table of  data in Figure 9 was used to conduct the calculation of a Z test statistic. All of the values in this table were able to be obtained from the census data in the shapefiles provided in the assignment.  The calculated Z test statistic was calculated to be -2.548.  The critical value determined with the parameters of a 95% confidence interval of a one tailed Z test was selected from Figure 1.  The critical value is -1.64, the negative critical value was selected because the sample mean is smaller than the hypothesized mean in this example.  The Z test statistic is less than the critical value meaning the null hypothesis can be rejected.  There is a significant difference between the average value of homes at block group level in the city of Eau Claire and the average value of homes at block group level in the County of Eau Claire.  The probability of the city of Eau Claire's block group average home values is 0.0055.  This was found by using the chart in Figure 3 and finding the corresponding probability value to the calculated Z statistic.  This means that the city of Eau Claire sample block groups is in the 0.55 percentile which is extremely low.               

Figure 8. Z test equation.  

Figure 9. Data collected from census data to complete Z test.

A map in Figure 10 was created to compare the average home value by block group between the city of Eau Claire and the county of Eau Claire.  The average block group home value of the county can be seen in green and the average block group home value of the city can be seen in purple.  It can be seen that in many of the block groups in the inner city of Eau Claire there is a lower home value compared to the home values found in the rest of the county.  The county block groups appear to have significantly larger home values than the home values of the block groups in the city of Eau Claire.
   
Figure 10. Map comparing the average block group home values in 
the city of Eau Claire and Eau Claire County.