Sunday, December 3, 2017

Assignment 5: Correlation and Autocorrelation

Introduction

This assignment was divided into two parts.  Part one worked with correlation analysis to determine the strength and direction present between pairs of variables.  Excel manipulation, the creation of scatterplots, and the statistical computer software SPSS were used to interpret the correlations present in data sets.  Part two used a real life scenario to practice conducting spatial autocorrelations.  In part two data was downloaded from the U.S. Census Site and was manipulated and joined with shapefiles in order to generate spatial autocorrelations.      
The goal of this assignment is to  

Part I: Correlation

1. In this question Excel and SPSS was used to help correlation of two variables in a hypothesis test.  The data supplied in this question is given in the table in Figure 1.  Ten measurements for two variables distance and sound level were supplied.  The data was organized in excel and a scatterplot of the data was generated.  The scatterplot suggests the association and the direction of the two variables, distance and sound level.  The scatterplot with trendline can be seen in Figure 2.  In the statistical computer software SPSS a two tailed Pearson Correlation was conducted.  The statistical results of the correlation can be seen in Figure 3.   

For the hypothesis test being conducted in this question, a T test is being conducted because there is less than 30 samples being used, there are only 10 samples.  It is assumed that there is a linear association between the variables distance and sound level.  The null hypothesis is that there is no linear association between the variables distance and sound level.  The alternative hypothesis is that there is a linear association between the variables distance and sound level.  A 95% significance level two tailed test was used.  The scatterplot in Figure 2 suggests a strong association between the variables distance and sound level because, the data points are located close to the trendline on the scatterplot.  The scatterplot also suggests the relationship between the two variables to be a negative relationship because, as one variable increases the other decreases: as distance increases sound level decreases.  Looking at the SPSS output results similar, more detailed results to the scatterplot are displayed.  The measure of correlation ranges from -1 to 0 to 1, where the closer to -1 or +1 correlation coefficient is the stronger the strength of association between the variables, and the closer to 0 the correlation coefficient is the weaker the strength of association between the variables.  The Pearson Correlation coefficient suggests strength of the correlation is r= -0.896.  The strength is negative meaning the two variables have a negative relationship and the number is close to one which suggests a high correlation between the variables distance and sound level.  There is a strong negative relationship between distance and sound level.  The calculated significance level is significant at the 0.01 level which is less than 0.05 so the null hypothesis can be rejected and it can be concluded that there is a there is a linear association between the variables distance and sound level.     

Figure 1. Data measurements for distance and sound measurements given for question 1.

Figure 2. Scatterplot created in Excel of the data given in Figure 1.  

Figure 3. Output from SPSS software. 

2. Census tract and population data for Detroit, Michigan was supplied for this question.  A small portion of the data supplied in excel can be seen in Figure 4.  The SPSS computer software was used to create a correlation matrix for the data.  The matrix can be seen in Figure 5.

The correlation matrix was created to see the measured correspondence of several pairs of variables.  The matrix provides Pearson Correlation coefficient (r) values that range from -1 to 0 to 1.  The closer to -1 or +1 correlation coefficient is the stronger the strength of association between the variables, and the closer to 0 the correlation coefficient is the weaker the strength of association between the variables.  If the number is positive there is a positive correlation between the two variables and if the number is negative there is a negative correlation between variables.  The Pearson Correlation coefficients were calculated using a 95% significance level using a two tailed test.  There were greater than 30 samples in the supplied dataset so a Z test is being conducted.     

In the matrix it can be seen that there is perfect correlation (r=1) between the comparison of the variable with itself.  That makes sense because the two things being compared are identical resulting in a perfect correlation.  The four nationalities mentioned in the matrix are of the population from the 1000 census tract in and around Detroit. 

In the matrix median household income (MedHHInc) has a high positive correlation with number of people with a Bachelor's Degree (BachDegree), a high positive correlation with median home value (MedHomeValue), a low negative correlation with the number of manufacturing employees (Manu), a low positive correlation with the number of retail employees (Retail), a low positive correlation with the number of finance employees (Finance), a positive moderate correlation with the white population (White), a low negative correlation with the black population (Black), a low positive correlation with the Asian (Asian) population, and a low negative correlation with the Hispanic population (Hispanic).  
  
In the matrix number of people with a Bachelor's Degree has a high positive correlation with median home value, an extremely low positive correlation with the number of manufacturing employees, a low positive correlation with the number of retail employees, a very low positive correlation with the number of finance employees, a high positive correlation with the white population, a low negative correlation with the black population, a very low negative correlation with the Hispanic population, and a moderate positive correlation to the Asian population. 

In the matrix median home value has a very weak positive correlation with number of manufacturing employees, a weak positive correlation with the number of retail employees, a very weak positive correlation with the number of finance employees, a weak positive correlation with the white population, a weak negative correlation with the black population, a weak positive correlation with the Asian population, and a very weak negative correlation with the Hispanic population.   

In the matrix the number of manufacturing employees has a weak positive correlation with the number of retail employees, a weak positive correlation with the number of finance employees, a very weak positive correlation with the white population, a very weak negative correlation with the black population, a very low positive correlation with the Asian population, and an extremely low negative correlation with the Hispanic population.  

In the matrix the number of retail employees has a low positive correlation with the number of finance employees, a weak positive correlation with the white population, a weak negative correlation with the black population, a weak positive correlation with the Asian population, and an extremely weak negative correlation with the Hispanic population.

In the matrix the number of finance employees has an extremely low negative correlation with the white population, a very weak correlation with the black population, a low positive correlation with the Asian population, and a low negative correlation with the Hispanic population. 

It can be seen in the matrix that four nationalities (White, Black, Asian, and Hispanic) the population from the 1000 census tract in and around Detroit were tested for correlations between them.  There is a weak negative correlation between Black and Asian, Asian and Hispanic, and Hispanic and Black populations.  There is a strong negative correlation between Black and White.  There is a weak positive correlation between White and Asian and White and Hispanic. 

It is important to note trends in the matrix as well as to what level the correlations are significant.  All of the correlations with the black population are significant to the 0.01 level except for the number of finance employees.  The black population also has negative correlations with every variable.  This means that everywhere where there is low white population, low Asian population, low Hispanic population, low number of people with a Bachelor's degree, low median household income, low median home value, and a low number of manufacturing, finance, and retail employees, there is most likely a high black population.  The white population is significantly correlated to the 0.01 level with every variable except for the number of manufacturing and finance employees.  The number of manufacturing employees only has significant low correlations to the 0.01 level with three of the variables and one significant low correlation to the 0.05 level.  Median home value has a significant correlation to the 0.01 level with every variable except for number of finance employees which is significant to the 0.05 level and number of manufacturing employees which is of no notable significance.  Another notable trend is that the white population has a positive correlation with every variable except for the black population and number of finance employees. There are a lot of significantly noted correlations both at the 0.01 and 0.05 level in this matrix.                    

Figure 4. Small portion of supplied Detroit, Michigan excel dataset. 

Figure 5. Correlation matrix of supplied Detroit, Michigan data.

Part II: Spatial Autocorrelation

Introduction

In part two of this assignment a real world spatial question was posed and the computer software Geoda was used to calculate the autocorrelation of the variables and SPSS was used to calculate the correlation of the variables.  In this part the Texas Election Commission (TEC) is requesting for voting data from the 1980 and 2012 presidential elections to be analyzed for spatial patterns or clustering in the distribution of voting patterns and voter turnout.  Data at the county level was supplied for the voter turnout at the two elections and percent democratic vote from both elections.  TEC desires to supply this data to the governor of Texas to understand how election patterns have changes in the 32 years between the two analyzed elections.  It is plausible that there may be correlations to voting patterns and population variables, and those were analyzed too to get a full understanding of what is possibly a driver in the spatial patterns present in voting data. 

Methods

First all of the data for the analysis had to be downloaded.  The voter turnout and the percent democratic vote data was supplied by the TEC at the county level.  A shapefile of the counties in Texas was downloaded from the U.S. Census Bureau and an excel spreadsheet of Hispanic population data for Texas was downloaded from the U.S. Census Bureau also.  Next, the downloaded excel file was manipulated so that it was normalized and could be utilized by ArcMap.  In an ArcMap viewer the shapefile of the counties of Texas and the excel sheet of voting data and the sheet of Hispanic population data.  The two excel sheets were joined to the Texas counties shapefile using geo ID field to conduct the join.  The shapefile was exported as a new shapefile including all of the joined data.  Next, the data analysis computer software Geoda was opened and the newly created shapefile was opened in Geoda.  A spatial weights matrix was created in Geoda to determine the clustering of the data.  The rook contiguity method of calculating spatial weights was selected because it is the most common method.  After the spatial weights were calculated Moran's I, scatterplots, and Univariate Local Indicators of Spatial Autocorrelation (LISA) were generated by the software for each of the variables being tested: percent Hispanic population per county, voter turnout for the 1980 election, voter turnout for the 2012 election, percent democratic vote for the 1980 election, and percent democratic vote for the 2012 election.  Finally by using the SPSS computer software a correlation matrix was generated to look for correlations between the five tested variables.  To complete this first all of the variables needed to be placed in a normalized excel spreadsheet.         

Results

To be able to understand the Moran's I values, scatterplots, and LISA cluster maps below, first spatial autocorrelation must be understood.  Spatial autocorrelation looks for a variable if any systematic pattern of the spatial distribution exists.  If this pattern or clustering exists a variable is determined to be autocorrelated.  The autocorrelation test Moran's I works by comparing the variable's value with surrounding variable value locations.  This renders a Moran's I coefficient that below can be seen for every tested variable at the top of each scatterplot.  The coefficient can have values between -1.0 and 1.0, where the higher the coefficient the higher the autocorrelation.  The higher the autocorrelation the greater the clustering of the data.  The LISA cluster maps provide the visual, spatial element of autocorrelation.  Any area in white is of no significance value.  The map renders four categories of where data is significant (p=0.05) on the map. Areas where the variable has a high value and is surrounded by other areas of similar high values are High-High areas symbolized by red.   Areas where the variable has a high value and is surrounded by low value areas are High-Low areas symbolized by salmon.  Areas where the variable has a low value and is surrounded by high value areas are Low-High areas symbolized by periwinkle blue.  Areas where the variable has a low value and is surrounded by low value areas are Low-Low areas symbolized by royal blue.    

First the percent of the 2010 population that is Hispanic was analyzed for spatial autocorrelation.  The calculated Moran's I for the percent of the population that is Hispanic can be seen in Figure 6.  The Moran's I value is 0.778655 which is a positive number close to one, this means that the 2010 Hispanic population is highly clustered in the state of Texas.  The scatterplot in Figure 6 also is demonstrating a positive autocorrelation and the data points are located close to the trend line aiding in determining more clustering is present with this variable.  The LISA map for the percent of the population that is Hispanic can be seen in Figure 7.  Which also suggests high amounts of clustering of the Hispanic population in Texas.  In the map high clustering of the Hispanic population can be seen in the southwest corner of the state.  High Hispanic populations can be explained for clustering in that location because it is along the Mexico border.  In the map also clustering of low Hispanic populations can be seen in the northeast corner of the state.  Low Hispanic populations can be explained because this location in Texas is the farthest away from the Mexico border.  There is one outlier, Titus County in northeast Texas.  This county is a county with a high percentage of a Hispanic population.  This can be attributed to the large number of factory jobs available in this county that would bring in the Hispanic population to fill the job positions.          

Figure 6. Scatterplot of Texas counties and the percent of the population
 that is Hispanic in those counties.


Figure 7. Cluster map of Texas counties and the percent of the population
 that is Hispanic in those counties. 

The next variable analyzed for spatial autocorrelation is the voter turnout in Texas for the 1980 election.  The Moran's I value seen in Figure 8 for this variable is 0.575173 which is a positive number moderately close to 1 which suggests a moderate clustering rate, meaning the voter turnout should be slightly clustered in Texas.   The scatterplot in Figure 8 also suggests a moderate clustering of the 1980 election voter turnout.  The LISA map for the voter turnout of the 1980 election can be seen in Figure 9.  This map also suggests a moderate spatial autocorrelation with some moderate clustering of the voter turnout of the 1980 election.  In the farthest north counties and central counties of Texas there was clustering of low voter turnout and in the most southern counties of Texas there was clustering of high voter turnout.  There were also outlier counties like Swisher County, Bexar County, and Waller County that had high voter turnout surrounded by counties with low voter turnout.  This could be explained for Bexar County by the city of San Antonio being located in that county.  People in large cities typically have better access to voting locations than smaller cities.  Outliers with low voter turnout surrounded by counties of high voter turnout include Bowie County, McMullen County, and King County.  These counties are most likely smaller counties made up of ranching populations that might not have as good of acres to voting locations.                

Figure 8. Scatterplot of Texas counties and the voter turnout 
in those counties in the 1980 election.


Figure 9. Cluster map of Texas counties and the voter turnout
in those counties in the 1980 election. 

The next variable analyzed for spatial autocorrelation is the voter turnout in Texas for the 2012 election.  The Moran's I value seen in Figure 10 for this variable is 0.695853 which is a positive number close to 1 which suggests the voter turnout is highly clustered in the state of Texas.  This Moran's I value is greater than the Moran's I value from the 1980 election which suggests that there was less voter turnout clustering in 1980 than in 2012.  The scatterplot in Figure 10 also suggests a relatively high amount of clustering of the 2012 election voter turnout.  The LISA map for the voter turnout of the 2012 can be seen in Figure 11.  The map also shows a slightly high spatial autocorrelation some high clustering in Texas of voter turnout from the 2012 election.  In northern and central counties of Texas there is clustering of low voter turnout.  In the southern and western counties of Texas there is clustering of high voter turnout.  Foard County was an outlier that had higher voter turnout surrounded by counties with low voter turnout.  There were also outlier counties like Oak and McMullen.  These counties are most likely small counties made up of small ranching communities that do not have as good of access to voting locations.  McMullen County was a low voter turnout county surrounded by counties of higher voter turnout for both the 1980 and 2012 election.     
Figure 10.  Scatterplot of Texas counties and the voter turnout 
in those counties in the 2012 election.


Figure 11. Cluster map of Texas counties and the voter turnout 
in those counties in the 2012 election.

The next variable analyzed for spatial autocorrelation is the percentage of voters in Texas that voted democratic in the 1980 election.  The Moran's I value seen in Figure 12 for this variable is 0.489058 which is a positive number moderately close to 1 which suggests a moderate clustering rate, meaning the democratic vote of the 1980 election is slightly clustered in Texas.  The scatterplot in Figure 12 also suggests a slight clustering of the 1980 election democratic vote.  The data points are semi clustered around the trend line.  The LISA map for the percent democratic vote of the 1980 election can be seen in Figure 13.  This map also suggests only a slight amount of spatial autocorrelation with some mild clustering of the percentage of voters that voted democratic in the 1980 election.  In the farthest north counties there is a small clump of high percentages of democratic voters, and in the south and a little in the east there is clustering of low percentages of democratic voters.  There is a significant amount of outliers with 6 counties being counties of low percentages of democratic voters surrounded by counties of higher percentages of democratic voters and 4 different counties being counties of high percentages of democratic voters surrounded by counties of lower percentages of democratic voters.  This makes sense because with a greater number of outliers the Moran's I should be smaller, which it is and it results in less clustering of the variable.  It can also be seen that some of the locations with high Hispanic populations from the map in Figure 7 are similar locations highlighted as high areas of democratic voters in Figure 13.      


Figure 12. Scatterplot of the Texas counties and the percentage of voters that 
voted democratic in the 1980 election.

Figure 13. Cluster map of Texas counties and the percentage of voters that
voted democratic in the 1980 election.

The next variable analyzed for spatial autocorrelation is the percentage of voters in Texas that voted democratic in the 2012 election.  The Moran's I value seen in Figure 14 for this variable is 0.335851 which is a positive number not relatively close to 1 which suggests low clustering rate, meaning the democratic vote of the 2012 election is not very clustered in Texas.   This Moran's I value is smaller than the value form the 1980 election, this suggests that there was less spatial clustering of the percent of democratic vote in Texas counties in the 2012 election than in the 1980 election.  The scatterplot in Figure 14 also suggests a low clustering of the 2012 election democratic vote.  Some of the data points are clustered around the trend line while a significant amount of the data points are spread far out from the trend line.  The LISA map for the percent democratic vote of the 2012 election can be seen in Figure 15.  This map also suggests only the smallest amount of spatial autocorrelation with minimal clustering of the percentage of voters that voted democratic in the 2012 election.  At the southernmost part of Texas there is a clump of counties with low percentages of democratic voters.  In a few central Texas counties and a few northern Texas counties there are small clumps of counties with high percentages of democratic voters.  There are also outlier counties scattered across the state with both counties of low percentages of democratic voters surrounded by counties of higher percentages of democratic voters and counties of high percentages of democratic voters surrounded by counties of lower percentages of democratic voters.  It can also be seen that some of the locations with high Hispanic populations from the map in Figure 7 are similar locations highlighted as high areas of democratic voters in Figure 15.  There are less similar counties than there were in the 1980 election, but they are still present.                        
Figure 14. Scatterplot of the Texas counties and the percentage of voters that 
voted democratic in the 2012 election. 


Figure 15. Cluster map of Texas counties and the percentage of voters that
voted democratic in the 2012 election.

A correlation matrix was created for the 5 tested variables: the percent of the 2010 population
 that is Hispanic (HD02_202), the percentage of voters that voted democratic in the 1980 election (Pres80D), the percentage of voters that voted democratic in the 2012 election (PresD12), the voter turnout in the 1980 election (VTP80), and the voter turnout in the 2012 election (VT12).  The matrix was created to determine if there was a significant correlation between any of the variables.

The 2010 percent of the population that is Hispanic has a weak positive correlation with the percentage of the voters that voted democratic in the 1980 election, a strong positive correlation with the percentage of voters that voted democratic in the 2012 election with a significance level at the 0.01 level, a weak negative correlation with the voter turnout in the 1980 election with a significance level at the 0.01 level, and a negative strong relationship with the voter turnout of the 2012 election with a significance level at the 0.01 level.  There is a strong amount of overlap between the areas of strong democratic vote and high Hispanic populations. This indicates that the Hispanic population generally votes democrat because of the strong positive relationship.  It can also potentially mean that counties with higher percents of Hispanic populations will have a greater percentage of democratic votes.  This overlap occurs mostly along the Texas and Mexico border.  The negative correlations with the voter turnouts and the percent Hispanic population suggests that counties with high percentages of Hispanics are more likely to have lower voter turnout.  This analysis suggests that the Hispanic population has lower voter turnout.

The percent democratic vote of the 1980 election has a moderate positive correlation with the percent democratic vote of the 2012 election.  The voter turnout of the 1980 election has a moderate positive correlation with the voter turnout of the 2012 election.  Both of these correlations are significant to the 0.01 level.  Both of these correlations suggest that there were moderate positive changes in voter turnout and percent democratic vote between the two elections.  Voting patterns for the two elections increased only a little bit between the two years.  
Figure 16. Correlation matrix of the Texas variables of the percent of the 2010 population
 that is Hispanic, the percentage of voters that voted democratic in the 1980 election, the percentage of voters that voted democratic in the 2012 election, the voter turnout in those counties in the 1980 election, and the voter turnout in those counties in the 2012 election. 

Conclusion

Given the presented results, there are several trends that can be identified in the analyzed data that can help TEC and the Texas governor better understand the voting patterns of the state.  The Hispanic population is most likely to vote democratic but is less likely to get out and vote.  If the governor is a democrat or wants Texas to have a democratic result, the governor should promote voting in Hispanic populations to push those communities to get out and vote in the elections.  It can also be concluded that in the 32 years between the 1980 and 2012 elections that voting patterns like voter turnout and percentages of voters that voted democratic have increased only slightly between the two election years.  This can all be useful information for the TEC and governor as they work to better understand voting trends in Texas.

Sources

U.S. Census Bureau (2010). QT-P10: Hispanic or Latino by Type. Retrieved from https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t