The Yost index is a composite measure of socioeconomic status (SES) based on seven census variables that taken together capture its most important dimensions: average educational level, median income, poverty rate, median housing value, median rent, unemployment rate, and employment mix (white versus blue collar). There are indexes both more complex and less complex than this, but I find the Yost index serves my projects well. It was first used in a 2001 paper by Kathleen Yost and others and has been gaining traction in the field of cancer surveillance. There are obvious advantages to applying the same SES measure across disparate studies.
Since I found myself calculating this index for multiple projects, I decided to publish the results for everyone’s benefit. The results consist of a single file containing five fields. The file is available as a zipped, comma-separated file here. The file is about 20 megabytes.
- GEOID, a 14-digit unique geographic identifier defined by the census that can be linked with other census files and map files
- year, identifying the five-year period for which the Yost index was calculated. Values range from 2007-2011 and 2014-2018. In all cases, the 2010 census block group definitions apply.
- name, the block group, census tract, county and state written in words
- score, the score generated from the factor analysis. Few users will have need for this, but I included it in case anyone wanted to group the block groups other than by percentiles.
- index, the Yost index representing the percentile rank of the block group in the entire United States, where 1=most affluent and 100=most deprived. The index was calculated separately for each year, not pooled across years.
- The seven input variables are as defined by the SEER program here.
- Data were obtained from the National Historical Geographic Information System site; the data can also be obtained directly from the Census.
- Block groups with fewer than 100 people, fewer than 30 housing units, or more than one-third of the population living in group quarters were deleted as these results are not considered stable or reliable. This rule, first applied in a 2001 paper by Ana Diez Roux and others, reduced the number of block groups by about 2%, from about 220,000 to about 216,000. The exact number of block groups varied by year as some moved above or below one of these thresholds between years. Since group quarters populations were only available at the census tract level and not the census block group, all block groups in a tract were removed when the tract had more than one-third of its population living in group quarters.
- Missing values were imputed using the classification and regression tree (
cart) method in the R package
mice, using the default of five iterations and five imputations. The average of the imputations for each missing value was used. 16% of records were missing median rent, 5% missing median housing value, and 2% missing median income. Education level had no missing values. The other variables were missing less than 1% of the time. 19% of observations were missing values for one variable, 0.6% were missing values for two variables, and a negligible number were missing values for three or more variables.
- Census values were converted to ranks such that a rank of 1 represented the most affluent block group and a rank of 216,000 (approximately) the most deprived. For ties the average rank was used. The use of ranks helped standardize values that are on quite different scales and some of which are top-coded and bottom-coded. For example, in 2014-2018 the minimum median rent was 99 dollars and the maximum was 3500.
- Factor analysis was performed on the ranked measures using the
fa()function in the R package
psychusing the maximum likelihood factoring method. The first principal component was retained (included in the file as
score) and converted into percentiles.
- Beginning with the third year of data, the Census renamed and/or renumbered a very small number of block groups. For example, Shannon County, South Dakota was renamed Oglala Lakota County. The file reports the GEOIDs and names as reported – no attempt was made to standardize them across years.
More detailed information on the calculation method, including the R code that was used to perform the calculations, is available by request.