Back to Article
NY HEAT Analysis
Download Source

NY HEAT Analysis

Introduction

Welcome!

This notebook, written in the R programming language, contains all of the code used to produce the findings in our report, starting from raw data.

Win Climate’s open-source reports reflect our commitment to transparency and openness.

  • To download the code, click Download Source above.

  • To go back to the report, click Back to Article above.

  • Got questions about the code? .

Import packages

In [1]:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(gt)
library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
library(sf)
library(tidycensus)
options(tigris_use_cache=TRUE, tigris_year=2019)

Set report parameters

Here, we specify the county (or groups of counties), state, year, and affordability cutoffs to analyze. This allows us to easily create reports for different times, places, and scenarios.

In [2]:
REPORT_TYPE <- Sys.getenv("REPORT_TYPE")
REGION_NAME <- Sys.getenv("REGION_NAME")
REGION_COUNTIES <- Sys.getenv("REGION_COUNTIES")
REGION_GROUPS <- Sys.getenv("REGION_GROUPS")
REGION_MAPPING <- Sys.getenv("REGION_MAPPING")
STATE_FIPS <- Sys.getenv("STATE_FIPS")
burdened_cutoff <- 0.06
pums_year <- 2022
In [3]:
county_names <- REGION_COUNTIES |> stringr::str_split_1("\\|")
region_groups <- REGION_GROUPS |> stringr::str_split_1("\\|")
region_mapping <- tibble(
  county_name = county_names,
  region_group = REGION_MAPPING |> stringr::str_split_1("\\|")
)

Load data

In [4]:
pums_path <- paste0("/mnt/data/pums_1_22_", STATE_FIPS, ".Rds")
puma_county_mapping_path <- "/mnt/data/buildings2/geocorr/puma_county_mapping.csv"
county_lmi_path <- "/mnt/data/buildings2/HUD_2021_income_limits_by_NY_county.xlsx"

PUMS

Our analysis relies on the 2022 PUMS dataset from the US Census. We fetch the data directly from the Census API, and only grab a handful of the variables available within PUMS (vars).

In [5]:
vars <- c("BLD", "PUMA", "HFL", "TEN", "YRBLT", "HINCP", "NP", "MV", "BDSP", "ELEP", "FULP", "GASP", "MRGP", "RNTP", "RMSP", "ELEFP", "FULFP", "GASFP")

# Get data from the Census API
pums_path <- paste0("/mnt/data/pums_1_22_", STATE_FIPS, ".Rds")

# Get data from the Census API
if (file.exists(pums_path)) {
  raw_data <- readRDS(pums_path)
} else {
  raw_data <- get_pums(variables = vars,
                 state = STATE_FIPS,
                 year = 2022,
                 survey = "acs1",
                 rep_weights = "housing",
                 recode = TRUE) |>
          distinct(SERIALNO, .keep_all = TRUE) # housing units only, not people
  saveRDS(raw_data, pums_path)
}

Let’s take a look ten random records.

In [6]:
set.seed(0)
raw_data |>
  select(
    BLD_label, BDSP, YRBLT_label, TEN_label,
    PUMA, HINCP, NP, ELEFP_label, ELEP,
    GASFP_label, GASP, FULFP_label, FULP, WGTP
  ) |>
    slice_sample(n=10) |>
  gt()
BLD_label BDSP YRBLT_label TEN_label PUMA HINCP NP ELEFP_label ELEP GASFP_label GASP FULFP_label FULP WGTP
One-family house detached 3 1940 to 1949 Owned with mortgage or loan (include home equity loans) 03102 73800 2 Valid monthly electricity cost in ELEP 230 No charge or gas not used 3 No charge or fuel other than gas or electricity not used 2 5
20-49 Apartments 2 1970 to 1979 Rented 04003 170000 1 Valid monthly electricity cost in ELEP 400 Valid monthly gas cost in GASP 40 No charge or fuel other than gas or electricity not used 2 18
One-family house detached 4 1960 to 1969 Owned with mortgage or loan (include home equity loans) 00703 190000 4 Valid monthly electricity cost in ELEP 300 Valid monthly gas cost in GASP 300 No charge or fuel other than gas or electricity not used 2 26
One-family house detached 4 1990 to 1999 Owned with mortgage or loan (include home equity loans) 00901 77000 1 Valid monthly electricity cost in ELEP 130 Valid monthly gas cost in GASP 330 No charge or fuel other than gas or electricity not used 2 24
10-19 Apartments 2 1990 to 1999 Rented 03308 16800 1 Valid monthly electricity cost in ELEP 60 Valid monthly gas cost in GASP 80 No charge or fuel other than gas or electricity not used 2 22
N/A (GQ) -1 NA N/A (GQ/vacant) 00905 -60000 1 N/A (GQ/vacant) 2 N/A (GQ/vacant) 3 N/A (GQ/vacant) 2 0
One-family house attached 3 1939 or earlier Owned with mortgage or loan (include home equity loans) 04111 34300 2 Valid monthly electricity cost in ELEP 100 Valid monthly gas cost in GASP 110 No charge or fuel other than gas or electricity not used 2 15
N/A (GQ) -1 NA N/A (GQ/vacant) 02500 -60000 1 N/A (GQ/vacant) 2 N/A (GQ/vacant) 3 N/A (GQ/vacant) 2 0
One-family house attached 3 1990 to 1999 Owned with mortgage or loan (include home equity loans) 03902 313000 4 Valid monthly electricity cost in ELEP 220 Valid monthly gas cost in GASP 150 No charge or fuel other than gas or electricity not used 2 16
One-family house detached 3 1939 or earlier Owned free and clear 03313 40600 2 Valid monthly electricity cost in ELEP 70 Valid monthly gas cost in GASP 50 No charge or fuel other than gas or electricity not used 2 44

PUMS consists of so-called “microdata”: anonymized, individual-level households survey results. Every row contains the answers that a real person gave on a questionnaire. For our purposes, the key variables are:

variables example value description
HINCP $113,000 The household’s income over the past 12 months
ELEP $170 The household’s monthly electricity cost
GASP $80 The household’s monthly natural gas cost
FULP $100 The household’s annual cost for fuels other than electricity and natural gas


For definitions of other PUMS variables, consult the official data dictionary. To learn how to work with PUMS data, check out this tutorial.

PUMAs

What if we want to calculate the average annual income of a specific area? PUMS doesn’t tell us the city or county each household is located in, only the public use microdata area (PUMA).

PUMAs are Census geographies that contain at least 100,000 people and are entirely within a single state. They are built from census tracts and counties, and may or may not be similar to other recognized geographic boundaries. (So, if you are interested in pulling data about block groups, census tracts, or other small areas, you can’t use PUMS data.)

In densely populated areas like New York City, PUMAs are closely aligned to Community Districts, which are smaller than counties. In some places, PUMAS are equivalent to counties. In the least populated places, a single PUMA can span several entire counties. These scales are granular enough for our purposes.

PUMAS to counties

PUMAs are represented by numeric codes like 01101 (see the PUMA column above). Reporting the number of energy burdened families for PUMA 01101 won’t mean anything to anyone, so need to associate PUMAS with counties in order make our results digestible to readers. Let’s import a PUMA-to-county mapping.

In [7]:
puma_county_mapping <- read_csv(puma_county_mapping_path, skip=1) |>
  janitor::clean_names() |>
  mutate(county_code = as.character(county_code)) |>
  mutate(county_name = str_replace(county_name, " NY", ""))
Rows: 187 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): PUMA (2012), State abbr., County name, PUMA12 name
dbl (5): State code, County code, Total housing units (2020 Census), county-...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
puma_county_mapping |> head() |> gt()
state_code puma_2012 county_code state_abbr county_name puma12_name total_housing_units_2020_census county_to_puma12_allocation_factor puma12_to_county_allocation_factor
36 00100 36089 NY St. Lawrence St. Lawrence County 51612 1 1.0000
36 00200 36019 NY Clinton Clinton, Franklin, Essex & Hamilton Counties 37315 1 0.3904
36 00200 36031 NY Essex Clinton, Franklin, Essex & Hamilton Counties 25123 1 0.2628
36 00200 36033 NY Franklin Clinton, Franklin, Essex & Hamilton Counties 25318 1 0.2649
36 00200 36041 NY Hamilton Clinton, Franklin, Essex & Hamilton Counties 7826 1 0.0819
36 00300 36113 NY Warren Warren & Washington Counties 39928 1 0.5793

One PUMA may be associated with several low-population counties: in the example above, PUMA 00200 is associated with Clinton, Essex, and Franklin counties. (The reverse situation can also occur: Queens county is made up of several geographically small PUMAS, each with a population of roughly 100,000 people.)

If our goal is to calculate the energy burdens for each county, this presents a problem: we can calculate metrics for a PUMA, both how do we break them down for each constituent county? We’ll use each county’s puma12_to_county_allocation_factor, which simply reflects the percentage of the PUMA’s population found in that county.

LMI cutoffs

Finally, we want to report our findings in terms the Area Median Income (AMI). AMI is defined as the midpoint of a particular area’s income distribution, which is calculated every year by the Department of Housing and Urban Development (HUD).

The “Area” in Area Median Income is the county. 80% is the cutoff associated with Low and Moderate Income (LMI) households. Let’s import a dataset from HUD that lists 80% of each county’s AMI.

In [8]:
county_lmi <- readxl::read_excel(county_lmi_path) |>
  pivot_longer(-County, names_to = "NP") |>
  rename(lmi_cutoff = value) |>
  mutate(NP = as.numeric(NP))

county_lmi |> head() |> gt()
County NP lmi_cutoff
Albany 1 53550
Albany 2 61200
Albany 3 68850
Albany 4 76500
Albany 5 82650
Albany 6 88750

Note that AMI isn’t defined as the median income of all households in a given county. There’s actually an AMI for each household size (the NP column for number of persons).

In the example above, the LMI cutoff in Albany county for a two-person household is $61,200, for a three-person household is $68,850, and so on. This represents the median household income of all 2-person households in the county, multiplified by .8.

Calculate household-level burdens and savings

With our datasets imported, we can construct the household-level variables we need for our analysis.

In [9]:
data <- raw_data |>
  filter(HINCP > 0) |> # Remove surveys that didn't report income (0). We assume non-respondense is uncorrelated with income.
  # Negative income usually means a business loss, and that is both rare and not relevant for our purposes
  mutate(
    # Denote if responded reported any energy costs or not
    bill_in_rent = (ELEFP == "1") | (GASFP == "1") | (FULP == "1"),
    # ACS uses low values of these columns to denote different reasons for zeros, not actual values, so we need to set them to zero
    GASP = case_when(GASP <= 4 ~ 0, .default = GASP),
    ELEP = case_when(ELEP <= 3 ~ 0, .default = ELEP),
    FULP = case_when(FULP <= 3 ~ 0, .default = FULP),

    # Calculate household's yearly energy bill
    yearly_energy_bill = (ELEP + GASP) * 12 + FULP,
    # Find household's energy burden percentage, whether they're burdened
    energy_burden_pct = yearly_energy_bill / HINCP,
    is_energy_burdened = energy_burden_pct > burdened_cutoff,
    # Find difference between how much household must spend on energy
    # to be burdened, and how much they actually spend
    burdened_income_cutoff = burdened_cutoff * HINCP,
    annual_energy_burden = yearly_energy_bill - burdened_income_cutoff,
    monthly_energy_burden = annual_energy_burden / 12,

    # NY HEAT only applies to electricity and gas, not delivered fuels.
    # Repeat burden calculation for utility energy alone.
    yearly_utility_bill = (ELEP + GASP) * 12, # `utility` bill does not include delivered fuel
    utility_burden_pct = yearly_utility_bill / (HINCP+1),
    monthly_utility_burden = (yearly_utility_bill - (burdened_cutoff * HINCP)) / 12,
    is_utility_burdened = utility_burden_pct > burdened_cutoff) |>
  filter(!bill_in_rent)

Next, we link households to counties through PUMAs. This will cause households that live in PUMAs that span multiple counties to be duplicated, which we deal with later on. We also bring in each county’s LMI cutoff and economic region.

The survey responses in PUMS come from a sample of the population in each PUMA. To report statistics about the PUMA as a whole, and therefore for counties, we need to us special weights supplied by the Census (the WGTP column).

And because one PUMA may map to multiple columns, the weight for each PUMS record actually comes from combining two separate weights: WGTP from the ACS, which tells us how much to weight a given observation for a given PUMA, and puma12_to_county_allocation_factor that tells us how to allocate a PUMA to a county.

For instance, for a PUMA that is totally contained within a county, this value will just be WGTP, but if a PUMA is split 50% between two counties, then each observation would be WGTP times 0.5.

In [10]:
data <- data |>
  inner_join(puma_county_mapping, by = c("PUMA" = "puma_2012"),
            relationship = 'many-to-many') |>
  mutate(wt = WGTP * puma12_to_county_allocation_factor) |>
  inner_join(county_lmi, by = c("county_name" = "County", "NP")) |>
  mutate(is_lmi = HINCP <= lmi_cutoff) |>
  filter(county_name %in% county_names,
         wt > 0)

For flexibility, each row can either be generically called “Area”, or else it can get a region name.

In [11]:
# Depending on whether we've got multiple groups or not, we add another label
if (length(region_groups) > 1) {
  data <- data |>
    right_join(region_mapping, by = "county_name")
} else {
  data$region_group <- "Area"
}

Let’s take a look at where we stand.

In [12]:
data |>
    select(HINCP, yearly_energy_bill, is_energy_burdened, energy_burden_pct,
  yearly_utility_bill, is_utility_burdened, utility_burden_pct, monthly_utility_burden) |>
    head() |> gt()
HINCP yearly_energy_bill is_energy_burdened energy_burden_pct yearly_utility_bill is_utility_burdened utility_burden_pct monthly_utility_burden
179500 2760 FALSE 0.01537604 2760 FALSE 0.015375959 -667.50
700500 11760 FALSE 0.01678801 11760 FALSE 0.016787985 -2522.50
189690 4800 FALSE 0.02530444 4800 FALSE 0.025304311 -548.45
38000 1440 FALSE 0.03789474 1440 FALSE 0.037893740 -70.00
34850 3360 TRUE 0.09641320 3360 TRUE 0.096410433 105.75
125000 1080 FALSE 0.00864000 1080 FALSE 0.008639931 -535.00

We know whether each household in our sample is_energy_burdened: whether they pay more than 6% of their annual income on energy.

We’ve also calculated their energy_burden_pct: the percent of their income they actually spend on energy.

We also know their utility_burden_pct and is_utility_burdened: what percent of their annual income they’re spending on electricity and natural gas bills alone–excluding delivered fuels–and whether this exceeds 6% of their annual income.

Finally, we’ve calculated how much money utility-burdened households would save every month (monthly_utility_burden) if their bills were downsized to just 6% of their income.

Aggregate burdens and savings across the state

All that’s left now is to report the following metrics for different geographies, starting with the entire state:

  • households_included: the number of households
  • median_income: the median income of households in the geography
  • pct_energy_burdened: the percent of households that are energy burdened
  • avg_monthly_bill_of_burdened: the average monthly energy bills of those households, before NY HEAT
  • utility_burden_of_burdened: how much utility burdened households stand to save every month, after NY HEAT
In [13]:
total_rollup <- data |>
  summarise(
    households_included = sum(wt),
    median_income = Hmisc::wtd.quantile(HINCP, weights = wt, probs = 0.5),
    pct_energy_burdened = weighted.mean(is_energy_burdened, w = wt),
    pct_utility_burdened = weighted.mean(is_utility_burdened, w = wt),
    avg_monthly_bill_of_burdened = weighted.mean(case_when(
      is_energy_burdened ~ yearly_energy_bill / 12), w = wt,
      na.rm = TRUE),
    utility_burden_of_burdened = mean(case_when(
      is_utility_burdened ~ monthly_utility_burden), w = wt,
      na.rm = TRUE)) |>
mutate(region = REGION_NAME,
       grouping = "Total")

Next, we crunch the same numbers for each PUMA:

In [14]:
puma_rollup <- data |>
  summarise(
    households_included = sum(wt),
    median_income = Hmisc::wtd.quantile(HINCP, weights = wt, probs = 0.5),
    pct_energy_burdened = weighted.mean(is_energy_burdened, w = wt),
    pct_utility_burdened = weighted.mean(is_utility_burdened, w =wt),
    avg_monthly_bill_of_burdened = weighted.mean(case_when(
      is_energy_burdened ~ yearly_energy_bill/12), w = wt,
      na.rm = TRUE),
    utility_burden_of_burdened = mean(case_when(
      is_utility_burdened ~ monthly_utility_burden), w = wt,
      na.rm = TRUE),
    .by = c(puma12_name, PUMA)) |>
  filter(households_included >= 5000) |>
  mutate(grouping = "Area",
         unit_id = paste0(STATE_FIPS, PUMA)) |>
  rename(region = puma12_name)
In [15]:
county_rollup <- data |>
  summarise(
    households_included = sum(wt),
    median_income = Hmisc::wtd.quantile(HINCP, weights = wt, probs = 0.5),
    pct_energy_burdened = weighted.mean(is_energy_burdened, w = wt),
    pct_utility_burdened = weighted.mean(is_utility_burdened, w =wt),
    avg_monthly_bill_of_burdened = weighted.mean(case_when(
      is_energy_burdened ~ yearly_energy_bill/12), w = wt,
      na.rm = TRUE),
    utility_burden_of_burdened = mean(case_when(
      is_utility_burdened ~ monthly_utility_burden), w = wt,
      na.rm = TRUE),
    .by = c(county_name, region_group)) |>
  filter(households_included >= 5000) |>
  mutate(grouping = region_group,
         unit_id = county_name) |>
  rename(region = county_name)

Tabulate burdens and savings across the region

Let’s place each level of rolled-up metrics into a single table we can embed into the report.

In [16]:
rollup <- function(component_rollup) {
  bind_rows(total_rollup, component_rollup) |>
    mutate(region = str_replace(region, "^.*Community", "Community")) |>
    select(region, grouping, pct_energy_burdened, avg_monthly_bill_of_burdened, utility_burden_of_burdened) |>
    group_by(grouping) |>
    gt() |>
    row_group_order(c("Total", region_groups)) |>
    fmt_currency(c(avg_monthly_bill_of_burdened,
      utility_burden_of_burdened), decimals = 0) |>
    fmt_percent(c(pct_energy_burdened), decimals = 0) |>
    cols_label(
      region = "Region",
      pct_energy_burdened = "Homes with high energy burdens (%)",
      avg_monthly_bill_of_burdened = "Avg. monthly energy bills of high-burden homes",
      utility_burden_of_burdened = "Avg. monthly savings for high-burden homes under NY HEAT",
    ) |>
    tab_footnote("Excluding households with utility bills in rent, or missing fuel or income information.\nSource: 2022 ACS. Win Climate, 2024")
}
In [17]:
if (REPORT_TYPE == "county") {
  tbl_results <- rollup(puma_rollup)
} else {
  tbl_results <- rollup(county_rollup)
}
tbl_results
Region Homes with high energy burdens (%) Avg. monthly energy bills of high-burden homes Avg. monthly savings for high-burden homes under NY HEAT
Total
Hudson Valley 24% $361 $154
Upper Hudson Valley
Albany 19% $259 $114
Sullivan 36% $335 $142
Rensselaer 23% $293 $119
Columbia 31% $322 $139
Greene 31% $322 $139
Mid-Hudson Valley
Orange 29% $379 $166
Dutchess 24% $344 $137
Ulster 31% $344 $142
Lower Hudson Valley
Rockland 27% $393 $189
Westchester 21% $413 $181
Putnam 19% $363 $116
Excluding households with utility bills in rent, or missing fuel or income information. Source: 2022 ACS. Win Climate, 2024

Map burdens and savings across the region

Now, let’s create some maps of our metrics.

Let’s create some map-creation functions that we can reuse.

In [18]:
pct_burden_map <- function(boundaries, rollup) {
  boundaries |>
    right_join(rollup, by = "unit_id") |>
    mutate(region = str_replace(region, ".*--","")) |>
    mutate(label = paste0(region," (", scales::percent(pct_energy_burdened, 1) ,")")) |>
    ggplot(aes(fill=pct_energy_burdened, label=label)) +
    geom_sf(color = "black", alpha=.5) +
    ggrepel::geom_label_repel(
      size = 3,
      aes(geometry = geometry),
      fill = "white",
      stat = "sf_coordinates",
      min.segment.length = 0
      ) +
    scale_fill_distiller(palette = "Spectral",
                         name = "Percentage",
                         limits = c(0, .5),
                         labels = scales::percent_format(scale=100)) +
    theme_void() +
    labs(tag = "Win Climate, 2024\n(Source: 2022 ACS)") +
    theme(plot.tag.position = "bottomleft", text = element_text(size=8, color = "#747a7f"))
}

savings_amount_map <- function(boundaries, rollup) {
  boundaries |>
    right_join(rollup, by = "unit_id") |>
    mutate(region = str_replace(region, ".*--","")) |>
    mutate(label = paste0(region," (", scales::dollar(utility_burden_of_burdened, 1) ,")")) |>
    ggplot(aes(fill=utility_burden_of_burdened, label=label)) +
    geom_sf(color = "black") +
    ggrepel::geom_label_repel(
      size = 3,
      aes(geometry = geometry),
      fill = "white",
      stat = "sf_coordinates",
      min.segment.length = 0
      ) +
    scale_fill_distiller(palette = "Greens",
                         direction = 1,
                         name = "NY HEAT Savings",
                         labels = scales::dollar_format(accuracy = 1),
                         limits = c(90, 220)) +
    theme_void() +
    labs(tag = "Win Climate, 2024\n(Source: 2022 ACS)") +
    theme(plot.tag.position = "bottomleft", text = element_text(size=8, color = "#747a7f"))
}
In [19]:
if (REPORT_TYPE == "county") {
    rollup = puma_rollup
    boundaries <- pumas(STATE_FIPS, 2019) |>
      mutate(unit_id = GEOID10)
  } else if (REPORT_TYPE == "region") {
    rollup = county_rollup
    boundaries <- counties(STATE_FIPS, 2019) |>
      mutate(unit_id = NAME)
}
Retrieving data for the year 2019

Now we can use our functions to create map of energy burdens…

In [20]:
fig_map_pct <- pct_burden_map(boundaries, rollup)
ggsave("../img/fig_map_pct.pdf", plot = fig_map_pct, width = 10, height = 8)
Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
give correct results for longitude/latitude data
fig_map_pct
Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
give correct results for longitude/latitude data
Percent of households with high energy burdens, by area

…and savings under NY-HEAT.

In [21]:
fig_map_amt <- savings_amount_map(boundaries, rollup)
ggsave("../img/fig_map_pct.pdf", plot = fig_map_amt, width = 10, height = 8)
Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
give correct results for longitude/latitude data
fig_map_amt
Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
give correct results for longitude/latitude data
Avg. monthly savings for high burden households under NY HEAT, by area

Finally, we expert a few variables we’ll need to emded into the report.

In [22]:
saveRDS(total_rollup, file = "/mnt/tmp/ny_heat_rollup.rds")
saveRDS(tbl_results, file = "/mnt/tmp/ny_heat_tbl_results.rds")