PM 566 Midterm

Stephanie Lee 2023-05-08

Introduction

In recent years, Medicare spending has become a significant concern for policymakers, as the program represents a substantial portion of the federal budget. According to the Centers for Medicare and Medicaid Services (CMS), in 2020, Medicare spending accounted for approximately 16% of the federal budget and totaled $796.2 billion.

One area of particular concern is inpatient hospitalizations, which represent a significant portion of Medicare spending. In 2018, Medicare spent $203.1 billion on inpatient hospitalizations, accounting for 40% of all Medicare fee-for-service spending. Additionally, the average Medicare payment for an inpatient hospitalization was $14,219 in 2018, with substantial variation across different geographic regions and hospital types.

Given the high costs associated with Medicare inpatient hospitalizations, it is essential to understand the patterns and trends in healthcare utilization by Original Medicare Part A beneficiaries. The Medicare Inpatient Hospitals by Geography and Service medicareset provides a rich source of information for exploring the variation in healthcare utilization and costs by geographic region and Medicare Severity Diagnosis Related Group (DRG).

Research Question:

How does Medicare inpatient hospitalization rates and payments vary by geographic region and Medicare Severity Diagnosis Related Group (DRG)?

Follow-up Questions:

What are the specific DRG codes with the highest rates of Medicare inpatient hospitalizations, and how do they vary across different geographic regions?
What factors contribute to the variation in Medicare payments for inpatient hospitalizations across different geographic regions and DRG codes?

Methods

Data Sources

The Medicare Inpatient Hospitals by Geography and Service dataset is a critical source of information for understanding healthcare utilization and spending among Original Medicare Part A beneficiaries. The dataset is publicly accessible on the CMS website and the version downloaded was the latest available from 2020. According to CMS, the dataset is derived from the Medicare Provider Analysis and Review (MedPAR) file, which contains information on hospital inpatient stays for Medicare beneficiaries. The MedPAR data is then aggregated by CMS into the Medicare Inpatient Hospital Discharge File (MIHDF), which is the source of the Medicare Inpatient Hospitals by Geography and Service dataset.

The MIHDF contains information on hospital discharges for Original Medicare Part A beneficiaries by inpatient prospective payment system (IPPS) hospitals. The dataset is organized by geography and Medicare Severity Diagnosis Related Group (MS-DRG), a classification system used to group patients with similar diagnoses and treatments. The dataset includes information on the number of discharges, payments, and submitted charges for each MS-DRG in each geographic area.

In addition to the DRG codes, the CMS Major Diagnostic Categories (MDCs) are also included, also found on the CMS MS-DRG Definitions Manual. MDCs group DRGs based on the diagnosis that led to the hospital admission. The MDCs are broader categories than the DRGs and can provide additional insights into the types of medical conditions that are most commonly treated in hospitals.

For further information, CMS provides documentation for the following:

Data Wrangling

After reading in data, checking the dimensions, headers, and footers, and carefully checking for variable names and types. There are a total of 9 variables and 26985 observations. The only missing values are within the geographic code varaible. This is because the data aggregated at the national level was given a value of ‘NA’ to distinguish from the other observations collected at the state level. The observations are categorized by state, with two additional regions: District of Columbia and aggregated data at the national level. Though there are 752 distinct diagnostic codes, there are only 735 diagnostic descriptions. For the diagnostic descriptions with more than one diagnostic code, they seem to be more vague like a certain type of procedure or containing the words “or” or “other”. There are 17 total DRG descriptions assigned to more than one code. I assume these diagnosis descriptions are most likely further specified by the other diagnostic codes, which may be difficult for a non-medical expert to discern or irrelevant for data analysis use. Nevertheless, these descriptions only make up 3% of the observations in the dataset. Overall, the dataset was already fairly usable and only needed some variable data type changes, such as conversions of codes to categorical variables. Furthermore, MDCs were also joined to the dataset for broader diagnosis categories.

dim(cms)
head(cms)
tail(cms)

#Variables
str(cms)

dim(mdcs)
head(mdcs)
tail(mdcs)

#Take a closer look at the key variables
distinct(cms, `DRG_Cd`)
distinct(cms, `Rndrng_Prvdr_Geo_Cd`)
distinct(cms, `DRG_Desc`)
distinct(cms, `Rndrng_Prvdr_Geo_Desc`)

# Find DRG descriptions assigned to more than one code.
drgs <- distinct(cms,`DRG_Cd`,`DRG_Desc`)
drgs_dup <- drgs[duplicated(drgs$DRG_Desc),  ]
drgs_dup

# Count the occurrences of DRG descriptions assigned to more than one code
drgs_dup_vec <- drgs_dup$DRG_Desc
table(cms$DRG_Desc %in% drgs_dup_vec)

cms <- left_join(cms, select(mdcs, 'DRG_Cd','MDC', 'MDC_Desc'), by = 'DRG_Cd')

#Check missing values
summary(is.na(cms))

cms$Rndrng_Prvdr_Geo_Cd <- factor(cms$Rndrng_Prvdr_Geo_Cd)
cms$DRG_Cd <- factor(cms$DRG_Cd)
cms$MDC <- factor(cms$MDC)

#Remove NAs except for geo codes for the National Level (this is intentionally set to NA)
cms <- cms %>%
  filter(!is.na(MDC) & !is.na(MDC_Desc))

# Define region groups
Northeast <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New York", "New Jersey", "Pennsylvania")
South <- c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "District of Columbia", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas")
Midwest <- c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota")
West <- c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming", "Alaska", "California", "Hawaii", "Oregon", "Washington")

# Create a new column for region and assign each state to a region
cms$Region[cms$Rndrng_Prvdr_Geo_Desc %in% Northeast] <- "Northeast"
cms$Region[cms$Rndrng_Prvdr_Geo_Desc %in% South] <- "South"
cms$Region[cms$Rndrng_Prvdr_Geo_Desc %in% Midwest] <- "Midwest"
cms$Region[cms$Rndrng_Prvdr_Geo_Desc %in% West] <- "West"
cms$Region <- as.factor(cms$Region)

summary(cms)

Preliminary Results

Summary Tables

national_avg_discharges <- cms %>%
  filter(Rndrng_Prvdr_Geo_Lvl == "National") %>%
  group_by(MDC, MDC_Desc) %>%
  summarize(total_discharges = sum(Tot_Dschrgs)) %>% 
  arrange(desc(total_discharges))

## `summarise()` has grouped output by 'MDC'. You can override using the `.groups`
## argument.

kable(national_avg_discharges, col.names = c("MDC Code", "MDC Description", "Total Annual Discharges"), caption = "Table 1. Total National Discharges By Diagnostic Category") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Table 1. Total National Discharges By Diagnostic Category

MDC Code	MDC Description	Total Annual Discharges
5	Diseases and Disorders of the Circulatory System	1596376
4	Diseases and Disorders of the Respiratory System	1024340
8	Diseases and Disorders of the Musculoskeletal System and Connective Tissue	998311
18	Infectious and Parasitic Diseases (Systemic or Unspecified Sites)	924623
6	Diseases and Disorders of the Digestive System	758007
1	Diseases and Disorders of the Nervous System	646020
11	Diseases and Disorders of the Kidney and Urinary Tract	588358
10	Endocrine, Nutritional and Metabolic Diseases and Disorders	315110
7	Diseases and Disorders of the Hepatobiliary System and Pancreas	218604
9	Diseases and Disorders of the Skin, Subcutaneous Tissue and Breast	160054
16	Diseases and Disorders of the Blood and Blood Forming Organs and Immunological Disorders	115226
21	Injuries, Poisonings and Toxic Effects of Drugs	98966
19	Mental Diseases and Disorders	92800
17	Myeloproliferative Diseases and Disorders, and Poorly Differentiated Neoplasm	56501
20	Alcohol/Drug Use and Alcohol/Drug Induced Organic Mental Disorders	48418
23	Factors Influencing Health Status and Other Contacts with Health Services	40083
3	Diseases and Disorders of the Ear, Nose, Mouth and Throat	37896
13	Diseases and Disorders of the Female Reproductive System	28277
12	Diseases and Disorders of the Male Reproductive System	26814
24	Multiple Significant Trauma	18883
14	Pregnancy, Childbirth and the Puerperium	12833
2	Diseases and Disorders of the Eye	8898
25	Human Immunodeficiency Virus Infections	6784
22	Burns	3535

Table 1 shows the number of occurrences of each diagnosis category recorded by CMS nationally. We can see that the MDC codes do not line up with frequency of occurrences. Furthermore, the leading type of diagnosis nationally are circulatory system related. This lines up the CDC’s reported high prevalence of cardiac disease (1 in 3 deaths).

regional_spending <- cms %>%
  filter(!is.na(Rndrng_Prvdr_Geo_Cd)) %>%
  group_by(Region) %>%
  summarize(total_medicare_spending = sum(Avg_Mdcr_Pymt_Amt)) %>% 
  arrange(desc(total_medicare_spending))

overall_total <- sum(regional_spending$total_medicare_spending)
regional_spending$percent_of_total <- percent(regional_spending$total_medicare_spending / overall_total)

region_pop <- data.frame(
  Region = c("Northeast", "Midwest", "South", "West"),
  Population = c(58058022, 68329004, 125580448, 78347268)
) %>%
  mutate(percent_population = scales::percent(Population / sum(Population)))

regional_spending <- left_join(regional_spending, select(region_pop, "percent_population", "Region"), by = "Region") 

regional_spending$per_capita_spending <- dollar_format()(
  regional_spending$total_medicare_spending / region_pop$Population
)

regional_spending$total_medicare_spending <- dollar_format()(regional_spending$total_medicare_spending)

kable(regional_spending, col.names = c("Region", "Annual Medicare Spending", "Percent of Total Spending","Percent Population", "Spending Per Capita"), 
      caption = "Table 2. Average Annual Medicare Spending By Region") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Table 2. Average Annual Medicare Spending By Region

Region	Annual Medicare Spending	Percent of Total Spending	Percent Population	Spending Per Capita
South	\$134,884,335	35.33%	38.0%	\$2.32
Midwest	\$89,651,177	23.48%	20.7%	\$1.31
West	\$87,303,694	22.87%	23.7%	\$0.70
Northeast	\$69,969,547	18.33%	17.6%	\$0.89

From Table 2, we see that federal spending on Medicare is highest in the South. Generally, this percent of total spending is proportional to the percent of the population by region. However, we see that the South spends significantly more per capita on Medicare than in the West.

Plots

# create a new column with the average Medicare payment per state
avg_payment_by_state <- cms %>%
  group_by(Rndrng_Prvdr_Geo_Desc) %>%
  summarize(avg_payment = mean(Avg_Mdcr_Pymt_Amt, na.rm = TRUE)) %>%
  ungroup()

# sort the states by average payment in descending order
avg_payment_by_state <- avg_payment_by_state[order(-avg_payment_by_state$avg_payment),]

# highlight the top four states in a different color
top_four <- head(avg_payment_by_state, 4)$Rndrng_Prvdr_Geo_Desc
avg_payment_by_state$highlight <- ifelse(avg_payment_by_state$Rndrng_Prvdr_Geo_Desc %in% top_four, "Top 4 States", "Other States")

# create the bar chart
ggplot(avg_payment_by_state, aes(x = reorder(Rndrng_Prvdr_Geo_Desc, desc(avg_payment)), y = avg_payment, fill = highlight)) +
  geom_bar(stat = "identity",  width = 0.7) +
  scale_fill_manual(values = c("Top 4 States" = "tomato", "Other States" = "gray")) +
  labs(title = "Average Medicare Payment by State",
       x = "State",
       y = "Average Medicare Payment (in Thousands)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From this bar chart, we can surprisingly see that Maryland has the highest average Medicare payment amounts for Original Medicare Part A beneficiaries, followed by California and New York. Upon further research, Maryland enacted its “All-Payer” model in 2013 which increased Medicare funds for greater financial accessibility. Alaska is also quite surprising, but with further inspection, Alaska has a high proportion of elderly residents and due to its small population and remote location, healthcare delivery is more expensive.

Additional sources are linked: - CMS Maryland All-Payer Model - Health Affairs on Maryland - Alaska Policy Forum

#new column with abbreviated MDC Descriptions and Codes
cms$MDC_Desc <- str_replace(cms$MDC_Desc, "Diseases and Disorders of the ", "")
cms$MDC_Code_Desc <- paste(cms$MDC, cms$MDC_Desc, sep = " - ")

ggplot(cms, aes(x = MDC, y = Avg_Mdcr_Pymt_Amt, fill = MDC_Code_Desc)) +
  geom_boxplot() +
  xlab("MDC Code") +
  ylab("Average Medicare Payment Amount (in Thousands)") +
  ggtitle("Average Medicare Payment Amount by MDC Code") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name = "MDC Description") +
  scale_y_continuous(labels = scales::dollar_format(scale = 1e-3, prefix = "$")) +
  theme(legend.key.size = unit(0.5, "cm"),
        legend.position = "right",
        legend.box.just = "left",
        legend.text = element_text(size = 10)) +
  theme(plot.margin = unit(c(1, 2, 1, 1), "lines")) +
  guides(fill = guide_legend(title = "MDC Description", ncol = 1)) +
  coord_flip()

This boxplot shows that some of the diagnostic categories have a wide range of Medicare costs, particularly with burns and kidney and urinary tract disorders/diseases. Burns have a wide range of severity levels, with the highest degree requiring long hospital stays and care. Kidney and urinary tract infections are very common, so its sheer prevalence could be contributing to such high costs. Similarly, there is a wide range of severity when it comes to these issues, from a UTI to a full-blown kidney infection, which requires hospitalization.

# create a new column with the total discharges per region and MDC
discharges_by_region_mdc <- cms %>%
  filter(!is.na(Region)) %>% 
  group_by(Region, MDC_Desc) %>%
  summarize(total_discharges = sum(Tot_Dschrgs)) %>%
  ungroup()

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

# get the top 5 MDCs with the highest total discharges for each region
top_mdc <- discharges_by_region_mdc %>%
  group_by(Region) %>%
  top_n(5, total_discharges) %>%
  arrange(desc(total_discharges))

# Wrap text
top_mdc$MDC_Desc <- str_wrap(top_mdc$MDC_Desc, width = 25)

# plot the top 5 MDCs with the highest total discharges for each region
ggplot(top_mdc, aes(x = reorder(MDC_Desc, desc(total_discharges)), y = total_discharges, fill = Region)) +
  geom_bar(stat = "identity", position = "dodge") +
  xlab("MDC") +
  ylab("Total Discharges") +
  ggtitle("Top 5 MDCs with the Highest Total Discharges by Region") +
  scale_fill_discrete(name = "Region") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma_format())

In this barchart, we can see similar patterns of the South having the highest number of discharges for the top 5 MDC total discharges. This far surpasses the other regions, potentially indicating determinants of the population, potentially due to demographics or culture. We also see that diseases and disorders of the circulatory system, e.g. cardiovascular issues, are extremely prevalent, and this trend is only exacerbated in the South.

Conclusion

From these preliminary findings, it is clear that Medicare spending and hospitalization rates are dependent on a wide variety of causes. The South has the highest hospitalization rates and the highest Medicare spending per capita. This is interesting because none of the states with the highest average total Medicare spending are in the South. Those states seem to have initiatives or unique qualities that lead to increased funding for Medicare. Ailments related to the circulatory system are by far the most common hospitalizations nationwide and the most costly. However, some illnesses such as those that are kidney and urinary tract related, also have very high spending due to the range of costs in this diagnostic category.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README_files/figure-gfm		README_files/figure-gfm
.gitignore		.gitignore
DRG_MDC.csv		DRG_MDC.csv
Design_and_development_of_the_Diagnosis_Related_Group_(DRGs).pdf		Design_and_development_of_the_Diagnosis_Related_Group_(DRGs).pdf
MUP_IHP_RY22_P02_V10_Dy20_Geo.csv		MUP_IHP_RY22_P02_V10_Dy20_Geo.csv
MUP_INP_RY22_20220722_DD_Geo.pdf		MUP_INP_RY22_20220722_DD_Geo.pdf
Medicare Inpatient Hospitals Methodology (1).pdf		Medicare Inpatient Hospitals Methodology (1).pdf
PM566_Midterm.Rproj		PM566_Midterm.Rproj
README.Rmd		README.Rmd
README.html		README.html
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PM 566 Midterm

Introduction

Methods

Data Sources

Data Wrangling

Preliminary Results

Summary Tables

Plots

Conclusion

About

Releases

Packages

scl301/PM566_Midterm

Folders and files

Latest commit

History

Repository files navigation

PM 566 Midterm

Introduction

Methods

Data Sources

Data Wrangling

Preliminary Results

Summary Tables

Plots

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages