Executive Summary¶
This analysis explores hospital encounter data to understand patient demographics, utilization trends, and payer coverage. Results indicate stable utilization over time, and meaningful variation in cost and coverage that may represent operational and financial risk areas.
Limitations This analysis is based on a synthetic dataset and does not adjust for patient severity, case mix, or clinical outcomes.
Prepare Dataset¶
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
encounters = pd.read_csv("encounters.csv")
payers = pd.read_csv("payers.csv")
patients = pd.read_csv("patients.csv")
procedures = pd.read_csv("procedures.csv")
conn = sqlite3.connect("hospital_db")
c = conn.cursor()
patients.to_sql("patients", conn, if_exists="replace", index=False)
encounters.to_sql("encounters", conn, if_exists="replace", index=False)
payers.to_sql("payers", conn, if_exists="replace", index=False)
procedures.to_sql("procedures", conn, if_exists="replace", index=False)
47701
Patient Demographics¶
Understand our primary demographics.
patients.shape
(974, 20)
patients.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 974 entries, 0 to 973 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 974 non-null object 1 BIRTHDATE 974 non-null object 2 DEATHDATE 154 non-null object 3 PREFIX 974 non-null object 4 FIRST 974 non-null object 5 LAST 974 non-null object 6 SUFFIX 21 non-null object 7 MAIDEN 386 non-null object 8 MARITAL 973 non-null object 9 RACE 974 non-null object 10 ETHNICITY 974 non-null object 11 GENDER 974 non-null object 12 BIRTHPLACE 974 non-null object 13 ADDRESS 974 non-null object 14 CITY 974 non-null object 15 STATE 974 non-null object 16 COUNTY 974 non-null object 17 ZIP 832 non-null float64 18 LAT 974 non-null float64 19 LON 974 non-null float64 dtypes: float64(3), object(17) memory usage: 152.3+ KB
patients["GENDER"].value_counts()
GENDER M 494 F 480 Name: count, dtype: int64
patients["ETHNICITY"].value_counts()
ETHNICITY nonhispanic 783 hispanic 191 Name: count, dtype: int64
patients["RACE"].value_counts()
RACE white 680 black 163 asian 91 other 16 hawaiian 13 native 11 Name: count, dtype: int64
Demographics Graphs¶
There are 974 patients in the system with a almost an exact split of male / females. The patients race is predominantly white, but the population distribution is equal to that of the country.
cols = patients[["GENDER", "RACE", "ETHNICITY"]]
for c in cols.columns:
plt.title("Plot of "+ c, fontsize=15)
plt.hist(cols[c],bins = 5)
plt.show()
Utilization¶
What services are most frequently used and what trends are there?
encounters.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27891 entries, 0 to 27890 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 27891 non-null object 1 START 27891 non-null object 2 STOP 27891 non-null object 3 PATIENT 27891 non-null object 4 ORGANIZATION 27891 non-null object 5 PAYER 27891 non-null object 6 ENCOUNTERCLASS 27891 non-null object 7 CODE 27891 non-null int64 8 DESCRIPTION 27891 non-null object 9 BASE_ENCOUNTER_COST 27891 non-null float64 10 TOTAL_CLAIM_COST 27891 non-null float64 11 PAYER_COVERAGE 27891 non-null float64 12 REASONCODE 8350 non-null float64 13 REASONDESCRIPTION 8350 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 3.0+ MB
encounters["ENCOUNTERCLASS"].hist()
<Axes: >
Ambulatory is by far the most frequent service, followed by outpatient and then urgent care. Inpatient is the least utlized service provided.
encounters["START"] = pd.to_datetime(encounters["START"])
# Group encounters by month start and plot
encounters.set_index("START").resample("MS").size().plot(title="Monthly Encounter Volume")
plt.show()
Encounter volume is mostly stable over time, excluding the significant spike in 2014 which is likely a data backfill that will be excluded from trend analysis. There is a gradual increase around 2020 which is likely due to COIVD-19 onset.
monthly = (
encounters
.set_index("START")[["ENCOUNTERCLASS"]]
.groupby("ENCOUNTERCLASS")
.resample("MS")
.size()
.unstack(0)
)
monthly.plot(figsize=(12,6), title="Monthly Encounter Volume by Type")
plt.show()
C:\Users\wayne\AppData\Local\Temp\ipykernel_39596\3896481572.py:6: FutureWarning: DataFrameGroupBy.resample operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. .size()
Even when excluding the spike in encounters in 2014, ambulatory is still the most used service across all years.
Cost / Payer Analysis¶
How much are pateints and payers typically paying? What is the cost distribution among procedures?
encounters["PAYER_COVERAGE"].mean()
np.float64(1114.9656516438995)
encounters["TOTAL_CLAIM_COST"].mean()
np.float64(3639.6821741780504)
The average claim cost is \$3,639 of which the payers cover an average of \$1,114.
plt.figure(figsize=(7,6))
plt.title("Total Claim Cost by Encounter Type")
plt.bar(encounters["ENCOUNTERCLASS"],encounters["TOTAL_CLAIM_COST"])
plt.xlabel("Encounter Class Type", fontsize=13)
plt.ylabel("Claim Costs ($)", fontsize=13)
plt.show()
The top 10 most frequent procedures and their avergae cost¶
Find the most frequent procedures and how much they cost
sql = """
SELECT DESCRIPTION, COUNT(*) as total_procedures, ROUND(AVG(BASE_COST),2) as Avg_cost
FROM procedures
GROUP BY DESCRIPTION
ORDER BY total_procedures DESC
LIMIT 10
"""
res = c.execute(sql)
for r in res:
print(r)
('Assessment of health and social care needs (procedure)', 4596, 431.0)
('Hospice care (regime/therapy)', 4098, 431.0)
('Depression screening using Patient Health Questionnaire Two-Item score (procedure)', 3614, 431.0)
('Depression screening (procedure)', 3614, 431.0)
('Assessment of substance use (procedure)', 2906, 431.0)
('Renal dialysis (procedure)', 2746, 1004.09)
('Assessment using Morse Fall Scale (procedure)', 2422, 431.0)
('Assessment of anxiety (procedure)', 2288, 431.0)
('Medication Reconciliation (procedure)', 2284, 509.12)
('Screening for drug abuse (procedure)', 1484, 431.0)
Assessments and screenings are the most common procedures, accounting for 70% of the top ten, and are priced at \$431.
The most exepnsive prodecures and the number of occurences¶
Find the most expensive procedures and how many times they occur.
sql = """
SELECT DESCRIPTION, COUNT(*) as total_procedures, ROUND(AVG(BASE_COST),2) as Avg_cost
FROM procedures
GROUP BY DESCRIPTION
ORDER BY Avg_cost DESC
LIMIT 10
"""
res = c.execute(sql)
for r in res:
print(r)
('Admit to ICU (procedure)', 5, 206260.4)
('Coronary artery bypass grafting', 9, 47085.89)
('Lumpectomy of breast (procedure)', 5, 29353.0)
('Hemodialysis (procedure)', 27, 29299.56)
('Insertion of biventricular implantable cardioverter defibrillator', 4, 27201.0)
('Electrical cardioversion', 1383, 25903.11)
('Partial resection of colon', 7, 25229.29)
('Fine needle aspiration biopsy of lung (procedure)', 1, 23141.0)
('Percutaneous mechanical thrombectomy of portal vein using fluoroscopic guidance', 57, 20228.04)
('Percutaneous coronary intervention', 9, 19728.0)
Of the top 10 most expensive procedures, 70% occur less than ten times. The notable exception is 'Electrical cardioversion' which has happened 1,383 times and is the sixth most expensive procedure.
The average total claim cost for encounters by payer¶
Find the average payments made by payers
sql = """
SELECT p.NAME, ROUND(AVG(e.TOTAL_CLAIM_COST),2) as avg_claim_cost
FROM encounters as e
JOIN payers as p
on e.PAYER = p.Id
GROUP BY PAYER
ORDER BY avg_claim_cost DESC
"""
res = c.execute(sql)
for r in res:
print(r)
('Medicaid', 6205.22)
('NO_INSURANCE', 5593.2)
('Anthem', 4236.81)
('Humana', 3269.3)
('Blue Cross Blue Shield', 3245.58)
('Cigna Health', 2996.95)
('UnitedHealthcare', 2848.34)
('Aetna', 2767.05)
('Medicare', 2167.55)
('Dual Eligible', 1696.19)
The second highest average total cliam cost is actually by those without insurance, with the top three payers being Medicaid, Anthem and Humana.