This epidemiological study investigates how sleep metrics and lifestyle factors influence cardiovascular health indicators, with a primary focus on identifying sex-specific differences. Utilizing the Kaggle Sleep Health and Lifestyle Dataset (n=374), this project employs descriptive statistics and formal hypothesis testing (Independent t-test and ANOVA) to ascertain whether sleep disturbances and cardiac workloads vary significantly between male and female cohorts.
We load the dataset and evaluate its baseline columns to confirm variable structures and ensure no data points are missing.
# Load the dataset
sleep_data <- read_csv("Sleep_health_and_lifestyle_dataset.csv")
# Inspect dataset anatomy
glimpse(sleep_data)
## Rows: 374
## Columns: 13
## $ `Person ID` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Age <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, …
## $ Occupation <chr> "Software Engineer", "Doctor", "Doctor", "Sa…
## $ `Sleep Duration` <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8,…
## $ `Quality of Sleep` <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6,…
## $ `Physical Activity Level` <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, …
## $ `Stress Level` <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8,…
## $ `BMI Category` <chr> "Overweight", "Normal", "Normal", "Obese", "…
## $ `Blood Pressure` <chr> "126/83", "125/80", "125/80", "140/90", "140…
## $ `Heart Rate` <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, …
## $ `Daily Steps` <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, …
## $ `Sleep Disorder` <chr> "None", "None", "None", "Sleep Apnea", "Slee…
# Check for missing values
cat("\nMissing Values Count:\n")
##
## Missing Values Count:
colSums(is.na(sleep_data))
## Person ID Gender Age
## 0 0 0
## Occupation Sleep Duration Quality of Sleep
## 0 0 0
## Physical Activity Level Stress Level BMI Category
## 0 0 0
## Blood Pressure Heart Rate Daily Steps
## 0 0 0
## Sleep Disorder
## 0
The dataset contains 374 clinical observations and 13 columns
covering metrics like Gender, Age,
Sleep Duration, Quality of Sleep,
Heart Rate, and Blood Pressure. No missing
values are present across core variables, making the dataset highly
valid for robust parametric testing.
We visualize the dynamic relationship between Gender, Sleep Duration, and Heart Rate to understand baseline variations prior to statistical validation.
# Boxplot for Sleep Duration by Gender
ggplot(sleep_data, aes(x = Gender, y = `Sleep Duration`, fill = Gender)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Distribution of Sleep Duration Across Genders",
x = "Gender", y = "Sleep Duration (Hours)") +
theme_minimal()
Preliminary visualization shows noticeable shifts in central tendencies. The median sleep duration for female subjects appears slightly higher than that of males. To confirm whether this visible margin is a true clinical variance or random chance, we must deploy inferential statistics.
We evaluate whether the mean difference in Sleep Duration between males and females is statistically significant. - Null Hypothesis (\(H_0\)): There is no difference in mean sleep duration between males and females. - Alternative Hypothesis (\(H_1\)): There is a significant difference in mean sleep duration between males and females.
# Perform Independent Two-Sample t-test
t_test_result <- t.test(`Sleep Duration` ~ Gender, data = sleep_data)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: Sleep Duration by Gender
## t = 2.3565, df = 349.38, p-value = 0.019
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## 0.03195795 0.35448564
## sample estimates:
## mean in group Female mean in group Male
## 7.229730 7.036508
Check the resulting \(p\)-value from the console output. If \(p < 0.05\), we reject the null hypothesis and conclude that biological sex significantly dictates baseline sleep architecture duration.
Next, we evaluate if cardiovascular workload, represented by resting Heart Rate, significantly varies across different Sleep Disorders (None, Insomnia, Sleep Apnea) and whether gender mediates these changes.
# One-Way ANOVA: Testing Heart Rate across Sleep Disorders
anova_result <- aov(`Heart Rate` ~ `Sleep Disorder`, data = sleep_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Sleep Disorder` 2 962 481.1 32.95 6.74e-14 ***
## Residuals 371 5417 14.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test quantifies the variance in cardiac stress
(Heart Rate) across different clinical categories of sleep
disorders. A \(p\)-value lower than
0.05 indicates that chronic sleep disruptions, such as Sleep Apnea or
Insomnia, correlate directly with altered resting cardiovascular rates.
## 6. Advanced Post-Hoc Analysis & Factor Interactions To explore
specific group differences and evaluate whether gender intermediates the
relationship between sleep disorders and cardiovascular stress, we
conduct a Tukey HSD test and a Two-Way ANOVA.
# ၁။ ANOVA မော်ဒယ်ကို Tukey စစ်ဆေးချက်အတွက် ပိုမိုတိကျသော ပုံစံဖြင့် ပြန်လည်တည်ဆောက်ခြင်း
anova_model_fixed <- aov(sleep_data$`Heart Rate` ~ as.factor(sleep_data$`Sleep Disorder`))
# ၂။ Post-Hoc Testing (Tukey HSD) ပြုလုပ်ခြင်း
cat("--- Tukey HSD Pairwise Comparisons --- \n")
## --- Tukey HSD Pairwise Comparisons ---
tukey_result <- TukeyHSD(anova_model_fixed)
print(tukey_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sleep_data$`Heart Rate` ~ as.factor(sleep_data$`Sleep Disorder`))
##
## $`as.factor(sleep_data$`Sleep Disorder`)`
## diff lwr upr p adj
## None-Insomnia -1.449268 -2.640622 -0.2579137 0.0123205
## Sleep Apnea-Insomnia 2.622211 1.177651 4.0667710 0.0000732
## Sleep Apnea-None 4.071479 2.885789 5.2571689 0.0000000
# ၃။ Two-Way ANOVA with Interaction (Gender * Sleep Disorder)
cat("\n--- Two-Way ANOVA with Interaction --- \n")
##
## --- Two-Way ANOVA with Interaction ---
two_way_anova <- aov(sleep_data$`Heart Rate` ~ as.factor(sleep_data$Gender) * as.factor(sleep_data$`Sleep Disorder`))
summary(two_way_anova)
## Df Sum Sq
## as.factor(sleep_data$Gender) 1 301
## as.factor(sleep_data$`Sleep Disorder`) 2 1645
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`) 2 200
## Residuals 368 4234
## Mean Sq
## as.factor(sleep_data$Gender) 300.7
## as.factor(sleep_data$`Sleep Disorder`) 822.5
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`) 99.8
## Residuals 11.5
## F value
## as.factor(sleep_data$Gender) 26.134
## as.factor(sleep_data$`Sleep Disorder`) 71.481
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`) 8.675
## Residuals
## Pr(>F)
## as.factor(sleep_data$Gender) 5.13e-07
## as.factor(sleep_data$`Sleep Disorder`) < 2e-16
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`) 0.000208
## Residuals
##
## as.factor(sleep_data$Gender) ***
## as.factor(sleep_data$`Sleep Disorder`) ***
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`) ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Gender:Sleep Disorder yields a \(p\)-value \(<
0.05\), it proves that the impact of sleep disorders on
cardiovascular workload differs significantly depending on
whether the patient is male or female, fulfilling a core
requirement of sex-specific epidemiological evaluation.To translate these epidemiological insights into a predictive clinical layout, we train a Decision Tree algorithm to automatically classify the presence or absence of a Sleep Disorder based on demographic and cardiac biomarkers.
# Load required machine learning packages
# install.packages("rpart")
# install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
library(caret)
# Data preprocessing for machine learning
ml_sleep_data <- sleep_data %>%
mutate(Disorder_Status = ifelse(`Sleep Disorder` == "None", "No Disorder", "Disorder Present"),
Disorder_Status = as.factor(Disorder_Status),
Gender = as.factor(Gender))
# Set random seed for reproducibility
set.seed(456)
# Split data into Training (80%) and Testing (20%) sets
train_idx <- createDataPartition(ml_sleep_data$Disorder_Status, p = 0.8, list = FALSE)
train_set <- ml_sleep_data[train_idx, ]
test_set <- ml_sleep_data[-train_idx, ]
# Train Decision Tree Classifier
dt_model <- rpart(Disorder_Status ~ Gender + Age + `Sleep Duration` + `Quality of Sleep` + `Heart Rate`,
data = train_set,
method = "class")
# Plot the clinical decision flowchart
rpart.plot(dt_model, main = "Clinical Decision Tree for Sleep Disorder Prediction",
box.palette = "RdBu", shadow.col = "gray", nn = TRUE)
# Evaluate ML model accuracy on the test set
dt_predictions <- predict(dt_model, test_set, type = "class")
cat("\n--- Machine Learning Model Accuracy --- \n")
##
## --- Machine Learning Model Accuracy ---
confusionMatrix(dt_predictions, test_set$Disorder_Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Disorder Present No Disorder
## Disorder Present 26 5
## No Disorder 5 38
##
## Accuracy : 0.8649
## 95% CI : (0.7655, 0.9332)
## No Information Rate : 0.5811
## P-Value [Acc > NIR] : 1.229e-07
##
## Kappa : 0.7224
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8387
## Specificity : 0.8837
## Pos Pred Value : 0.8387
## Neg Pred Value : 0.8837
## Prevalence : 0.4189
## Detection Rate : 0.3514
## Detection Prevalence : 0.4189
## Balanced Accuracy : 0.8612
##
## 'Positive' Class : Disorder Present
##