1. Introduction & Research Objectives

This epidemiological study investigates how sleep metrics and lifestyle factors influence cardiovascular health indicators, with a primary focus on identifying sex-specific differences. Utilizing the Kaggle Sleep Health and Lifestyle Dataset (n=374), this project employs descriptive statistics and formal hypothesis testing (Independent t-test and ANOVA) to ascertain whether sleep disturbances and cardiac workloads vary significantly between male and female cohorts.


2. Data Loading & Inspection

We load the dataset and evaluate its baseline columns to confirm variable structures and ensure no data points are missing.

# Load the dataset
sleep_data <- read_csv("Sleep_health_and_lifestyle_dataset.csv")

# Inspect dataset anatomy
glimpse(sleep_data)
## Rows: 374
## Columns: 13
## $ `Person ID`               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Gender                    <chr> "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Age                       <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, …
## $ Occupation                <chr> "Software Engineer", "Doctor", "Doctor", "Sa…
## $ `Sleep Duration`          <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8,…
## $ `Quality of Sleep`        <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6,…
## $ `Physical Activity Level` <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, …
## $ `Stress Level`            <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8,…
## $ `BMI Category`            <chr> "Overweight", "Normal", "Normal", "Obese", "…
## $ `Blood Pressure`          <chr> "126/83", "125/80", "125/80", "140/90", "140…
## $ `Heart Rate`              <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, …
## $ `Daily Steps`             <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, …
## $ `Sleep Disorder`          <chr> "None", "None", "None", "Sleep Apnea", "Slee…
# Check for missing values
cat("\nMissing Values Count:\n")
## 
## Missing Values Count:
colSums(is.na(sleep_data))
##               Person ID                  Gender                     Age 
##                       0                       0                       0 
##              Occupation          Sleep Duration        Quality of Sleep 
##                       0                       0                       0 
## Physical Activity Level            Stress Level            BMI Category 
##                       0                       0                       0 
##          Blood Pressure              Heart Rate             Daily Steps 
##                       0                       0                       0 
##          Sleep Disorder 
##                       0

Methodological Interpretation:

The dataset contains 374 clinical observations and 13 columns covering metrics like Gender, Age, Sleep Duration, Quality of Sleep, Heart Rate, and Blood Pressure. No missing values are present across core variables, making the dataset highly valid for robust parametric testing.


3. Descriptive Statistics & Data Visualization

We visualize the dynamic relationship between Gender, Sleep Duration, and Heart Rate to understand baseline variations prior to statistical validation.

# Boxplot for Sleep Duration by Gender
ggplot(sleep_data, aes(x = Gender, y = `Sleep Duration`, fill = Gender)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Distribution of Sleep Duration Across Genders",
       x = "Gender", y = "Sleep Duration (Hours)") +
  theme_minimal()

Medical Interpretation of the Sleep Distribution:

Preliminary visualization shows noticeable shifts in central tendencies. The median sleep duration for female subjects appears slightly higher than that of males. To confirm whether this visible margin is a true clinical variance or random chance, we must deploy inferential statistics.


4. Hypothesis Testing - I: Independent Sample t-test

We evaluate whether the mean difference in Sleep Duration between males and females is statistically significant. - Null Hypothesis (\(H_0\)): There is no difference in mean sleep duration between males and females. - Alternative Hypothesis (\(H_1\)): There is a significant difference in mean sleep duration between males and females.

# Perform Independent Two-Sample t-test
t_test_result <- t.test(`Sleep Duration` ~ Gender, data = sleep_data)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  Sleep Duration by Gender
## t = 2.3565, df = 349.38, p-value = 0.019
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  0.03195795 0.35448564
## sample estimates:
## mean in group Female   mean in group Male 
##             7.229730             7.036508

Statistical Interpretation of t-test:

Check the resulting \(p\)-value from the console output. If \(p < 0.05\), we reject the null hypothesis and conclude that biological sex significantly dictates baseline sleep architecture duration.


5. Hypothesis Testing - II: Analysis of Variance (ANOVA)

Next, we evaluate if cardiovascular workload, represented by resting Heart Rate, significantly varies across different Sleep Disorders (None, Insomnia, Sleep Apnea) and whether gender mediates these changes.

# One-Way ANOVA: Testing Heart Rate across Sleep Disorders
anova_result <- aov(`Heart Rate` ~ `Sleep Disorder`, data = sleep_data)
summary(anova_result)
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## `Sleep Disorder`   2    962   481.1   32.95 6.74e-14 ***
## Residuals        371   5417    14.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Statistical Interpretation of ANOVA:

The ANOVA test quantifies the variance in cardiac stress (Heart Rate) across different clinical categories of sleep disorders. A \(p\)-value lower than 0.05 indicates that chronic sleep disruptions, such as Sleep Apnea or Insomnia, correlate directly with altered resting cardiovascular rates. ## 6. Advanced Post-Hoc Analysis & Factor Interactions To explore specific group differences and evaluate whether gender intermediates the relationship between sleep disorders and cardiovascular stress, we conduct a Tukey HSD test and a Two-Way ANOVA.

# ၁။ ANOVA မော်ဒယ်ကို Tukey စစ်ဆေးချက်အတွက် ပိုမိုတိကျသော ပုံစံဖြင့် ပြန်လည်တည်ဆောက်ခြင်း
anova_model_fixed <- aov(sleep_data$`Heart Rate` ~ as.factor(sleep_data$`Sleep Disorder`))

# ၂။ Post-Hoc Testing (Tukey HSD) ပြုလုပ်ခြင်း
cat("--- Tukey HSD Pairwise Comparisons --- \n")
## --- Tukey HSD Pairwise Comparisons ---
tukey_result <- TukeyHSD(anova_model_fixed)
print(tukey_result)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sleep_data$`Heart Rate` ~ as.factor(sleep_data$`Sleep Disorder`))
## 
## $`as.factor(sleep_data$`Sleep Disorder`)`
##                           diff       lwr        upr     p adj
## None-Insomnia        -1.449268 -2.640622 -0.2579137 0.0123205
## Sleep Apnea-Insomnia  2.622211  1.177651  4.0667710 0.0000732
## Sleep Apnea-None      4.071479  2.885789  5.2571689 0.0000000
# ၃။ Two-Way ANOVA with Interaction (Gender * Sleep Disorder)
cat("\n--- Two-Way ANOVA with Interaction --- \n")
## 
## --- Two-Way ANOVA with Interaction ---
two_way_anova <- aov(sleep_data$`Heart Rate` ~ as.factor(sleep_data$Gender) * as.factor(sleep_data$`Sleep Disorder`))
summary(two_way_anova)
##                                                                      Df Sum Sq
## as.factor(sleep_data$Gender)                                          1    301
## as.factor(sleep_data$`Sleep Disorder`)                                2   1645
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`)   2    200
## Residuals                                                           368   4234
##                                                                     Mean Sq
## as.factor(sleep_data$Gender)                                          300.7
## as.factor(sleep_data$`Sleep Disorder`)                                822.5
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`)    99.8
## Residuals                                                              11.5
##                                                                     F value
## as.factor(sleep_data$Gender)                                         26.134
## as.factor(sleep_data$`Sleep Disorder`)                               71.481
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`)   8.675
## Residuals                                                                  
##                                                                       Pr(>F)
## as.factor(sleep_data$Gender)                                        5.13e-07
## as.factor(sleep_data$`Sleep Disorder`)                               < 2e-16
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`) 0.000208
## Residuals                                                                   
##                                                                        
## as.factor(sleep_data$Gender)                                        ***
## as.factor(sleep_data$`Sleep Disorder`)                              ***
## as.factor(sleep_data$Gender):as.factor(sleep_data$`Sleep Disorder`) ***
## Residuals                                                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Medical Interpretation of Advanced Statistics:

  • Tukey HSD: This pairwise comparison isolates exactly where the cardiac workload shifts. It quantifies whether patients with Sleep Apnea experience significantly higher or lower resting heart rates compared to those with Insomnia or no disorders.
  • Two-Way ANOVA Interaction: If the interaction term Gender:Sleep Disorder yields a \(p\)-value \(< 0.05\), it proves that the impact of sleep disorders on cardiovascular workload differs significantly depending on whether the patient is male or female, fulfilling a core requirement of sex-specific epidemiological evaluation.

7. Medical Machine Learning: Decision Tree Classification

To translate these epidemiological insights into a predictive clinical layout, we train a Decision Tree algorithm to automatically classify the presence or absence of a Sleep Disorder based on demographic and cardiac biomarkers.

# Load required machine learning packages
# install.packages("rpart")
# install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
library(caret)  
# Data preprocessing for machine learning
ml_sleep_data <- sleep_data %>%
  mutate(Disorder_Status = ifelse(`Sleep Disorder` == "None", "No Disorder", "Disorder Present"),
         Disorder_Status = as.factor(Disorder_Status),
         Gender = as.factor(Gender))

# Set random seed for reproducibility
set.seed(456)

# Split data into Training (80%) and Testing (20%) sets
train_idx <- createDataPartition(ml_sleep_data$Disorder_Status, p = 0.8, list = FALSE)
train_set <- ml_sleep_data[train_idx, ]
test_set  <- ml_sleep_data[-train_idx, ]

# Train Decision Tree Classifier
dt_model <- rpart(Disorder_Status ~ Gender + Age + `Sleep Duration` + `Quality of Sleep` + `Heart Rate`, 
                  data = train_set, 
                  method = "class")

# Plot the clinical decision flowchart
rpart.plot(dt_model, main = "Clinical Decision Tree for Sleep Disorder Prediction", 
           box.palette = "RdBu", shadow.col = "gray", nn = TRUE)

# Evaluate ML model accuracy on the test set
dt_predictions <- predict(dt_model, test_set, type = "class")
cat("\n--- Machine Learning Model Accuracy --- \n")
## 
## --- Machine Learning Model Accuracy ---
confusionMatrix(dt_predictions, test_set$Disorder_Status)
## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Disorder Present No Disorder
##   Disorder Present               26           5
##   No Disorder                     5          38
##                                           
##                Accuracy : 0.8649          
##                  95% CI : (0.7655, 0.9332)
##     No Information Rate : 0.5811          
##     P-Value [Acc > NIR] : 1.229e-07       
##                                           
##                   Kappa : 0.7224          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8387          
##             Specificity : 0.8837          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.8837          
##              Prevalence : 0.4189          
##          Detection Rate : 0.3514          
##    Detection Prevalence : 0.4189          
##       Balanced Accuracy : 0.8612          
##                                           
##        'Positive' Class : Disorder Present
##