VB
Vishalini Balasubramani Data Science · Machine Learning · Analytics
Back to projects
Healthcare Analytics Machine Learning EDA

PCOS Prediction

Data science project using clinical and lifestyle data from 541 participants to analyse symptoms, hormone correlations and lifestyle impacts in PCOS, culminating in a supervised machine-learning prediction system achieving 94.45% accuracy.

2024 · Programming for Data Analysis
Domain: Healthcare · Medical Data Analytics · Kaggle Dataset
Outcome: Random Forest model – 94.45% accuracy

Project summary

This project analysed a kaggle dataset - PCOS clinical dataset comprising 541 patient records and 45 medical, physiological and lifestyle features to explore symptom prevalence, hormone relationships, and long-term metabolic risk associated with Polycystic Ovary Syndrome.

The work combined statistical EDA, correlation analysis, feature engineering, and multiple classification models to predict PCOS diagnosis (Y/N), followed by cluster analysis to identify distinct disorder subtypes for potential future clinical interpretation.

Analytical questions

  • How do dietary habits and exercise influence hormone levels and follicle counts in PCOS?
  • Does early age of onset lead to elevated metabolic risk (e.g., insulin resistance, diabetes)?
  • How does BMI relate to severity of symptoms such as hair growth, weight gain, pimples, skin darkening, hair loss and menstrual irregularities?

Data preparation & modelling workflow

  • EDA: Class imbalance analysis (67% negative vs 33% PCOS positive cases), lifestyle visualisations (food habits vs exercise), and symptom distribution assessments.
  • Data cleaning: Converted mixed data to numeric, computed missing BMI from height & weight, imputed remaining null values, removed low-value variables and capped numeric outliers using the IQR method.
  • Feature standardisation: Z-score normalisation across 31+ clinical predictors.
  • Models trained: Logistic Regression, Random Forest, SVM, KNN, and Decision Tree classifiers.
  • Evaluation: Accuracy comparison across models to identify best performing predictor.
  • Advanced analysis/Hyperparameter tuning: K-means clustering (K=3) to identify PCOS sub-phenotype groupings using age vs hormone levels.

Key findings & outcomes

  • Best ML model: Random Forest achieved 94.45% classification accuracy, outperforming SVM, Logistic Regression, KNN and Decision Tree.
  • Strong correlations: High correlation between follicle counts (L vs R ovaries ≈ 0.80) and moderate LH–FSH hormone correlation (≈ 0.43).
  • Symptom associations: Skin darkening exhibited strong correlation with PCOS (≈ 0.47). BMI and weight highly correlated (≈ 0.90), reinforcing obesity as an aggravating factor.
  • Lifestyle impact: Fast-food consumption linked to higher follicle counts; exercise alone was not sufficient without dietary improvement.
  • Age patterns: Highest PCOS onset observed in reproductive ages 25–34, with insulin resistance indicated by abnormal random blood sugar levels.
  • Clustering: K-means revealed three clinical subgroups with varying FSH hormone profiles, potentially supporting personalised treatment segmentation.

What I learned

This project strengthened my ability to handle real-world, imperfect healthcare data — from missing values and heavy class imbalance to noisy biological variables requiring careful outlier handling and interpretation.

I developed hands-on experience with the full data-science lifecycle: EDA → preprocessing → ML modelling → evaluation → interpretation → clustering. Just as importantly, the project improved my skills in translating statistics and machine-learning results into clinically meaningful insights rather than stopping at raw model metrics.