PCOS Prediction
Data science project using clinical and lifestyle data from 541 participants to analyse symptoms, hormone correlations and lifestyle impacts in PCOS, culminating in a supervised machine-learning prediction system achieving 94.45% accuracy.
Project summary
This project analysed a kaggle dataset - PCOS clinical dataset comprising 541 patient records and 45 medical, physiological and lifestyle features to explore symptom prevalence, hormone relationships, and long-term metabolic risk associated with Polycystic Ovary Syndrome.
The work combined statistical EDA, correlation analysis, feature engineering, and multiple classification models to predict PCOS diagnosis (Y/N), followed by cluster analysis to identify distinct disorder subtypes for potential future clinical interpretation.
Analytical questions
- How do dietary habits and exercise influence hormone levels and follicle counts in PCOS?
- Does early age of onset lead to elevated metabolic risk (e.g., insulin resistance, diabetes)?
- How does BMI relate to severity of symptoms such as hair growth, weight gain, pimples, skin darkening, hair loss and menstrual irregularities?
Data preparation & modelling workflow
- EDA: Class imbalance analysis (67% negative vs 33% PCOS positive cases), lifestyle visualisations (food habits vs exercise), and symptom distribution assessments.
- Data cleaning: Converted mixed data to numeric, computed missing BMI from height & weight, imputed remaining null values, removed low-value variables and capped numeric outliers using the IQR method.
- Feature standardisation: Z-score normalisation across 31+ clinical predictors.
- Models trained: Logistic Regression, Random Forest, SVM, KNN, and Decision Tree classifiers.
- Evaluation: Accuracy comparison across models to identify best performing predictor.
- Advanced analysis/Hyperparameter tuning: K-means clustering (K=3) to identify PCOS sub-phenotype groupings using age vs hormone levels.
Key findings & outcomes
- Best ML model: Random Forest achieved 94.45% classification accuracy, outperforming SVM, Logistic Regression, KNN and Decision Tree.
- Strong correlations: High correlation between follicle counts (L vs R ovaries ≈ 0.80) and moderate LH–FSH hormone correlation (≈ 0.43).
- Symptom associations: Skin darkening exhibited strong correlation with PCOS (≈ 0.47). BMI and weight highly correlated (≈ 0.90), reinforcing obesity as an aggravating factor.
- Lifestyle impact: Fast-food consumption linked to higher follicle counts; exercise alone was not sufficient without dietary improvement.
- Age patterns: Highest PCOS onset observed in reproductive ages 25–34, with insulin resistance indicated by abnormal random blood sugar levels.
- Clustering: K-means revealed three clinical subgroups with varying FSH hormone profiles, potentially supporting personalised treatment segmentation.
What I learned
This project strengthened my ability to handle real-world, imperfect healthcare data — from missing values and heavy class imbalance to noisy biological variables requiring careful outlier handling and interpretation.
I developed hands-on experience with the full data-science lifecycle: EDA → preprocessing → ML modelling → evaluation → interpretation → clustering. Just as importantly, the project improved my skills in translating statistics and machine-learning results into clinically meaningful insights rather than stopping at raw model metrics.