Diabetes Insight is a demonstration project that generates a well-structured PDF report based on user input.
A pretrained machine learning model predicts whether a user is diabetic, or estimates their diabetes risk score. SHAP values are calculated and, together with the prediction, passed to gpt-5-mini, which produces a user-friendly, medically-styled explanation. This response is then formatted into a downloadable PDF report.
This project is designed as a practice exploration of how traditional machine learning, model explainability, and LLM engineering can work together in a healthcare-style application.
It is not intended for real clinical use.
- Technologies Used
- Features
- Model Training
- Project Structure
- Environment Variables
- Installation
- Screenshots
- Notes
- License
- Contributing
- Author & Contact
- Frontend: Dash
- Backend: FastAPI
- Machine Learning: Custom-trained tree-based models (CatBoost + variants)
- LLM Integration: gpt-5-mini with prompt engineering
- PDF Generation: Python libraries for layout and export
- Docker
- Generate a diabetes diagnosis report based on user input
- Generate a diabetes risk score report
- Automatic SHAP explainability
- Automatic LLM-generated narrative for users
- Cleanly formatted PDF download
The training pipeline includes data cleaning, exploratory analysis, feature processing, model training, and SHAP explainability.
The following models were evaluated:
- KNN
- Logistic Regression
- Decision Tree
- Voting Classifier
- Random Forest
- Gradient Boosting
- AdaBoost
- Extra Trees
- XGBoost
- LightGBM
- CatBoost
CatBoost showed slightly better performance and was selected for both final models.
Other tree-based methods performed similarly.
Note: Some training notebooks take considerable time to run, even on modern machines (late 2025).
- models/ — Serialized CatBoost classification models
- routers/reports.py — Two API endpoints for generating reports
- limiter.py — Simple rate limiter to prevent abuse
- main.py — FastAPI initialization + CORS middleware
- models.py — Pydantic request validation schemas
- utils.py — Utility functions used in the endpoints
- frontend.ipynb — Dash application (UI built with Dash Bootstrap Components & Templates)
- Jupyter notebooks with EDA, preprocessing, and model training
- models/ — Final models (mirrors backend/models)
- parquet files — Processed datasets
- diabetes_dataset.csv — Original Kaggle dataset
- utils.py — Common helper functions
- Final_training.ipynb — Re-training and SHAP generation workflow
Include a .env file with:
- OPENAI_API_KEY=your_api_key
- API_URL=http://frontend:8050
- BACKEND_URL=http://backend:8000/reports- If you decide to use a different API_URL, make sure to update this line of code in frontend.py (and frontend.ipynb for consistency - not necessary):
if __name__ == '__main__':
app.run(host="0.0.0.0", port=8050)- The BACKEND_URL currently includes the
/reportsprefix, since it is the only one called by the frontend in the current version of the app. If you decide to expand the app, make sure to exclude and, and update the relevant parts of the code in the frontend, namely these two:
response = requests.post(f"{BACKEND_URL}/diagnosis", json=input_data)response = requests.post(f"{BACKEND_URL}/risk_score", json=input_data)- Docker
- Python 3.13 (optional)
- This app uses some python packages, such as weasyprint, that tend to act differently on different machines. It is recommended you use Docker to run it, in order to avoid any potential issues.
- Some of the packages used during training, such as cupy, act very differently on different machines, depending on both OS and hardware. F.x. cupy only works on Nvidia GPUs that support CUDA drivers. Hence, there is no requirements.txt with a list of all packages used in training, but the notebooks can be accessed nonetheless.
git clone https://github.com/Sebastijan-Dominis/diabetes-insight
cd diabetes-insightdocker-compose build --no-cachedocker compose up- If you used the default ports, you can now access the frontend on localhost:8050, and the backend on localhost:8000/docs.
Below are examples of how the app looks and what the generated reports contain.
- SHAP values are computed on the backend for inference-time explainability.
- The included dataset is synthetic and the project is for learning/demonstration only.
- The frontend is intentionally simplified; in production, a React or Vue SPA would be preferable.
- This repository includes a
LICENSEfile — please review it for terms of reuse.
- Improvements and bug fixes welcome. Open an issue or submit a pull request with a clear description of the change.
- Author: Sebastijan Dominis
- Contact: sebastijan.dominis99@gmail.com




