This repository contains practical implementations of various data mining and warehousing tasks. The projects utilize machine learning models, data preprocessing techniques, and clustering algorithms on different datasets like medical records, fuel consumption, customer segmentation, and academic performance.
- Diabetes Dataset
- Petrol Consumption Dataset
- Mall Customers Dataset
- Marks Dataset
- Label Encoding
- One Hot Encoding
- LR_SVM_DT_KNN_MLP_RF_GB_LGB
- Assignment
Read diabetes.csv for diabetes that datasets consist of several medical predictor variables and one target variable, Outcome. Predictor variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on. Experiment with the following issues with python programming language-
- a) Show the number of patients information using a pie chart.
- b) Handle missing values using mean value for one column, median for another and mode for 3rd one if (any).
- c) Plot the boxplot of the pre-processed dataset.
- d) Compare the performance results of the ML model like LR, SVM and DT.
- e) Show the confusion matrix of your results.
View the Jupyter Notebook for this task
Read petrol_consumption.csv Apply and Experiment with the following issues with python programming language:
- a) Predict the fuel consumption using multiple linear regression.
- b) Show and compare the results using 70:30, and 80:20 distribution during the training of the dataset.
- c) Show the actual and predicted value in a scatter plot for 80:20 distribution.
- d) Find the Mean Absolute Error.
View the Jupyter Notebook for this task
Load the Mall_Customers.csv
- a) Visualize male and female customer spending scores.
- b) Find the ideal number of k using the elbow method.
- c) Apply k-means clustering using 4 clusters and 5 clusters.
- d) Draw the graph.
View the Jupyter Notebook for this task
Load the Marks.csv file. Then do the following:
- a) Write the statement to display the first and third quartiles of all subjects
- b) Find the standard deviation and variance of each subject
- c) Find the summary of the data
View the Jupyter Notebook for this task
It covers the Label Encoding technique to transform categorical data into a numerical format.
- Apply Label Encoding to categorical variables in datasets.
- Visualize the transformations.
View the Label Encoding Notebook
This section focuses on One Hot Encoding for converting categorical data into a format suitable for machine learning algorithms.
- Apply One Hot Encoding to transform categorical variables.
- Show how to handle categorical features in machine learning pipelines.
View the One Hot Encoding Notebook
This section focuses on the performance comparison of multiple classifiers such as Logistic Regression (LR), SVM, Decision Trees, KNN, MLP, Random Forest (RF), Gradient Boosting (GB), and LightGBM (LGB).
- Train multiple classifiers on the diabetes dataset.
- Compare the performance using accuracy, confusion matrix, and F1 score.
- Plot the results for visualization.
View the Classifier Comparison Notebook
This assignment focuses on applying data preprocessing techniques to a dataset.
- Implement Label Encoding and One Hot Encoding to handle categorical data.
- Plot correlation heatmaps to visualize relationships between variables.
- Apply standardization to scale features for model training.
Clone the repository:
git clone https://github.com/nishatrhythm/Data-Mining-and-Data-Warehousing-Lab.gitEnsure that you have Python installed, along with all necessary dependencies. You can install the dependencies using the requirements.txt file:
pip install -r requirements.txtNavigate to the respective dataset directory and run the corresponding Python scripts or open the Jupyter notebooks to experiment with the code.
This project is licensed under the MIT License - see the LICENSE file for details.