Skip to content
This repository was archived by the owner on Oct 27, 2024. It is now read-only.

nishatrhythm/Data-Mining-and-Data-Warehousing-Lab

Repository files navigation

Data Mining and Data Warehousing Lab

This repository contains practical implementations of various data mining and warehousing tasks. The projects utilize machine learning models, data preprocessing techniques, and clustering algorithms on different datasets like medical records, fuel consumption, customer segmentation, and academic performance.

Table of Contents

  1. Diabetes Dataset
  2. Petrol Consumption Dataset
  3. Mall Customers Dataset
  4. Marks Dataset
  5. Label Encoding
  6. One Hot Encoding
  7. LR_SVM_DT_KNN_MLP_RF_GB_LGB
  8. Assignment

1. Diabetes Dataset

Read diabetes.csv for diabetes that datasets consist of several medical predictor variables and one target variable, Outcome. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. Experiment with the following issues with python programming language-

Tasks:

  • a) Show the number of patients information using a pie chart.
  • b) Handle missing values using mean value for one column, median for another and mode for 3rd one if (any).
  • c) Plot the boxplot of the pre-processed dataset.
  • d) Compare the performance results of the ML model like LR, SVM and DT.
  • e) Show the confusion matrix of your results.

View the Jupyter Notebook for this task


2. Petrol Consumption Dataset

Read petrol_consumption.csv Apply and Experiment with the following issues with python programming language:

Tasks:

  • a) Predict the fuel consumption using multiple linear regression.
  • b) Show and compare the results using 70:30, and 80:20 distribution during the training of the dataset.
  • c) Show the actual and predicted value in a scatter plot for 80:20 distribution.
  • d) Find the Mean Absolute Error.

View the Jupyter Notebook for this task


3. Mall Customers Dataset

Load the Mall_Customers.csv

Tasks:

  • a) Visualize male and female customer spending scores.
  • b) Find the ideal number of k using the elbow method.
  • c) Apply k-means clustering using 4 clusters and 5 clusters.
  • d) Draw the graph.

View the Jupyter Notebook for this task


4. Marks Dataset

Load the Marks.csv file. Then do the following:

Tasks:

  • a) Write the statement to display the first and third quartiles of all subjects
  • b) Find the standard deviation and variance of each subject
  • c) Find the summary of the data

View the Jupyter Notebook for this task


5. Label Encoding

It covers the Label Encoding technique to transform categorical data into a numerical format.

Tasks:

  • Apply Label Encoding to categorical variables in datasets.
  • Visualize the transformations.

View the Label Encoding Notebook


6. One Hot Encoding

This section focuses on One Hot Encoding for converting categorical data into a format suitable for machine learning algorithms.

Tasks:

  • Apply One Hot Encoding to transform categorical variables.
  • Show how to handle categorical features in machine learning pipelines.

View the One Hot Encoding Notebook


7. LR_SVM_DT_KNN_MLP_RF_GB_LGB

This section focuses on the performance comparison of multiple classifiers such as Logistic Regression (LR), SVM, Decision Trees, KNN, MLP, Random Forest (RF), Gradient Boosting (GB), and LightGBM (LGB).

Tasks:

  • Train multiple classifiers on the diabetes dataset.
  • Compare the performance using accuracy, confusion matrix, and F1 score.
  • Plot the results for visualization.

View the Classifier Comparison Notebook


8. Assignment

This assignment focuses on applying data preprocessing techniques to a dataset.

Tasks:

  • Implement Label Encoding and One Hot Encoding to handle categorical data.
  • Plot correlation heatmaps to visualize relationships between variables.
  • Apply standardization to scale features for model training.

View the Assignment Notebook

Getting Started

Clone the repository:

git clone https://github.com/nishatrhythm/Data-Mining-and-Data-Warehousing-Lab.git

Prerequisites

Ensure that you have Python installed, along with all necessary dependencies. You can install the dependencies using the requirements.txt file:

pip install -r requirements.txt

Usage

Navigate to the respective dataset directory and run the corresponding Python scripts or open the Jupyter notebooks to experiment with the code.


License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A collection of lab exercises, code samples, and datasets for hands-on learning in Data Mining and Data Warehousing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages