This is a repository for running predictions using the winning densenet model from the HPA Kaggle Image Classification Challenge, as well as the relevant image preprocessing for the model to work ideally. The package also includes the possibility to perform dimensionality reduction using the UMAP package.
The package is centered around using separate commands for different parts of the pipeline. Currently available commands are:
-
preprocess— Preprocessing a set of images for future predictions. -
predict— Prediction using the HPA Densenet model. -
dimred— Perform dimensionality reduction on the output from the HPA Densenet model. -
umapNd— Generates a CSV file with the previous results as a data source to plot a Nd UMAP.
If there are any questions on how to use the code in this repo please ask them by opening up an issue here on Github or by contacting @cwinsnes.
|
Note
|
It is recommended to run the module in a separate virtual environment such as Anaconda or venvs to avoid any issues with package versions. |
Installing the required modules for this package can be done through pip.
python -m pip install -r requirements.txtThe model requires the input images to be stored in the following way:
-
All images should be in a single folder
-
All images should be separated into 4 based on their colors
-
The names of the images should fit the pattern
<FILENAME>_{red,blue,yellow,green}.<file_extension>-
For example file1_red.jpg or 2222_517_E3_1_yellow.png
-
The images should follow the format of the HPA image dataset:
-
The red images should contain to a microtubule marker
-
The blue images should contain to DAPI markers
-
The yellow images should contain to an endoplasmic reticulum marker
-
And the green images should contain the protein of interest.
-
-
An example data folder could look like
data/
images/
image1_red.jpg image2_red.jpg
image1_blue.jpg image2_blue.jpg
image1_yellow.jpg image2_yellow.jpg
image1_green.jpg image2_green.jpg
et.c.
All commands are run from the main module.
Specifically, the commands are run through the call python main.py <command> <arguments>
To access the help section for any specific command, run python main.py <command> --help.
The purpose and structure of each command is listed below:
The preprocess command performs preprocessing on images for them to be usable
by the machine learning model. At present time, the only preprocessing that is
performed is resizing of the images and the output format is hardcoded to .jpg.
The following arguments are allowed:
-h, --help show help message and exit
-s SRC_DIR, --src-dir SRC_DIR
source directory, where images to process are located.
-d DST_DIR, --dst-dir DST_DIR
destination directory, where processed images end up.
--size SIZE image size
The output size of the processed images. Default `1536`
-w NUM_WORKERS, --num-workers NUM_WORKERS
The number of multiprocessing workers to perform the resizing
Defaults to `10`.
--continue Continue from a previously aborted run.
This should only be done if the `SRC_DIR` is unchanged in between runs.
Note that -s and -d are required arguments!
The predict command runs the densenet model on the processed images.
The output from the model is split into three parts: probabilities, meta_information,
and features.
The probabilities represent the model prediction probabilities while the features
correspond to the latent space feature representation of the model.
The meta information contains the names of each image that was predicted upon.
The three files are timestamped and stored in the output folder.
The probabilities consist of the logit output of the model with the same order as the original Kaggle challenge.
The output files are all compressed numpy storage files and can be loaded
using the numpy.load function.
Each file contains a python dict with the corresponding information. To see how to load
the information, see the example presented in Example run.
At present time, the only format allowed for the input directory is .jpg.
The following arguments are allowed:
-h, --help show help message and exit
-s SRC_DIR, --src-dir SRC_DIR
src image directory (preprocessed)
-d DST_DIR, --dst-dir DST_DIR
output directory
The output files will be stored in the compressed numpy
format '.npz'.
--size SIZE image size
Defaults to 1536.
--gpu GPU Which gpus to use for prediction.
Any string valid for the environment variable `CUDA_VISIBLE_DEVICES`is valid for this.
If cpu calculations ONLY is desired, a value of 'cpu' is also allowed.
Defaults to `CUDA_VISIBLE_DEVICES`
Note that -s and -d are required arguments!
The dimred command runs UMAP dimensionality reduction on the features from the
predict command.
The output consists of an n-dimensional array stored in '.npz' format, where n
corresponds to the number of dimensions asked for. To se how to easily load
the data, see the example in Example run.
The following arguments are allowed:
-h, --help show help message and exit
-s SRC, --src SRC Source feature file to reduce.
-d DST, --dst DST File to store predictions in.
The prediction will be stored in the compressed
numpy format '.npz'.
-n NUM_DIM, --num-dim NUM_DIM
Number of dimensions to reduce to. Defaults to 2.
Note that -s and -d are required arguments!
The umapNd command generates a simple CSV file from a previous dimensionality result file and meta-information result
file.
The output consists CSV file with the columns "Id", "X", "Y", […]. See the example in Example run.
The following arguments are allowed:
-h, --help show help message and exit -sred, --sred Source reduction file. -n, --num-dim Number of present reduced dimensions to add to the CSV -smeta, --smeta Source meta-information file. -d, --dst File to store the CSV values in.
Note that all arguments are required!
Assuming you have a data folder containing images on the format described above, a prediction can easily be made using the following commands:
$ python main.py preprocess -s data/images -d data/resized_images
$ python main.py predict -s data/resized_images -d data/predictionsIf you want to perform dimensionality reduction using UMAP, you can run the following commands:
$ python main.py dimred -s data/predictions/<FEATURE_FILE> -d data/umap/reduced.npzIf you want to generate a CSV file containing the date to plot a Nd UMAP, you can run the following commands:
$ python main.py umapNd -sred data/umap/<REDUCED_FILE> --num-dim 2 -smeta data/predictions/<METAINFORMATION_FILE> -sprob data/predictions/<PROBABILITIES_FILE> --dst data/umap2d.csv
OR
$ python main.py umapNd -sred data/umap/<REDUCED_FILE> --num-dim 3 -smeta data/predictions/<METAINFORMATION_FILE> -sprob data/predictions/<PROBABILITIES_FILE> --dst data/umap3d.csvTo access the predicted data, use numpy to load the stored arrays:
import numpy as np
features = np.load('data/predictions/<FEATURE_FILE>')['feats']
probabilities = np.load('data/predictions/<PROBABILITY_FILE>')['probs']
image_ids = np.load('data/predictions/<META_INFORMATION_FILE>')['image_ids']
# If you performed dimensionality reduction, you load it in a similar vein.
reduced = np.load('data/umap/reduced.npz')['components']