Embedding Model Training & Evaluation Framework

A comprehensive framework for training and evaluating text embedding models, with specialized support for academic and scientific document retrieval.

Features

Contrastive Learning: Train embedding models using hard negative sampling
Multiple Pooling Strategies: Support for CLS, mean, last token, and weighted average pooling
MRL (Matryoshka Representation Learning): Train models with multiple embedding dimensions
Academic Benchmarks: Evaluate on MTEB academic tasks, QASA, and LitSearch

Project Structure

embedding/
├── common/          # Shared utilities and base classes
│   ├── base_model.py    # Base embedding model class
│   ├── config.py        # Model configurations
│   ├── heads.py         # Model head implementations
│   └── utils.py         # Utility functions
├── train/           # Training scripts and configurations
│   ├── src/             # Training source code
│   ├── config/          # Training configurations
│   └── script/          # Training shell scripts
└── evaluate/        # Evaluation scripts
    ├── evaluate_academic_mteb.py   # MTEB academic tasks
    ├── evaluate_qasa.py            # QASA benchmark
    └── evaluate_litsearch.py       # LitSearch benchmark

Installation

# Create and activate conda environment
conda create -n train_embedding python=3.10
conda activate train_embedding

# Install dependencies
conda install -c conda-forge pyarrow -y
pip install -e .

Quick Start

Training

# Run training with default configuration
cd embedding/train
bash script/train.sh

Evaluation

# Evaluate on MTEB academic tasks
python -m embedding.evaluate.evaluate_academic_mteb \
    --model_name "your-model-path" \
    --output_dir "results/"

# Evaluate on LitSearch
python -m embedding.evaluate.evaluate_litsearch \
    --model_name "your-model-path" \
    --output_dir "results/"

# Evaluate on QASA
python -m embedding.evaluate.evaluate_qasa \
    --model_name "your-model-path" \
    --corpus_path "/path/to/qasa_section.parquet" \
    --query_path "/path/to/qasa_data_qasa_test.jsonl" \
    --output_dir "results/"

Documentation

Training - Training configuration and scripts
Evaluation - Benchmark evaluation

Supported Models

The framework supports various embedding model architectures:

E5 Models: intfloat/e5-*
Qwen3 Embedding: Qwen/Qwen3-Embedding-*
Snowflake Arctic: Snowflake/snowflake-arctic-embed-*
Custom Models: Any HuggingFace transformer model

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
embedding		embedding
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedding Model Training & Evaluation Framework

Features

Project Structure

Installation

Quick Start

Training

Evaluation

Documentation

Supported Models

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Embedding Model Training & Evaluation Framework

Features

Project Structure

Installation

Quick Start

Training

Evaluation

Documentation

Supported Models

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages