Getting Started with Deep Learning tasks - Fine-tuning LLaMA 3.1 8B with LoRA (In-Person)
- 1 Presentation to the teams
- 1.1 Introduction
- 1.2 Competition Tasks
- 1.2.1 1. Speed Benchmark
- 1.2.2 2. Accuracy Benchmark
- 1.2.3 Evaluation Criteria
- 1.2.4 Optimization Freedom
- 1.3 Hardware Support
- 1.4 Container Setup
- 1.4.1 NVIDIA GPUs
- 1.4.2 AMD GPUs
- 1.4.3 Intel GPUs
- 1.4.4 CPU-Only
- 1.5 HuggingFace Registration
- 1.6 Version Requirements
- 1.7 Installation
- 1.8 Running the Competition Tasks
- 1.8.1 Time-to-Solution Benchmark
- 1.8.2 Accuracy Benchmark
- 1.9 Configuration
- 1.10 Datasets
- 1.11 Directory Structure
- 1.12 Submission Instructions
- 1.13 Additional Notes
Presentation to the teams
Introduction
Welcome to the exciting world of Large Language Model (LLM) fine-tuning! In this task, we'll be working with Meta's LLaMA 3.1 8B model, a powerful and efficient language model that's part of the cutting-edge LLaMA 3.1 family. Our goal is to fine-tune this model using Low-Rank Adaptation (LoRA), a technique that allows for the efficient adaptation of large pre-trained models to specific tasks.
LLaMA 3.1 8B is the smallest variant in the LLaMA 3.1 series, offering a balance between performance and resource efficiency. With 8 billion parameters, it's capable of handling a wide range of natural language processing tasks while being more accessible for fine-tuning on limited computational resources.
LoRA is a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters. Instead of updating all model weights, LoRA adds small, trainable rank decomposition matrices to existing weights, allowing for efficient adaptation with minimal computational overhead.
In this task, you'll learn how to:
Set up the environment for fine-tuning LLaMA 3.1 8B with PyTorch
Prepare your dataset for fine-tuning
Configure and apply LoRA to the LLaMA 3.1 8B model
Fine-tune the model on your specific task
Evaluate the fine-tuned model's performance
By completing this task, you'll gain hands-on experience with state-of-the-art language models and efficient fine-tuning techniques. This knowledge is invaluable for developing custom AI solutions, improving model performance on specific domains, and pushing the boundaries of what's possible with large language models. Let's dive in and start fine-tuning LLaMA 3.1 8B with LoRA!
Competition Tasks
1. Speed Benchmark
Objective: Achieve as high a number of train samples per second as possible
Dataset: CosmosQA will be given for practicing at home. A new train dataset will be given at the competition.
Measurement:
train_samples_per_secondTeams have the freedom to fine-tune on the given dataset with any split.
Teams have the freedom to run as many epochs as they want (one epoch at the least).
2. Accuracy Benchmark
Objective: Achieve the highest accuracy on the evaluation dataset with the fine-tuned model from (1)
Evaluation Dataset: ScienceQA will be given for practicing at home. A new evalulate dataset will be given at the competition.
Evaluation Criteria
Speed Benchmark (30%)
Highest
train_samples_per_secondon the train datasetMust provide reproducible results
Accuracy Benchmark (50%)
Highest accuracy on the evaluation dataset
Must provide a checkpoint for verification
Technical Report (20%)
Summary of optimization techniques
Discussion of trade-offs and decisions
Max 500 words; a template is provided in the report.md
Optimization Freedom
Teams are encouraged to explore various optimization techniques, including but not limited to:
Model optimization (parallelism, quantization, etc).
LoRA configurations (parameter tuning or any type of LoRA (qLoRA, DoRA, etc))
Training strategies (learning rates, batch sizes)
System optimizations (memory management, I/O optimization)
Hardware-specific optimizations
Using TransformerEngine.
Other PyTorch-based extensions.
The only constraints are:
Must use PyTorch as the base framework
Must maintain the provided code structure
Must be reproducible on competition hardware
Must maintain the same loggings and checkpointing
Hardware Support
This implementation supports multiple hardware architectures:
NVIDIA GPUs
AMD GPUs
Intel GPUs
CPU-only systems
However, only NVIDIA GPUs were tested. Please reach out if you experience an issue on other hardware.
Container Setup
NVIDIA GPUs
# Pull the container
docker pull nvcr.io/nvidia/pytorch:25.01-py3
# Run with NVIDIA Container Runtime
docker run --gpus all \
--runtime=nvidia \
-v ${PWD}:/workspace \
nvcr.io/nvidia/pytorch:25.01-py3AMD GPUs
# Pull the container
docker pull rocm/pytorch:latest
# Run with ROCm support
docker run \
--device=/dev/kfd \
--device=/dev/dri \
--security-opt seccomp=unconfined \
--group-add video \
-v ${PWD}:/workspace \
rocm/pytorch:latestIntel GPUs
# Pull the container
docker pull intel/intel-extension-for-pytorch:latest
# Run with Intel GPU support
docker run -it \
--device=/dev/dri \
--group-add video \
-v ${PWD}:/workspace \
intel/intel-extension-for-pytorch:latestCPU-Only
# Pull the container
docker pull python:3.9-slim
# Run container
docker run -it \
-v ${PWD}:/workspace \
python:3.9-slimHuggingFace Registration
Please register an account at HuggingFace and obtain a user access token
Version Requirements
Python 3.9+
PyTorch 2.0+
Transformers 4.36+
40GB+ GPU memory per device (recommended)
Hardware-specific libraries as needed
Installation
Clone the repository:
git clone https://github.com/phu0ngng/isc25_llm.git
cd isc25_llama_loraInstall dependencies:
pip install -r requirements.txtSet up HuggingFace authentication:
export HF_TOKEN="your_token_here"Set up the cache directory for the model and datasets (Optional):
export HF_HOME=/path/to/huggingface/cache
Running the Competition Tasks
Time-to-Solution Benchmark
# For NVIDIA GPUs
torchrun --nproc_per_node=8 main.py --benchmark speed --device-type cuda
# For AMD GPUs
python -m torch.distributed.launch --nproc_per_node=8 main.py --benchmark speed --device-type rocm
# For Intel GPUs
python -m intel_extension_for_pytorch.cpu.launch --nproc_per_node=8 main.py --benchmark speed --device-type xpu
# For CPU-only
python main.py --benchmark time --device-type cpu --precision fp32The sample run.sh script is provided.
Accuracy Benchmark
# Fine-tuning and evaluation
torchrun --nproc_per_node=8 main.py --benchmark accuracy --device-type cuda
# Running evaluation from checkpoint
torchrun --nproc_per_node=8 main.py --benchmark accuracy --device-type cuda --checkpoint path-to-checkpoint
# Same for other GPUsConfiguration
All configuration parameters can be found in config.py. Key parameters include:
Model name and configuration
Dataset paths
Training hyperparameters
Hardware-specific settings
LoRA configuration.
Datasets
For practicing at home, two datasets will be given:
For Fine-Tuning - CosmosQA: The Cosmos QA dataset is a large-scale collection of 35,600 problems designed to test commonsense-based reading comprehension. It presents multiple-choice questions that require interpreting the likely causes and effects of events in everyday narratives, often necessitating reasoning beyond the explicit text content.
Accuracy Benchmark - ScienceQA: The ScienceQA dataset consists of approximately 21,208 multimodal multiple-choice science questions covering various topics across natural, language, and social sciences. It includes both image and text contexts, with detailed annotations to support understanding the reasoning behind answers. This structure makes it a valuable tool for assessing and improving AI's reasoning capabilities. For the scope of the competition, all the image-based questions are removed thus, only text-based questions are used.
New datasets will be given at the competition.
Directory Structure
This repository provides a reference implementation for fine-tuning LLaMA 3.1 8B with LoRA using PyTorch.
isc25_llm/
├── report.md # Template report
├── config.py # Configuration parameters
├── main.py # Main training script
├── requirements.txt # Dependencies
├── run.sh # Runner script
└── src/ # Source code
├── dataset.py # Data loading
├── distributed.py # Distributed utilities
├── evaluation.py # Model evaluation
├── hub_utils.py # HuggingFace utilities
├── lora_model.py # LoRA implementation
└── trainer.py # TrainerSubmission Instructions
Teams must submit:
Final model checkpoint
Code modifications (if any)
Configuration files
Brief report detailing optimizations used
Please submit a zip file that includes all the files and directories following this hierarchy
isc25_llm/
├── report.md # Template report
├── config.py # Configuration parameters
├── main.py # Main training script
├── requirements.txt # Dependencies
├── run.sh # Runner script
├── src/ # Source code
│ ├── dataset.py # Data loading
│ ├── distributed.py # Distributed utilities
│ ├── evaluation.py # Model evaluation
│ ├── hub_utils.py # HuggingFace utilities
│ ├── lora_model.py # LoRA implementation
│ └── trainer.py # Trainer
├── checkpoints/ # Checkpoint directory
│ ├── best_model.pt # Your final model for evaluation
│ └── checkpoint_xx # Other checkpoint states
└── loggings/ # Logging directory
├── speed.log # Stdout when running the speed benchmark
└── accuracy.log # Stdout when running the accuracy benchmarkYou should name your zip file as isc25_llm_[team_name].tar.gz.
tar -czvf isc25_llm_team_name.tar.gz /path/to/isc25_llmAdditional Notes
Monitor GPU memory usage
Use appropriate precision for hardware
Implement regular checkpointing
Profile code for bottlenecks
Test with smaller datasets first
For questions or issues, please contact the competition organizers. Good luck, and may the best-optimized model win!