Fine Tuning LLMs with Unsloth.ai
Fine-Tuning LLMs with UNS Sloth: 5x Faster with 70% Less Memory
In the rapidly evolving landscape of AI development, efficient fine-tuning of large language models has become a critical bottleneck. Today, I'm excited to introduce you to **UNS Sloth**, a groundbreaking tool that's revolutionizing how we fine-tune models like Mistral, Gemma, and LLaMA 2.
## Why UnSloth.ai Is a Game-Changer
UNS Sloth offers several impressive advantages over traditional fine-tuning methods:
- **5x faster** fine-tuning process
- **70% less memory** consumption
- **Zero loss in accuracy** compared to standard methods
- **Cross-platform support** for Linux and Windows (via WSL)
- **Flexible quantization options**: 4-bit, 16-bit, QLoRA, and LoRA fine-tuning
- **Outperforms Hugging Face** by 2x in benchmarks across multiple datasets
Step-by-Step Tutorial: Fine-Tuning Mistral-7B
In this tutorial, I'll walk you through how to fine-tune the Mistral 7B parameter model using the OIG (OpenInstruct Generated) dataset for instruction following. We'll see the dramatic difference in the model's responses before and after fine-tuning.
Before We Start
Let's look at a quick example of what we're trying to achieve:
**Before fine-tuning:**
When asked "What are the tips for a successful business plan?", the model gives a continuous, often rambling response.
The model provides a structured, point-by-point response that directly addresses the query.
Setup Requirements
I'm using an NVIDIA RTX A6000 with 47GB RAM and 6 virtual CPUs for this demonstration. Let's start by setting up our environment:
```bash
conda create -n unsloth python=3.11
conda activate unsloth
pip install huggingface_hub ipython
pip install unsloth[conda]
export HF_TOKEN=your_huggingface_token
```
You can also authenticate with Hugging Face using:
```bash
huggingface-cli login
# Enter your token when prompted
```
The Code: Fine-Tuning Process
Let's create a file called `app.py` and implement our fine-tuning process:
```python
import os
from unsloth import FastLanguageModel
import torch
from transformers import SFTTrainer, TrainingArguments, TextStreamer
Step 1: Load the OIG dataset
data_url = "OIG/small_oig_instruct"
from datasets import load_dataset
dataset = load_dataset(data_url, split="train")
Step 2: Load the Mistral model
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
"mistralai/Mistral-7B-v0.1",
max_seq_length=max_seq_length,
load_in_4bit=True,
)
Set up for inference to compare
before/after
fast_model = FastLanguageModel.get_fast_model(model)
Function to generate text for comparison
def generate_text(prompt):
tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
streamer = TextStreamer(tokenizer)
outputs = fast_model.generate(
tokens,
max_new_tokens=500,
use_cache=True,
streamer=streamer,
temperature=0.7,
top_p=0.95,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Test before training
print("Before training:")
generate_text("What are the tips for a successful business plan?")
# Step 3: Configure for training
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0,
bias="none",
use_gradient_checkpointing=True,
random_state=3407,
use_rslora=False,
loftq_config=None,
)
Step 4: Training setup
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="adamw_torch",
),
)
Step 5: Train the model
trainer.train()
Test after training
print("After training:")
generate_text("What are the tips for a successful business plan?")
Step 6: Save the model
model.save_pretrained("outputs/adapter")
Step 7: Save the merged model in 16-bit
model.save_pretrained_merged("outputs/merged")
Step 8: Upload to Hugging Face Hub
model.push_to_hub_merged(
"your-username/mistral-7b-finetuned-oig",
token=os.environ.get("HF_TOKEN"),
save_method="merged_16bit",
)
# Upload the LoRA adapter separately
model.push_to_hub(
"your-username/mistral-7b-finetuned-oig-lora",
token=os.environ.get("HF_TOKEN"),
)
```
What's Happening in the Code
1. **Data Loading**: We're using the OIG dataset, which contains instruction-response pairs in JSONL format. Each entry has a human query and an AI response.
2. **Model Loading**: We load the Mistral-7B model in 4-bit quantized format to reduce memory usage.
3. **Initial Testing**: We test how the model responds before training to establish a baseline.
4. **Model Patching**: We configure the model for fine-tuning using LoRA (Low-Rank Adaptation), which significantly reduces memory requirements.
5. **Training Configuration**: We set up the Supervised Fine-Tuning (SFT) trainer with parameters like batch size, learning rate, etc.
6. **Training**: We run the training process for 60 steps (you may want to increase this for production models).
7. **Saving Options**: We save both:
- The adapter only (which can be used with the base model)
- A merged model (base model + adapter combined)
8. **Uploading to Hugging Face**: We upload both versions to the Hugging Face Hub for easy distribution and usage.
Results: Before vs. After
The difference in the model's responses is striking:
**Before fine-tuning:**
A continuous completion without clear structure or organization.
**After fine-tuning:**
A well-structured response with clear, numbered points addressing the specific question about business plan tips.
Using Your Fine-Tuned Model
After uploading, you can easily use your model with a simple test script:
```python
from unsloth import FastLanguageModel
import torch
from transformers import TextStreamer
# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
"your-username/mistral-7b-finetuned-oig",
max_seq_length=2048,
)
# Set up for inference
fast_model = FastLanguageModel.get_fast_model(model)
# Generate text
def generate_text(prompt):
tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
streamer = TextStreamer(tokenizer)
outputs = fast_model.generate(
tokens,
max_new_tokens=500,
use_cache=True,
streamer=streamer,
temperature=0.7,
top_p=0.95,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test the fine-tuned model
response = generate_text("What are the tips for a successful business plan?")
print(response)
```
Conclusion
Unsloth.ai represents a significant advancement in the fine-tuning of large language models. With its impressive performance improvements—5x faster with 70% less memory—it makes sophisticated model customization accessible to a broader range of developers and researchers.
Whether you're working with Mistral, Gemma, or LLaMA 2, UNS Sloth provides a more efficient path to creating specialized models tailored to your specific use cases. The ability to fine-tune these models with limited computational resources opens up new possibilities for AI innovation.
Give it a try and see how Unsloth.ai can transform your AI development workflow!
---
*If you found this tutorial helpful, please like, share, and subscribe for more content on AI development and large language models.*
https://github.com/unslothai/unsloth
Credit: Mervin Praison (YouTube)
Ps.. Don't have a NVIDIA RTX A6000 with 47GB RAM lying around the house or maybe some of you do.....
Comments
Post a Comment