Fine Tuning LLMs with Unsloth.ai

Fine-Tuning LLMs with UNS Sloth: 5x Faster with 70% Less Memory

In the rapidly evolving landscape of AI development, efficient fine-tuning of large language models has become a critical bottleneck. Today, I'm excited to introduce you to **UNS Sloth**, a groundbreaking tool that's revolutionizing how we fine-tune models like Mistral, Gemma, and LLaMA 2.

## Why UnSloth.ai Is a Game-Changer

UNS Sloth offers several impressive advantages over traditional fine-tuning methods:

- **5x faster** fine-tuning process

- **70% less memory** consumption

- **Zero loss in accuracy** compared to standard methods

- **Cross-platform support** for Linux and Windows (via WSL)

- **Flexible quantization options**: 4-bit, 16-bit, QLoRA, and LoRA fine-tuning

- **Outperforms Hugging Face** by 2x in benchmarks across multiple datasets

Step-by-Step Tutorial: Fine-Tuning Mistral-7B

In this tutorial, I'll walk you through how to fine-tune the Mistral 7B parameter model using the OIG (OpenInstruct Generated) dataset for instruction following. We'll see the dramatic difference in the model's responses before and after fine-tuning.

Before We Start

Let's look at a quick example of what we're trying to achieve:

**Before fine-tuning:**

When asked "What are the tips for a successful business plan?", the model gives a continuous, often rambling response.

The model provides a structured, point-by-point response that directly addresses the query.

Setup Requirements

I'm using an NVIDIA RTX A6000 with 47GB RAM and 6 virtual CPUs for this demonstration. Let's start by setting up our environment:

```bash

conda create -n unsloth python=3.11

conda activate unsloth

pip install huggingface_hub ipython

pip install unsloth[conda]

export HF_TOKEN=your_huggingface_token

```

You can also authenticate with Hugging Face using:

```bash

huggingface-cli login

# Enter your token when prompted

```

The Code: Fine-Tuning Process

Let's create a file called `app.py` and implement our fine-tuning process:

```python

import os

from unsloth import FastLanguageModel

import torch

from transformers import SFTTrainer, TrainingArguments, TextStreamer

Step 1: Load the OIG dataset

data_url = "OIG/small_oig_instruct"

from datasets import load_dataset

dataset = load_dataset(data_url, split="train")

Step 2: Load the Mistral model

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(

"mistralai/Mistral-7B-v0.1",

max_seq_length=max_seq_length,

load_in_4bit=True,

)

Set up for inference to compare

before/after

fast_model = FastLanguageModel.get_fast_model(model)

Function to generate text for comparison

def generate_text(prompt):

tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer)

outputs = fast_model.generate(

tokens,

max_new_tokens=500,

use_cache=True,

streamer=streamer,

temperature=0.7,

top_p=0.95,

)

return tokenizer.decode(outputs[0], skip_special_tokens=True)

Test before training

print("Before training:")

generate_text("What are the tips for a successful business plan?")

# Step 3: Configure for training

model = FastLanguageModel.get_peft_model(

model,

r=16,

target_modules=["q_proj", "k_proj", "v_proj", "o_proj",

"gate_proj", "up_proj", "down_proj"],

lora_alpha=32,

lora_dropout=0,

bias="none",

use_gradient_checkpointing=True,

random_state=3407,

use_rslora=False,

loftq_config=None,

)

Step 4: Training setup

trainer = SFTTrainer(

model=model,

train_dataset=dataset,

dataset_text_field="text",

max_seq_length=max_seq_length,

tokenizer=tokenizer,

args=TrainingArguments(

per_device_train_batch_size=1,

gradient_accumulation_steps=4,

warmup_steps=5,

max_steps=60,

learning_rate=2e-4,

fp16=True,

logging_steps=1,

output_dir="outputs",

optim="adamw_torch",

)

Step 5: Train the model

trainer.train()

Test after training

print("After training:")

generate_text("What are the tips for a successful business plan?")

Step 6: Save the model

model.save_pretrained("outputs/adapter")

Step 7: Save the merged model in 16-bit

model.save_pretrained_merged("outputs/merged")

Step 8: Upload to Hugging Face Hub

model.push_to_hub_merged(

"your-username/mistral-7b-finetuned-oig",

token=os.environ.get("HF_TOKEN"),

save_method="merged_16bit",

)

# Upload the LoRA adapter separately

model.push_to_hub(

"your-username/mistral-7b-finetuned-oig-lora",

token=os.environ.get("HF_TOKEN"),

)

```

What's Happening in the Code

1. **Data Loading**: We're using the OIG dataset, which contains instruction-response pairs in JSONL format. Each entry has a human query and an AI response.

2. **Model Loading**: We load the Mistral-7B model in 4-bit quantized format to reduce memory usage.

3. **Initial Testing**: We test how the model responds before training to establish a baseline.

4. **Model Patching**: We configure the model for fine-tuning using LoRA (Low-Rank Adaptation), which significantly reduces memory requirements.

5. **Training Configuration**: We set up the Supervised Fine-Tuning (SFT) trainer with parameters like batch size, learning rate, etc.

6. **Training**: We run the training process for 60 steps (you may want to increase this for production models).

7. **Saving Options**: We save both:

- The adapter only (which can be used with the base model)

- A merged model (base model + adapter combined)

8. **Uploading to Hugging Face**: We upload both versions to the Hugging Face Hub for easy distribution and usage.

Results: Before vs. After

The difference in the model's responses is striking:

**Before fine-tuning:**

A continuous completion without clear structure or organization.

**After fine-tuning:**

A well-structured response with clear, numbered points addressing the specific question about business plan tips.

Using Your Fine-Tuned Model

After uploading, you can easily use your model with a simple test script:

```python

from unsloth import FastLanguageModel

import torch

from transformers import TextStreamer

# Load the fine-tuned model

model, tokenizer = FastLanguageModel.from_pretrained(

"your-username/mistral-7b-finetuned-oig",

max_seq_length=2048,

)

# Set up for inference

fast_model = FastLanguageModel.get_fast_model(model)

# Generate text

def generate_text(prompt):

tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer)

outputs = fast_model.generate(

tokens,

max_new_tokens=500,

use_cache=True,

streamer=streamer,

temperature=0.7,

top_p=0.95,

)

return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the fine-tuned model

response = generate_text("What are the tips for a successful business plan?")

print(response)

```

Conclusion

Unsloth.ai represents a significant advancement in the fine-tuning of large language models. With its impressive performance improvements—5x faster with 70% less memory—it makes sophisticated model customization accessible to a broader range of developers and researchers.

Whether you're working with Mistral, Gemma, or LLaMA 2, UNS Sloth provides a more efficient path to creating specialized models tailored to your specific use cases. The ability to fine-tune these models with limited computational resources opens up new possibilities for AI innovation.

Give it a try and see how Unsloth.ai can transform your AI development workflow!

---

*If you found this tutorial helpful, please like, share, and subscribe for more content on AI development and large language models.*

https://unsloth.ai

https://x.com/unslothai

https://github.com/unslothai/unsloth

Credit: Mervin Praison (YouTube)

https://www.youtube.com/user/MervinPraison

Ps.. Don't have a NVIDIA RTX A6000 with 47GB RAM lying around the house or maybe some of you do.....