Fine Tuning LLMs with Unsloth.ai





Fine-Tuning LLMs with UNS Sloth: 5x Faster with 70% Less Memory

In the rapidly evolving landscape of AI development, efficient fine-tuning of large language models has become a critical bottleneck. Today, I'm excited to introduce you to **UNS Sloth**, a groundbreaking tool that's revolutionizing how we fine-tune models like Mistral, Gemma, and LLaMA 2.

## Why UnSloth.ai Is a Game-Changer

UNS Sloth offers several impressive advantages over traditional fine-tuning methods:

- **5x faster** fine-tuning process

- **70% less memory** consumption

- **Zero loss in accuracy** compared to standard methods

- **Cross-platform support** for Linux and Windows (via WSL)

- **Flexible quantization options**: 4-bit, 16-bit, QLoRA, and LoRA fine-tuning
- **Outperforms Hugging Face** by 2x in benchmarks across multiple datasets



 Step-by-Step Tutorial: Fine-Tuning Mistral-7B

In this tutorial, I'll walk you through how to fine-tune the Mistral 7B parameter model using the OIG (OpenInstruct Generated) dataset for instruction following. We'll see the dramatic difference in the model's responses before and after fine-tuning.


Before We Start

Let's look at a quick example of what we're trying to achieve:


**Before fine-tuning:**

When asked "What are the tips for a successful business plan?", the model gives a continuous, often rambling response.


The model provides a structured, point-by-point response that directly addresses the query.


 Setup Requirements

I'm using an NVIDIA RTX A6000 with 47GB RAM and 6 virtual CPUs for this demonstration. Let's start by setting up our environment:

```bash
conda create -n unsloth python=3.11
conda activate unsloth
pip install huggingface_hub ipython
pip install unsloth[conda]
export HF_TOKEN=your_huggingface_token
```

You can also authenticate with Hugging Face using:

```bash
huggingface-cli login
# Enter your token when prompted
```


The Code: Fine-Tuning Process

Let's create a file called `app.py` and implement our fine-tuning process:

```python
import os
from unsloth import FastLanguageModel
import torch
from transformers import SFTTrainer, TrainingArguments, TextStreamer



Step 1: Load the OIG dataset

data_url = "OIG/small_oig_instruct"
from datasets import load_dataset
dataset = load_dataset(data_url, split="train")



Step 2: Load the Mistral model


max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
)



Set up for inference to compare

before/after
fast_model = FastLanguageModel.get_fast_model(model)



Function to generate text for comparison

def generate_text(prompt):
    tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    streamer = TextStreamer(tokenizer)
    outputs = fast_model.generate(
        tokens,
        max_new_tokens=500,
        use_cache=True,
        streamer=streamer,
        temperature=0.7,
        top_p=0.95,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)



Test before training


print("Before training:")
generate_text("What are the tips for a successful business plan?")

# Step 3: Configure for training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)



Step 4: Training setup


trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="adamw_torch",
    ),
)



Step 5: Train the model

trainer.train()


Test after training

print("After training:")
generate_text("What are the tips for a successful business plan?")

Step 6: Save the model

model.save_pretrained("outputs/adapter")


Step 7: Save the merged model in 16-bit
model.save_pretrained_merged("outputs/merged")


Step 8: Upload to Hugging Face Hub
model.push_to_hub_merged(
    "your-username/mistral-7b-finetuned-oig",
    token=os.environ.get("HF_TOKEN"),
    save_method="merged_16bit",
)

# Upload the LoRA adapter separately
model.push_to_hub(
    "your-username/mistral-7b-finetuned-oig-lora",
    token=os.environ.get("HF_TOKEN"),
)
```



What's Happening in the Code

1. **Data Loading**: We're using the OIG dataset, which contains instruction-response pairs in JSONL format. Each entry has a human query and an AI response.



2. **Model Loading**: We load the Mistral-7B model in 4-bit quantized format to reduce memory usage.


3. **Initial Testing**: We test how the model responds before training to establish a baseline.



4. **Model Patching**: We configure the model for fine-tuning using LoRA (Low-Rank Adaptation), which significantly reduces memory requirements.



5. **Training Configuration**: We set up the Supervised Fine-Tuning (SFT) trainer with parameters like batch size, learning rate, etc.



6. **Training**: We run the training process for 60 steps (you may want to increase this for production models).



7. **Saving Options**: We save both:
   - The adapter only (which can be used with the base model)
   - A merged model (base model + adapter combined)



8. **Uploading to Hugging Face**: We upload both versions to the Hugging Face Hub for easy distribution and usage.



Results: Before vs. After

The difference in the model's responses is striking:



**Before fine-tuning:**
A continuous completion without clear structure or organization.



**After fine-tuning:**
A well-structured response with clear, numbered points addressing the specific question about business plan tips.



Using Your Fine-Tuned Model

After uploading, you can easily use your model with a simple test script:

```python
from unsloth import FastLanguageModel
import torch
from transformers import TextStreamer

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    "your-username/mistral-7b-finetuned-oig",
    max_seq_length=2048,
)

# Set up for inference
fast_model = FastLanguageModel.get_fast_model(model)

# Generate text
def generate_text(prompt):
    tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    streamer = TextStreamer(tokenizer)
    outputs = fast_model.generate(
        tokens,
        max_new_tokens=500,
        use_cache=True,
        streamer=streamer,
        temperature=0.7,
        top_p=0.95,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the fine-tuned model
response = generate_text("What are the tips for a successful business plan?")
print(response)
```

 Conclusion

Unsloth.ai represents a significant advancement in the fine-tuning of large language models. With its impressive performance improvements—5x faster with 70% less memory—it makes sophisticated model customization accessible to a broader range of developers and researchers.

Whether you're working with Mistral, Gemma, or LLaMA 2, UNS Sloth provides a more efficient path to creating specialized models tailored to your specific use cases. The ability to fine-tune these models with limited computational resources opens up new possibilities for AI innovation.

Give it a try and see how Unsloth.ai can transform your AI development workflow!

---

*If you found this tutorial helpful, please like, share, and subscribe for more content on AI development and large language models.*




https://github.com/unslothai/unsloth

Credit: Mervin Praison (YouTube)



Ps.. Don't have a NVIDIA RTX A6000 with 47GB RAM lying around the house or maybe some of you do.....


Comments

Popular posts from this blog

Building AI Ready Codebase Indexing With CocoIndex

Code Rabbit VS Code Extension: Real-Time Code Review