Fine Tuning LLMs with Unsloth.ai





Fine-Tuning LLMs with UNS Sloth: 5x Faster with 70% Less Memory

In the rapidly evolving landscape of AI development, efficient fine-tuning of large language models has become a critical bottleneck. Today, I'm excited to introduce you to **UNS Sloth**, a groundbreaking tool that's revolutionizing how we fine-tune models like Mistral, Gemma, and LLaMA 2.

## Why UnSloth.ai Is a Game-Changer

UNS Sloth offers several impressive advantages over traditional fine-tuning methods:

- **5x faster** fine-tuning process

- **70% less memory** consumption

- **Zero loss in accuracy** compared to standard methods

- **Cross-platform support** for Linux and Windows (via WSL)

- **Flexible quantization options**: 4-bit, 16-bit, QLoRA, and LoRA fine-tuning
- **Outperforms Hugging Face** by 2x in benchmarks across multiple datasets



 Step-by-Step Tutorial: Fine-Tuning Mistral-7B

In this tutorial, I'll walk you through how to fine-tune the Mistral 7B parameter model using the OIG (OpenInstruct Generated) dataset for instruction following. We'll see the dramatic difference in the model's responses before and after fine-tuning.


Before We Start

Let's look at a quick example of what we're trying to achieve:


**Before fine-tuning:**

When asked "What are the tips for a successful business plan?", the model gives a continuous, often rambling response.


The model provides a structured, point-by-point response that directly addresses the query.


 Setup Requirements

I'm using an NVIDIA RTX A6000 with 47GB RAM and 6 virtual CPUs for this demonstration. Let's start by setting up our environment:

```bash
conda create -n unsloth python=3.11
conda activate unsloth
pip install huggingface_hub ipython
pip install unsloth[conda]
export HF_TOKEN=your_huggingface_token
```

You can also authenticate with Hugging Face using:

```bash
huggingface-cli login
# Enter your token when prompted
```


The Code: Fine-Tuning Process

Let's create a file called `app.py` and implement our fine-tuning process:

```python
import os
from unsloth import FastLanguageModel
import torch
from transformers import SFTTrainer, TrainingArguments, TextStreamer



Step 1: Load the OIG dataset

data_url = "OIG/small_oig_instruct"
from datasets import load_dataset
dataset = load_dataset(data_url, split="train")



Step 2: Load the Mistral model


max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
)



Set up for inference to compare

before/after
fast_model = FastLanguageModel.get_fast_model(model)



Function to generate text for comparison

def generate_text(prompt):
    tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    streamer = TextStreamer(tokenizer)
    outputs = fast_model.generate(
        tokens,
        max_new_tokens=500,
        use_cache=True,
        streamer=streamer,
        temperature=0.7,
        top_p=0.95,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)



Test before training


print("Before training:")
generate_text("What are the tips for a successful business plan?")

# Step 3: Configure for training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)



Step 4: Training setup


trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="adamw_torch",
    ),
)



Step 5: Train the model

trainer.train()


Test after training

print("After training:")
generate_text("What are the tips for a successful business plan?")

Step 6: Save the model

model.save_pretrained("outputs/adapter")


Step 7: Save the merged model in 16-bit
model.save_pretrained_merged("outputs/merged")


Step 8: Upload to Hugging Face Hub
model.push_to_hub_merged(
    "your-username/mistral-7b-finetuned-oig",
    token=os.environ.get("HF_TOKEN"),
    save_method="merged_16bit",
)

# Upload the LoRA adapter separately
model.push_to_hub(
    "your-username/mistral-7b-finetuned-oig-lora",
    token=os.environ.get("HF_TOKEN"),
)
```



What's Happening in the Code

1. **Data Loading**: We're using the OIG dataset, which contains instruction-response pairs in JSONL format. Each entry has a human query and an AI response.



2. **Model Loading**: We load the Mistral-7B model in 4-bit quantized format to reduce memory usage.


3. **Initial Testing**: We test how the model responds before training to establish a baseline.



4. **Model Patching**: We configure the model for fine-tuning using LoRA (Low-Rank Adaptation), which significantly reduces memory requirements.



5. **Training Configuration**: We set up the Supervised Fine-Tuning (SFT) trainer with parameters like batch size, learning rate, etc.



6. **Training**: We run the training process for 60 steps (you may want to increase this for production models).



7. **Saving Options**: We save both:
   - The adapter only (which can be used with the base model)
   - A merged model (base model + adapter combined)



8. **Uploading to Hugging Face**: We upload both versions to the Hugging Face Hub for easy distribution and usage.



Results: Before vs. After

The difference in the model's responses is striking:



**Before fine-tuning:**
A continuous completion without clear structure or organization.



**After fine-tuning:**
A well-structured response with clear, numbered points addressing the specific question about business plan tips.



Using Your Fine-Tuned Model

After uploading, you can easily use your model with a simple test script:

```python
from unsloth import FastLanguageModel
import torch
from transformers import TextStreamer

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    "your-username/mistral-7b-finetuned-oig",
    max_seq_length=2048,
)

# Set up for inference
fast_model = FastLanguageModel.get_fast_model(model)

# Generate text
def generate_text(prompt):
    tokens = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    streamer = TextStreamer(tokenizer)
    outputs = fast_model.generate(
        tokens,
        max_new_tokens=500,
        use_cache=True,
        streamer=streamer,
        temperature=0.7,
        top_p=0.95,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the fine-tuned model
response = generate_text("What are the tips for a successful business plan?")
print(response)
```

 Conclusion

Unsloth.ai represents a significant advancement in the fine-tuning of large language models. With its impressive performance improvements—5x faster with 70% less memory—it makes sophisticated model customization accessible to a broader range of developers and researchers.

Whether you're working with Mistral, Gemma, or LLaMA 2, UNS Sloth provides a more efficient path to creating specialized models tailored to your specific use cases. The ability to fine-tune these models with limited computational resources opens up new possibilities for AI innovation.

Give it a try and see how Unsloth.ai can transform your AI development workflow!

---

*If you found this tutorial helpful, please like, share, and subscribe for more content on AI development and large language models.*




https://github.com/unslothai/unsloth

Credit: Mervin Praison (YouTube)



Ps.. Don't have a NVIDIA RTX A6000 with 47GB RAM lying around the house or maybe some of you do.....


Comments

Popular posts from this blog

Video From YouTube

GPT Researcher: Deploy POWERFUL Autonomous AI Agents

Building AI Ready Codebase Indexing With CocoIndex