Build-LLAMA2: Mini Llama2 Transformer Implementation & Weight-Space Ensemble Fine-Tuning

Date: August 01, 2025

Deep Learning, Natural Language Processing, Transformer Architecture, LoRA, WiSE-FT

Introduction

This project involves building a mini Llama2-style transformer from scratch—including attention mechanisms, feed-forward layers, positional encoding, and complete training infrastructure—followed by implementing and evaluating Weight-Space Ensemble Fine-Tuning (WiSE-FT) to understand how weight interpolation can improve downstream task performance and model robustness.

The implementation follows the structure of Llama2 (Touvron et al., 2023), with careful reproduction of key architectural components and training behavior. The project results in a fully functional decoder-only transformer capable of text continuation, zero-shot classification, and downstream fine-tuning tasks.

Figure 1: Llama2 Decoder-Only Transformer Architecture with GQA, RoPE, and SwiGLU

Architecture Implementation

Core Components

I implemented all major components of the Llama2 decoder architecture from scratch:

Attention Mechanism:

Grouped Query Attention (GQA) - Memory-efficient attention with shared key-value heads
Multi-head self-attention with causal masking
Rotary Position Embeddings (RoPE) - Relative positional encoding

Feed-Forward Network:

SwiGLU activation - Improved over standard ReLU/GELU
Gated linear units for enhanced expressiveness

Normalization & Residuals:

Pre-LayerNorm architecture - Stabilizes training
Residual connections throughout all layers

Training Infrastructure:

Custom training loop with AdamW optimizer
Cosine learning rate schedule with warmup
Mixed precision training support (fp16/bf16)

Training Pipeline

Data Processing & Training Loop

Implemented tokenization pipeline using Llama’s tokenizer
Built end-to-end training infrastructure:
- Forward pass with efficient tensor operations
- Loss computation and backpropagation
- Gradient accumulation and clipping
- Checkpoint management

Supported Training Modes

GPT-style Pretraining - Causal language modeling on large corpora
Text Continuation - Autoregressive generation and perplexity evaluation
Downstream Fine-tuning - Task-specific adaptation (classification, QA)

Parameter-Efficient Fine-Tuning (PEFT)

LoRA (Low-Rank Adaptation)

Injected trainable low-rank matrices ($\Delta W = BA$) into attention projection layers
Achieved strong adaptation with < 5% of parameters trainable
Maintained base model weights frozen, enabling efficient multi-task deployment

Figure 2: LoRA Overview

WiSE-FT (Weight-Space Ensemble Fine-Tuning)

Core Concept:
Instead of using only fine-tuned weights, interpolate between pretrained and fine-tuned weights:

\[W_{\text{final}} = \alpha \times W_{\text{finetuned}} + (1 - \alpha) \times W_{\text{pretrained}}\]

where $\alpha \in [0, 1]$ is the interpolation coefficient.

Implementation:

Explored interpolation coefficient $\alpha \in [0, 1]$ with grid search
Evaluated on zero-shot classification and out-of-domain robustness
Compared three strategies: Vanilla Fine-tuning vs LoRA vs WiSE-FT

Experimental Results

Tasks Evaluated

Text Continuation Perplexity - Language modeling capability
Zero-shot Classification - SST-2 (sentiment), AGNews (topic classification)
Fine-tuning Performance - Comparing LoRA and WiSE-FT strategies

Key Findings

1. WiSE-FT Improves Robustness

Demonstrated strong robustness on out-of-domain samples
Interpolating with $\alpha \in [0.2, 0.4]$ gave best balance between in-domain and out-of-domain performance
Avoided overfitting while maintaining task-specific improvements

2. LoRA Stabilizes Training

Reduced parameter count by 95% while maintaining performance
Improved training stability with fewer optimization steps
Enabled efficient multi-task adaptation

3. Architecture Validation

Base model reproduced expected loss curves for small-scale Llama architecture
Achieved competitive perplexity on validation sets
Confirmed correct implementation of all architectural components

Repository Contents

The complete implementation is available on GitHub:

model.py - Full transformer implementation
rope.py - Rotary Position Embeddings
train.py - Training and evaluation scripts
lora.py - LoRA implementation
wise_ft.py - WiSE-FT weight interpolation
Configuration files and example notebooks

Discussion

This project demonstrates several important insights:

From-Scratch Implementation Value:
Building transformers from scratch provides deep understanding of:

How attention mechanisms process sequential information
Why positional encodings are critical for transformers
How normalization and residuals stabilize training

PEFT Trade-offs:

LoRA is excellent for memory-constrained scenarios and multi-task deployment
WiSE-FT excels when robustness and generalization are priorities
The optimal $\alpha$ varies by task and distribution shift severity

Engineering Lessons:

Proper tensor shapes and efficient operations are critical for scalability
Modular design enables rapid experimentation
Comprehensive testing is essential for correctness

Conclusion

This project successfully implements a complete Llama2-style transformer from scratch, including all architectural innovations (GQA, RoPE, SwiGLU) and modern training practices. The addition of LoRA and WiSE-FT demonstrates parameter-efficient fine-tuning methods that are essential for practical deployment of large language models.

Share on

Twitter Facebook LinkedIn

Kanlong Ye