AI • Software Development • Privacy

How to Run Claude Code Locally (100% Free & Fully Private)

How to turn your machine into a private AI coding agent using Ollama with Claude Code or Codex—without sending a single byte to the cloud.

You're Deep in a Codebase

The files are messy. The logic is half yours, half legacy, half regret.

You want help. But you don't want to send your entire project to the cloud just to get a suggestion that might be wrong anyway.

And you definitely don't want another API bill.

That's the moment this setup starts to make sense.

Because you can run Claude Code or Codex entirely on your own machine. Offline. Private. No tracking. No API costs. No one else seeing your files.

And the best part is, it's not a toy setup.

They can read files. Edit code. Run terminal commands. Refactor projects. And actually behave like a coding partner, not a chat box.

Let me show you exactly how it works.

Understanding the Tools: Claude Code and Codex

Both Claude Code and Codex are agentic coding tools that can work with local models through Ollama. Here's what you need to know:

Claude Code (Anthropic)

Claude Code is Anthropic's agentic coding tool that can read, modify, and execute code in your working directory. Thanks to Ollama's Anthropic-compatible API, you can now run Claude Code with local open-source models like qwen3-coder, glm-4.7, and gpt-oss.

Key Features:

Autonomous coding tasks from the terminal
Multi-file editing and refactoring
Project-wide analysis and code execution
Works with both cloud Claude models AND local Ollama models
Requires 64k+ token context window

Codex (OpenAI CLI)

Codex is OpenAI's command-line tool for AI-assisted coding. While originally designed for OpenAI's models, it fully supports Ollama's local models through the --oss flag.

Key Features:

Terminal-based coding assistance
File operations and code modifications
Supports local and cloud models via Ollama
Simple installation via npm
Requires 32k+ token context window

Which One Should You Use?

Choose Claude Code if:

You want the most modern agentic coding experience
You prefer Anthropic's approach to AI coding
You want flexibility to switch between local and cloud models
You have 64k+ context window models

Choose Codex if:

You prefer a simpler, more established tool
You're already familiar with OpenAI's ecosystem
You want lower context requirements (32k)
You prefer npm-based installation

Use both! They complement each other well and can both run locally through Ollama.

Who This Is Really For

Let's be honest about the audience here.

This is for you if:

You want a private AI coding agent that never leaves your machine
You like Claude Code or similar tools, but not the cloud dependency
You're experimenting with local LLMs and want something practical
You want an AI that can actually touch files and run commands
You're working on proprietary or sensitive code that can't be uploaded to third-party servers
You want to eliminate ongoing API costs entirely
Or you just want full control over your tools again

If you've ever thought, "I wish this thing just worked locally," you're in the right place.

The Big Idea: Understanding the Architecture

Both Claude Code and Codex work as a shell—a command-line interface that needs a brain to power it.

That's where Ollama comes in.

Think of Ollama like Docker for AI models. It runs quietly in the background and lets you pull models the same way you'd pull Docker images.

The Three-Layer Architecture

Layer 1: Claude Code or Codex (The Interface)

The command-line tool you interact with
Handles file operations, code editing, and terminal commands
Manages conversation context across your project
Translates your requests into structured API calls

Layer 2: Ollama (The Model Runtime)

Background service running on localhost:11434
Downloads, stores, and serves AI models
Handles model loading and GPU acceleration
Provides Anthropic-compatible and OpenAI-compatible APIs

Layer 3: Local LLM (The Brain)

The actual AI model performing reasoning and generation
Supports tool calling for file and command execution
Requires 32k-64k+ token context window
Can be swapped/upgraded independently

Hardware Requirements: Can Your Machine Handle It?

This is the make-or-break question. Local AI isn't magic—it requires real computational resources.

My Testing Setup: HP ZBook Studio 16 G9

I'm running this entire setup on an HP ZBook Studio 16 inch G9 Mobile Workstation PC with the following specifications:

Processor:
12th Gen Intel® Core™ i7-12800H (2.40 GHz base frequency)

14 cores (6 Performance-cores + 8 Efficient-cores)
20 threads with Hyper-Threading
24 MB Intel® Smart Cache
Max Turbo frequency up to 4.8 GHz

Memory:
32.0 GB DDR5 RAM (31.6 GB usable)

Speed: 4800 MT/s
Dual-channel configuration

Graphics Cards:

GPU 0: Intel® Iris® Xe Graphics (integrated, 1% utilization)
GPU 1: NVIDIA RTX™ A2000 8GB Laptop GPU (5% idle, 44°C)
- 2,560 CUDA cores
- 8 GB GDDR6 VRAM
- Used for AI model acceleration

Storage:
954 GB NVMe SSD

523 GB of 954 GB currently used
431 GB available for AI models

Minimum Requirements (Budget Setup)

Absolute Minimum:

CPU: 4-core processor (Intel i5 8th gen / AMD Ryzen 5 3600 equivalent)
RAM: 16 GB
Storage: 50 GB free space (SSD highly recommended)
GPU: Integrated graphics acceptable (Intel UHD 630 or better)
OS: Windows 10/11, macOS 11+, or modern Linux distribution

Note: You'll be limited to 3B-7B parameter models with slower response times (5-15 seconds per response).

Recommended Requirements (Good Experience)

CPU: 6+ core processor (Intel i7/i9 10th gen+, AMD Ryzen 7/9 5000+)
RAM: 32 GB or more
Storage: 200+ GB free NVMe SSD space
GPU: Dedicated GPU with 6+ GB VRAM (NVIDIA RTX 3060/4060 or better, AMD RX 6700 XT+)
OS: Latest stable OS version

Note: This configuration handles 7B-13B models with 2-5 second response times and can run 20B+ models at acceptable speeds.

Optimal Setup (Enthusiast/Professional)

CPU: 8+ core processor (Intel i9 12th gen+, AMD Ryzen 9 7000+)
RAM: 64 GB or more
Storage: 500+ GB NVMe SSD (dedicated for AI)
GPU: High-end GPU with 12+ GB VRAM (NVIDIA RTX 4070 Ti/4080/4090, AMD RX 7900 XT/XTX)

Note: Runs 30B-70B models smoothly, handles multiple concurrent sessions, and provides near-instant responses for smaller models.

Performance Reality Check

With my HP ZBook configuration (i7-12800H, 32GB RAM, RTX A2000 8GB), here's what I can realistically run:

3B-7B models: Nearly instant responses (0.5-2 seconds)
13B models: Responsive and production-ready (2-5 seconds)
20B models: Usable but noticeably slower (5-10 seconds)
30B+ models: Requires CPU fallback, significantly slower (15-30+ seconds)

If you're on a laptop with 16GB RAM and integrated graphics, stick to 7B models—they're surprisingly capable for most coding tasks.

Part 1: Installing Ollama

Ollama is the foundation of our local AI stack. It's remarkably easy to install across all major platforms.

Installation by Operating System

Windows (Recommended for most users):

Download the Windows installer from ollama.com
Run the OllamaSetup.exe file
The installer will set up Ollama as a Windows service that starts automatically
Open PowerShell or Command Prompt and verify:
ollama --version

Windows with WSL2 (Advanced users):

# Install WSL2 if not already installed
wsl --install

# Inside your WSL distribution (Ubuntu recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Start the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify installation
ollama --version
systemctl status ollama

macOS:

# Download and run the official installer
curl -fsSL https://ollama.com/install.sh | sh

# Or install via Homebrew
brew install ollama

# Verify installation
ollama --version

Linux (Ubuntu/Debian):

# Install using the official script
curl -fsSL https://ollama.com/install.sh | sh

# Start and enable the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify it's running
ollama --version
systemctl status ollama

Linux (Fedora/RHEL/CentOS):

# Install using the official script
curl -fsSL https://ollama.com/install.sh | sh

# Start and enable the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify installation
ollama --version

Post-Installation Verification

Once installed, Ollama runs as a background service on port 11434. Verify it's working:

# Check if Ollama is running
curl http://localhost:11434

# You should see: "Ollama is running"

If you don't see this message, the service isn't running properly. Check your installation or firewall settings.

GPU Detection (NVIDIA Users)

If you have an NVIDIA GPU (like my RTX A2000), verify Ollama detects it:

# Check GPU detection
ollama ps

# You should see GPU memory information
# Example output:
# NAME              ID              SIZE    PROCESSOR
# No models loaded  -               -       GPU:0 (NVIDIA RTX A2000)

If GPU isn't detected:

Windows: Install latest NVIDIA drivers from nvidia.com
Linux: Install NVIDIA CUDA Toolkit:
sudo apt install nvidia-cuda-toolkit
Then restart Ollama: sudo systemctl restart ollama

Part 2: Downloading AI Models

Now comes the exciting part—choosing and downloading your AI models.

Understanding Model Naming Convention

Ollama uses a straightforward naming format:

model-name:parameter-size-quantization

Examples:

qwen3-coder = Qwen 3 Coder (default size and quantization)
glm-4.7 = GLM 4 with 7B parameters
gpt-oss:20b = GPT-OSS with 20 billion parameters
gpt-oss:120b = GPT-OSS with 120 billion parameters

Recommended Models for Claude Code and Codex

1. Qwen 3 Coder — Best for Claude Code (Highly Recommended)

ollama pull qwen3-coder

Size: ~4.5 GB download
RAM needed: 8-12 GB
Strengths: Excellent code generation, strong reasoning, optimized for coding tasks
Context window: 128k tokens (perfect for Claude Code)
Best for: Daily coding with Claude Code, multi-file refactoring

2. GLM-4.7 — Strong Alternative

ollama pull glm-4.7

Size: ~4.2 GB download
RAM needed: 8-12 GB
Strengths: Fast inference, good code understanding, multilingual
Context window: 128k tokens
Best for: Quick iterations, projects in multiple languages

3. GPT-OSS 20B — Balanced Power (Default for Codex)

ollama pull gpt-oss:20b

Size: ~11 GB download
RAM needed: 24-32 GB
Strengths: More capable reasoning, better at complex tasks
Context window: 64k+ tokens
Best for: Complex refactoring, architectural decisions

4. GPT-OSS 120B — Maximum Capability (If you have the hardware)

ollama pull gpt-oss:120b

Size: ~68 GB download
RAM needed: 80+ GB
Strengths: Highest quality code generation, best reasoning
Context window: 64k+ tokens
Best for: Complex systems, production-grade code generation

Context Window Requirements

Critical: Claude Code and Codex have specific context requirements:

Claude Code: Minimum 64k tokens recommended
Codex: Minimum 32k tokens recommended

Most models in Ollama's library meet these requirements, but you can verify:

# Check model details
ollama show qwen3-coder

# Look for num_ctx parameter

Download Strategy (My Recommendation)

Start with one model for your use case:

# For Claude Code (best experience)
ollama pull qwen3-coder

# For Codex (default)
ollama pull gpt-oss:20b

# Or download both to compare
ollama pull qwen3-coder
ollama pull gpt-oss:20b

# Check what you've downloaded
ollama list

Model Storage Locations

Linux/macOS: ~/.ollama/models/
Windows: %USERPROFILE%\.ollama\models\

Testing Your Model

Before connecting to Claude Code or Codex, verify your model works correctly:

# Start an interactive session
ollama run qwen3-coder

# Try a simple coding prompt
>>> Write a Python function to calculate fibonacci numbers recursively

# You should get a proper code response

# Exit with Ctrl+D or type:
>>> /bye

If this works, you're ready to set up Claude Code or Codex.

Part 3: Installing and Configuring Claude Code

Claude Code is Anthropic's agentic coding tool that now works seamlessly with Ollama's local models.

Installing Claude Code

macOS / Linux:

# Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash

# Verify installation
claude --version

Windows:

Download the Windows installer from code.claude.com and follow the installation wizard.

Quick Setup with Ollama

The fastest way to get Claude Code working with Ollama:

# Automatic configuration and launch
ollama launch claude

This command automatically:

Detects your Ollama installation
Configures Claude Code to use local models
Sets up the Anthropic-compatible API endpoint
Launches Claude Code in your current directory

Manual Setup (More Control)

For custom setups, configure without launching:

# Configure Claude Code for Ollama
ollama launch claude --config

Or set environment variables manually:

# Set the environment variables
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

# Run Claude Code with an Ollama model
claude --model qwen3-coder

Or run with environment variables inline:

ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model glm-4.7

Changing Models

Switch between different models easily:

# Use Qwen 3 Coder (recommended)
claude --model qwen3-coder

# Use GLM 4.7
claude --model glm-4.7

# Use GPT-OSS 20B
claude --model gpt-oss:20b

# Use GPT-OSS 120B (if you have the hardware)
claude --model gpt-oss:120b

Persistent Configuration

Make your Ollama setup permanent by adding to your shell profile:

# For bash (~/.bashrc) or zsh (~/.zshrc)
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
echo 'export ANTHROPIC_API_KEY=""' >> ~/.bashrc
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
source ~/.bashrc

# Now you can just run:
claude --model qwen3-coder

Using Cloud Models (Optional)

Claude Code also supports cloud models through ollama.com:

# Use cloud-hosted 120B model
claude --model gpt-oss:120b-cloud

Cloud models at ollama.com/search?c=cloud require an API key but offer more powerful options.

Part 4: Installing and Configuring Codex

Codex is OpenAI's command-line coding tool that works great with Ollama's local models.

Installing Codex

# Install via npm
npm install -g @openai/codex

# Verify installation
codex --version

Quick Setup with Ollama

The fastest way to use Codex with Ollama:

# Launch Codex in OSS (Open Source Software) mode
codex --oss

This automatically connects to your local Ollama instance at http://localhost:11434.

Changing Models

By default, Codex uses gpt-oss:20b. Switch to different models with the -m flag:

# Use GPT-OSS 20B (default)
codex --oss

# Use GPT-OSS 120B (more capable)
codex --oss -m gpt-oss:120b

# Use Qwen 3 Coder
codex --oss -m qwen3-coder

# Use GLM 4.7
codex --oss -m glm-4.7

Manual Configuration

For persistent settings, edit ~/.codex/config.toml:

# ~/.codex/config.toml

model = "qwen3-coder"
model_provider = "ollama"

[model_providers.ollama]
name = "Ollama"
base_url = "http://localhost:11434"
timeout = 120
max_retries = 3

[context]
max_tokens = 65536  # Match your model's context window
max_files = 50
include_hidden = false

[performance]
stream_responses = true
cache_responses = true

After editing, restart Codex to load the new settings.

Cloud Models via ollama.com

To use ollama.com's cloud models:

Create an API key at ollama.com/settings/keys

Export it as an environment variable:

export OLLAMA_API_KEY="your-api-key-here"

Edit ~/.codex/config.toml:

model = "gpt-oss:120b-cloud"
model_provider = "ollama"

[model_providers.ollama]
name = "Ollama"
base_url = "https://ollama.com/v1"
env_key = "OLLAMA_API_KEY"

Run codex in a new terminal to load the settings

Part 5: Using Claude Code and Codex Effectively

Now that everything is configured, let's explore how to actually use these tools for real development work.

Starting a Session

With Claude Code:

# Navigate to your project
cd ~/projects/my-web-app

# Start Claude Code with default settings
claude --model qwen3-coder

# You'll see a prompt like:
# Claude (qwen3-coder)
# >

With Codex:

# Navigate to your project
cd ~/projects/my-web-app

# Start Codex with Ollama
codex --oss

# Or with a specific model
codex --oss -m qwen3-coder

Basic Commands and Operations

File Operations:

# Read files
> Read the main.py file
> Show me the authentication module
> What's in the utils folder?

# List files
> List all Python files in src/
> Show the project structure

# Search
> Find all TODO comments
> Show me where the User class is defined

Code Generation:

# Create new code
> Create a new API endpoint for user registration
> Write a function to validate email addresses
> Generate a TypeScript interface for the User model

# Add features
> Add logging to all API endpoints
> Implement rate limiting middleware
> Create unit tests for the calculate_total function

Code Analysis:

# Understanding code
> Explain what the process_payment function does
> What's the purpose of this decorator?
> Walk me through the authentication flow

# Finding issues
> Find potential bugs in the error handling
> Check for security vulnerabilities
> Identify performance bottlenecks in this loop

# Suggesting improvements
> Suggest optimizations for the database queries
> How can I make this code more maintainable?
> Review this code for best practices

Refactoring:

# Code improvements
> Refactor this class to use dependency injection
> Convert this callback-based code to async/await
> Extract this repeated logic into a helper function

# Modernization
> Update this code to use TypeScript generics
> Migrate from class components to React hooks
> Rewrite using modern Python type hints

Terminal Commands:

# Running commands
> Run the test suite
> Install the requests package
> Check git status
> Create a new branch called feature/user-auth

# Build and deployment
> Build the Docker image
> Run the development server
> Deploy to staging environment

Real-World Example Workflow

Let's walk through a realistic development scenario:

# Start in your e-commerce API project
cd ~/projects/ecommerce-api
claude --model qwen3-coder

# Session starts
> Hi! I need to add better error handling to the payment processing module

# Claude analyzes the code
[Reads payment.py and related files]

> I can see the payment module. The current error handling is minimal. 
> I'll add comprehensive try-catch blocks, logging, and user-friendly 
> error messages. Should I also add retry logic for network failures?

> Yes, add retry logic with exponential backoff

# Claude makes the changes
[Shows code modifications]

> Review the changes I made to payment.py

# After reviewing
> Looks good. Now write unit tests for these error scenarios

# Claude generates tests
[Creates test_payment_errors.py with comprehensive test cases]

> Run the tests

# Claude executes pytest
[Shows test results - 1 test failing]

> Test test_network_retry is failing because of a type mismatch on line 45

# Claude analyzes and fixes
> I see the issue. The mock response returns a string but the code 
> expects a dict. I'll fix the test.

[Updates test file]

> Run the tests again

[All tests pass]

> Perfect! Write a git commit message for these changes

> Here's a suggested commit message:
> 
> feat(payment): Add comprehensive error handling with retry logic
> 
> - Added try-catch blocks for all payment operations
> - Implemented exponential backoff for network failures
> - Added detailed logging for debugging
> - Created unit tests for error scenarios
> - Fixed type handling in payment response processing

> Commit these changes with that message

[Executes git commit]

> Done! Anything else?

> No, that's all. Thanks!

# Exit
Ctrl+D

This entire workflow stays local. Your code never leaves your machine.

Keyboard Shortcuts

Ctrl+D - Exit Claude Code or Codex
Ctrl+C - Cancel current operation
↑/↓ - Navigate command history

Part 6: Performance Optimization

Running AI locally requires careful attention to performance. Here's how to get the most out of your hardware.

GPU Acceleration (Critical for Performance)

If you have an NVIDIA GPU (like my RTX A2000), ensure Ollama is using it properly:

Verify GPU usage:

# Check if models are running on GPU
ollama ps

# You should see GPU memory usage
# Example output:
# NAME              ID        SIZE    PROCESSOR
# qwen3-coder      abc123    4.5GB   GPU:0 (NVIDIA RTX A2000 8GB)

If you see "CPU" instead of "GPU", there's a problem.

Fix GPU detection (Windows):

Install latest NVIDIA drivers from nvidia.com
Restart your computer

Restart Ollama service:

Restart-Service Ollama  # In PowerShell as admin

Fix GPU detection (Linux):

# Install NVIDIA CUDA Toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit

# Verify CUDA installation
nvcc --version

# Check GPU is visible
nvidia-smi

# Restart Ollama
sudo systemctl restart ollama

# Test GPU usage
ollama run qwen3-coder

Monitor GPU usage during inference:

# Linux/Windows (with NVIDIA drivers)
nvidia-smi -l 1  # Updates every 1 second

# Watch GPU utilization while using Claude Code/Codex
# You should see GPU usage spike during responses

Adjusting Context Window

Both Claude Code and Codex benefit from larger context windows. You can adjust this in Ollama:

# Create a custom modelfile
cat > custom-qwen.txt << EOF
FROM qwen3-coder
PARAMETER num_ctx 128000
EOF

# Create custom model
ollama create qwen3-coder-large -f custom-qwen.txt

# Use the custom model
claude --model qwen3-coder-large

See Ollama's context length documentation for more details.

Memory Management Strategies

For 16GB RAM systems:

Use smaller models (qwen3-coder, glm-4.7)
Close other applications when coding
Use one model at a time
Consider reducing context window if needed

For 32GB+ RAM systems (like mine):

Can comfortably run gpt-oss:20b
Multiple models can be loaded
Full context windows work smoothly

For 64GB+ RAM systems:

Run gpt-oss:120b
Keep multiple models loaded
Maximum context windows

Benchmarking Your Setup

Test response times to find your optimal configuration:

# Simple benchmark
echo "Write a hello world function in Python" | time ollama run qwen3-coder

# Complex benchmark
echo "Explain the singleton pattern with code examples" | time ollama run qwen3-coder

My benchmark results (HP ZBook, RTX A2000, 32GB RAM):

Model	Simple Query	Complex Query	GPU Usage
qwen3-coder	1.2s	4.8s	68%
glm-4.7	1.1s	4.5s	65%
gpt-oss:20b	6.8s	24.1s	98%+CPU

Your results will vary based on hardware.

Part 7: Privacy and Security Considerations

Running AI locally is about more than saving money—it's about data sovereignty and security.

What Stays Local (Everything)

With this setup, 100% of your data stays on your machine:

All code and files processed
Conversation history and context
Model weights and inference computations
Generated outputs and suggestions
Project analysis and documentation

Zero network calls are made to external services (unless you explicitly use cloud models).

Verifying Complete Privacy

Confirm nothing is being sent externally by monitoring network connections:

Linux:

# Monitor all network connections
sudo netstat -tnp | grep ollama

# You should only see local connections (127.0.0.1)

Windows:

Open Resource Monitor (resmon.exe)
Go to Network tab
Filter by ollama.exe
Verify all connections are to 127.0.0.1:11434

macOS:

# Monitor network connections
sudo lsof -i -P | grep ollama

# You should only see localhost connections

Security Best Practices

1. Firewall Configuration

Block external access to Ollama's port:

# Linux (ufw)
sudo ufw deny 11434
sudo ufw status

# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 11434 -j DROP
sudo iptables -A INPUT -i lo -p tcp --dport 11434 -j ACCEPT

2. Encryption at Rest

Encrypt the drive where models are stored:

Windows: Use BitLocker
macOS: Enable FileVault
Linux: Use LUKS encryption

3. Regular Updates

# Update Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Update models
ollama pull qwen3-coder

# Update Claude Code
curl -fsSL https://claude.ai/install.sh | bash

# Update Codex
npm update -g @openai/codex

Compliance Considerations

For enterprise or regulated environments:

GDPR: Local processing means data never leaves your jurisdiction
HIPAA: No PHI is transmitted to third parties
SOC 2: Easier to maintain control over data processing
ISO 27001: Reduces attack surface by eliminating cloud dependencies

Part 8: Troubleshooting Common Issues

Issue: "Connection refused to localhost:11434"

Cause: Ollama service isn't running.

Solution:

# Check if Ollama is running
curl http://localhost:11434

# Linux
systemctl status ollama
sudo systemctl start ollama

# macOS
brew services list
brew services start ollama

# Windows
# Check Services app (services.msc)
# Restart "Ollama" service

Issue: "Model not found" or "Model does not exist"

Cause: Model isn't downloaded or name is misspelled.

Solution:

# List all downloaded models
ollama list

# Download the model if missing
ollama pull qwen3-coder

# Use exact name from 'ollama list'
claude --model qwen3-coder
codex --oss -m qwen3-coder

Issue: "Out of memory" errors

Cause: Model is too large for available RAM/VRAM.

Solution:

# Switch to a smaller model
ollama pull glm-4.7
claude --model glm-4.7

# Or close other applications to free memory

Issue: Very slow response times (30+ seconds)

Potential causes and fixes:

1. CPU bottleneck (no GPU acceleration):

# Verify GPU is being used
ollama ps
nvidia-smi  # Should show activity during inference

# If not using GPU, reinstall NVIDIA drivers

2. Model too large for hardware:

# Use a smaller/faster model
claude --model qwen3-coder  # instead of gpt-oss:120b

3. Context window too large:

# Reduce context window in modelfile
cat > smaller-context.txt << EOF
FROM qwen3-coder
PARAMETER num_ctx 32000
EOF

ollama create qwen3-fast -f smaller-context.txt
claude --model qwen3-fast

Issue: "Context window exceeded"

Cause: Your project has too many files or the model's context is too small.

Solution:

# Use a model with larger context
ollama pull qwen3-coder  # 128k context

# Or limit the files being analyzed
# Focus Claude Code/Codex on specific directories

Issue: Claude Code or Codex crashes

Debugging steps:

# Check Ollama is running
curl http://localhost:11434

# Check system resources
top  # or htop
nvidia-smi  # GPU

# Run with verbose output
claude --model qwen3-coder --verbose
codex --oss --verbose

Part 9: Local vs. Cloud: Honest Comparison

Let's be realistic about the trade-offs between local and cloud AI coding assistants.

Local Setup Advantages

✅ Complete privacy — Your code never leaves your machine
✅ Zero ongoing costs — No API bills, ever
✅ Offline capability — Works without internet connection
✅ No rate limits — Use as much as you want
✅ Full control — Customize models, prompts, and behavior
✅ No vendor lock-in — Switch models freely
✅ Compliance friendly — Easier to meet regulatory requirements
✅ No data retention concerns — Nothing stored externally

Local Setup Disadvantages

❌ Hardware requirements — Needs decent CPU/RAM/GPU
❌ Initial setup complexity — More steps than API signup
❌ Slower than cloud — Even with good hardware
❌ Model updates — Manual downloads required
❌ Capability ceiling — Can't match GPT-4 or Claude Opus
❌ Maintenance burden — You manage updates and troubleshooting

Cloud AI Advantages

✅ Fastest performance — Optimized datacenter hardware
✅ Latest models — Immediate access to new releases
✅ No setup — Works out of the box
✅ Highest capability — Access to largest, most powerful models
✅ Zero maintenance — Provider handles everything

Cloud AI Disadvantages

❌ Privacy concerns — Code sent to third parties
❌ Ongoing costs — Can get expensive ($20-500+/month)
❌ Requires internet — Useless offline
❌ Rate limits — Throttling during high demand
❌ Vendor lock-in — Dependent on provider's ecosystem
❌ Data retention — Your code may be stored/used for training

My Personal Recommendation

Use local AI when:

Working on proprietary or sensitive code
Prototyping and experimenting heavily
Internet connection is unreliable
Cost is a major concern
Privacy is non-negotiable
You have capable hardware (16GB+ RAM)

Use cloud AI when:

You need the absolute best quality
Speed is critical to your workflow
You rarely use AI assistance (low usage)
You don't have capable hardware
Convenience outweighs privacy concerns

Hybrid approach (my preference):

Use local for routine tasks (80% of work)
Use cloud for complex problems (20% of work)
Keep both configured and switch as needed

Cost Analysis

Local Setup (One-Time):

Hardware (if upgrading): $500-2000
Time investment: 2-4 hours
Electricity: ~$5-15/month
Total first year: $560-2180 (then $60-180/year)

Cloud AI (Ongoing):

GitHub Copilot: $10/month ($120/year)
Claude API: $20-100/month ($240-1200/year)
Heavy usage: $200-500+/month ($2400-6000+/year)
Total first year: $120-6000+ (recurring annually)

Break-even point: Local setup pays for itself in 3-12 months with heavy usage.

Conclusion: The Future is Flexible and Private

The ability to run Claude Code or Codex entirely on your own machine represents a fundamental shift in how we think about AI tooling.

You're no longer forced to choose between capability and privacy, between power and control.

With my HP ZBook setup (Intel i7-12800H, 32GB RAM, RTX A2000), I've found local AI remarkably capable for daily development work. Models like qwen3-coder and glm-4.7 handle most tasks admirably. They feel nearly as capable as cloud alternatives for many scenarios.

Yes, there's a quality gap compared to frontier models like GPT-4 Turbo or Claude Opus. But that gap is narrowing rapidly. And for many use cases—refactoring, documentation, test generation, code explanation, debugging assistance—local models are already "good enough."

The real win is optionality.

You can run everything locally by default, preserving privacy and eliminating costs. When you hit a genuinely complex problem requiring maximum capability, you can temporarily connect to cloud models through ollama.com for that specific task, then disconnect.

That's the future: flexible, hybrid, user-controlled AI that adapts to your needs rather than forcing you into someone else's business model.

Key Takeaways

Local AI is practical — Not just for hobbyists; production-ready for many tasks
Both Claude Code and Codex work locally — Through Ollama's compatible APIs
Privacy is achievable — You can have powerful AI without cloud dependencies
Hardware matters — But 32GB RAM + decent GPU works well
Setup is straightforward — Ollama + Claude Code/Codex runs in under an hour
Performance is acceptable — 1-5 second responses for most queries
Cost savings are real — Zero ongoing API bills
It's getting better — Models improve monthly

Ready to Start?

The Quick Start (30 minutes):

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Download a model: ollama pull qwen3-coder
Install Claude Code: curl -fsSL https://claude.ai/install.sh | bash
Launch: ollama launch claude
Start coding with complete privacy

Or use Codex:

Install Ollama (same as above)
Download a model: ollama pull gpt-oss:20b
Install Codex: npm install -g @openai/codex
Launch: codex --oss
Start coding with complete privacy

The tools are mature. The models are capable. Your code stays yours.

Welcome to the future of private, powerful, local AI coding assistance.

Additional Resources

Official Documentation

Ollama Documentation — Complete guide to Ollama
Claude Code + Ollama Integration — Official integration docs
Codex + Ollama Integration — Official integration docs
Claude Code Documentation — Anthropic's official docs
Codex CLI Documentation — OpenAI's official docs
Ollama Model Library — Browse all available models
Ollama Context Length Guide — Adjusting context windows

Community Resources

r/LocalLLaMA — Reddit community for local AI
Ollama Discord — Real-time support and discussion

Hardware and Benchmarking

PCPartPicker — Build a local AI workstation
Lambda Labs GPU Benchmarks — GPU performance for AI
LLM Leaderboard — Compare model capabilities

How to Run Claude Code Locally (100% Free & Fully Private): Complete Setup Guide with Real Hardware Testing

You're Deep in a Codebase

Understanding the Tools: Claude Code and Codex

Claude Code (Anthropic)

Codex (OpenAI CLI)

Which One Should You Use?

Who This Is Really For

The Big Idea: Understanding the Architecture

The Three-Layer Architecture

Hardware Requirements: Can Your Machine Handle It?

My Testing Setup: HP ZBook Studio 16 G9

Minimum Requirements (Budget Setup)

Recommended Requirements (Good Experience)

Optimal Setup (Enthusiast/Professional)

Performance Reality Check

Part 1: Installing Ollama

Installation by Operating System

Post-Installation Verification

GPU Detection (NVIDIA Users)

Part 2: Downloading AI Models

Understanding Model Naming Convention

Recommended Models for Claude Code and Codex

Context Window Requirements

Download Strategy (My Recommendation)

Model Storage Locations

Testing Your Model

Part 3: Installing and Configuring Claude Code

Installing Claude Code

Quick Setup with Ollama

Manual Setup (More Control)

Changing Models

Persistent Configuration

Using Cloud Models (Optional)

Part 4: Installing and Configuring Codex

Installing Codex

Quick Setup with Ollama

Changing Models

Manual Configuration

Cloud Models via ollama.com

Part 5: Using Claude Code and Codex Effectively

Starting a Session

Basic Commands and Operations

Real-World Example Workflow

Keyboard Shortcuts

Part 6: Performance Optimization

GPU Acceleration (Critical for Performance)

Adjusting Context Window

Memory Management Strategies

Benchmarking Your Setup

Part 7: Privacy and Security Considerations

What Stays Local (Everything)

Verifying Complete Privacy

Security Best Practices

Compliance Considerations

Part 8: Troubleshooting Common Issues

Issue: "Connection refused to localhost:11434"

Issue: "Model not found" or "Model does not exist"

Issue: "Out of memory" errors

Issue: Very slow response times (30+ seconds)

Issue: "Context window exceeded"

Issue: Claude Code or Codex crashes

Part 9: Local vs. Cloud: Honest Comparison

Local Setup Advantages

Local Setup Disadvantages

Cloud AI Advantages

Cloud AI Disadvantages

My Personal Recommendation

Cost Analysis

Conclusion: The Future is Flexible and Private

Key Takeaways

Ready to Start?

Additional Resources

Official Documentation

Community Resources

Hardware and Benchmarking

Wilson Kumalo

Ready to build something bold?