By Wilson Kumalo438 viewsUpdated Jan 30, 2026
ollamaclaude-codecodexlocal-llmai-coding-assistantprivacyopen-sourcegpu-accelerationdeveloper-toolsoffline-ai
How to Run Claude Code Locally (100% Free & Fully Private): Complete Setup Guide with Real Hardware Testing - Transform your machine into a private AI coding agent using Ollama and Codex CLI. This comprehensive guide covers installation, configuration, and optimization on real hardware (HP ZBook with Intel i7-12800H and RTX A2000), delivering zero API costs, complete code privacy, and offline capability. Includes troubleshooting, performance benchmarks, and security best practices.
Jan 202623 min read

How to Run Claude Code Locally (100% Free & Fully Private): Complete Setup Guide with Real Hardware Testing

Transform your machine into a private AI coding agent using Ollama and Codex CLI. This comprehensive guide covers installation, configuration, and optimization on real hardware (HP ZBook with Intel i7-12800H and RTX A2000), delivering zero API costs, complete code privacy, and offline capability. Includes troubleshooting, performance benchmarks, and security best practices.

AI • Software Development • Privacy

How to Run Claude Code Locally (100% Free & Fully Private)

How to turn your machine into a private AI coding agent using Ollama with Claude Code or Codex—without sending a single byte to the cloud.


You're Deep in a Codebase

The files are messy. The logic is half yours, half legacy, half regret.

You want help. But you don't want to send your entire project to the cloud just to get a suggestion that might be wrong anyway.

And you definitely don't want another API bill.

That's the moment this setup starts to make sense.

Because you can run Claude Code or Codex entirely on your own machine. Offline. Private. No tracking. No API costs. No one else seeing your files.

And the best part is, it's not a toy setup.

They can read files. Edit code. Run terminal commands. Refactor projects. And actually behave like a coding partner, not a chat box.

Let me show you exactly how it works.


Understanding the Tools: Claude Code and Codex

Both Claude Code and Codex are agentic coding tools that can work with local models through Ollama. Here's what you need to know:

Claude Code (Anthropic)

Claude Code is Anthropic's agentic coding tool that can read, modify, and execute code in your working directory. Thanks to Ollama's Anthropic-compatible API, you can now run Claude Code with local open-source models like qwen3-coder, glm-4.7, and gpt-oss.

Key Features:

  • Autonomous coding tasks from the terminal
  • Multi-file editing and refactoring
  • Project-wide analysis and code execution
  • Works with both cloud Claude models AND local Ollama models
  • Requires 64k+ token context window

Codex (OpenAI CLI)

Codex is OpenAI's command-line tool for AI-assisted coding. While originally designed for OpenAI's models, it fully supports Ollama's local models through the --oss flag.

Key Features:

  • Terminal-based coding assistance
  • File operations and code modifications
  • Supports local and cloud models via Ollama
  • Simple installation via npm
  • Requires 32k+ token context window

Which One Should You Use?

Choose Claude Code if:

  • You want the most modern agentic coding experience
  • You prefer Anthropic's approach to AI coding
  • You want flexibility to switch between local and cloud models
  • You have 64k+ context window models

Choose Codex if:

  • You prefer a simpler, more established tool
  • You're already familiar with OpenAI's ecosystem
  • You want lower context requirements (32k)
  • You prefer npm-based installation

Use both! They complement each other well and can both run locally through Ollama.


Who This Is Really For

Let's be honest about the audience here.

This is for you if:

  • You want a private AI coding agent that never leaves your machine
  • You like Claude Code or similar tools, but not the cloud dependency
  • You're experimenting with local LLMs and want something practical
  • You want an AI that can actually touch files and run commands
  • You're working on proprietary or sensitive code that can't be uploaded to third-party servers
  • You want to eliminate ongoing API costs entirely
  • Or you just want full control over your tools again

If you've ever thought, "I wish this thing just worked locally," you're in the right place.


The Big Idea: Understanding the Architecture

Both Claude Code and Codex work as a shell—a command-line interface that needs a brain to power it.

That's where Ollama comes in.

Think of Ollama like Docker for AI models. It runs quietly in the background and lets you pull models the same way you'd pull Docker images.

The Three-Layer Architecture

Layer 1: Claude Code or Codex (The Interface)

  • The command-line tool you interact with
  • Handles file operations, code editing, and terminal commands
  • Manages conversation context across your project
  • Translates your requests into structured API calls

Layer 2: Ollama (The Model Runtime)

  • Background service running on localhost:11434
  • Downloads, stores, and serves AI models
  • Handles model loading and GPU acceleration
  • Provides Anthropic-compatible and OpenAI-compatible APIs

Layer 3: Local LLM (The Brain)

  • The actual AI model performing reasoning and generation
  • Supports tool calling for file and command execution
  • Requires 32k-64k+ token context window
  • Can be swapped/upgraded independently

Hardware Requirements: Can Your Machine Handle It?

This is the make-or-break question. Local AI isn't magic—it requires real computational resources.

My Testing Setup: HP ZBook Studio 16 G9

I'm running this entire setup on an HP ZBook Studio 16 inch G9 Mobile Workstation PC with the following specifications:

Processor:
12th Gen Intel® Core™ i7-12800H (2.40 GHz base frequency)

  • 14 cores (6 Performance-cores + 8 Efficient-cores)
  • 20 threads with Hyper-Threading
  • 24 MB Intel® Smart Cache
  • Max Turbo frequency up to 4.8 GHz

Memory:
32.0 GB DDR5 RAM (31.6 GB usable)

  • Speed: 4800 MT/s
  • Dual-channel configuration

Graphics Cards:

  • GPU 0: Intel® Iris® Xe Graphics (integrated, 1% utilization)
  • GPU 1: NVIDIA RTX™ A2000 8GB Laptop GPU (5% idle, 44°C)
    • 2,560 CUDA cores
    • 8 GB GDDR6 VRAM
    • Used for AI model acceleration

Storage:
954 GB NVMe SSD

  • 523 GB of 954 GB currently used
  • 431 GB available for AI models

Minimum Requirements (Budget Setup)

Absolute Minimum:

  • CPU: 4-core processor (Intel i5 8th gen / AMD Ryzen 5 3600 equivalent)
  • RAM: 16 GB
  • Storage: 50 GB free space (SSD highly recommended)
  • GPU: Integrated graphics acceptable (Intel UHD 630 or better)
  • OS: Windows 10/11, macOS 11+, or modern Linux distribution

Note: You'll be limited to 3B-7B parameter models with slower response times (5-15 seconds per response).

Recommended Requirements (Good Experience)

  • CPU: 6+ core processor (Intel i7/i9 10th gen+, AMD Ryzen 7/9 5000+)
  • RAM: 32 GB or more
  • Storage: 200+ GB free NVMe SSD space
  • GPU: Dedicated GPU with 6+ GB VRAM (NVIDIA RTX 3060/4060 or better, AMD RX 6700 XT+)
  • OS: Latest stable OS version

Note: This configuration handles 7B-13B models with 2-5 second response times and can run 20B+ models at acceptable speeds.

Optimal Setup (Enthusiast/Professional)

  • CPU: 8+ core processor (Intel i9 12th gen+, AMD Ryzen 9 7000+)
  • RAM: 64 GB or more
  • Storage: 500+ GB NVMe SSD (dedicated for AI)
  • GPU: High-end GPU with 12+ GB VRAM (NVIDIA RTX 4070 Ti/4080/4090, AMD RX 7900 XT/XTX)

Note: Runs 30B-70B models smoothly, handles multiple concurrent sessions, and provides near-instant responses for smaller models.

Performance Reality Check

With my HP ZBook configuration (i7-12800H, 32GB RAM, RTX A2000 8GB), here's what I can realistically run:

  • 3B-7B models: Nearly instant responses (0.5-2 seconds)
  • 13B models: Responsive and production-ready (2-5 seconds)
  • 20B models: Usable but noticeably slower (5-10 seconds)
  • 30B+ models: Requires CPU fallback, significantly slower (15-30+ seconds)

If you're on a laptop with 16GB RAM and integrated graphics, stick to 7B models—they're surprisingly capable for most coding tasks.


Part 1: Installing Ollama

Ollama is the foundation of our local AI stack. It's remarkably easy to install across all major platforms.

Installation by Operating System

Windows (Recommended for most users):

  1. Download the Windows installer from ollama.com
  2. Run the OllamaSetup.exe file
  3. The installer will set up Ollama as a Windows service that starts automatically
  4. Open PowerShell or Command Prompt and verify:
    ollama --version

Windows with WSL2 (Advanced users):

# Install WSL2 if not already installed
wsl --install

# Inside your WSL distribution (Ubuntu recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Start the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify installation
ollama --version
systemctl status ollama

macOS:

# Download and run the official installer
curl -fsSL https://ollama.com/install.sh | sh

# Or install via Homebrew
brew install ollama

# Verify installation
ollama --version

Linux (Ubuntu/Debian):

# Install using the official script
curl -fsSL https://ollama.com/install.sh | sh

# Start and enable the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify it's running
ollama --version
systemctl status ollama

Linux (Fedora/RHEL/CentOS):

# Install using the official script
curl -fsSL https://ollama.com/install.sh | sh

# Start and enable the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify installation
ollama --version

Post-Installation Verification

Once installed, Ollama runs as a background service on port 11434. Verify it's working:

# Check if Ollama is running
curl http://localhost:11434

# You should see: "Ollama is running"

If you don't see this message, the service isn't running properly. Check your installation or firewall settings.

GPU Detection (NVIDIA Users)

If you have an NVIDIA GPU (like my RTX A2000), verify Ollama detects it:

# Check GPU detection
ollama ps

# You should see GPU memory information
# Example output:
# NAME              ID              SIZE    PROCESSOR
# No models loaded  -               -       GPU:0 (NVIDIA RTX A2000)

If GPU isn't detected:

  • Windows: Install latest NVIDIA drivers from nvidia.com
  • Linux: Install NVIDIA CUDA Toolkit:
    sudo apt install nvidia-cuda-toolkit
    Then restart Ollama: sudo systemctl restart ollama

Part 2: Downloading AI Models

Now comes the exciting part—choosing and downloading your AI models.

Understanding Model Naming Convention

Ollama uses a straightforward naming format:

model-name:parameter-size-quantization

Examples:

  • qwen3-coder = Qwen 3 Coder (default size and quantization)
  • glm-4.7 = GLM 4 with 7B parameters
  • gpt-oss:20b = GPT-OSS with 20 billion parameters
  • gpt-oss:120b = GPT-OSS with 120 billion parameters

Recommended Models for Claude Code and Codex

1. Qwen 3 Coder — Best for Claude Code (Highly Recommended)

ollama pull qwen3-coder
  • Size: ~4.5 GB download
  • RAM needed: 8-12 GB
  • Strengths: Excellent code generation, strong reasoning, optimized for coding tasks
  • Context window: 128k tokens (perfect for Claude Code)
  • Best for: Daily coding with Claude Code, multi-file refactoring

2. GLM-4.7 — Strong Alternative

ollama pull glm-4.7
  • Size: ~4.2 GB download
  • RAM needed: 8-12 GB
  • Strengths: Fast inference, good code understanding, multilingual
  • Context window: 128k tokens
  • Best for: Quick iterations, projects in multiple languages

3. GPT-OSS 20B — Balanced Power (Default for Codex)

ollama pull gpt-oss:20b
  • Size: ~11 GB download
  • RAM needed: 24-32 GB
  • Strengths: More capable reasoning, better at complex tasks
  • Context window: 64k+ tokens
  • Best for: Complex refactoring, architectural decisions

4. GPT-OSS 120B — Maximum Capability (If you have the hardware)

ollama pull gpt-oss:120b
  • Size: ~68 GB download
  • RAM needed: 80+ GB
  • Strengths: Highest quality code generation, best reasoning
  • Context window: 64k+ tokens
  • Best for: Complex systems, production-grade code generation

Context Window Requirements

Critical: Claude Code and Codex have specific context requirements:

  • Claude Code: Minimum 64k tokens recommended
  • Codex: Minimum 32k tokens recommended

Most models in Ollama's library meet these requirements, but you can verify:

# Check model details
ollama show qwen3-coder

# Look for num_ctx parameter

Download Strategy (My Recommendation)

Start with one model for your use case:

# For Claude Code (best experience)
ollama pull qwen3-coder

# For Codex (default)
ollama pull gpt-oss:20b

# Or download both to compare
ollama pull qwen3-coder
ollama pull gpt-oss:20b

# Check what you've downloaded
ollama list

Model Storage Locations

  • Linux/macOS: ~/.ollama/models/
  • Windows: %USERPROFILE%\.ollama\models\

Testing Your Model

Before connecting to Claude Code or Codex, verify your model works correctly:

# Start an interactive session
ollama run qwen3-coder

# Try a simple coding prompt
>>> Write a Python function to calculate fibonacci numbers recursively

# You should get a proper code response

# Exit with Ctrl+D or type:
>>> /bye

If this works, you're ready to set up Claude Code or Codex.


Part 3: Installing and Configuring Claude Code

Claude Code is Anthropic's agentic coding tool that now works seamlessly with Ollama's local models.

Installing Claude Code

macOS / Linux:

# Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash

# Verify installation
claude --version

Windows:

Download the Windows installer from code.claude.com and follow the installation wizard.

Quick Setup with Ollama

The fastest way to get Claude Code working with Ollama:

# Automatic configuration and launch
ollama launch claude

This command automatically:

  1. Detects your Ollama installation
  2. Configures Claude Code to use local models
  3. Sets up the Anthropic-compatible API endpoint
  4. Launches Claude Code in your current directory

Manual Setup (More Control)

For custom setups, configure without launching:

# Configure Claude Code for Ollama
ollama launch claude --config

Or set environment variables manually:

# Set the environment variables
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

# Run Claude Code with an Ollama model
claude --model qwen3-coder

Or run with environment variables inline:

ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model glm-4.7

Changing Models

Switch between different models easily:

# Use Qwen 3 Coder (recommended)
claude --model qwen3-coder

# Use GLM 4.7
claude --model glm-4.7

# Use GPT-OSS 20B
claude --model gpt-oss:20b

# Use GPT-OSS 120B (if you have the hardware)
claude --model gpt-oss:120b

Persistent Configuration

Make your Ollama setup permanent by adding to your shell profile:

# For bash (~/.bashrc) or zsh (~/.zshrc)
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
echo 'export ANTHROPIC_API_KEY=""' >> ~/.bashrc
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
source ~/.bashrc

# Now you can just run:
claude --model qwen3-coder

Using Cloud Models (Optional)

Claude Code also supports cloud models through ollama.com:

# Use cloud-hosted 120B model
claude --model gpt-oss:120b-cloud

Cloud models at ollama.com/search?c=cloud require an API key but offer more powerful options.


Part 4: Installing and Configuring Codex

Codex is OpenAI's command-line coding tool that works great with Ollama's local models.

Installing Codex

# Install via npm
npm install -g @openai/codex

# Verify installation
codex --version

Quick Setup with Ollama

The fastest way to use Codex with Ollama:

# Launch Codex in OSS (Open Source Software) mode
codex --oss

This automatically connects to your local Ollama instance at http://localhost:11434.

Changing Models

By default, Codex uses gpt-oss:20b. Switch to different models with the -m flag:

# Use GPT-OSS 20B (default)
codex --oss

# Use GPT-OSS 120B (more capable)
codex --oss -m gpt-oss:120b

# Use Qwen 3 Coder
codex --oss -m qwen3-coder

# Use GLM 4.7
codex --oss -m glm-4.7

Manual Configuration

For persistent settings, edit ~/.codex/config.toml:

# ~/.codex/config.toml

model = "qwen3-coder"
model_provider = "ollama"

[model_providers.ollama]
name = "Ollama"
base_url = "http://localhost:11434"
timeout = 120
max_retries = 3

[context]
max_tokens = 65536  # Match your model's context window
max_files = 50
include_hidden = false

[performance]
stream_responses = true
cache_responses = true

After editing, restart Codex to load the new settings.

Cloud Models via ollama.com

To use ollama.com's cloud models:

  1. Create an API key at ollama.com/settings/keys
  2. Export it as an environment variable:
    export OLLAMA_API_KEY="your-api-key-here"
    
  3. Edit ~/.codex/config.toml:
model = "gpt-oss:120b-cloud"
model_provider = "ollama"

[model_providers.ollama]
name = "Ollama"
base_url = "https://ollama.com/v1"
env_key = "OLLAMA_API_KEY"
  1. Run codex in a new terminal to load the settings

Part 5: Using Claude Code and Codex Effectively

Now that everything is configured, let's explore how to actually use these tools for real development work.

Starting a Session

With Claude Code:

# Navigate to your project
cd ~/projects/my-web-app

# Start Claude Code with default settings
claude --model qwen3-coder

# You'll see a prompt like:
# Claude (qwen3-coder)
# >

With Codex:

# Navigate to your project
cd ~/projects/my-web-app

# Start Codex with Ollama
codex --oss

# Or with a specific model
codex --oss -m qwen3-coder

Basic Commands and Operations

File Operations:

# Read files
> Read the main.py file
> Show me the authentication module
> What's in the utils folder?

# List files
> List all Python files in src/
> Show the project structure

# Search
> Find all TODO comments
> Show me where the User class is defined

Code Generation:

# Create new code
> Create a new API endpoint for user registration
> Write a function to validate email addresses
> Generate a TypeScript interface for the User model

# Add features
> Add logging to all API endpoints
> Implement rate limiting middleware
> Create unit tests for the calculate_total function

Code Analysis:

# Understanding code
> Explain what the process_payment function does
> What's the purpose of this decorator?
> Walk me through the authentication flow

# Finding issues
> Find potential bugs in the error handling
> Check for security vulnerabilities
> Identify performance bottlenecks in this loop

# Suggesting improvements
> Suggest optimizations for the database queries
> How can I make this code more maintainable?
> Review this code for best practices

Refactoring:

# Code improvements
> Refactor this class to use dependency injection
> Convert this callback-based code to async/await
> Extract this repeated logic into a helper function

# Modernization
> Update this code to use TypeScript generics
> Migrate from class components to React hooks
> Rewrite using modern Python type hints

Terminal Commands:

# Running commands
> Run the test suite
> Install the requests package
> Check git status
> Create a new branch called feature/user-auth

# Build and deployment
> Build the Docker image
> Run the development server
> Deploy to staging environment

Real-World Example Workflow

Let's walk through a realistic development scenario:

# Start in your e-commerce API project
cd ~/projects/ecommerce-api
claude --model qwen3-coder

# Session starts
> Hi! I need to add better error handling to the payment processing module

# Claude analyzes the code
[Reads payment.py and related files]

> I can see the payment module. The current error handling is minimal. 
> I'll add comprehensive try-catch blocks, logging, and user-friendly 
> error messages. Should I also add retry logic for network failures?

> Yes, add retry logic with exponential backoff

# Claude makes the changes
[Shows code modifications]

> Review the changes I made to payment.py

# After reviewing
> Looks good. Now write unit tests for these error scenarios

# Claude generates tests
[Creates test_payment_errors.py with comprehensive test cases]

> Run the tests

# Claude executes pytest
[Shows test results - 1 test failing]

> Test test_network_retry is failing because of a type mismatch on line 45

# Claude analyzes and fixes
> I see the issue. The mock response returns a string but the code 
> expects a dict. I'll fix the test.

[Updates test file]

> Run the tests again

[All tests pass]

> Perfect! Write a git commit message for these changes

> Here's a suggested commit message:
> 
> feat(payment): Add comprehensive error handling with retry logic
> 
> - Added try-catch blocks for all payment operations
> - Implemented exponential backoff for network failures
> - Added detailed logging for debugging
> - Created unit tests for error scenarios
> - Fixed type handling in payment response processing

> Commit these changes with that message

[Executes git commit]

> Done! Anything else?

> No, that's all. Thanks!

# Exit
Ctrl+D

This entire workflow stays local. Your code never leaves your machine.

Keyboard Shortcuts

  • Ctrl+D - Exit Claude Code or Codex
  • Ctrl+C - Cancel current operation
  • ↑/↓ - Navigate command history

Part 6: Performance Optimization

Running AI locally requires careful attention to performance. Here's how to get the most out of your hardware.

GPU Acceleration (Critical for Performance)

If you have an NVIDIA GPU (like my RTX A2000), ensure Ollama is using it properly:

Verify GPU usage:

# Check if models are running on GPU
ollama ps

# You should see GPU memory usage
# Example output:
# NAME              ID        SIZE    PROCESSOR
# qwen3-coder      abc123    4.5GB   GPU:0 (NVIDIA RTX A2000 8GB)

If you see "CPU" instead of "GPU", there's a problem.

Fix GPU detection (Windows):

  1. Install latest NVIDIA drivers from nvidia.com
  2. Restart your computer
  3. Restart Ollama service:
    Restart-Service Ollama  # In PowerShell as admin
    

Fix GPU detection (Linux):

# Install NVIDIA CUDA Toolkit
sudo apt update
sudo apt install nvidia-cuda-toolkit

# Verify CUDA installation
nvcc --version

# Check GPU is visible
nvidia-smi

# Restart Ollama
sudo systemctl restart ollama

# Test GPU usage
ollama run qwen3-coder

Monitor GPU usage during inference:

# Linux/Windows (with NVIDIA drivers)
nvidia-smi -l 1  # Updates every 1 second

# Watch GPU utilization while using Claude Code/Codex
# You should see GPU usage spike during responses

Adjusting Context Window

Both Claude Code and Codex benefit from larger context windows. You can adjust this in Ollama:

# Create a custom modelfile
cat > custom-qwen.txt << EOF
FROM qwen3-coder
PARAMETER num_ctx 128000
EOF

# Create custom model
ollama create qwen3-coder-large -f custom-qwen.txt

# Use the custom model
claude --model qwen3-coder-large

See Ollama's context length documentation for more details.

Memory Management Strategies

For 16GB RAM systems:

  • Use smaller models (qwen3-coder, glm-4.7)
  • Close other applications when coding
  • Use one model at a time
  • Consider reducing context window if needed

For 32GB+ RAM systems (like mine):

  • Can comfortably run gpt-oss:20b
  • Multiple models can be loaded
  • Full context windows work smoothly

For 64GB+ RAM systems:

  • Run gpt-oss:120b
  • Keep multiple models loaded
  • Maximum context windows

Benchmarking Your Setup

Test response times to find your optimal configuration:

# Simple benchmark
echo "Write a hello world function in Python" | time ollama run qwen3-coder

# Complex benchmark
echo "Explain the singleton pattern with code examples" | time ollama run qwen3-coder

My benchmark results (HP ZBook, RTX A2000, 32GB RAM):

ModelSimple QueryComplex QueryGPU Usage
qwen3-coder1.2s4.8s68%
glm-4.71.1s4.5s65%
gpt-oss:20b6.8s24.1s98%+CPU

Your results will vary based on hardware.


Part 7: Privacy and Security Considerations

Running AI locally is about more than saving money—it's about data sovereignty and security.

What Stays Local (Everything)

With this setup, 100% of your data stays on your machine:

  • All code and files processed
  • Conversation history and context
  • Model weights and inference computations
  • Generated outputs and suggestions
  • Project analysis and documentation

Zero network calls are made to external services (unless you explicitly use cloud models).

Verifying Complete Privacy

Confirm nothing is being sent externally by monitoring network connections:

Linux:

# Monitor all network connections
sudo netstat -tnp | grep ollama

# You should only see local connections (127.0.0.1)

Windows:

  1. Open Resource Monitor (resmon.exe)
  2. Go to Network tab
  3. Filter by ollama.exe
  4. Verify all connections are to 127.0.0.1:11434

macOS:

# Monitor network connections
sudo lsof -i -P | grep ollama

# You should only see localhost connections

Security Best Practices

1. Firewall Configuration

Block external access to Ollama's port:

# Linux (ufw)
sudo ufw deny 11434
sudo ufw status

# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 11434 -j DROP
sudo iptables -A INPUT -i lo -p tcp --dport 11434 -j ACCEPT

2. Encryption at Rest

Encrypt the drive where models are stored:

  • Windows: Use BitLocker
  • macOS: Enable FileVault
  • Linux: Use LUKS encryption

3. Regular Updates

# Update Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Update models
ollama pull qwen3-coder

# Update Claude Code
curl -fsSL https://claude.ai/install.sh | bash

# Update Codex
npm update -g @openai/codex

Compliance Considerations

For enterprise or regulated environments:

  • GDPR: Local processing means data never leaves your jurisdiction
  • HIPAA: No PHI is transmitted to third parties
  • SOC 2: Easier to maintain control over data processing
  • ISO 27001: Reduces attack surface by eliminating cloud dependencies

Part 8: Troubleshooting Common Issues

Issue: "Connection refused to localhost:11434"

Cause: Ollama service isn't running.

Solution:

# Check if Ollama is running
curl http://localhost:11434

# Linux
systemctl status ollama
sudo systemctl start ollama

# macOS
brew services list
brew services start ollama

# Windows
# Check Services app (services.msc)
# Restart "Ollama" service

Issue: "Model not found" or "Model does not exist"

Cause: Model isn't downloaded or name is misspelled.

Solution:

# List all downloaded models
ollama list

# Download the model if missing
ollama pull qwen3-coder

# Use exact name from 'ollama list'
claude --model qwen3-coder
codex --oss -m qwen3-coder

Issue: "Out of memory" errors

Cause: Model is too large for available RAM/VRAM.

Solution:

# Switch to a smaller model
ollama pull glm-4.7
claude --model glm-4.7

# Or close other applications to free memory

Issue: Very slow response times (30+ seconds)

Potential causes and fixes:

1. CPU bottleneck (no GPU acceleration):

# Verify GPU is being used
ollama ps
nvidia-smi  # Should show activity during inference

# If not using GPU, reinstall NVIDIA drivers

2. Model too large for hardware:

# Use a smaller/faster model
claude --model qwen3-coder  # instead of gpt-oss:120b

3. Context window too large:

# Reduce context window in modelfile
cat > smaller-context.txt << EOF
FROM qwen3-coder
PARAMETER num_ctx 32000
EOF

ollama create qwen3-fast -f smaller-context.txt
claude --model qwen3-fast

Issue: "Context window exceeded"

Cause: Your project has too many files or the model's context is too small.

Solution:

# Use a model with larger context
ollama pull qwen3-coder  # 128k context

# Or limit the files being analyzed
# Focus Claude Code/Codex on specific directories

Issue: Claude Code or Codex crashes

Debugging steps:

# Check Ollama is running
curl http://localhost:11434

# Check system resources
top  # or htop
nvidia-smi  # GPU

# Run with verbose output
claude --model qwen3-coder --verbose
codex --oss --verbose

Part 9: Local vs. Cloud: Honest Comparison

Let's be realistic about the trade-offs between local and cloud AI coding assistants.

Local Setup Advantages

Complete privacy — Your code never leaves your machine
Zero ongoing costs — No API bills, ever
Offline capability — Works without internet connection
No rate limits — Use as much as you want
Full control — Customize models, prompts, and behavior
No vendor lock-in — Switch models freely
Compliance friendly — Easier to meet regulatory requirements
No data retention concerns — Nothing stored externally

Local Setup Disadvantages

Hardware requirements — Needs decent CPU/RAM/GPU
Initial setup complexity — More steps than API signup
Slower than cloud — Even with good hardware
Model updates — Manual downloads required
Capability ceiling — Can't match GPT-4 or Claude Opus
Maintenance burden — You manage updates and troubleshooting

Cloud AI Advantages

Fastest performance — Optimized datacenter hardware
Latest models — Immediate access to new releases
No setup — Works out of the box
Highest capability — Access to largest, most powerful models
Zero maintenance — Provider handles everything

Cloud AI Disadvantages

Privacy concerns — Code sent to third parties
Ongoing costs — Can get expensive ($20-500+/month)
Requires internet — Useless offline
Rate limits — Throttling during high demand
Vendor lock-in — Dependent on provider's ecosystem
Data retention — Your code may be stored/used for training

My Personal Recommendation

Use local AI when:

  • Working on proprietary or sensitive code
  • Prototyping and experimenting heavily
  • Internet connection is unreliable
  • Cost is a major concern
  • Privacy is non-negotiable
  • You have capable hardware (16GB+ RAM)

Use cloud AI when:

  • You need the absolute best quality
  • Speed is critical to your workflow
  • You rarely use AI assistance (low usage)
  • You don't have capable hardware
  • Convenience outweighs privacy concerns

Hybrid approach (my preference):

  • Use local for routine tasks (80% of work)
  • Use cloud for complex problems (20% of work)
  • Keep both configured and switch as needed

Cost Analysis

Local Setup (One-Time):

  • Hardware (if upgrading): $500-2000
  • Time investment: 2-4 hours
  • Electricity: ~$5-15/month
  • Total first year: $560-2180 (then $60-180/year)

Cloud AI (Ongoing):

  • GitHub Copilot: $10/month ($120/year)
  • Claude API: $20-100/month ($240-1200/year)
  • Heavy usage: $200-500+/month ($2400-6000+/year)
  • Total first year: $120-6000+ (recurring annually)

Break-even point: Local setup pays for itself in 3-12 months with heavy usage.


Conclusion: The Future is Flexible and Private

The ability to run Claude Code or Codex entirely on your own machine represents a fundamental shift in how we think about AI tooling.

You're no longer forced to choose between capability and privacy, between power and control.

With my HP ZBook setup (Intel i7-12800H, 32GB RAM, RTX A2000), I've found local AI remarkably capable for daily development work. Models like qwen3-coder and glm-4.7 handle most tasks admirably. They feel nearly as capable as cloud alternatives for many scenarios.

Yes, there's a quality gap compared to frontier models like GPT-4 Turbo or Claude Opus. But that gap is narrowing rapidly. And for many use cases—refactoring, documentation, test generation, code explanation, debugging assistance—local models are already "good enough."

The real win is optionality.

You can run everything locally by default, preserving privacy and eliminating costs. When you hit a genuinely complex problem requiring maximum capability, you can temporarily connect to cloud models through ollama.com for that specific task, then disconnect.

That's the future: flexible, hybrid, user-controlled AI that adapts to your needs rather than forcing you into someone else's business model.

Key Takeaways

  1. Local AI is practical — Not just for hobbyists; production-ready for many tasks
  2. Both Claude Code and Codex work locally — Through Ollama's compatible APIs
  3. Privacy is achievable — You can have powerful AI without cloud dependencies
  4. Hardware matters — But 32GB RAM + decent GPU works well
  5. Setup is straightforward — Ollama + Claude Code/Codex runs in under an hour
  6. Performance is acceptable — 1-5 second responses for most queries
  7. Cost savings are real — Zero ongoing API bills
  8. It's getting better — Models improve monthly

Ready to Start?

The Quick Start (30 minutes):

  1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  2. Download a model: ollama pull qwen3-coder
  3. Install Claude Code: curl -fsSL https://claude.ai/install.sh | bash
  4. Launch: ollama launch claude
  5. Start coding with complete privacy

Or use Codex:

  1. Install Ollama (same as above)
  2. Download a model: ollama pull gpt-oss:20b
  3. Install Codex: npm install -g @openai/codex
  4. Launch: codex --oss
  5. Start coding with complete privacy

The tools are mature. The models are capable. Your code stays yours.

Welcome to the future of private, powerful, local AI coding assistance.


Additional Resources

Official Documentation

Community Resources

Hardware and Benchmarking


About the Author: Wilson Kumalo is a software developer and AI enthusiast exploring the intersection of privacy, performance, and practical tooling. He runs all his AI experiments on an HP ZBook Studio 16 G9 Mobile Workstation and writes about making powerful technology accessible and privacy-respecting.

System Specs: Intel i7-12800H • 32GB DDR5 RAM • NVIDIA RTX A2000 8GB • 954GB NVMe SSD

Contact: Questions or feedback? Reach out on X @KumaloWilson or via email (info@wilsonkumalo.dev).


This article was written in January 2026 and reflects the state of local AI tooling at that time. Technologies evolve rapidly—always check official documentation for the latest information.

Tested and verified on:

  • Windows 11 Pro (64-bit) with WSL2
  • Ollama (latest version)
  • Claude Code via Ollama integration
  • Codex CLI via Ollama integration
  • Models: Qwen 3 Coder, GLM-4.7, GPT-OSS 20B

About the Author

Profile picture of Wilson Kumalo - Full Stack Software Engineer - Flutter Doctor - AI & Digital Health Systems Builder

Wilson Kumalo

I design and build scalable, secure, and impactful software systems - from mobile apps and web platforms to AI-powered and digital health solutions. Also known as the Flutter Doctor. Passionate about solving real-world problems through technology.

Ready to build something bold?

Let's talk about your next product, platform, or experience. I'm currently available for new projects.