Run a LLM Locally Using Ollama

Guide 2026-03-28

How to Run a Large Language Model (LLM) Locally Using Ollama – Complete Beginner to Advanced Guide

Imagine having a powerful AI assistant like ChatGPT running entirely on your own computer — no internet connection required, no monthly subscription fees, and complete privacy for your data.

Ollama makes this possible by providing a simple, one-command way to run large language models locally on Windows, macOS, and Linux. Whether you have a low-powered laptop or a high-end gaming rig, there's a model that will work for you.

In this guide, you'll learn what Ollama is, how to install it using exact PowerShell and terminal commands, which models to choose based on your hardware, how to run them, and even how to build your own AI-powered applications — all with zero cloud dependencies.

What is Ollama?

Ollama is a free, open-source tool that bundles everything needed to run large language models locally. It handles model downloading, memory management, GPU acceleration, and provides both a command-line interface and a REST API.

Ollama simplifies local AI by:

Letting you pull and run models with a single command
Auto-detecting and utilizing GPU hardware (NVIDIA, AMD, Apple Silicon)
Managing model versions and storage automatically
Providing an OpenAI-compatible API for developers

System Requirements: Low-End to Advanced Hardware

Different models require different hardware. Here's what you need based on your computer's capabilities:

Low-End / Older Computers (8GB RAM, no dedicated GPU)

Use models: Phi-3 (3.8B), LLaMA 3.2 (3B), TinyLlama (1.1B)
RAM needed: 4-8 GB
Storage: ~5-8 GB free
Expected performance: 2-5 seconds per response

Mid-Range / Average Laptops (16GB RAM, integrated or entry GPU)

Use models: LLaMA 3.1 (8B), Mistral (7B), Gemma 2 (9B)
RAM needed: 12-16 GB
Storage: ~10-15 GB free
Expected performance: 1-3 seconds per response

High-End / Gaming or Workstation (32GB+ RAM, NVIDIA GPU with 8GB+ VRAM)

Use models: LLaMA 3.1 (70B), DeepSeek-Coder (33B), Mixtral (8x7B)
RAM needed: 32-64 GB
Storage: ~40-80 GB free
Expected performance: Near real-time, high accuracy

Installing Ollama: Step-by-Step Commands

For Windows (PowerShell – Run as Administrator)

# Option 1: Download installer from official website
# Visit: https://ollama.com/download/windows
# Run the downloaded .exe file

# Option 2: Install using winget (Windows Package Manager)
winget install Ollama.Ollama

# After installation, verify it's working
ollama --version

# If command not found, close and reopen PowerShell

For Linux (Ubuntu/Debian – Terminal)

# One-line install script
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Start Ollama service (if not auto-started)
ollama serve

For macOS (Terminal)

# Using Homebrew
brew install ollama

# Or download from https://ollama.com/download/mac

# Verify installation
ollama --version

Running Your First Model: Pull and Execute

Step 1: Pull a model (download it to your computer)

# For low-end computers (runs on almost any machine)
ollama pull phi3

# For mid-range computers
ollama pull llama3.2

# For high-end computers
ollama pull llama3.1

Step 2: Run the model interactively

# Start chatting with the model
ollama run phi3

# Type your message and press Enter
# Example: "Explain what a large language model is in one sentence."

# To exit the chat, press Ctrl+D

Step 3: One-off prompt (non-interactive)

# Generate a single response without entering chat mode
ollama run phi3 "Write a haiku about artificial intelligence"

Essential Ollama Commands (PowerShell / Terminal)

# List all downloaded models
ollama list

# Show details about a specific model
ollama show phi3

# Remove a model to free up space
ollama rm phi3

# Show currently running models
ollama ps

# Run a model with custom context window (for low memory)
ollama run phi3 --num-ctx 2048

# Run a model with GPU offloading disabled (CPU only)
ollama run phi3 --no-mmap

# Pull a specific quantized version (uses less RAM)
ollama pull llama3.2:3b-instruct-q4_0
ollama run llama3.2:3b-instruct-q4_0

Popular Models: Exact Pull and Run Commands

For Low-End / Less Powerful Computers

# Microsoft Phi-3 Mini (3.8B) - Best for low-end PCs
ollama pull phi3
ollama run phi3

# LLaMA 3.2 3B - Balanced speed and quality
ollama pull llama3.2
ollama run llama3.2

# TinyLlama (1.1B) - Extremely lightweight
ollama pull tinyllama
ollama run tinyllama

# Gemma 2B - Google's compact model
ollama pull gemma:2b
ollama run gemma:2b

For Mid-Range / Average Computers

# LLaMA 3.1 8B - Strong performance
ollama pull llama3.1
ollama run llama3.1

# Mistral 7B - Excellent all-around model
ollama pull mistral
ollama run mistral

# Gemma 2 9B - Great for reasoning
ollama pull gemma2
ollama run gemma2

# CodeLlama 7B - For programming assistance
ollama pull codellama
ollama run codellama

For Advanced / High-End Computers

# LLaMA 3.1 70B - State-of-the-art
ollama pull llama3.1:70b
ollama run llama3.1:70b

# Mixtral 8x7B - Mixture of experts
ollama pull mixtral
ollama run mixtral

# DeepSeek-Coder 33B - Advanced coding model
ollama pull deepseek-coder:33b
ollama run deepseek-coder:33b

# WizardCoder 34B - Programming specialist
ollama pull wizardcoder
ollama run wizardcoder

Using Ollama as a ChatGPT Alternative

Once you run ollama run mistral (or any model), you enter an interactive chat session. Here are example prompts to try:

# Coding help
"Write a Python function to check if a string is a palindrome."

# Writing assistance
"Write a professional email requesting a project update."

# Explanation
"Explain how transformers work in AI, like I'm 10 years old."

# Creative writing
"Write a short story about a robot that discovers emotions."

# Translation
"Translate 'Good morning, how are you?' into Spanish, French, and Japanese."

# Analysis
"Summarize the key differences between machine learning and deep learning."

The model maintains conversation history, so you can ask follow-up questions naturally. Press Ctrl+D to exit.

Using Ollama's API in Python

Ollama exposes a REST API at http://localhost:11434. Here's how to use it from Python:

# First, install the requests library
pip install requests

# Python script to call Ollama API
import requests
import json

# Simple generation
response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "phi3",
        "prompt": "What is the capital of Japan?",
        "stream": False
    }
)
print(response.json()["response"])

# Chat with history
messages = [
    {"role": "user", "content": "Who wrote Romeo and Juliet?"}
]
response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.2",
        "messages": messages,
        "stream": False
    }
)
print(response.json()["message"]["content"])

# List available models
models = requests.get("http://localhost:11434/api/tags")
print(models.json())

Building a Simple Local AI Chatbot

Save the following as chatbot.html and open it in your browser. Make sure Ollama is running (ollama serve in background).

<h2>Local AI Chatbot (Ollama)</h2>
<textarea id="prompt" rows="3" cols="60" placeholder="Ask anything..."></textarea>
<br>
<button onclick="askAI()">Send</button>
<h3>Response:</h3>
<div id="response" style="border:1px solid #ccc; padding:10px; min-height:100px;"></div>
<script>
async function askAI() {
    const prompt = document.getElementById('prompt').value;
    const responseDiv = document.getElementById('response');
    responseDiv.innerHTML = "Thinking...";
    try {
        const res = await fetch('http://localhost:11434/api/generate', {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ model: 'phi3', prompt: prompt, stream: false })
        });
        const data = await res.json();
        responseDiv.innerHTML = data.response;
    } catch (error) {
        responseDiv.innerHTML = "Error: Make sure Ollama is running (ollama serve)";
    }
}
</script>

Performance Optimization Tips for Each Hardware Tier

For Low-End Computers:

Use smallest models: ollama pull phi3 then ollama run phi3 or ollama pull llama3.2:1b then ollama run llama3.2:1b
Reduce context window: ollama run phi3 --num-ctx 1024
Close all other applications before running
Use quantized models (q4 variants) which use less RAM: ollama pull llama3.2:3b-instruct-q4_0 then ollama run llama3.2:3b-instruct-q4_0

For Mid-Range Computers:

Enable GPU if available: Ollama auto-detects, but verify with ollama ps
Use 7B-9B parameter models for best balance: ollama pull mistral then ollama run mistral
Set OLLAMA_NUM_PARALLEL=1 to limit concurrent requests

For Advanced Computers:

Use larger models (70B) for maximum accuracy: ollama pull llama3.1:70b then ollama run llama3.1:70b
Multiple GPUs: Ollama automatically splits models across GPUs
Set OLLAMA_HOST=0.0.0.0:11434 to expose API to network

Real-World Use Cases for Local LLMs

Offline AI Assistant: Use during travel, flights, or areas with poor internet
Private Document Analysis: Analyze sensitive business or medical documents without cloud upload
Coding Assistant: Get programming help without sending code to external servers
Education: Students can learn and practice with AI without subscription costs
Content Creation: Generate blog posts, social media content, and marketing copy privately
Research: Experiment with different models and prompts without API rate limits

Common Errors & How to Fix Them

Error: "ollama is not recognized"

# Windows: Add to PATH manually
# Close and reopen PowerShell, or add: C:\Users\%USERNAME%\AppData\Local\Programs\Ollama

# Linux/macOS: Reinstall or check if service is running
ollama serve

Error: "insufficient memory" or "out of memory"

# Use a smaller model
ollama pull phi3
ollama run phi3  # Instead of llama3.1

# Or use quantized version
ollama pull llama3.2:3b-instruct-q4_0
ollama run llama3.2:3b-instruct-q4_0

Error: "port 11434 already in use"

# Change the port
OLLAMA_HOST=0.0.0.0:11435 ollama serve

Slow performance on Windows

# Check if GPU is being used
ollama ps

# Force CPU mode if GPU causes issues
ollama run phi3 --no-mmap

# Update NVIDIA drivers if using GPU

Limitations to Keep in Mind

Local models (even 70B) may not match GPT-4 or Claude on complex reasoning tasks
Hardware requirements: Larger models need significant RAM and storage
No built-in internet search or real-time data access (unless you build it)
Multimodal (image, video) support is limited compared to cloud offerings
Setup requires basic command-line familiarity

The Future of Local AI (2026 and Beyond)

Smaller, more capable models: Phi-3 and LLaMA 3.2 show that 3B models can be highly capable
NPU integration: New laptops with Neural Processing Units will run models faster and more efficiently
Better quantization: Improved compression techniques will reduce RAM requirements
Multimodal local models: Vision and audio capabilities coming to local models
Seamless integration: Local AI will become built into operating systems and applications

Conclusion

Ollama puts the power of large language models in your hands — literally. Whether you have a decade-old laptop or a cutting-edge workstation, there's a model that will run on your hardware.

By running AI locally, you gain complete privacy, zero recurring costs, and the freedom to use AI anywhere, anytime. Start with a small model like ollama pull phi3 then ollama run phi3, and work your way up as you become more comfortable.

Bonus: Quick Start Cheat Sheet

# INSTALLATION
# Windows (PowerShell Admin)
winget install Ollama.Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# BASIC COMMANDS
ollama pull phi3           # Download a model
ollama run phi3            # Start interactive chat
ollama list                # See downloaded models
ollama rm phi3             # Delete model

# MODEL RECOMMENDATIONS (Always pull first, then run)
# Low-end PC (8GB RAM)
ollama pull phi3
ollama run phi3

ollama pull llama3.2:1b
ollama run llama3.2:1b

# Mid-range (16GB RAM)
ollama pull llama3.2
ollama run llama3.2

ollama pull mistral
ollama run mistral

# High-end (32GB+ RAM)
ollama pull llama3.1:70b
ollama run llama3.1:70b

# API (Python)
# POST http://localhost:11434/api/generate
# Payload: {"model": "phi3", "prompt": "Your prompt"}

Frequently Asked Questions (FAQs)

Q1: Can I run Ollama without an internet connection?
Yes — after you've downloaded a model, you can use it completely offline.

Q2: Will running LLMs damage my computer?
No. It uses your computer's resources like any other application. If your system is under heavy load, close other apps.

Q3: Do I need a GPU?
No. Many models run fine on CPU. GPU acceleration just makes them faster.

Q4: How much storage do I need?
Each model takes 2-80 GB. Start with one model (phi3 is ~2.5 GB) to see if you like it.

Q5: Is Ollama free?
Yes, completely free and open-source.

Q6: Can I use Ollama with Python or JavaScript?
Yes — Ollama provides a REST API that works with any programming language.

Q7: Which model is best for a 4GB RAM laptop?
Use ollama pull llama3.2:1b then ollama run llama3.2:1b or ollama pull tinyllama then ollama run tinyllama. They're designed for low-memory devices.

Ready to start? Open your terminal or PowerShell and type:

ollama pull phi3
ollama run phi3

Your local AI journey begins now — no cloud, no subscriptions, just pure AI power on your own machine.

Table of Contents