How to Run a Large Language Model (LLM) Locally Using Ollama – Complete Beginner to Advanced Guide
Imagine having a powerful AI assistant like ChatGPT running entirely on your own computer — no internet connection required, no monthly subscription fees, and complete privacy for your data.
Ollama makes this possible by providing a simple, one-command way to run large language models locally on Windows, macOS, and Linux. Whether you have a low-powered laptop or a high-end gaming rig, there's a model that will work for you.
In this guide, you'll learn what Ollama is, how to install it using exact PowerShell and terminal commands, which models to choose based on your hardware, how to run them, and even how to build your own AI-powered applications — all with zero cloud dependencies.
What is Ollama?
Ollama is a free, open-source tool that bundles everything needed to run large language models locally. It handles model downloading, memory management, GPU acceleration, and provides both a command-line interface and a REST API.
Ollama simplifies local AI by:
- Letting you pull and run models with a single command
- Auto-detecting and utilizing GPU hardware (NVIDIA, AMD, Apple Silicon)
- Managing model versions and storage automatically
- Providing an OpenAI-compatible API for developers
System Requirements: Low-End to Advanced Hardware
Different models require different hardware. Here's what you need based on your computer's capabilities:
Low-End / Older Computers (8GB RAM, no dedicated GPU)
- Use models: Phi-3 (3.8B), LLaMA 3.2 (3B), TinyLlama (1.1B)
- RAM needed: 4-8 GB
- Storage: ~5-8 GB free
- Expected performance: 2-5 seconds per response
Mid-Range / Average Laptops (16GB RAM, integrated or entry GPU)
- Use models: LLaMA 3.1 (8B), Mistral (7B), Gemma 2 (9B)
- RAM needed: 12-16 GB
- Storage: ~10-15 GB free
- Expected performance: 1-3 seconds per response
High-End / Gaming or Workstation (32GB+ RAM, NVIDIA GPU with 8GB+ VRAM)
- Use models: LLaMA 3.1 (70B), DeepSeek-Coder (33B), Mixtral (8x7B)
- RAM needed: 32-64 GB
- Storage: ~40-80 GB free
- Expected performance: Near real-time, high accuracy
Installing Ollama: Step-by-Step Commands
For Windows (PowerShell – Run as Administrator)
# Option 1: Download installer from official website
# Visit: https://ollama.com/download/windows
# Run the downloaded .exe file
# Option 2: Install using winget (Windows Package Manager)
winget install Ollama.Ollama
# After installation, verify it's working
ollama --version
# If command not found, close and reopen PowerShell
For Linux (Ubuntu/Debian – Terminal)
# One-line install script
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start Ollama service (if not auto-started)
ollama serve
For macOS (Terminal)
# Using Homebrew
brew install ollama
# Or download from https://ollama.com/download/mac
# Verify installation
ollama --version
Running Your First Model: Pull and Execute
Step 1: Pull a model (download it to your computer)
# For low-end computers (runs on almost any machine)
ollama pull phi3
# For mid-range computers
ollama pull llama3.2
# For high-end computers
ollama pull llama3.1
Step 2: Run the model interactively
# Start chatting with the model
ollama run phi3
# Type your message and press Enter
# Example: "Explain what a large language model is in one sentence."
# To exit the chat, press Ctrl+D
Step 3: One-off prompt (non-interactive)
# Generate a single response without entering chat mode
ollama run phi3 "Write a haiku about artificial intelligence"
Essential Ollama Commands (PowerShell / Terminal)
# List all downloaded models
ollama list
# Show details about a specific model
ollama show phi3
# Remove a model to free up space
ollama rm phi3
# Show currently running models
ollama ps
# Run a model with custom context window (for low memory)
ollama run phi3 --num-ctx 2048
# Run a model with GPU offloading disabled (CPU only)
ollama run phi3 --no-mmap
# Pull a specific quantized version (uses less RAM)
ollama pull llama3.2:3b-instruct-q4_0
ollama run llama3.2:3b-instruct-q4_0
Popular Models: Exact Pull and Run Commands
For Low-End / Less Powerful Computers
# Microsoft Phi-3 Mini (3.8B) - Best for low-end PCs
ollama pull phi3
ollama run phi3
# LLaMA 3.2 3B - Balanced speed and quality
ollama pull llama3.2
ollama run llama3.2
# TinyLlama (1.1B) - Extremely lightweight
ollama pull tinyllama
ollama run tinyllama
# Gemma 2B - Google's compact model
ollama pull gemma:2b
ollama run gemma:2b
For Mid-Range / Average Computers
# LLaMA 3.1 8B - Strong performance
ollama pull llama3.1
ollama run llama3.1
# Mistral 7B - Excellent all-around model
ollama pull mistral
ollama run mistral
# Gemma 2 9B - Great for reasoning
ollama pull gemma2
ollama run gemma2
# CodeLlama 7B - For programming assistance
ollama pull codellama
ollama run codellama
For Advanced / High-End Computers
# LLaMA 3.1 70B - State-of-the-art
ollama pull llama3.1:70b
ollama run llama3.1:70b
# Mixtral 8x7B - Mixture of experts
ollama pull mixtral
ollama run mixtral
# DeepSeek-Coder 33B - Advanced coding model
ollama pull deepseek-coder:33b
ollama run deepseek-coder:33b
# WizardCoder 34B - Programming specialist
ollama pull wizardcoder
ollama run wizardcoder
Using Ollama as a ChatGPT Alternative
Once you run ollama run mistral (or any model), you enter an interactive chat session. Here are example prompts to try:
# Coding help
"Write a Python function to check if a string is a palindrome."
# Writing assistance
"Write a professional email requesting a project update."
# Explanation
"Explain how transformers work in AI, like I'm 10 years old."
# Creative writing
"Write a short story about a robot that discovers emotions."
# Translation
"Translate 'Good morning, how are you?' into Spanish, French, and Japanese."
# Analysis
"Summarize the key differences between machine learning and deep learning."
The model maintains conversation history, so you can ask follow-up questions naturally. Press Ctrl+D to exit.
Using Ollama's API in Python
Ollama exposes a REST API at http://localhost:11434. Here's how to use it from Python:
# First, install the requests library
pip install requests
# Python script to call Ollama API
import requests
import json
# Simple generation
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "phi3",
"prompt": "What is the capital of Japan?",
"stream": False
}
)
print(response.json()["response"])
# Chat with history
messages = [
{"role": "user", "content": "Who wrote Romeo and Juliet?"}
]
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3.2",
"messages": messages,
"stream": False
}
)
print(response.json()["message"]["content"])
# List available models
models = requests.get("http://localhost:11434/api/tags")
print(models.json())
Building a Simple Local AI Chatbot
Save the following as chatbot.html and open it in your browser. Make sure Ollama is running (ollama serve in background).
<h2>Local AI Chatbot (Ollama)</h2>
<textarea id="prompt" rows="3" cols="60" placeholder="Ask anything..."></textarea>
<br>
<button onclick="askAI()">Send</button>
<h3>Response:</h3>
<div id="response" style="border:1px solid #ccc; padding:10px; min-height:100px;"></div>
<script>
async function askAI() {
const prompt = document.getElementById('prompt').value;
const responseDiv = document.getElementById('response');
responseDiv.innerHTML = "Thinking...";
try {
const res = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: 'phi3', prompt: prompt, stream: false })
});
const data = await res.json();
responseDiv.innerHTML = data.response;
} catch (error) {
responseDiv.innerHTML = "Error: Make sure Ollama is running (ollama serve)";
}
}
</script>
Performance Optimization Tips for Each Hardware Tier
For Low-End Computers:
- Use smallest models:
ollama pull phi3thenollama run phi3orollama pull llama3.2:1bthenollama run llama3.2:1b - Reduce context window:
ollama run phi3 --num-ctx 1024 - Close all other applications before running
- Use quantized models (q4 variants) which use less RAM:
ollama pull llama3.2:3b-instruct-q4_0thenollama run llama3.2:3b-instruct-q4_0
For Mid-Range Computers:
- Enable GPU if available: Ollama auto-detects, but verify with
ollama ps - Use 7B-9B parameter models for best balance:
ollama pull mistralthenollama run mistral - Set
OLLAMA_NUM_PARALLEL=1to limit concurrent requests
For Advanced Computers:
- Use larger models (70B) for maximum accuracy:
ollama pull llama3.1:70bthenollama run llama3.1:70b - Multiple GPUs: Ollama automatically splits models across GPUs
- Set
OLLAMA_HOST=0.0.0.0:11434to expose API to network
Real-World Use Cases for Local LLMs
- Offline AI Assistant: Use during travel, flights, or areas with poor internet
- Private Document Analysis: Analyze sensitive business or medical documents without cloud upload
- Coding Assistant: Get programming help without sending code to external servers
- Education: Students can learn and practice with AI without subscription costs
- Content Creation: Generate blog posts, social media content, and marketing copy privately
- Research: Experiment with different models and prompts without API rate limits
Common Errors & How to Fix Them
Error: "ollama is not recognized"
# Windows: Add to PATH manually
# Close and reopen PowerShell, or add: C:\Users\%USERNAME%\AppData\Local\Programs\Ollama
# Linux/macOS: Reinstall or check if service is running
ollama serve
Error: "insufficient memory" or "out of memory"
# Use a smaller model
ollama pull phi3
ollama run phi3 # Instead of llama3.1
# Or use quantized version
ollama pull llama3.2:3b-instruct-q4_0
ollama run llama3.2:3b-instruct-q4_0
Error: "port 11434 already in use"
# Change the port
OLLAMA_HOST=0.0.0.0:11435 ollama serve
Slow performance on Windows
# Check if GPU is being used
ollama ps
# Force CPU mode if GPU causes issues
ollama run phi3 --no-mmap
# Update NVIDIA drivers if using GPU
Limitations to Keep in Mind
- Local models (even 70B) may not match GPT-4 or Claude on complex reasoning tasks
- Hardware requirements: Larger models need significant RAM and storage
- No built-in internet search or real-time data access (unless you build it)
- Multimodal (image, video) support is limited compared to cloud offerings
- Setup requires basic command-line familiarity
The Future of Local AI (2026 and Beyond)
- Smaller, more capable models: Phi-3 and LLaMA 3.2 show that 3B models can be highly capable
- NPU integration: New laptops with Neural Processing Units will run models faster and more efficiently
- Better quantization: Improved compression techniques will reduce RAM requirements
- Multimodal local models: Vision and audio capabilities coming to local models
- Seamless integration: Local AI will become built into operating systems and applications
Conclusion
Ollama puts the power of large language models in your hands — literally. Whether you have a decade-old laptop or a cutting-edge workstation, there's a model that will run on your hardware.
By running AI locally, you gain complete privacy, zero recurring costs, and the freedom to use AI anywhere, anytime. Start with a small model like ollama pull phi3 then ollama run phi3, and work your way up as you become more comfortable.
Bonus: Quick Start Cheat Sheet
# INSTALLATION
# Windows (PowerShell Admin)
winget install Ollama.Ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS
brew install ollama
# BASIC COMMANDS
ollama pull phi3 # Download a model
ollama run phi3 # Start interactive chat
ollama list # See downloaded models
ollama rm phi3 # Delete model
# MODEL RECOMMENDATIONS (Always pull first, then run)
# Low-end PC (8GB RAM)
ollama pull phi3
ollama run phi3
ollama pull llama3.2:1b
ollama run llama3.2:1b
# Mid-range (16GB RAM)
ollama pull llama3.2
ollama run llama3.2
ollama pull mistral
ollama run mistral
# High-end (32GB+ RAM)
ollama pull llama3.1:70b
ollama run llama3.1:70b
# API (Python)
# POST http://localhost:11434/api/generate
# Payload: {"model": "phi3", "prompt": "Your prompt"}
Frequently Asked Questions (FAQs)
Q1: Can I run Ollama without an internet connection?
Yes — after you've downloaded a model, you can use it completely offline.
Q2: Will running LLMs damage my computer?
No. It uses your computer's resources like any other application. If your system is under heavy load, close other apps.
Q3: Do I need a GPU?
No. Many models run fine on CPU. GPU acceleration just makes them faster.
Q4: How much storage do I need?
Each model takes 2-80 GB. Start with one model (phi3 is ~2.5 GB) to see if you like it.
Q5: Is Ollama free?
Yes, completely free and open-source.
Q6: Can I use Ollama with Python or JavaScript?
Yes — Ollama provides a REST API that works with any programming language.
Q7: Which model is best for a 4GB RAM laptop?
Use ollama pull llama3.2:1b then ollama run llama3.2:1b or ollama pull tinyllama then ollama run tinyllama. They're designed for low-memory devices.
Ready to start? Open your terminal or PowerShell and type:
ollama pull phi3
ollama run phi3
Your local AI journey begins now — no cloud, no subscriptions, just pure AI power on your own machine.