Local LLM Setup

Run translations completely locally using LM Studio or Ollama.

Overview

Subtide supports local LLM inference, allowing you to:

Run translations without cloud API costs
Keep all data on your machine
Use any compatible model
Avoid rate limits

LM Studio

LM Studio provides a user-friendly interface for running local models.

Installation

Download LM Studio from lmstudio.ai
Install and launch
Download a model from the built-in browser

Recommended Models

Model	Size	Quality	Speed
Llama 3.1 8B	4.7 GB	Good	Fast
Mistral 7B	4.1 GB	Good	Fast
Qwen 2.5 7B	4.4 GB	Excellent	Fast
Llama 3.1 70B	40 GB	Excellent	Slow

Start Local Server

In LM Studio, click Local Server tab
Select your downloaded model
Click Start Server
Server runs at http://localhost:1234/v1

Configure Extension

Operation Mode: Tier 1 or Tier 2
API Provider: Custom Endpoint
API URL: http://localhost:1234/v1
API Key: lm-studio
Model: (leave empty or enter model name)

Ollama

Ollama is a lightweight tool for running LLMs locally.

Installation

=== “macOS”

brew install ollama

=== “Linux”

curl -fsSL https://ollama.com/install.sh | sh

=== “Windows”

Download from ollama.ai

Download Models

# Recommended models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:7b

# For Asian languages
ollama pull qwen2.5:14b

Start Server

Ollama runs automatically after installation. Verify:

ollama serve  # If not running

Server runs at http://localhost:11434/v1.

Configure Extension

Operation Mode: Tier 1 or Tier 2
API Provider: Custom Endpoint
API URL: http://localhost:11434/v1
API Key: ollama
Model: llama3.1:8b

Model Recommendations

By Language

Language	Recommended Model	Notes
English	Any model	All work well
European	Mistral, Llama	Good coverage
Chinese	Qwen 2.5	Specifically trained
Japanese	Qwen 2.5	Good Asian language support
Korean	Qwen 2.5	Good Asian language support

By Hardware

RAM	Recommended Model
8 GB	Not recommended
16 GB	7B models (quantized)
32 GB	7B-8B models
64 GB	13B-14B models
128 GB+	70B models

Warning: Memory Requirements Models require significant RAM. The values above are rough estimates. Quantized models (Q4, Q5) use less memory.

Performance Tuning

LM Studio

GPU Layers: Increase for faster inference
Context Length: Reduce for less memory
Batch Size: Adjust based on hardware

Ollama

# Set number of GPU layers
OLLAMA_NUM_GPU=35 ollama run llama3.1:8b

# Set context size
ollama run llama3.1:8b --num-ctx 4096

Combining with Local Whisper

For fully local operation:

# Backend with local Whisper
WHISPER_MODEL=base WHISPER_BACKEND=mlx ./subtide-backend

Extension:
Operation Mode: Tier 2 (Enhanced)
API Provider: Custom Endpoint
API URL: http://localhost:1234/v1
API Key: lm-studio

This setup:

Transcribes audio locally with Whisper
Translates locally with LM Studio/Ollama
No data leaves your machine

Troubleshooting

Connection Refused

Verify the local server is running
Check the port (1234 for LM Studio, 11434 for Ollama)
Ensure no firewall blocking

Slow Responses

Use a smaller model
Increase GPU layers
Use quantized versions

Out of Memory

Use a smaller model
Use more aggressive quantization
Reduce context length

Poor Translation Quality

Use a larger model
Try a different model architecture
Consider using cloud APIs for critical work

Model Comparison

Translation Quality Ranking

GPT-4o (cloud) - Best quality
Llama 3.1 70B (local) - Excellent
Qwen 2.5 14B (local) - Very good for Asian languages
Llama 3.1 8B (local) - Good
Mistral 7B (local) - Good

Speed Ranking (local)

Mistral 7B - Fastest
Llama 3.1 8B - Fast
Qwen 2.5 7B - Fast
Qwen 2.5 14B - Medium
Llama 3.1 70B - Slow

Cost Comparison

Setup	Cost
OpenAI GPT-4o	~$0.01/video
OpenAI GPT-4o-mini	~$0.001/video
Local LLM	Electricity only

Tip: Hybrid Approach Use local models for most translations, cloud APIs for high-quality needs.

Next Steps

Backend Overview - All backend options
Docker Deployment - Container deployment
Configuration - All settings