Perplexity Calculation in Python for Apple M4 Silicon
Bottom line
Three viable approaches for your setup: (1) HuggingFace Transformers + PyTorch MPS - most flexible, best ecosystem, works today. (2) MLX via mlx-lm or sillm-mlx - fastest on Apple Silicon, 2-3x over MPS for LLM inference; (3) lmppl library - simplest API, 3 lines of code, wraps HuggingFace under the hood. For a library you're building, option 1 gives most control. Your 24GB unified memory comfortably holds 8B models at FP16 or 14B at Q4/Q5 quantization.
Key findings
-
Math is simple:
PPL = exp(-1/N * sum(log P(token_i | context)))- exponentiated average negative log-likelihood. Equivalent toexp(cross_entropy_loss). -
Model size for 24GB: 8B params FP16 = ~16GB (fits tight). 8B Q8_0 = ~8GB (comfortable). 14B Q4_K_M = ~8.4GB weights + ~3GB overhead (fits well). Leaves headroom for KV cache and eval batching.
-
MLX outperforms PyTorch MPS on Apple Silicon - 2-3x faster generation, native unified memory.
torch.compileon MPS still limited, often falls back to CPU. -
Tokenization matters critically. Perplexity scores are not comparable across models with different tokenizers. Always report tokenizer used.
-
Sliding window needed for texts longer than model's context window. HuggingFace docs show stride=512 with GPT-2 (1024 ctx) drops PPL from 19.44 to 16.44.
Background
Perplexity measures how "surprised" a language model is by text. Lower = more predictable/fluent. Formula:
PPL(X) = exp{ -1/N * sum log p(x_i | x_{<i}) }
Use cases: (1) evaluate language model quality, (2) AI text detection (AI prose = low perplexity), (3) writing quality analysis, (4) text fluency scoring.
Related metric: burstiness - variance in sentence structure. Combined with perplexity for AI detection.
Implementation approaches
Approach 1: Raw Transformers (recommended for library)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai-community/gpt2-large" # or any causal LM
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# For Apple Silicon
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)
def perplexity(text: str) -> float:
encodings = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**encodings, labels=encodings["input_ids"])
return torch.exp(outputs.loss).item()
For long texts, use sliding window (stride < max_length):
max_length = model.config.n_positions # e.g. 1024 for GPT-2
stride = 512
seq_len = encodings.input_ids.size(1)
nll_sum, n_tokens = 0.0, 0
prev_end_loc = 0
for begin_loc in range(0, seq_len, stride):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end_loc
input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
neg_log_likelihood = outputs.loss
num_valid = (target_ids != -100).sum().item()
num_loss_tokens = num_valid - target_ids.size(0)
nll_sum += neg_log_likelihood * num_loss_tokens
n_tokens += num_loss_tokens
prev_end_loc = end_loc
if end_loc == seq_len:
break
ppl = torch.exp(nll_sum / n_tokens)
Approach 2: lmppl (simplest)
import lmppl
scorer = lmppl.LM("gpt2")
ppl = scorer.get_perplexity(["Your text here"], batch_size=8)
Supports: causal LMs (LM), masked LMs pseudo-perplexity (MaskedLM), encoder-decoder (EncoderDecoderLM).
Approach 3: MLX native (fastest on M4)
pip install sillm-mlx
sillm-mlx supports perplexity on MLX with Llama, Mistral, Gemma, Phi, Qwen models. Uses unified memory - no CPU-GPU transfer overhead.
Approach 4: HuggingFace evaluate
from evaluate import load
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=["text1", "text2"], model_id="gpt2")
# results["mean_perplexity"], results["perplexities"]
Model choices for 24GB M4
| Model | Precision | Weight Size | Total w/ overhead | Quality |
|---|---|---|---|---|
| GPT-2 (124M) | FP32 | ~0.5GB | ~1GB | Baseline, fast |
| GPT-2 Large (774M) | FP16 | ~1.5GB | ~2GB | Better, still fast |
| Llama 3 8B | Q8_0 | ~8GB | ~11GB | Near-lossless (+0.0004 PPL) |
| Llama 3 8B | FP16 | ~16GB | ~19GB | Full precision, tight fit |
| Qwen 2.5 14B | Q4_K_M | ~8.4GB | ~12GB | Good quality (+0.05 PPL) |
| Gemma 3 12B | Q5_K_M | ~7.5GB | ~11GB | Very good quality |
Quantization perplexity impact (measured delta vs FP16):
| Quant | Bits | PPL Delta | Assessment |
|---|---|---|---|
| Q8_0 | 8.0 | +0.0004 | Essentially lossless |
| Q6_K | 6.0 | +0.0044 | Virtually no difference |
| Q5_K_M | 5.1 | +0.0142 | Hard to notice |
| Q4_K_M | 4.5 | +0.0535 | Fine for most tasks |
| Q3_K_M | 3.3 | +0.2437 | Significant loss |
Limitations and critiques
- Perplexity only defined for causal (autoregressive) LMs. Not well-defined for BERT-style masked LMs (pseudo-perplexity is an approximation).
- Not comparable across models with different tokenizers or vocabularies.
- Token-level vs word-level: HuggingFace averages over GPT-2 tokens, not words - produces different numbers than academic papers averaging over words.
- PyTorch MPS limitations:
torch.compileoften falls back to CPU on MPS. FlashAttention not natively supported. SDPA crashes on sequences >12K tokens. - Perplexity alone insufficient for AI detection - modern LLMs can produce high-perplexity text. Combine with burstiness and other metrics.
Open questions
- MLX ecosystem still maturing - will
mlx-lmadd native perplexity eval command? - How does Apple's Neural Engine factor into perplexity eval on M4?
- Best stride/window size tradeoff for production use cases?
Practical takeaways for your library
- Start with PyTorch + Transformers - widest model support, most flexible, MPS gives decent speedup over CPU.
- GPT-2 Large is good default model - small, fast, well-understood baseline. Upgrade to 8B models for production quality.
- Always implement sliding window for texts longer than context window.
- Dependencies:
torch,transformers(HuggingFace). Optional:acceleratefor device management. - Consider MLX backend later as optional fast path for Apple Silicon users.
- 24GB M4 can handle up to 14B Q4 models - plenty of room for a text analysis tool.
Sources
- HuggingFace - Perplexity of fixed-length models
- HuggingFace evaluate - perplexity metric
- HuggingFace evaluate source
- lmppl - GitHub
- lmppl - PyPI
- mlx-lm - GitHub
- sillm-mlx - PyPI
- Skeptric - Measuring a Language Model
- LLM Quantization Explained - Enclave AI
- Tokenization effect on perplexity
- PyTorch MPS vs MLX comparative study
- Apple MLX M5 research
- PyTorch-Metrics Perplexity
- torch.compile on MPS state
- AI text detection - perplexity + burstiness