Perplexity Calculation in Python for Apple M4 Silicon

2026-05-22

Bottom line

Three viable approaches for your setup: (1) HuggingFace Transformers + PyTorch MPS - most flexible, best ecosystem, works today. (2) MLX via mlx-lm or sillm-mlx - fastest on Apple Silicon, 2-3x over MPS for LLM inference; (3) lmppl library - simplest API, 3 lines of code, wraps HuggingFace under the hood. For a library you're building, option 1 gives most control. Your 24GB unified memory comfortably holds 8B models at FP16 or 14B at Q4/Q5 quantization.

Key findings

Math is simple: PPL = exp(-1/N * sum(log P(token_i | context))) - exponentiated average negative log-likelihood. Equivalent to exp(cross_entropy_loss).
Model size for 24GB: 8B params FP16 = ~16GB (fits tight). 8B Q8_0 = ~8GB (comfortable). 14B Q4_K_M = ~8.4GB weights + ~3GB overhead (fits well). Leaves headroom for KV cache and eval batching.
MLX outperforms PyTorch MPS on Apple Silicon - 2-3x faster generation, native unified memory. torch.compile on MPS still limited, often falls back to CPU.
Tokenization matters critically. Perplexity scores are not comparable across models with different tokenizers. Always report tokenizer used.
Sliding window needed for texts longer than model's context window. HuggingFace docs show stride=512 with GPT-2 (1024 ctx) drops PPL from 19.44 to 16.44.

Background

Perplexity measures how "surprised" a language model is by text. Lower = more predictable/fluent. Formula:

PPL(X) = exp{ -1/N * sum log p(x_i | x_{<i}) }

Use cases: (1) evaluate language model quality, (2) AI text detection (AI prose = low perplexity), (3) writing quality analysis, (4) text fluency scoring.

Related metric: burstiness - variance in sentence structure. Combined with perplexity for AI detection.

Implementation approaches

Approach 1: Raw Transformers (recommended for library)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai-community/gpt2-large"  # or any causal LM
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# For Apple Silicon
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)

def perplexity(text: str) -> float:
    encodings = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**encodings, labels=encodings["input_ids"])
    return torch.exp(outputs.loss).item()

For long texts, use sliding window (stride < max_length):

max_length = model.config.n_positions  # e.g. 1024 for GPT-2
stride = 512
seq_len = encodings.input_ids.size(1)
nll_sum, n_tokens = 0.0, 0
prev_end_loc = 0

for begin_loc in range(0, seq_len, stride):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        neg_log_likelihood = outputs.loss

    num_valid = (target_ids != -100).sum().item()
    num_loss_tokens = num_valid - target_ids.size(0)
    nll_sum += neg_log_likelihood * num_loss_tokens
    n_tokens += num_loss_tokens
    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(nll_sum / n_tokens)

Approach 2: lmppl (simplest)

import lmppl
scorer = lmppl.LM("gpt2")
ppl = scorer.get_perplexity(["Your text here"], batch_size=8)

Supports: causal LMs (LM), masked LMs pseudo-perplexity (MaskedLM), encoder-decoder (EncoderDecoderLM).

Approach 3: MLX native (fastest on M4)

pip install sillm-mlx

sillm-mlx supports perplexity on MLX with Llama, Mistral, Gemma, Phi, Qwen models. Uses unified memory - no CPU-GPU transfer overhead.

Approach 4: HuggingFace evaluate

from evaluate import load
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=["text1", "text2"], model_id="gpt2")
# results["mean_perplexity"], results["perplexities"]

Model choices for 24GB M4

Model	Precision	Weight Size	Total w/ overhead	Quality
GPT-2 (124M)	FP32	~0.5GB	~1GB	Baseline, fast
GPT-2 Large (774M)	FP16	~1.5GB	~2GB	Better, still fast
Llama 3 8B	Q8_0	~8GB	~11GB	Near-lossless (+0.0004 PPL)
Llama 3 8B	FP16	~16GB	~19GB	Full precision, tight fit
Qwen 2.5 14B	Q4_K_M	~8.4GB	~12GB	Good quality (+0.05 PPL)
Gemma 3 12B	Q5_K_M	~7.5GB	~11GB	Very good quality

Quantization perplexity impact (measured delta vs FP16):

Quant	Bits	PPL Delta	Assessment
Q8_0	8.0	+0.0004	Essentially lossless
Q6_K	6.0	+0.0044	Virtually no difference
Q5_K_M	5.1	+0.0142	Hard to notice
Q4_K_M	4.5	+0.0535	Fine for most tasks
Q3_K_M	3.3	+0.2437	Significant loss

Limitations and critiques

Perplexity only defined for causal (autoregressive) LMs. Not well-defined for BERT-style masked LMs (pseudo-perplexity is an approximation).
Not comparable across models with different tokenizers or vocabularies.
Token-level vs word-level: HuggingFace averages over GPT-2 tokens, not words - produces different numbers than academic papers averaging over words.
PyTorch MPS limitations: torch.compile often falls back to CPU on MPS. FlashAttention not natively supported. SDPA crashes on sequences >12K tokens.
Perplexity alone insufficient for AI detection - modern LLMs can produce high-perplexity text. Combine with burstiness and other metrics.

Open questions

MLX ecosystem still maturing - will mlx-lm add native perplexity eval command?
How does Apple's Neural Engine factor into perplexity eval on M4?
Best stride/window size tradeoff for production use cases?

Practical takeaways for your library

Start with PyTorch + Transformers - widest model support, most flexible, MPS gives decent speedup over CPU.
GPT-2 Large is good default model - small, fast, well-understood baseline. Upgrade to 8B models for production quality.
Always implement sliding window for texts longer than context window.
Dependencies: torch, transformers (HuggingFace). Optional: accelerate for device management.
Consider MLX backend later as optional fast path for Apple Silicon users.
24GB M4 can handle up to 14B Q4 models - plenty of room for a text analysis tool.