Perplexity Calculation in Python for Apple M4 Silicon

Bottom line

Three viable approaches for your setup: (1) HuggingFace Transformers + PyTorch MPS - most flexible, best ecosystem, works today. (2) MLX via mlx-lm or sillm-mlx - fastest on Apple Silicon, 2-3x over MPS for LLM inference; (3) lmppl library - simplest API, 3 lines of code, wraps HuggingFace under the hood. For a library you're building, option 1 gives most control. Your 24GB unified memory comfortably holds 8B models at FP16 or 14B at Q4/Q5 quantization.

Key findings

  • Math is simple: PPL = exp(-1/N * sum(log P(token_i | context))) - exponentiated average negative log-likelihood. Equivalent to exp(cross_entropy_loss).

  • Model size for 24GB: 8B params FP16 = ~16GB (fits tight). 8B Q8_0 = ~8GB (comfortable). 14B Q4_K_M = ~8.4GB weights + ~3GB overhead (fits well). Leaves headroom for KV cache and eval batching.

  • MLX outperforms PyTorch MPS on Apple Silicon - 2-3x faster generation, native unified memory. torch.compile on MPS still limited, often falls back to CPU.

  • Tokenization matters critically. Perplexity scores are not comparable across models with different tokenizers. Always report tokenizer used.

  • Sliding window needed for texts longer than model's context window. HuggingFace docs show stride=512 with GPT-2 (1024 ctx) drops PPL from 19.44 to 16.44.

Background

Perplexity measures how "surprised" a language model is by text. Lower = more predictable/fluent. Formula:

PPL(X) = exp{ -1/N * sum log p(x_i | x_{<i}) }

Use cases: (1) evaluate language model quality, (2) AI text detection (AI prose = low perplexity), (3) writing quality analysis, (4) text fluency scoring.

Related metric: burstiness - variance in sentence structure. Combined with perplexity for AI detection.

Implementation approaches

Approach 1: Raw Transformers (recommended for library)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai-community/gpt2-large"  # or any causal LM
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# For Apple Silicon
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)

def perplexity(text: str) -> float:
    encodings = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**encodings, labels=encodings["input_ids"])
    return torch.exp(outputs.loss).item()

For long texts, use sliding window (stride < max_length):

max_length = model.config.n_positions  # e.g. 1024 for GPT-2
stride = 512
seq_len = encodings.input_ids.size(1)
nll_sum, n_tokens = 0.0, 0
prev_end_loc = 0

for begin_loc in range(0, seq_len, stride):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        neg_log_likelihood = outputs.loss

    num_valid = (target_ids != -100).sum().item()
    num_loss_tokens = num_valid - target_ids.size(0)
    nll_sum += neg_log_likelihood * num_loss_tokens
    n_tokens += num_loss_tokens
    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(nll_sum / n_tokens)

Approach 2: lmppl (simplest)

import lmppl
scorer = lmppl.LM("gpt2")
ppl = scorer.get_perplexity(["Your text here"], batch_size=8)

Supports: causal LMs (LM), masked LMs pseudo-perplexity (MaskedLM), encoder-decoder (EncoderDecoderLM).

Approach 3: MLX native (fastest on M4)

pip install sillm-mlx

sillm-mlx supports perplexity on MLX with Llama, Mistral, Gemma, Phi, Qwen models. Uses unified memory - no CPU-GPU transfer overhead.

Approach 4: HuggingFace evaluate

from evaluate import load
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=["text1", "text2"], model_id="gpt2")
# results["mean_perplexity"], results["perplexities"]

Model choices for 24GB M4

Model Precision Weight Size Total w/ overhead Quality
GPT-2 (124M) FP32 ~0.5GB ~1GB Baseline, fast
GPT-2 Large (774M) FP16 ~1.5GB ~2GB Better, still fast
Llama 3 8B Q8_0 ~8GB ~11GB Near-lossless (+0.0004 PPL)
Llama 3 8B FP16 ~16GB ~19GB Full precision, tight fit
Qwen 2.5 14B Q4_K_M ~8.4GB ~12GB Good quality (+0.05 PPL)
Gemma 3 12B Q5_K_M ~7.5GB ~11GB Very good quality

Quantization perplexity impact (measured delta vs FP16):

Quant Bits PPL Delta Assessment
Q8_0 8.0 +0.0004 Essentially lossless
Q6_K 6.0 +0.0044 Virtually no difference
Q5_K_M 5.1 +0.0142 Hard to notice
Q4_K_M 4.5 +0.0535 Fine for most tasks
Q3_K_M 3.3 +0.2437 Significant loss

Limitations and critiques

  • Perplexity only defined for causal (autoregressive) LMs. Not well-defined for BERT-style masked LMs (pseudo-perplexity is an approximation).
  • Not comparable across models with different tokenizers or vocabularies.
  • Token-level vs word-level: HuggingFace averages over GPT-2 tokens, not words - produces different numbers than academic papers averaging over words.
  • PyTorch MPS limitations: torch.compile often falls back to CPU on MPS. FlashAttention not natively supported. SDPA crashes on sequences >12K tokens.
  • Perplexity alone insufficient for AI detection - modern LLMs can produce high-perplexity text. Combine with burstiness and other metrics.

Open questions

  • MLX ecosystem still maturing - will mlx-lm add native perplexity eval command?
  • How does Apple's Neural Engine factor into perplexity eval on M4?
  • Best stride/window size tradeoff for production use cases?

Practical takeaways for your library

  1. Start with PyTorch + Transformers - widest model support, most flexible, MPS gives decent speedup over CPU.
  2. GPT-2 Large is good default model - small, fast, well-understood baseline. Upgrade to 8B models for production quality.
  3. Always implement sliding window for texts longer than context window.
  4. Dependencies: torch, transformers (HuggingFace). Optional: accelerate for device management.
  5. Consider MLX backend later as optional fast path for Apple Silicon users.
  6. 24GB M4 can handle up to 14B Q4 models - plenty of room for a text analysis tool.

Sources