Detecting Checkpoint Anomalies via Frequency Masking

Published: March 09, 2026

awkmlopsmodel-driftsupply-chain-securityobservabilitymodel-integrityunixfrequency-masking

Frequency Masking on Model Checkpoints

The previous post applied frequency masks to compiled binaries. Given a corpus of known-good versions of the same binary, the method builds a mask of invariant byte patterns and measures how much of that structure a target binary still covers. The experiment used nine versions of Alpine’s apk binary. The separation between the known-good corpus and a major version rewrite was clean and required no source, no signature and no prior knowledge of the binary’s internals. The method itself was first developed in Me gustan las listas, where frequency masks were applied to plain text corpora to remove boilerplate from hundreds of w3m dumps.

A trained model checkpoint is a binary. It has a version history. That history has statistical structure.

The question is the same.

The corpus

EleutherAI’s Pythia-14m has something most models do not: public checkpoints from multiple points during training. Not the final model and a fine-tuned variant. The full pretraining process captured at discrete steps.

BASE_URL="https://huggingface.co/EleutherAI/pythia-14m/resolve"
CORPUS_STEPS="1000 4000 16000 32000 64000 128000"

for step in $CORPUS_STEPS 143000; do
    wget -q --show-progress \
        -O "checkpoints/step${step}.safetensors" \
        "${BASE_URL}/step${step}/model.safetensors"
done

Seven files. 26.84 MB each. The corpus is the first six steps. step143000 is the test target: the same model at the end of pretraining.

Two substrates

A safetensors file has two regions with distinct properties.

The first is the JSON header: layer names, data types, offsets, architecture metadata. Printable bytes. The same tr -cd '\040-\176' pipeline from the previous post works directly.

The second is the actual weights: packed float32 values. Not text. Applying the same method requires representing them as a hex sequence.

# substrate 1: printable bytes from the header
tr -cd '\040-\176' < checkpoints/step1000.safetensors > printable/step1000.txt

# substrate 2: 2MB of weights from the center of the file, represented as hex
file_size=$(wc -c < checkpoints/step1000.safetensors)
skip=$(( (file_size / 2) - 1000000 ))
dd if=checkpoints/step1000.safetensors bs=1 skip="$skip" count=2000000 2>/dev/null \
    | xxd -p -c 16 \
    > hexbytes/step1000.hex

The -c 16 flag on xxd writes 16 bytes per line as 32 hex characters. A one-byte-per-line output would give 2-character lines. With grain=4, no n-gram would ever be generated. With 32-character lines, the sliding window works as expected.

The mask

The same AWK from the previous post. The key distinction holds: count by distinct file, not by total occurrences. A pattern that appears a thousand times inside a single checkpoint contributes 1 to the frequency count. A pattern that appears in five of six checkpoints contributes 5.

# mask over printable bytes (grain=6, step=2, threshold=0.75)
awk -v grain=6 -v step_size=2 -v threshold=0.75 \
    -f build_mask.awk \
    printable/step{1000,4000,16000,32000,64000,128000}.txt \
    > mask_printable.txt

# mask over hex bytes (grain=4, step=1, threshold=0.75)
awk -v grain=4 -v step_size=1 -v threshold=0.75 \
    -f build_mask.awk \
    hexbytes/step{1000,4000,16000,32000,64000,128000}.hex \
    > mask_hex.txt

mask_printable:   893 n-grams
mask_hex:       51721 n-grams

The results

Coverage over corpus and test target:

=== printable coverage ===
step  1000: covered=4244  total=5882756  coverage=0.001
step  4000: covered=4244  total=5903096  coverage=0.001
step 16000: covered=4244  total=5800540  coverage=0.001
step 32000: covered=4244  total=5691162  coverage=0.001
step 64000: covered=4244  total=5650045  coverage=0.001
step128000: covered=4244  total=5417921  coverage=0.001
step143000: covered=4243  total=5392809  coverage=0.001

=== hex coverage ===
step  1000: covered=3621149  total=3625000  coverage=0.999
step  4000: covered=3620327  total=3625000  coverage=0.999
step 16000: covered=3618789  total=3625000  coverage=0.998
step 32000: covered=3618928  total=3625000  coverage=0.998
step 64000: covered=3617073  total=3625000  coverage=0.998
step128000: covered=3616316  total=3625000  coverage=0.998
step143000: covered=3615556  total=3625000  coverage=0.997

The printable substrate has no useful signal over the weights. The 4244 fixed hits are the JSON header: layer names, types, offsets. The rest of the printable bytes in a 26MB float32 weight file are incidental. A coverage of 0.001 is not a failure. It is the correct proportion of header content relative to the total printable byte count.

The hex substrate has signal. 0.997 to 0.999 is a stable range. step143000 lands at 0.997: same model, more trained, inside the corpus range.

The outliers

A single known-good variant at the boundary of the corpus is not enough. Two models with increasing distance from the pretraining distribution were measured.

Pythia-14m-deduped: same architecture, trained on the deduplicated Pile instead of the full Pile. Identical layer names. Different weight distribution.

wget -q --show-progress \
    -O checkpoints/deduped.safetensors \
    "https://huggingface.co/EleutherAI/pythia-14m-deduped/resolve/main/model.safetensors"

Coverage: 0.992.

pythia-14m-sentences: fine-tuned on a curated corpus of English sentences. Same architecture. Task substantially different from general pretraining.

wget -q --show-progress \
    -O checkpoints/sentences.safetensors \
    "https://huggingface.co/agentlans/pythia-14m-sentences/resolve/main/model.safetensors"

Coverage: 0.737.

Full table:

model	coverage	description
step1000 - step128000	0.998 - 0.999	corpus baseline
step143000	0.997	same model, end of pretraining
pythia-14m-deduped	0.992	same architecture, different training corpus
pythia-14m-sentences	0.737	aggressive fine-tuning on sentence corpus

The separation is clean. 0.997 to 0.992 is training corpus drift. 0.992 to 0.737 is fine-tuning. The mask knows neither. It knows what byte patterns appeared in the version history of the artifact.

Measurement cost

Once the mask exists, measuring a new checkpoint is a single linear pass over a fixed-size sample.

sentences finetuned:  covered=2671052 total=3625000 coverage=0.737

real    0m1.740s
user    0m1.726s
sys     0m0.008s

1.74 seconds. 27MB file. The mask loads into AWK’s hash table at startup. What follows is a sliding window over 2MB of hex output with one hash lookup per n-gram. No model loading. No framework. No GPU.

The sample size is fixed at 2MB regardless of the checkpoint size. The measurement cost does not scale with model size. It scales with SAMPLE_BYTES, which is a parameter.

The dd seek does scale with file size on spinning disk. On SSD it is negligible.

The only measured data point is pythia-14m at 1.7 seconds on a desktop machine. The table below extends that to larger models based on two assumptions: that dd seek time on SSD is under one second for any file size, and that AWK processing time is dominated by the fixed 2MB sample rather than by the total file size. Both assumptions hold for the measured case. Whether they hold at 13GB or 130GB requires actual measurement on those files.

The torch.load() column assumes a machine with sufficient RAM to load the full model. For LLaMA-70B that is approximately 140GB. On hardware without that capacity, loading is not a timing question.

model	size	this method	torch.load()
pythia-14m	27MB	1.7s (measured)	~3s
pythia-1b	2GB	~2s (estimated)	~25s (estimated)
LLaMA-7B	13GB	~2-4s (estimated)	~90s (estimated)
LLaMA-70B	130GB	~3-8s on SSD (estimated)	~900s on high-memory server (estimated)

These are projections from a single data point, not a benchmark. Confirming them is one of the three future directions described below.

The method requires approximately 50MB regardless of model size: AWK’s hash table for the mask plus the 2MB sample buffer.

This is not an optimization. It is a structural property of the method. The model is never loaded. Only a fixed-size sample of its byte content is examined.

What this measures and what it does not

A signature certifies that a specific process signed a specific artifact. It does not certify that the artifact is semantically consistent with its own history. A fine-tuning step that modifies weights after pretraining and before signing produces a valid signature on a modified model.

Coverage analysis answers a different question: is this artifact what it has always been. The two signals are orthogonal. Signature verification is a gate. Coverage is a baseline. Gates are binary. Baselines are continuous.

The method detects byte population anomalies, not arbitrary weight modifications. A change that produces byte patterns already present in the corpus will not lower coverage. Targeted adversarial modifications aware of the mask could evade it. The corpus must be known-good: a change present in all corpus versions becomes part of the mask.

Coverage as a time series

A single measurement detects whether a specific checkpoint is anomalous. A time series detects whether the model is drifting across releases.

A model that loses 0.2% coverage per fine-tuning round over ten rounds triggers no single-artifact alert. The cumulative drop is visible as slope. The mask variance metric from the previous post applies here directly: how much does the mask itself change when rebuilt from a sliding window of the checkpoint history. A stable model produces a stable mask. A mask that gains or loses many entries between consecutive rebuilds signals that the population is in flux.

Future directions

Three experiments would convert this from a demonstration into a rigorous method.

Ablation over SAMPLE_BYTES. The current setup samples 2MB from the center of each file. It is not known whether the center is the most informative region or whether the signal holds at smaller sample sizes. Running coverage measurements at 512KB, 1MB, 2MB and 4MB samples across the same corpus would characterize the tradeoff between cost and signal quality.

Benchmark at scale. The projected times in the table above are estimates based on the structural properties of the method. Measuring actual times on pythia-1b and LLaMA-7B with the same scripts would either confirm the projections or expose where they break. The comparison against torch.load() would move from estimated to measured.

Grain and corpus size coupling. With 6 corpus versions, grain=4 step=1 produces 51721 mask entries and clear separation. It is not known how the mask degrades as corpus size shrinks or how signal quality improves as grain increases with a larger corpus. A systematic sweep over (corpus_size, grain) pairs on a model family with more public checkpoints would characterize that surface.

The first post applied the method to plain text: w3m dumps with boilerplate repeated across files. The second applied it to compiled binaries: versions of the same executable with an invariant string population across releases. This post applies it to ML artifacts: checkpoints of the same model at different points during training.

The substrate changes. The method does not.

What changes across the three posts is not the procedure. It is what the procedure is answering. In plain text: which lines are template. In binaries: does this executable still look like what it has always been. In model checkpoints: is this artifact a statistical descendant of the versions that preceded it.

The frequency mask does not know which domain it is operating in. It knows how to count.