PhishNet field observations research

What 93 unique attack-kit captures reveal about phishing infrastructure

A public-safe analysis of 93 production kit-vault captures, showing what deduplicated raw evidence can teach police officers, researchers and CERT teams without publishing the kits.

Research question. What can defenders learn from raw attack-kit captures without exposing the kits themselves?
Thesis. The useful finding is not that PhishNet collected files; it is that 93 unique raw captures can be transformed into deduplicated evidence, size bands, MIME classes, retention facts and safe clustering features that help investigators separate artifacts, kit families and campaign hypotheses.

What PhishNet observes

Opening sanitized daily observation module for Belgium. Public modules show source families, redacted examples and high-level graph signals only.

Why this matters

A kit vault is most valuable when it turns volatile attacker infrastructure into reviewable evidence and repeatable research features.

Belgian context

Belgian phishing often reuses global tooling with local overlays: bank, parcel, identity, public-service and multilingual lures. A production kit vault lets investigators preserve the technical substrate behind those lures even when domains disappear or pages change.

What PhishNet observes

PhishNet observes the production kit vault as a controlled evidence store, not as public exploit material. The public layer can safely expose aggregate counts, byte ranges, MIME classes, retention facts, collection-window notes and derived feature bands. The authenticated platform keeps the raw captures, manifests, download audit trail, source routes and analyst review records. That split lets a police officer understand what evidence exists, lets a researcher understand what can be studied, and prevents the public page from becoming a distribution channel for attacker tooling.

What PhishNet found

The production kit vault currently preserves 93 unique content-addressed raw captures in the raw kit prefix, totaling 22,045,330 bytes. The observed GCS metadata places this capture batch around May 22, 2026 between 14:00 and 14:02 UTC. The largest sampled object is a 2,500,310 byte text/html capture retained until May 22, 2027. Those numbers matter because they describe an evidence cut: each object is addressable by content, retained as a vault artifact, and available for authenticated offline analysis.

What the numbers do not mean

The 93 objects are 93 unique raw captures, not automatically 93 unique kit families. A raw capture may be a landing page, error page, exposed directory, config fragment, backup artifact, self-contained HTML page, or captured flow. Family-level uniqueness requires clustering on DOM structure, scripts, form schema, redirect behavior, assets, safe weakness categories and source provenance. The public article should therefore be careful: the vault proves preserved artifacts; the analyzer turns artifacts into family hypotheses.

Why content-addressed storage changes the evidence problem

For police and CERT teams, content-addressed storage changes the question from 'did we see a page?' to 'which exact artifact did we preserve, when, with what hash, and can we compare it later?' A SHA-256-addressed vault object can be deduplicated, re-analyzed, referenced in a manifest, and cited in a chain-of-custody record without publishing the raw material. That is the difference between a screenshot-like observation and evidence that can support an investigation.

The size distribution is the first research signal

The production cut has 19 captures under 10 KB, 34 between 10 and 100 KB, 32 between 100 and 500 KB, and 8 over 500 KB. This distribution is analytically useful. Very small captures often deserve review as possible markers, config fragments, directory listings, error pages or lightweight kit pieces. Mid-sized captures are consistent with compact landing pages or simple templates. Larger captures can indicate self-contained pages, embedded assets, heavy JavaScript, obfuscation, captured landing flows or asset-heavy kits. These are research hypotheses until the analyzer confirms features.

Why the tight capture window matters

The 14:00-14:02 UTC capture window is a collection-batch signal, not a campaign-timing claim. That distinction matters. If a public article treats collection time as attacker deployment time, it creates false chronology. The correct interpretation is operational: a collector or sync process preserved a batch of artifacts in a narrow window. Campaign timing must be inferred from source observations, first seen times, certificate data, hosting changes, redirect evidence and capture provenance, not from GCS object creation alone.

What police officers can do with this evidence

For police, the value is not public disclosure of a kit. The value is an evidentiary workbench: artifact hashes, original source route, retention, MIME type, size, manifest, safe extracted entities, source attestation, download audit and related clusters. Investigators can ask whether multiple victims or brands connect to the same preserved artifact class, whether an artifact contains safe weakness categories, whether a source route supports a warrant or preservation request, and whether a campaign hypothesis is strong enough for case work.

What researchers can study safely

Researchers can study aggregate features without receiving dangerous raw material: size bands, MIME mix, script-density bands, form-schema categories, redirect presence, external dependency bands, obfuscation bands, safe weakness categories and cluster counts. That is enough to ask empirical questions about kit modularity, reuse, campaign churn and evidence readiness while keeping raw exfiltration routes, tokens, form endpoints and victim-facing URLs out of public view.

What large HTML captures may indicate

The largest sampled capture is text/html and about 2.5 MB. That could point to a self-contained landing page, embedded base64 or inline assets, heavy JavaScript, copied framework bundles, obfuscation, anti-analysis logic or a captured flow with substantial page state. The article should not overclaim. The right insight is that large HTML artifacts deserve a different triage lane from small markers: they should be fingerprinted for DOM structure, script density, resource embedding and form behavior before anyone calls them a family.

How PhishNet keeps the public boundary safe

The public layer should show the method and the derived findings, not the weaponized artifact. Full URLs, endpoints, panel paths, webhook values, tokens, credentials, victim data and complete hashes stay out of the public page. The public page can show that there were 93 unique raw captures, the size distribution, the capture window, retention, safe feature bands and the distinction between evidence and family hypotheses. Authenticated users retain the operational evidence according to role and audit policy.

From kit capture to campaign DNA

A single raw capture becomes campaign DNA only after linking. Useful links include the capture hash, source route, target brand category, DOM/form schema, script markers, redirect behavior, source family, collection run, timestamp, safe extracted entities and any related evidence artifacts. If two domains share those features, analysts have a reuse hypothesis. If the hypothesis survives source corroboration and review, it can support a campaign cluster. This is why the article should talk about a workflow, not just a count.

The key lesson

The most important lesson from this production cut is that evidence quality creates research quality. Without a vault, phishing pages vanish into anecdote. With content-addressed captures, retention, manifests and safe derived features, investigators can compare artifacts over time, researchers can study reuse without unsafe disclosure, and CERT teams can decide which evidence is ready for handoff.

Research value

  • A public-safe evidence cut from the production kit vault
  • Clear separation between raw captures, kit-family hypotheses and legal attribution
  • Research features that can be studied without publishing dangerous kit material
  • Police-oriented evidence workflow: hashes, manifests, retention, source routes and audit

Selected sources and research

PhishNet uses public research, official Belgian sources and open OSINT documentation as context. Public pages explain the method and redact examples; authenticated platform views retain operational indicators according to role and policy.

Common questions

Are these 93 unique kit families?

No. They are 93 unique raw captures. Family-level uniqueness requires clustering and analyst validation.

Why not publish the kits?

Publishing raw kits, endpoints, tokens or panel paths can enable abuse. Public research should publish safe derived features and keep raw artifacts authenticated.

What should police review first?

Large captures, captures with safe weakness categories, repeated DOM/form features, source-overlap links and artifacts tied to Belgian brand or victim journeys.