Why raw cyber evidence needs an observation dataset

Research question. How do raw evidence vaults become usable intelligence without exposing sensitive material?
Thesis. The bridge between raw evidence and operational value is a structured observation dataset: raw bytes stay protected in the vault, while collection metadata, targeting context, hashes, manifests and derived features make the evidence searchable, reviewable and exportable.

Evidence-readiness pipeline

ObservedSource sees signal

CapturedEvidence preserved

EnrichedGraph entities extracted

ReviewedState assigned

Handed offExport or case

What PhishNet observes

Opening sanitized daily observation module for Belgium. Public modules show source families, redacted examples and high-level graph signals only.

Why this matters

Evidence value comes from context, not only possession.

Belgian context

Belgian and Benelux investigations often connect phishing, smishing, fake investment, credential exposure and ransomware evidence across languages, brands and jurisdictions. Structured observations let analysts compare those artifacts without pushing raw victim data into every dashboard.

What PhishNet observes

A leak archive, kit capture, APK sample, EML file or ransomware package is only one part of the evidence story. PhishNet observes the artifact and the operational envelope around it: when it was collected, where it came from, how it was collected, who or what it targeted, which source observed it, which manifest signed it, and which policy governs access. Without that surrounding record, raw material is hard to search, hard to compare and hard to explain.

Why a second layer is necessary

Raw bytes should not be copied into every workbench, export or chart. They may contain malware, credentials, victim data, tokens, private communications or operationally sensitive paths. The observation layer carries metadata and safe derived features instead: artifact kind, bucket family, source route, collection method, target domain or sector, size band, MIME type, hashes, manifest URI, legal basis, source attestation and redaction state. This keeps the dashboard fast and reduces the chance that a routine query becomes an unnecessary raw-evidence access.

What investigators gain

Investigators can search for all artifacts targeting one brand, all ransomware packages from a trusted source, all credential dumps tied to a domain, all APK references linked to a mobile banker family, or all evidence captured during a collection run. They can open the exact artifact only when their role and the policy allow it, and the download event remains audited.

What researchers gain

Researchers get reproducible, safer features. They can study size distributions, collection windows, MIME mix, form schema fingerprints, script-density bands, source families, target sectors and country relevance without receiving raw kit files or leak archives. This is how public research can say something useful while keeping dangerous material out of public pages.

What insurers gain

For underwriting and claims teams, observations turn evidence into traceable risk context. Outside-in posture, credential exposure, phishing pressure, ransomware mentions, breach disclosures and vendor concentration can be linked back to the evidence trail. The result is not actuarial advice by itself; it is a defensible source-backed view of exposure and event context.

The difference between evidence and a score

A risk score without evidence is a black box. An observation-backed score can show which facts contributed: expired certificates, weak mail authentication, exposed services, credential leak deltas, typosquat pressure, ransomware sector activity, incident news or source gaps. Operators can click through to the rows behind the number instead of trusting a static grade.

Daily collection must be tiered

Fast lanes should collect deltas such as certificate transparency, phishing pressure, credential exposure changes and ransomware posts. Daily lanes should build posture snapshots for DNS, SPF, DMARC, DKIM, TLS, HTTP headers, passive DNS, CVE/banner context, breach disclosures and news classification. Weekly reference lanes should update BGP/ASN reputation, technology baselines, vendor concentration and sector benchmarks.

How raw evidence stays controlled

Raw ransomware packages, leak archives, credential dumps, APK samples and kit captures should enter through trusted police or government buckets, authorized submissions, internal captures or mandate-backed intake. Normal dashboards should show filenames, sizes, hashes, policy, target, source and observation metadata. Raw download remains role-gated and audited.

Why target context matters

Targeting is what turns an artifact into an investigative clue. A page capture may target a bank, a government service, a telecom provider or a generic brand category. A leak archive may target a sector or organization. A credential dump may affect a corporate domain. Capturing that targeting at observation time makes later clustering and reporting far more useful.

How collection method changes interpretation

The same byte pattern means different things depending on collection method. A public RSS reference is context, an authorized submission may be direct evidence, a trusted bucket sync can preserve a raw package, and an internal capture may include sandbox or screenshot context. Treating those methods as identical weakens analysis. Recording method, worker, source route and legal basis lets analysts decide whether a row is suitable for public research, operator triage, claims context or police case work.

Why portfolio analysis needs evidence trails

Portfolio and cat-modeling datasets can drift into abstract percentages: how many insureds use a provider, how many show exposed services, how many have credential exposure. Observation links keep those aggregates grounded. A concentration chart can point back to source rows, posture snapshots and evidence artifacts so an underwriter can separate verified exposure, inferred exposure and missing measurement. That distinction is especially important when a portfolio depends on one mail provider, CDN, identity platform or security appliance.

What should not appear in normal datasets

Normal datasets should not carry raw credentials, victim communications, full leak archives, malware bytes, full panel paths, webhook URLs, tokens or private identifiers. They should carry policy state, partial or hashed identifiers where appropriate, safe summaries, manifests, source attestation and enough derived structure to support analysis. The raw material remains available through the vault for authorized users, but it is not sprayed across API responses.

The public boundary

Public pages should explain the method, schema, aggregate findings and safe examples. They should not publish full live URLs, panel paths, webhook values, tokens, credentials, victim data, raw leak content or complete hashes. This boundary lets PhishNet publish meaningful research while keeping operational evidence inside authenticated workflows.

The operational lesson

The practical lesson is simple: if a platform collects raw evidence but cannot tell an analyst what the artifact is, where it came from, what it targeted and how it can be used, then the evidence is underpowered. Observation metadata makes the raw vault useful without making every screen dangerous. It is the difference between storage and an evidence intelligence system.

Research value

Defines the metadata bridge between raw artifacts and usable analysis
Explains how police and researchers can inspect evidence without broad raw-data exposure
Connects evidence observations to underwriting, claims and portfolio risk context

Selected sources and research

PhishNet uses public research, official Belgian sources and open OSINT documentation as context. Public pages explain the method and redact examples; authenticated platform views retain operational indicators according to role and policy.

APWG Phishing Activity Trends Reports ENISA Threat Landscape 2025 MITRE ATT&CK: Phishing Why Phishing Works, Dhamija, Tygar and Hearst, CHI 2006 CCB/Safeonweb phishing surge FSMA warnings and sanctions BIPT reserved and allocated numbering database DNS Belgium statistics

Common questions

Does the observation dataset replace the vault?

No. The vault remains the canonical store for raw and sensitive evidence. Observations are the searchable metadata layer that points back to vault artifacts.

Can raw evidence still be downloaded?

Yes, when the artifact policy and operator role allow it. Downloads stay audited and linked to the artifact.

Can this support cyber insurance use cases?

Yes. It supports underwriting and claims context by linking posture, loss-event and portfolio signals back to source-backed evidence, but it is not a standalone actuarial model.