Why Media Forensics Needs Social Theories
For nearly a decade, deepfake detection has been framed as a classification task: given an audio or video clip, decide whether it is real or synthetic. Top detectors often report high accuracy on standard benchmarks; however, performance drops sharply on content from newer or unseen generators. We argue that better classifiers of synthetic media alone will not solve this problem, especially for interactive deepfakes such as impersonation in video and voice calls, where the harm lies not in the artifact (manipulated media signal) but in the act of deception.
Deepfake detection therefore requires a complementary analytical layer focused on communicative interaction, not just media realism. We identify five assumptions that artifact-based detection (the forensic analysis of low-level signal traces) relies on and show that all five are eroding as generative models improve, producing what we call the Generalization Illusion. To address this, we draw on three well-established frameworks from philosophy of language and social psychology, namely, Speech Act Theory, Grice's Cooperative Principle, and Cialdini's Principles of Influence, to examine forensic signals at three levels: the utterance, the conversation, and the listener response. The result is a unified framework that complements existing forensic methods. We close with open problems for future work.
Current detectors ask "was this generated by a machine?" We argue the right question is "is this being used to deceive someone?" Both are classification problems, but the second requires inputs that current detectors typically ignore: speech acts, conversational coherence, and influence patterns, rather than pixels and frequencies. In this paper, we take the position that this is not an engineering shortfall but a category error: media synthesis detection has been mistaken for the defining question, when it should be treated as one signal within the larger problem of deception detection.
Existing deepfake detectors rest on a set of forensic premises that were reasonable when introduced but are now eroding due to advances in generative modelling. We identify five such premises (P1–P5). They do not fail independently — they fail jointly, producing a systematic overestimation of real-world capability from static benchmarks that we term the Generalization Illusion.
Face manipulation introduces detectable irregularities at blending boundaries, in warping fields, or in local textures. CNN-based detectors such as XceptionNet, Face X-ray, and LAA-Net operationalize this by learning to localize boundary artifacts.
Eroded by: end-to-end diffusion-based generators that synthesize entire frames without a discrete blending step.
Generative models leave characteristic spectral fingerprints — checkerboard artifacts from transposed convolutions, anomalous frequency-energy distributions — that distinguish them from natural images. Targeted by F3-Net, FreqNet, FE-CLIP.
Eroded by: hybrid pipelines, frequency-domain post-processing, and signature mismatch between newer generators and older detectors.
Frame-by-frame generation introduces flicker, identity drift, and unnatural micro-movements detectable by modeling temporal dynamics. Methods include FTCN, AltFreezing, MSVT, and Temporal Coherence Networks.
Eroded by: temporally-aware generators, motion stabilization, and advances in frame consistency and interpolation.
Synthetic faces fail to reproduce subtle physiological signals: blink patterns, gaze stability, micro-expression timing, and rPPG-derived heart-rate variation. Operationalized by FakeCatcher, DeepRhythm.
Eroded by: high-resolution and temporally consistent generators that can preserve or imitate even rPPG signals.
The signals detectors rely on — spatial, spectral, temporal, biological — survive lossy real-world distribution channels: video compression, social-media re-encoding, screen capture, conferencing codecs.
Weakly validated: P5 is a meta-premise that gates the observability of every other signal at deployment. Most detectors are evaluated on minimally compressed lab data.
The record of real-world deepfake attacks reveals a consistent pattern. When attacks are stopped, it is because a human operator detected a contextual anomaly; when they succeed, it is because no such verification occurred. In neither case does automated media forensics play a meaningful role.
In each case, detection relied on contextual signals outside the current paradigm. Current detection methods do not capture these signals because they were not designed to.
We propose interaction forensics: a complementary layer that targets behavioural signals beyond the reach of artifact-based analysis, operating in parallel with existing detection pipelines. The framework decomposes an interaction into three analytical layers, drawing on three theories from linguistics and social psychology.
Analyzes individual utterances to ask: what is the speaker doing, and does it fit their role and context? Identity claims are treated as checkable assertions, where "checkable" is operationalized through out-of-band verification — callbacks, prior shared context, challenge-response — rather than real-time media analysis. Signals include vague answers under verification pressure, unsolicited self-identification, and resistance to verification.
Evaluates how well a conversation follows basic communication norms across its full flow, drawing on the four maxims: Quantity (right amount of information), Quality (truthful), Relation (relevant), Manner (clear and natural). Critically, this layer distinguishes deception from legitimate urgency: genuine urgent requests remain contextually coherent and admit verification; deceptive ones suppress verification and disrupt conversational norms.
Examines how the communication attempts to influence the target. The key signal is not whether influence is present, but how intensely it is applied and how many tactics co-occur. Deepfake fraud typically combines authority, scarcity, social proof, reciprocity, commitment, liking, and unity within a single interaction at atypical density.
The three layers are complementary lenses, and their joint signal is more informative than any layer in isolation. Each layer produces an independent deception score; an aggregate is computed as a weighted combination, with both an aggregate threshold and per-layer override thresholds to avoid the vulnerability of strict-agreement requirements. Weights and thresholds are deployment-dependent and form part of the calibration agenda.
The framework cannot be evaluated using the benchmarks that dominate deepfake detection research. Existing benchmarks test classifiers on isolated clips and report binary accuracy or AUC. AUC remains useful for evaluating media-classification subcomponents, but is insufficient as a primary summary for interaction-grounded deception: it averages over operating points, is invariant to deployment base rates, and reduces latency to a single number. We propose evaluating defences on complete interaction scenarios using four complementary metrics that surface these dimensions explicitly.
The fraction of attack scenarios in which the defence intervenes before the target executes the requested harmful action. Correct decisions issued after compliance contribute nothing to APR.
The fraction of legitimate interactions that proceed without unwarranted intervention. APR alone is gameable — a defence that intervenes on every interaction trivially achieves APR = 1 — so we pair it with BPR. A useful defence achieves both, with their trade-off made explicit on the APR–BPR plane.
The proportion of flagged interactions that are genuine attacks, evaluated at a chosen attack prevention rate and under deployment-realistic base rates. Because deepfake fraud has very low base rates in deployment, even high BPR can mask operationally untenable false-positive volumes at scale. Precision at fixed APR surfaces this cost directly.
The time elapsed from the start of an interaction to the system's intervention decision (block, alert, or escalation). A defence achieving high APR at 60 seconds is qualitatively different from one achieving the same APR at 5 seconds in attacks where compliance occurs within ten seconds. Latency must be reported alongside APR and BPR, not collapsed into them.
Scenarios should be characterized along five dimensions: attack type (CEO fraud, invoice redirection, phishing escalation), victim persona (finance, HR), modality and channel (audio, video, synchronous or asynchronous), interaction length (single or multi-step), and compliance setup (scripted decisions, simulated agents, or human studies). A useful benchmark may include 102–103 scenarios across ~10 attack types, ~5 personas, and 2–3 modalities, with matched benign cases for BPR.
Our framework is conceptual: it defines what an interaction-grounded detection system should analyze, not how to build one from current components. The building blocks at each layer are at different maturity levels — mature for text-based speech-act classification, descriptive for Gricean conversational analysis, and emerging for influence-pattern detection in interactive settings. Making the framework practical requires solving five problems whose foundations exist in adjacent domains but have not been integrated for interactive deepfake deception.
Several deployment constraints warrant acknowledgement. Real-time inference across three layers may exceed latency budgets, suggesting a staged pipeline that begins with lightweight transcript analysis and escalates as needed. Continuous transcription raises privacy concerns, requiring consent, on-device processing, and data minimization. Adversarial adaptation is expected: as attackers learn these signals are monitored, they will adjust their strategies. Humans remain the final line of defence; these systems should support, not replace, their judgment.
If you find this work useful, please cite it as:
This work was conducted at the Vector Institute for Artificial Intelligence.