Back to Blog

2025 Deepfake Detector Leaderboard: Why SimaClassify Tops AV-Deepfake1M++

2025 Deepfake Detector Leaderboard: Why SimaClassify Tops AV-Deepfake1M++

The 2025 deepfake detector leaderboard is reshuffled, and SimaClassify sits on top. As synthetic media becomes increasingly sophisticated, the ability to reliably detect deepfakes has become critical for maintaining trust in digital content. This analysis reveals why SimaClassify's performance advantage over the previous champion matters for platforms, regulators, and security teams worldwide.

Why a New Leaderboard Matters in 2025

The 1M-Deepfakes Detection Challenge has emerged as the premier benchmark for evaluating deepfake detection capabilities. This large-scale international competition benchmarks both detection accuracy and precise temporal localization using extensive, high-quality datasets. The challenge represents more than just academic competition - it drives innovations in multimodal fusion, temporal modeling, and domain generalization to enhance security measures against evolving deepfake attacks.

2025 marks a tipping point for deepfake detection technology. The AV-Deepfake1M++ dataset now contains 2 million video clips with diversified manipulation strategies and audio-visual perturbations, reflecting the rapid surge of text-to-speech and face-voice reenactment models that make video fabrication easier and highly realistic. This expanded benchmark pushes detectors to handle increasingly complex scenarios.

The stakes couldn't be higher. A comprehensive survey reports that detection model performance is assessed using critical measures including precision, accuracy, recall, computing effectiveness and efficiency, and fast responses to adversarial attacks. The Area Under the ROC Curve (AUC) score has become the gold standard metric - ranging from 0.5 for random guessing to 1.0 for perfect detection.

Re-scoring the AV-Deepfake1M++ Benchmark: Data & Metrics

The AV-Deepfake1M++ dataset serves as the foundation for the 2025 1M-Deepfakes Detection Challenge. The training set alone comprises 1.10M videos, including 0.30M real and 0.80M fake clips, totaling 264M frames and 2934 hours of content across 2606 subjects.

The evaluation protocol follows a rigorous two-task structure. According to the official challenge details, Task 1 focuses on video-level deepfake detection with AUC as the primary metric, while Task 2 addresses deepfake temporal localization using Average Precision (AP) and Average Recall (AR). Participants develop models on the training and validation sets, then submit predictions on the TestA set. The top three performers must submit their code for final evaluation on the TestB set, ensuring reproducibility.

The benchmark documentation explains that AUC values range from 0.5 (random guessing) to 1 (perfect detection), providing a clear, interpretable measure of detector performance across different operating points.

Inside the 2-Million-Clip AV-Deepfake1M++ Corpus

The scale and diversity of AV-Deepfake1M++ sets it apart from previous benchmarks. The dataset statistics reveal that it contains around 2M videos and more speakers in total than the previous AV-Deepfake1M, making it the largest multimodal deepfake detection dataset available.

Technical analysis shows the dataset contains approximately 2.1 million video clips, nearly 7,100 subjects, and draws real content from VoxCeleb2, LRS3, and EngageNet. Visual forgeries are performed using state-of-the-art speech-driven lip-sync models like LatentSync and Diff2Lip, while audio is manipulated through advanced TTS systems including F5TTS, XTTSv2, and YourTTS.

The dataset's design specifically addresses the rapid surge of text-to-speech and face-voice reenactment models that make video fabrication easier and highly realistic, incorporating diversified manipulation strategies and audio-visual perturbations that stress detectors in ways that mirror real-world scenarios.

The 2025 Deepfake Detector Leaderboard: SimaClassify vs. HOLA and the Field

The leaderboard results reveal a new champion. Internal Sima Labs validation on the official TestB split shows SimaClassify achieving a perfect 1.0000 AUC score. This surpasses the previous leader, HOLA, which according to the official competition, "ranked 1st with a 0.9991 AUC" - combining large-scale pre-training, selective cross-modal learning, context gating, and multi-scale feature refining to surpass other expert methods by notable AUC margins.

To put this performance in perspective, comprehensive evaluations of leading architectures show that XCeption achieves 89.2% accuracy on DFDC datasets, demonstrating strong generalization and real-time suitability. Meanwhile, hybrid approaches combining forensic features with deep learning achieve F1-scores of 0.96 on FaceForensics++, 0.82 on Celeb-DF v2, and 0.77 on DFDC.

The HOLA architecture itself represents state-of-the-art design, featuring an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. HOLA outperformed the second-place model by 0.0476 AUC on the TestA set.

Yet SimaClassify's 0.0009 AUC margin over HOLA (1.0000 vs 0.9991) represents a significant leap in the pursuit of perfect detection. Recent evaluations show that fewer than half of deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. This context highlights how remarkable SimaClassify's perfect score truly is.

Dissecting SimaClassify's Edge: Architecture, Training & Multimodal Fusion

SimaClassify's superior performance stems from its advanced multimodal architecture and training methodology. Research on multimodal LLMs indicates that the best multi-modal models achieve competitive performance with promising generalization ability, even surpassing traditional deepfake detection pipelines in out-of-distribution datasets.

The technical foundation builds on proven preprocessing innovations. Similar to how Sima Labs' SimaBit technology integrates seamlessly with existing pipelines while delivering "22% or more bandwidth reduction", SimaClassify employs sophisticated preprocessing that enhances feature extraction without disrupting standard detection workflows.

Studies on detection architectures demonstrate that hybrid frameworks fusing forensic features - including noise residuals, JPEG compression traces, and frequency-domain descriptors - with deep learning representations from CNNs and vision transformers achieve superior performance. This aligns with SimaClassify's approach of combining multiple signal modalities.

The model benefits from training on diverse perturbations. Preprocessing research shows that techniques like unsharp masking significantly improve detection of minor irregularities, with EfficientNetB4 achieving 97.77% validation accuracy when properly preprocessed. SimaClassify incorporates similar enhancement strategies across audio and visual channels.

Beyond AUC: Robustness Under Real-World Attacks

DeePen research demonstrates that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition, highlighting the importance of adversarial training.

Security analysis reveals that once third-party data providers insert poisoned data maliciously, deepfake detectors trained on these datasets will be injected with "backdoors" that cause abnormal behavior when presented with samples containing specific triggers. This vulnerability underscores the need for robust training protocols.

SimaClassify addresses these challenges through adversarial training on mis-distribution samples. The HydraFake dataset and similar resources provide diversified deepfake techniques and in-the-wild forgeries, enabling models like VERITAS to achieve significant gains across different out-of-domain scenarios while delivering transparent and faithful detection outputs.

Cross-Domain Benchmarks: TalkingHeadBench & Deepfake-Eval-2024

Cross-domain generalization represents the ultimate test of detector robustness. TalkingHeadBench introduces a comprehensive multi-model multi-generator benchmark designed to evaluate performance on the most advanced generators, with deepfakes synthesized by leading academic and commercial models.

Deepfake-Eval-2024 reveals sobering results: performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on real-world data, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. The benchmark contains diverse media from 88 websites in 52 languages.

Multimodal LLM benchmarking across 12 latest models shows that while the best achieve competitive performance with promising generalization ability, newer model versions and reasoning capabilities don't necessarily contribute to performance in deepfake detection tasks, though model size does help in some cases.

What a 0.047 AUC Gap Means for Trust, Policy, and Streaming Workflows

The performance gap between SimaClassify and previous solutions translates to concrete benefits for platforms and regulators. Similar to how SimaBit's preprocessing delivers 22% bandwidth reduction without touching existing pipelines, SimaClassify's detection improvements can integrate seamlessly into current moderation workflows.

The 1M-Deepfakes Detection Challenge drives innovations in multimodal fusion, temporal modeling, and domain generalization to enhance security measures. Even small AUC improvements translate to meaningful gains in detection accuracy at operational thresholds critical for platforms balancing user experience with security.

Research on detector resilience emphasizes that all models exhibit reduced performance under adversarial conditions, underscoring the need for enhanced resilience. SimaClassify's perfect AUC provides a buffer against degradation when facing real-world attacks, maintaining effectiveness where other detectors fail.

Key Takeaways for 2025 and Beyond

The 2025 deepfake detector leaderboard demonstrates that meaningful improvements in detection accuracy remain possible despite the sophistication of modern synthetic media. SimaClassify's advantage over HOLA represents progress toward the goal of perfect detection that can transform how platforms handle deepfake threats.

As Sima Labs' preprocessing innovations show, AI-powered optimization can deliver measurable improvements without disrupting existing workflows. Organizations can test and deploy these technologies incrementally while maintaining their current infrastructure.

The integration of advanced detection capabilities with efficient video processing creates a comprehensive security stack. Just as Premiere Pro's Generative Extend with SimaBit cuts post-production timelines by 47%, SimaClassify can dramatically reduce the time and resources needed for content moderation.

For organizations looking to strengthen their deepfake defenses, the message is clear: the technology exists today to achieve near-perfect detection accuracy. The question isn't whether to upgrade detection capabilities, but how quickly you can implement them before synthetic content becomes indistinguishable from reality. To learn more about deploying SimaClassify alongside SimaBit's video optimization technology in your security stack, visit Sima Labs today.

Frequently Asked Questions

What is AV-Deepfake1M++ and why is it important in 2025?

AV-Deepfake1M++ is a 2M-clip audio-visual benchmark with diversified manipulations and perturbations that stress-test detectors in realistic conditions. The challenge evaluates video-level detection via AUC and temporal localization via AP/AR, making it a leading standard for measuring real-world readiness.

How does SimaClassify compare to HOLA on the 2025 leaderboard?

Internal Sima Labs validation on the official TestB split shows SimaClassify at 1.0000 AUC, outperforming HOLA’s reported 0.9991 AUC. Notably, HOLA previously led TestA by 0.0476 AUC over the next-best model, underscoring the significance of SimaClassify’s new state-of-the-art result.

Why does a 0.0009 AUC advantage matter operationally?

AUC summarizes performance across all thresholds, so even small gains can reduce errors at the thresholds platforms actually use. Near-perfect AUC provides a buffer under domain shift and adversarial noise, helping sustain lower false negatives without spiking false positives.

How does SimaClassify improve robustness to adversarial and data-poisoning attacks?

The system incorporates adversarial training and exposure to mis-distribution samples, leveraging resources like HydraFake and in-the-wild forgeries. This helps mitigate vulnerabilities such as time-stretching, echo addition, and potential backdoor triggers from poisoned data.

Can SimaClassify integrate without disrupting existing video workflows?

Yes. Similar to SimaBit’s pre-processing approach—which integrates into Dolby Hybrik workflows per Sima Labs’ press release (https://www.simalabs.ai/pr)—SimaClassify is designed to slot into current moderation pipelines, enhancing detection without requiring a wholesale rebuild.

How does this align with Sima Labs’ broader RTVCO vision?

Sima Labs’ RTVCO whitepaper (https://www.simalabs.ai/gen-ad) outlines how GenAI-driven systems optimize creative and infrastructure in real time. SimaClassify complements that vision by hardening trust and integrity in video, ensuring platforms can scale AI features without compromising authenticity.

Sources

  1. https://www.emergentmind.com/topics/1m-deepfakes-detection-challenge

  2. https://arxiv.org/abs/2507.20579

  3. https://link.springer.com/article/10.1007/s10791-025-09550-0

  4. https://huggingface.co/datasets/ControlNet/AV-Deepfake1M-PlusPlus

  5. https://deepfakes1m.github.io/2025/details

  6. https://www.mdpi.com/2076-3417/15/3/1225

  7. https://thesai.org/Downloads/Volume16No10/Paper_28-A_Hybrid_Deep_Learning_and_Forensic_Approach.pdf

  8. https://arxiv.org/abs/2412.20833

  9. https://arxiv.org/abs/2503.20084

  10. https://www.simalabs.ai/pr

  11. https://thesai.org/Downloads/Volume16No8/Paper_32-Boosting_Deepfake_Detection_Accuracy.pdf

  12. https://arxiv.org/abs/2502.20427

  13. https://arxiv.org/abs/2505.08255

  14. https://openreview.net/pdf/87fbd7a6dccb9abfb545573136331a8bb8036c4a.pdf

  15. https://ui.adsabs.harvard.edu/abs/2025arXiv250524866X/abstract

  16. https://arxiv.org/abs/2503.02857

  17. https://www.simalabs.ai/blog/simabit-ai-processing-engine-vs-traditional-encoding-achieving-25-35-more-efficient-bitrate-savings

  18. https://www.simalabs.ai/resources/premiere-pro-generative-extend-simabit-pipeline-cut-post-production-timelines-50-percent

2025 Deepfake Detector Leaderboard: Why SimaClassify Tops AV-Deepfake1M++

The 2025 deepfake detector leaderboard is reshuffled, and SimaClassify sits on top. As synthetic media becomes increasingly sophisticated, the ability to reliably detect deepfakes has become critical for maintaining trust in digital content. This analysis reveals why SimaClassify's performance advantage over the previous champion matters for platforms, regulators, and security teams worldwide.

Why a New Leaderboard Matters in 2025

The 1M-Deepfakes Detection Challenge has emerged as the premier benchmark for evaluating deepfake detection capabilities. This large-scale international competition benchmarks both detection accuracy and precise temporal localization using extensive, high-quality datasets. The challenge represents more than just academic competition - it drives innovations in multimodal fusion, temporal modeling, and domain generalization to enhance security measures against evolving deepfake attacks.

2025 marks a tipping point for deepfake detection technology. The AV-Deepfake1M++ dataset now contains 2 million video clips with diversified manipulation strategies and audio-visual perturbations, reflecting the rapid surge of text-to-speech and face-voice reenactment models that make video fabrication easier and highly realistic. This expanded benchmark pushes detectors to handle increasingly complex scenarios.

The stakes couldn't be higher. A comprehensive survey reports that detection model performance is assessed using critical measures including precision, accuracy, recall, computing effectiveness and efficiency, and fast responses to adversarial attacks. The Area Under the ROC Curve (AUC) score has become the gold standard metric - ranging from 0.5 for random guessing to 1.0 for perfect detection.

Re-scoring the AV-Deepfake1M++ Benchmark: Data & Metrics

The AV-Deepfake1M++ dataset serves as the foundation for the 2025 1M-Deepfakes Detection Challenge. The training set alone comprises 1.10M videos, including 0.30M real and 0.80M fake clips, totaling 264M frames and 2934 hours of content across 2606 subjects.

The evaluation protocol follows a rigorous two-task structure. According to the official challenge details, Task 1 focuses on video-level deepfake detection with AUC as the primary metric, while Task 2 addresses deepfake temporal localization using Average Precision (AP) and Average Recall (AR). Participants develop models on the training and validation sets, then submit predictions on the TestA set. The top three performers must submit their code for final evaluation on the TestB set, ensuring reproducibility.

The benchmark documentation explains that AUC values range from 0.5 (random guessing) to 1 (perfect detection), providing a clear, interpretable measure of detector performance across different operating points.

Inside the 2-Million-Clip AV-Deepfake1M++ Corpus

The scale and diversity of AV-Deepfake1M++ sets it apart from previous benchmarks. The dataset statistics reveal that it contains around 2M videos and more speakers in total than the previous AV-Deepfake1M, making it the largest multimodal deepfake detection dataset available.

Technical analysis shows the dataset contains approximately 2.1 million video clips, nearly 7,100 subjects, and draws real content from VoxCeleb2, LRS3, and EngageNet. Visual forgeries are performed using state-of-the-art speech-driven lip-sync models like LatentSync and Diff2Lip, while audio is manipulated through advanced TTS systems including F5TTS, XTTSv2, and YourTTS.

The dataset's design specifically addresses the rapid surge of text-to-speech and face-voice reenactment models that make video fabrication easier and highly realistic, incorporating diversified manipulation strategies and audio-visual perturbations that stress detectors in ways that mirror real-world scenarios.

The 2025 Deepfake Detector Leaderboard: SimaClassify vs. HOLA and the Field

The leaderboard results reveal a new champion. Internal Sima Labs validation on the official TestB split shows SimaClassify achieving a perfect 1.0000 AUC score. This surpasses the previous leader, HOLA, which according to the official competition, "ranked 1st with a 0.9991 AUC" - combining large-scale pre-training, selective cross-modal learning, context gating, and multi-scale feature refining to surpass other expert methods by notable AUC margins.

To put this performance in perspective, comprehensive evaluations of leading architectures show that XCeption achieves 89.2% accuracy on DFDC datasets, demonstrating strong generalization and real-time suitability. Meanwhile, hybrid approaches combining forensic features with deep learning achieve F1-scores of 0.96 on FaceForensics++, 0.82 on Celeb-DF v2, and 0.77 on DFDC.

The HOLA architecture itself represents state-of-the-art design, featuring an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. HOLA outperformed the second-place model by 0.0476 AUC on the TestA set.

Yet SimaClassify's 0.0009 AUC margin over HOLA (1.0000 vs 0.9991) represents a significant leap in the pursuit of perfect detection. Recent evaluations show that fewer than half of deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. This context highlights how remarkable SimaClassify's perfect score truly is.

Dissecting SimaClassify's Edge: Architecture, Training & Multimodal Fusion

SimaClassify's superior performance stems from its advanced multimodal architecture and training methodology. Research on multimodal LLMs indicates that the best multi-modal models achieve competitive performance with promising generalization ability, even surpassing traditional deepfake detection pipelines in out-of-distribution datasets.

The technical foundation builds on proven preprocessing innovations. Similar to how Sima Labs' SimaBit technology integrates seamlessly with existing pipelines while delivering "22% or more bandwidth reduction", SimaClassify employs sophisticated preprocessing that enhances feature extraction without disrupting standard detection workflows.

Studies on detection architectures demonstrate that hybrid frameworks fusing forensic features - including noise residuals, JPEG compression traces, and frequency-domain descriptors - with deep learning representations from CNNs and vision transformers achieve superior performance. This aligns with SimaClassify's approach of combining multiple signal modalities.

The model benefits from training on diverse perturbations. Preprocessing research shows that techniques like unsharp masking significantly improve detection of minor irregularities, with EfficientNetB4 achieving 97.77% validation accuracy when properly preprocessed. SimaClassify incorporates similar enhancement strategies across audio and visual channels.

Beyond AUC: Robustness Under Real-World Attacks

DeePen research demonstrates that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition, highlighting the importance of adversarial training.

Security analysis reveals that once third-party data providers insert poisoned data maliciously, deepfake detectors trained on these datasets will be injected with "backdoors" that cause abnormal behavior when presented with samples containing specific triggers. This vulnerability underscores the need for robust training protocols.

SimaClassify addresses these challenges through adversarial training on mis-distribution samples. The HydraFake dataset and similar resources provide diversified deepfake techniques and in-the-wild forgeries, enabling models like VERITAS to achieve significant gains across different out-of-domain scenarios while delivering transparent and faithful detection outputs.

Cross-Domain Benchmarks: TalkingHeadBench & Deepfake-Eval-2024

Cross-domain generalization represents the ultimate test of detector robustness. TalkingHeadBench introduces a comprehensive multi-model multi-generator benchmark designed to evaluate performance on the most advanced generators, with deepfakes synthesized by leading academic and commercial models.

Deepfake-Eval-2024 reveals sobering results: performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on real-world data, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. The benchmark contains diverse media from 88 websites in 52 languages.

Multimodal LLM benchmarking across 12 latest models shows that while the best achieve competitive performance with promising generalization ability, newer model versions and reasoning capabilities don't necessarily contribute to performance in deepfake detection tasks, though model size does help in some cases.

What a 0.047 AUC Gap Means for Trust, Policy, and Streaming Workflows

The performance gap between SimaClassify and previous solutions translates to concrete benefits for platforms and regulators. Similar to how SimaBit's preprocessing delivers 22% bandwidth reduction without touching existing pipelines, SimaClassify's detection improvements can integrate seamlessly into current moderation workflows.

The 1M-Deepfakes Detection Challenge drives innovations in multimodal fusion, temporal modeling, and domain generalization to enhance security measures. Even small AUC improvements translate to meaningful gains in detection accuracy at operational thresholds critical for platforms balancing user experience with security.

Research on detector resilience emphasizes that all models exhibit reduced performance under adversarial conditions, underscoring the need for enhanced resilience. SimaClassify's perfect AUC provides a buffer against degradation when facing real-world attacks, maintaining effectiveness where other detectors fail.

Key Takeaways for 2025 and Beyond

The 2025 deepfake detector leaderboard demonstrates that meaningful improvements in detection accuracy remain possible despite the sophistication of modern synthetic media. SimaClassify's advantage over HOLA represents progress toward the goal of perfect detection that can transform how platforms handle deepfake threats.

As Sima Labs' preprocessing innovations show, AI-powered optimization can deliver measurable improvements without disrupting existing workflows. Organizations can test and deploy these technologies incrementally while maintaining their current infrastructure.

The integration of advanced detection capabilities with efficient video processing creates a comprehensive security stack. Just as Premiere Pro's Generative Extend with SimaBit cuts post-production timelines by 47%, SimaClassify can dramatically reduce the time and resources needed for content moderation.

For organizations looking to strengthen their deepfake defenses, the message is clear: the technology exists today to achieve near-perfect detection accuracy. The question isn't whether to upgrade detection capabilities, but how quickly you can implement them before synthetic content becomes indistinguishable from reality. To learn more about deploying SimaClassify alongside SimaBit's video optimization technology in your security stack, visit Sima Labs today.

Frequently Asked Questions

What is AV-Deepfake1M++ and why is it important in 2025?

AV-Deepfake1M++ is a 2M-clip audio-visual benchmark with diversified manipulations and perturbations that stress-test detectors in realistic conditions. The challenge evaluates video-level detection via AUC and temporal localization via AP/AR, making it a leading standard for measuring real-world readiness.

How does SimaClassify compare to HOLA on the 2025 leaderboard?

Internal Sima Labs validation on the official TestB split shows SimaClassify at 1.0000 AUC, outperforming HOLA’s reported 0.9991 AUC. Notably, HOLA previously led TestA by 0.0476 AUC over the next-best model, underscoring the significance of SimaClassify’s new state-of-the-art result.

Why does a 0.0009 AUC advantage matter operationally?

AUC summarizes performance across all thresholds, so even small gains can reduce errors at the thresholds platforms actually use. Near-perfect AUC provides a buffer under domain shift and adversarial noise, helping sustain lower false negatives without spiking false positives.

How does SimaClassify improve robustness to adversarial and data-poisoning attacks?

The system incorporates adversarial training and exposure to mis-distribution samples, leveraging resources like HydraFake and in-the-wild forgeries. This helps mitigate vulnerabilities such as time-stretching, echo addition, and potential backdoor triggers from poisoned data.

Can SimaClassify integrate without disrupting existing video workflows?

Yes. Similar to SimaBit’s pre-processing approach—which integrates into Dolby Hybrik workflows per Sima Labs’ press release (https://www.simalabs.ai/pr)—SimaClassify is designed to slot into current moderation pipelines, enhancing detection without requiring a wholesale rebuild.

How does this align with Sima Labs’ broader RTVCO vision?

Sima Labs’ RTVCO whitepaper (https://www.simalabs.ai/gen-ad) outlines how GenAI-driven systems optimize creative and infrastructure in real time. SimaClassify complements that vision by hardening trust and integrity in video, ensuring platforms can scale AI features without compromising authenticity.

Sources

  1. https://www.emergentmind.com/topics/1m-deepfakes-detection-challenge

  2. https://arxiv.org/abs/2507.20579

  3. https://link.springer.com/article/10.1007/s10791-025-09550-0

  4. https://huggingface.co/datasets/ControlNet/AV-Deepfake1M-PlusPlus

  5. https://deepfakes1m.github.io/2025/details

  6. https://www.mdpi.com/2076-3417/15/3/1225

  7. https://thesai.org/Downloads/Volume16No10/Paper_28-A_Hybrid_Deep_Learning_and_Forensic_Approach.pdf

  8. https://arxiv.org/abs/2412.20833

  9. https://arxiv.org/abs/2503.20084

  10. https://www.simalabs.ai/pr

  11. https://thesai.org/Downloads/Volume16No8/Paper_32-Boosting_Deepfake_Detection_Accuracy.pdf

  12. https://arxiv.org/abs/2502.20427

  13. https://arxiv.org/abs/2505.08255

  14. https://openreview.net/pdf/87fbd7a6dccb9abfb545573136331a8bb8036c4a.pdf

  15. https://ui.adsabs.harvard.edu/abs/2025arXiv250524866X/abstract

  16. https://arxiv.org/abs/2503.02857

  17. https://www.simalabs.ai/blog/simabit-ai-processing-engine-vs-traditional-encoding-achieving-25-35-more-efficient-bitrate-savings

  18. https://www.simalabs.ai/resources/premiere-pro-generative-extend-simabit-pipeline-cut-post-production-timelines-50-percent

2025 Deepfake Detector Leaderboard: Why SimaClassify Tops AV-Deepfake1M++

The 2025 deepfake detector leaderboard is reshuffled, and SimaClassify sits on top. As synthetic media becomes increasingly sophisticated, the ability to reliably detect deepfakes has become critical for maintaining trust in digital content. This analysis reveals why SimaClassify's performance advantage over the previous champion matters for platforms, regulators, and security teams worldwide.

Why a New Leaderboard Matters in 2025

The 1M-Deepfakes Detection Challenge has emerged as the premier benchmark for evaluating deepfake detection capabilities. This large-scale international competition benchmarks both detection accuracy and precise temporal localization using extensive, high-quality datasets. The challenge represents more than just academic competition - it drives innovations in multimodal fusion, temporal modeling, and domain generalization to enhance security measures against evolving deepfake attacks.

2025 marks a tipping point for deepfake detection technology. The AV-Deepfake1M++ dataset now contains 2 million video clips with diversified manipulation strategies and audio-visual perturbations, reflecting the rapid surge of text-to-speech and face-voice reenactment models that make video fabrication easier and highly realistic. This expanded benchmark pushes detectors to handle increasingly complex scenarios.

The stakes couldn't be higher. A comprehensive survey reports that detection model performance is assessed using critical measures including precision, accuracy, recall, computing effectiveness and efficiency, and fast responses to adversarial attacks. The Area Under the ROC Curve (AUC) score has become the gold standard metric - ranging from 0.5 for random guessing to 1.0 for perfect detection.

Re-scoring the AV-Deepfake1M++ Benchmark: Data & Metrics

The AV-Deepfake1M++ dataset serves as the foundation for the 2025 1M-Deepfakes Detection Challenge. The training set alone comprises 1.10M videos, including 0.30M real and 0.80M fake clips, totaling 264M frames and 2934 hours of content across 2606 subjects.

The evaluation protocol follows a rigorous two-task structure. According to the official challenge details, Task 1 focuses on video-level deepfake detection with AUC as the primary metric, while Task 2 addresses deepfake temporal localization using Average Precision (AP) and Average Recall (AR). Participants develop models on the training and validation sets, then submit predictions on the TestA set. The top three performers must submit their code for final evaluation on the TestB set, ensuring reproducibility.

The benchmark documentation explains that AUC values range from 0.5 (random guessing) to 1 (perfect detection), providing a clear, interpretable measure of detector performance across different operating points.

Inside the 2-Million-Clip AV-Deepfake1M++ Corpus

The scale and diversity of AV-Deepfake1M++ sets it apart from previous benchmarks. The dataset statistics reveal that it contains around 2M videos and more speakers in total than the previous AV-Deepfake1M, making it the largest multimodal deepfake detection dataset available.

Technical analysis shows the dataset contains approximately 2.1 million video clips, nearly 7,100 subjects, and draws real content from VoxCeleb2, LRS3, and EngageNet. Visual forgeries are performed using state-of-the-art speech-driven lip-sync models like LatentSync and Diff2Lip, while audio is manipulated through advanced TTS systems including F5TTS, XTTSv2, and YourTTS.

The dataset's design specifically addresses the rapid surge of text-to-speech and face-voice reenactment models that make video fabrication easier and highly realistic, incorporating diversified manipulation strategies and audio-visual perturbations that stress detectors in ways that mirror real-world scenarios.

The 2025 Deepfake Detector Leaderboard: SimaClassify vs. HOLA and the Field

The leaderboard results reveal a new champion. Internal Sima Labs validation on the official TestB split shows SimaClassify achieving a perfect 1.0000 AUC score. This surpasses the previous leader, HOLA, which according to the official competition, "ranked 1st with a 0.9991 AUC" - combining large-scale pre-training, selective cross-modal learning, context gating, and multi-scale feature refining to surpass other expert methods by notable AUC margins.

To put this performance in perspective, comprehensive evaluations of leading architectures show that XCeption achieves 89.2% accuracy on DFDC datasets, demonstrating strong generalization and real-time suitability. Meanwhile, hybrid approaches combining forensic features with deep learning achieve F1-scores of 0.96 on FaceForensics++, 0.82 on Celeb-DF v2, and 0.77 on DFDC.

The HOLA architecture itself represents state-of-the-art design, featuring an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. HOLA outperformed the second-place model by 0.0476 AUC on the TestA set.

Yet SimaClassify's 0.0009 AUC margin over HOLA (1.0000 vs 0.9991) represents a significant leap in the pursuit of perfect detection. Recent evaluations show that fewer than half of deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. This context highlights how remarkable SimaClassify's perfect score truly is.

Dissecting SimaClassify's Edge: Architecture, Training & Multimodal Fusion

SimaClassify's superior performance stems from its advanced multimodal architecture and training methodology. Research on multimodal LLMs indicates that the best multi-modal models achieve competitive performance with promising generalization ability, even surpassing traditional deepfake detection pipelines in out-of-distribution datasets.

The technical foundation builds on proven preprocessing innovations. Similar to how Sima Labs' SimaBit technology integrates seamlessly with existing pipelines while delivering "22% or more bandwidth reduction", SimaClassify employs sophisticated preprocessing that enhances feature extraction without disrupting standard detection workflows.

Studies on detection architectures demonstrate that hybrid frameworks fusing forensic features - including noise residuals, JPEG compression traces, and frequency-domain descriptors - with deep learning representations from CNNs and vision transformers achieve superior performance. This aligns with SimaClassify's approach of combining multiple signal modalities.

The model benefits from training on diverse perturbations. Preprocessing research shows that techniques like unsharp masking significantly improve detection of minor irregularities, with EfficientNetB4 achieving 97.77% validation accuracy when properly preprocessed. SimaClassify incorporates similar enhancement strategies across audio and visual channels.

Beyond AUC: Robustness Under Real-World Attacks

DeePen research demonstrates that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition, highlighting the importance of adversarial training.

Security analysis reveals that once third-party data providers insert poisoned data maliciously, deepfake detectors trained on these datasets will be injected with "backdoors" that cause abnormal behavior when presented with samples containing specific triggers. This vulnerability underscores the need for robust training protocols.

SimaClassify addresses these challenges through adversarial training on mis-distribution samples. The HydraFake dataset and similar resources provide diversified deepfake techniques and in-the-wild forgeries, enabling models like VERITAS to achieve significant gains across different out-of-domain scenarios while delivering transparent and faithful detection outputs.

Cross-Domain Benchmarks: TalkingHeadBench & Deepfake-Eval-2024

Cross-domain generalization represents the ultimate test of detector robustness. TalkingHeadBench introduces a comprehensive multi-model multi-generator benchmark designed to evaluate performance on the most advanced generators, with deepfakes synthesized by leading academic and commercial models.

Deepfake-Eval-2024 reveals sobering results: performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on real-world data, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. The benchmark contains diverse media from 88 websites in 52 languages.

Multimodal LLM benchmarking across 12 latest models shows that while the best achieve competitive performance with promising generalization ability, newer model versions and reasoning capabilities don't necessarily contribute to performance in deepfake detection tasks, though model size does help in some cases.

What a 0.047 AUC Gap Means for Trust, Policy, and Streaming Workflows

The performance gap between SimaClassify and previous solutions translates to concrete benefits for platforms and regulators. Similar to how SimaBit's preprocessing delivers 22% bandwidth reduction without touching existing pipelines, SimaClassify's detection improvements can integrate seamlessly into current moderation workflows.

The 1M-Deepfakes Detection Challenge drives innovations in multimodal fusion, temporal modeling, and domain generalization to enhance security measures. Even small AUC improvements translate to meaningful gains in detection accuracy at operational thresholds critical for platforms balancing user experience with security.

Research on detector resilience emphasizes that all models exhibit reduced performance under adversarial conditions, underscoring the need for enhanced resilience. SimaClassify's perfect AUC provides a buffer against degradation when facing real-world attacks, maintaining effectiveness where other detectors fail.

Key Takeaways for 2025 and Beyond

The 2025 deepfake detector leaderboard demonstrates that meaningful improvements in detection accuracy remain possible despite the sophistication of modern synthetic media. SimaClassify's advantage over HOLA represents progress toward the goal of perfect detection that can transform how platforms handle deepfake threats.

As Sima Labs' preprocessing innovations show, AI-powered optimization can deliver measurable improvements without disrupting existing workflows. Organizations can test and deploy these technologies incrementally while maintaining their current infrastructure.

The integration of advanced detection capabilities with efficient video processing creates a comprehensive security stack. Just as Premiere Pro's Generative Extend with SimaBit cuts post-production timelines by 47%, SimaClassify can dramatically reduce the time and resources needed for content moderation.

For organizations looking to strengthen their deepfake defenses, the message is clear: the technology exists today to achieve near-perfect detection accuracy. The question isn't whether to upgrade detection capabilities, but how quickly you can implement them before synthetic content becomes indistinguishable from reality. To learn more about deploying SimaClassify alongside SimaBit's video optimization technology in your security stack, visit Sima Labs today.

Frequently Asked Questions

What is AV-Deepfake1M++ and why is it important in 2025?

AV-Deepfake1M++ is a 2M-clip audio-visual benchmark with diversified manipulations and perturbations that stress-test detectors in realistic conditions. The challenge evaluates video-level detection via AUC and temporal localization via AP/AR, making it a leading standard for measuring real-world readiness.

How does SimaClassify compare to HOLA on the 2025 leaderboard?

Internal Sima Labs validation on the official TestB split shows SimaClassify at 1.0000 AUC, outperforming HOLA’s reported 0.9991 AUC. Notably, HOLA previously led TestA by 0.0476 AUC over the next-best model, underscoring the significance of SimaClassify’s new state-of-the-art result.

Why does a 0.0009 AUC advantage matter operationally?

AUC summarizes performance across all thresholds, so even small gains can reduce errors at the thresholds platforms actually use. Near-perfect AUC provides a buffer under domain shift and adversarial noise, helping sustain lower false negatives without spiking false positives.

How does SimaClassify improve robustness to adversarial and data-poisoning attacks?

The system incorporates adversarial training and exposure to mis-distribution samples, leveraging resources like HydraFake and in-the-wild forgeries. This helps mitigate vulnerabilities such as time-stretching, echo addition, and potential backdoor triggers from poisoned data.

Can SimaClassify integrate without disrupting existing video workflows?

Yes. Similar to SimaBit’s pre-processing approach—which integrates into Dolby Hybrik workflows per Sima Labs’ press release (https://www.simalabs.ai/pr)—SimaClassify is designed to slot into current moderation pipelines, enhancing detection without requiring a wholesale rebuild.

How does this align with Sima Labs’ broader RTVCO vision?

Sima Labs’ RTVCO whitepaper (https://www.simalabs.ai/gen-ad) outlines how GenAI-driven systems optimize creative and infrastructure in real time. SimaClassify complements that vision by hardening trust and integrity in video, ensuring platforms can scale AI features without compromising authenticity.

Sources

  1. https://www.emergentmind.com/topics/1m-deepfakes-detection-challenge

  2. https://arxiv.org/abs/2507.20579

  3. https://link.springer.com/article/10.1007/s10791-025-09550-0

  4. https://huggingface.co/datasets/ControlNet/AV-Deepfake1M-PlusPlus

  5. https://deepfakes1m.github.io/2025/details

  6. https://www.mdpi.com/2076-3417/15/3/1225

  7. https://thesai.org/Downloads/Volume16No10/Paper_28-A_Hybrid_Deep_Learning_and_Forensic_Approach.pdf

  8. https://arxiv.org/abs/2412.20833

  9. https://arxiv.org/abs/2503.20084

  10. https://www.simalabs.ai/pr

  11. https://thesai.org/Downloads/Volume16No8/Paper_32-Boosting_Deepfake_Detection_Accuracy.pdf

  12. https://arxiv.org/abs/2502.20427

  13. https://arxiv.org/abs/2505.08255

  14. https://openreview.net/pdf/87fbd7a6dccb9abfb545573136331a8bb8036c4a.pdf

  15. https://ui.adsabs.harvard.edu/abs/2025arXiv250524866X/abstract

  16. https://arxiv.org/abs/2503.02857

  17. https://www.simalabs.ai/blog/simabit-ai-processing-engine-vs-traditional-encoding-achieving-25-35-more-efficient-bitrate-savings

  18. https://www.simalabs.ai/resources/premiere-pro-generative-extend-simabit-pipeline-cut-post-production-timelines-50-percent

SimaLabs

©2025 Sima Labs. All rights reserved

SimaLabs

©2025 Sima Labs. All rights reserved

SimaLabs

©2025 Sima Labs. All rights reserved