TY - GEN
T1 - SoK
T2 - 34th USENIX Security Symposium, USENIX Security 2025
AU - Büchel, Marvin
AU - Paladini, Tommaso
AU - Longari, Stefano
AU - Carminati, Michele
AU - Zanero, Stefano
AU - Binyamini, Hodaya
AU - Engelberg, Gal
AU - Klein, Dan
AU - Guizzardi, Giancarlo
AU - Caselli, Marco
AU - Continella, Andrea
AU - van Steen, Maarten
AU - Peter, Andreas
AU - van Ede, Thijs
N1 - Publisher Copyright:
© 2025 by The USENIX Association All Rights Reserved.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Cyber Threat Intelligence (CTI) plays a critical role in sharing knowledge about new and evolving threats. With the increased prevalence and sophistication of threat actors, intelligence has expanded from simple indicators of compromise to extensive CTI reports describing high-level attack steps known as Tactics, Techniques and Procedures (TTPs). Such TTPs, often classified into the ontology of the ATT&CK framework, make CTI significantly more valuable, but also harder to interpret and automatically process. Natural Language Processing (NLP) makes it possible to automate large parts of the knowledge extraction from CTI reports; over 40 papers discuss approaches, ranging from named entity recognition over embedder models to generative large language models. Unfortunately, existing solutions are largely incomparable as they consider decisively different and constrained settings, rely on custom TTP ontologies, and use a multitude of custom, inaccessible CTI datasets. We take stock, systematize the knowledge in the field, and empirically evaluate existing approaches in a unified setting for fair comparisons. We gain several fundamental insights, including (1) the finding of a kind of performance limit that existing approaches seemingly cannot overcome as of yet, (2) that traditional NLP approaches (possibly counterintuitively) outperform modern embedder-based and generative approaches in realistic settings, and (3) that further research on understanding inherent ambiguities in TTP ontologies and on the creation of qualitative datasets is key to take a leap in the field.
AB - Cyber Threat Intelligence (CTI) plays a critical role in sharing knowledge about new and evolving threats. With the increased prevalence and sophistication of threat actors, intelligence has expanded from simple indicators of compromise to extensive CTI reports describing high-level attack steps known as Tactics, Techniques and Procedures (TTPs). Such TTPs, often classified into the ontology of the ATT&CK framework, make CTI significantly more valuable, but also harder to interpret and automatically process. Natural Language Processing (NLP) makes it possible to automate large parts of the knowledge extraction from CTI reports; over 40 papers discuss approaches, ranging from named entity recognition over embedder models to generative large language models. Unfortunately, existing solutions are largely incomparable as they consider decisively different and constrained settings, rely on custom TTP ontologies, and use a multitude of custom, inaccessible CTI datasets. We take stock, systematize the knowledge in the field, and empirically evaluate existing approaches in a unified setting for fair comparisons. We gain several fundamental insights, including (1) the finding of a kind of performance limit that existing approaches seemingly cannot overcome as of yet, (2) that traditional NLP approaches (possibly counterintuitively) outperform modern embedder-based and generative approaches in realistic settings, and (3) that further research on understanding inherent ambiguities in TTP ontologies and on the creation of qualitative datasets is key to take a leap in the field.
UR - https://www.scopus.com/pages/publications/105021346016
M3 - Conference contribution
AN - SCOPUS:105021346016
T3 - Proceedings of the 34th USENIX Security Symposium
SP - 4621
EP - 4641
BT - Proceedings of the 34th USENIX Security Symposium
PB - USENIX Association
Y2 - 13 August 2025 through 15 August 2025
ER -