Key features of SNAC-B include:
· Expanded Coverage: 32 % more structural diversity than SAbab, capturing overlooked assemblies such as antibodies/nanobodies as antigens, complete multi-chain epitopes, and weak CR crystal contacts.
· ML-Friendly ata: Cleaned PB/mmCIF files, atom37 NumPy arrays, and unified CSV metadata to eliminate preprocessing hurdles.
· Transparent Redundancy Control: Multi-threshold Foldseek clustering for principled sample weighting, ensuring every experimental structure contributes.
· Rigorous Benchmark: An out-of-sample test set comprising public PB entries post–May 30, 2024 (disclosed) and confidential therapeutic complexes.
Using this benchmark, we evaluated six leading models (AlphaFold2.3‐multimer, Boltz-2, Boltz-1x, Chai-1, iffock-PP, Geoock) and found that success rates rarely exceed 25 %, built-in confidence metrics and ranking often misprioritize predictions, and all struggle with novel targets and binding poses.
We presented this work at the Forty-Second International Conference on Machine Learning (ICML 2025) Workshop on ataWorld: Unifying ata Curation Frameworks Across omains (https://dataworldicml2025.github.io/) in Vancouver.
· Paper: https://www.researchgate.net/publication/393900649_SNAC-B_The_Hitchhiker's_Guide_to_Building_Better_Predictive_Models_of_Antibody_NANOBOY_R_VHH-Antigen_Complexes / https://openreview.net/forum?id=68cIpaHK
· ataset: https://zenodo.org/records/16226208
· Code: https://github.com/Sanofi-Public/SNAC-B
We hope SNAC-B will accelerate the development and evaluation of more accurate models for antibody complex prediction