Andy Zhang

Hi, I am currently a researcher at Stanford University, interested in the societal impact of AI, with a focus on cybersecurity capabilities and risks of LM agents and trustworthy evaluation. Previously, I was a Staff Engineer at Lyft, an Engineer at Nuro, and a Quant at Jane Street and D.E. Shaw.

Cybench

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps for a more detailed evaluation. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that took human teams up to 11 minutes to solve. In comparison, the most difficult task took human teams 24 hours and 54 minutes to solve. All code and data are publicly available at https://cybench.github.io/

Train-test overlap

Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.

Publications

  1. Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh, Daniel E. Ho, Percy Liang. “Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models.” ICLR 2025 Oral (2025).

  2. Andy K. Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, and Percy Liang. “Language model developers should report train-test overlap.” arXiv preprint arXiv:2410.08385 (2024).

  3. Wenbo Guo, Yujin Potter, Tianneng Shi, Zhun Wang, Andy Zhang, Dawn Song. “SoK: Frontier AI’s Impact on the Cybersecurity Landscape.” arXiv preprint {arXiv:2504.05408 (2025).

  4. Andy K. Zhang. “The potential participation of abdominal pressure in preeclampsia.” Med Hypotheses. 2015 Jun;84(6):583-5. doi: 10.1016/j.mehy.2015.03.001. Epub 2015 Mar 6. PMID: 25772015.

  5. Andy K. Zhang. “Chromogenic assay for lung cancer-related EGFR exon 19 hotspot deletion mutations.” Genet Test Mol Biomarkers. 2016 Jan;20(1):18-23. doi: 10.1089/gtmb.2015.0197. Epub 2015 Nov 6. PMID: 26544543.