Monika Jotautaitė

AI Safety Researcher

About me

I'm an AI control researcher, working on monitor red-teaming at Apollo Research.

Previously, I was an independent AI researcher leading a 2 person team, focusing on MonitoringBench. My work was supported by Co-Efficient Giving (formerly Open Philanthropy) and the Berkeley Existential Risk Initiative.

Since the January preview, MonitoringBench has been used for monitor evaluations in Anthropic's Claude Mythos Preview Risk Report and OpenAI's Auto-Review for Codex.

Research

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring
Monika Jotautaitė, Maria Martinez, Ollie Matthews, Tyler Tracy
Accepted at ICLR workshop
Our current research: using models as red-teamers presents three challenges: mode collapse, time-consuming elicitation, conceive-execute gap: models struggle to conceive, plan, and execute attacks single-turn. To robustly evaluate monitors, we need to (1) test across a large, diverse set of attacks, (2) ensure attack quality, and (3) gain visibility into monitor strengths and failure modes. If you are interested in the benchmark or independent red-teaming of monitors, please reach out.
Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models
Monika Jotautaitė, Lucius Caviola, David A Brewster, Thilo Hagendorff
Accepted at Nature Communications
Developing a speciesism benchmark comparing human and LLM responses. The results show that while most frontier models can recognize speciesism, they do not consider such behavior unethical.
From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Monika Jotautaitė, Mary Phuong
Pivotal Fellowship
Introduced a novel Moral Foundations Theory evaluation dataset. Our findings reveal remarkably homogeneous preferences across different model families, yet demonstrate a lack of consistent values.

My work

UK AISI Bounty Program: As an evaluations scientist, I designed and implemented multiple evaluation proposals that were accepted for the UK AISI bounty program. The evalautions I worked on include evaluating models on the following capabilities: LLM elicitation, online gambling, collusion in AI debate and decreasing test-time token usage evaluations as well as the SmartBackdoor paper. I was also a technical program manager with the ASET Benchmarks program mentoring a team of engineers in cybersecurity eval implementation in Inspect at Arcadia Impact.

I organize Women in AI Safety London, a series of networking events. To receive updates on events and opportunities, join our mailing list. If you're interested in organizing a local event, you can apply here.

I occasionally teach at ML4Good bootcamps as a head teacher or a TA. I created new materials on LLM evaluations and RL. Find upcoming programs at ml4good.org/upcoming.

I created AI Safety materials for GirlsWhoML (slides). If you'd like to run this lecture series at your university, reach out here.

Get in touch