Monika Jotautaitė

AI Safety Researcher

About me

I'm an AI control researcher, working on monitor red-teaming at Apollo Research.

Previously, I was an independent AI researcher leading a 2 person team, focusing on MonitoringBench. My work was supported by Co-Efficient Giving (formerly Open Philanthropy) and the Berkeley Existential Risk Initiative.

Since the January preview, MonitoringBench has been used for monitor evaluations in Anthropic's Claude Mythos Preview Risk Report and OpenAI's Auto-Review for Codex.

Research

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Monika Jotautaitė, Maria Martinez, Ollie Matthews, Tyler Tracy

Accepted at ICLR workshop

Our current research: using models as red-teamers presents three challenges: mode collapse, time-consuming elicitation, conceive-execute gap: models struggle to conceive, plan, and execute attacks single-turn. To robustly evaluate monitors, we need to (1) test across a large, diverse set of attacks, (2) ensure attack quality, and (3) gain visibility into monitor strengths and failure modes. If you are interested in the benchmark or independent red-teaming of monitors, please reach out.

Benchmark Code Paper (final) Paper (early version accepted at ICML Agents in the Wild workshop) Halfway Blogpost

Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models

Monika Jotautaitė, Lucius Caviola, David A Brewster, Thilo Hagendorff

Accepted at Nature Communications

Developing a speciesism benchmark comparing human and LLM responses. The results show that while most frontier models can recognize speciesism, they do not consider such behavior unethical.

Paper Nature Communications Website

From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Monika Jotautaitė, Mary Phuong

Pivotal Fellowship

Introduced a novel Moral Foundations Theory evaluation dataset. Our findings reveal remarkably homogeneous preferences across different model families, yet demonstrate a lack of consistent values.

Slides Code Paper

My work

UK AISI Bounty Program: As an evaluations scientist, I designed and implemented multiple evaluation proposals that were accepted for the UK AISI bounty program. The evalautions I worked on include evaluating models on the following capabilities: LLM elicitation, online gambling, collusion in AI debate and decreasing test-time token usage evaluations as well as the SmartBackdoor paper. I was also a technical program manager with the ASET Benchmarks program mentoring a team of engineers in cybersecurity eval implementation in Inspect at Arcadia Impact.

I organize Women in AI Safety London, a series of networking events. To receive updates on events and opportunities, join our mailing list. If you're interested in organizing a local event, you can apply here.

I occasionally teach at ML4Good bootcamps as a head teacher or a TA. I created new materials on LLM evaluations and RL. Find upcoming programs at ml4good.org/upcoming.

I created AI Safety materials for GirlsWhoML (slides). If you'd like to run this lecture series at your university, reach out here.

Contact

Links

Monika Jotautaitė

About me

Research

My work

Get in touch