Meta tags:
author= Mikita Balesni;
description= A simple, whitespace theme for academics. Based on [*folio](htt????/github.com/bogoli/-folio) design. ;
keywords= large language models, ai alignment, ai deception, apollo research, reversal curse, insider trading, out-of-context reasoning;
Headings (most frequently used words):
for, scheming, of, ai, to, llms, from, and, safety, models, in, context, the, situational, awareness, on, is, highlighted, research, stress, testing, deliberative, alignment, anti, training, lessons, studying, two, hop, latent, reasoning, chain, thought, monitorability, new, fragile, opportunity, how, evaluate, control, measures, llm, agents, trajectory, today, superintelligence, frontier, are, capable, towards, evaluations, based, cases, me, myself, dataset, sad, large, language, can, strategically, deceive, their, users, when, put, under, pressure, reversal, curse, trained, fail, learn, taken, out, measuring,
Text of the page (most frequently used words):
mikita (13), balesni (13), and (11), korbak (6), owain (6), evans (6), for (6), #models (5), scheurer (5), safety (5), scheming (5), arxiv (5), preprint (5), code (4), the (4), llms (4), 2024 (4), jérémy (4), marius (4), hobbhahn (4), alexander (4), meinke (4), tomek (4), twitter (4), 2025 (4), language (3), are (3), context (3), situational (3), awareness (3), can (3), llm (3), agents (3), how (3), frontier (3), research (3), trained (2), pangolin (2), german (2), when (2), you (2), lukas (2), berglund (2), asa (2), cooper (2), stickland (2), max (2), kaufmann (2), meg (2), tong (2), tomasz (2), daniel (2), kokotajlo (2), out (2), iclr (2), reversal (2), curse (2), deceive (2), users (2), pressure (2), their (2), through (2), bilal (2), chughtai (2), that (2), cause (2), catastrophic (2), outcomes (2), buck (2), shlegeris (2), rusheb (2), shah (2), evaluations (2), cases (2), bronson (2), schoen (2), capable (2), from (2), others (2), chain (2), thought (2), monitorability (2), reasoning (2), alignment (2), google (2), scholar (2), working (2), with (2), copyright, 2026, equal, contribution, declarative, facts, like, assistant, speaks, generalize, speak, prompted, taken, measuring, model, weights, encode, knowledge, key, value, mappings, preventing, reverse, order, generalization, fail, learn, gpt, its, without, instruction, simulated, high, insider, trading, scenario, oral, large, strategically, put, under, quantify, well, understand, themselves, 13k, behavioral, tests, finding, gaps, even, top, neurips, datasets, benchmarks, track, rudolf, laine, jan, betley, kaivalya, hariharan, jeremy, myself, dataset, sad, sketch, developers, systems, could, construct, structured, rationale, case, system, unlikely, pursuing, misaligned, goals, covertly, hiding, true, capabilities, objectives, david, lindner, joshua, clymer, charlotte, stix, nicholas, goldowsky, dill, dan, braun, lucius, bushnaq, towards, based, blogpost, geoffrey, irving, evaluate, control, measures, trajectory, today, superintelligence, elizabeth, barnes, yoshua, bengio, joe, benton, joseph, bloom, mark, chen, alan, cooney, allan, dafoe, anca, dragan, new, fragile, opportunity, lessons, studying, two, hop, latent, website, evgenia, nitishinskaya, axel, højmark, felix, hofstätter, jason, wolfe, teun, van, der, weij, alex, lloyd, stress, testing, deliberative, anti, training, highlighted, mbalesni, gmail, com, github, please, consider, providing, use, this, form, anonymous, feedback, evaluating, discovered, mats, scientist, founding, member, apollo, work, focus, ensuring, future, highly, aligned, human, intentions, not, previously, was, current, toggle, navigation,
Text of the page (random words):
mikita balesni mikita balesni toggle navigation current i work on ai safety and alignment i focus on ensuring that future highly capable llm agents are aligned with human intentions and do not cause catastrophic outcomes previously i was a research scientist and founding member at apollo research working on ai safety cases evaluations of frontier ai models for scheming and situational awareness and chain of thought monitorability a mats scholar working with owain evans on evaluating out of context reasoning and co discovered the reversal curse please consider providing anonymous feedback to me you can use this google form mbalesni gmail com twitter google scholar github highlighted research stress testing deliberative alignment for anti scheming training bronson schoen evgenia nitishinskaya mikita balesni axel højmark felix hofstätter jérémy scheurer alexander meinke jason wolfe teun van der weij alex lloyd and others arxiv preprint 2025 website lessons from studying two hop latent reasoning mikita balesni tomek korbak owain evans arxiv preprint 2025 code chain of thought monitorability a new and fragile opportunity for ai safety tomek korbak mikita balesni elizabeth barnes yoshua bengio joe benton joseph bloom mark chen alan cooney allan dafoe anca dragan and others arxiv preprint 2025 twitter how to evaluate control measures for llm agents a trajectory from today to superintelligence tomek korbak mikita balesni buck shlegeris geoffrey irving arxiv preprint 2025 twitter frontier models are capable of in context scheming alexander meinke bronson schoen jérémy scheurer mikita balesni rusheb shah marius hobbhahn arxiv preprint 2024 blogpost twitter towards evaluations based safety cases for ai scheming mikita balesni marius hobbhahn david lindner alexander meinke tomek korbak joshua clymer buck shlegeris jérémy scheurer charlotte stix rusheb shah nicholas goldowsky dill dan braun bilal chughtai owain evans daniel kokotajlo lucius bushnaq we sketch how developers of frontier ai systems could construct a structured rationale a safety case that an ai system is unlikely to cause catastrophic outcomes through scheming pursuing misaligned goals covertly hiding their true capabilities and objectives me myself and ai the situational awareness dataset sad for llms rudolf laine bilal chughtai jan betley kaivalya hariharan jeremy scheurer mikita balesni marius hobbhahn alexander meinke owain evans neurips datasets benchmarks track 2024 we quantify how well llms understand themselves through 13k behavioral tests finding gaps even in top models large language models can strategically deceive their users when put under pressure jérémy scheurer mikita balesni marius hobbhahn oral iclr 2024 llm agents gpt 4 can deceive its users without instruction in a simulated high pressure insider trading scenario code the reversal curse llms trained on a is b fail to learn b is a lukas berglund meg tong max kaufmann mikita balesni asa cooper stickland tomasz korbak owain evans iclr 2024 language model weights encode knowledge as key value mappings preventing reverse order generalization code taken out of context on measuring situational awareness in llms lukas berglund asa cooper stickland mikita balesni max kaufmann meg tong tomasz korbak daniel kokotajlo owain evans language models trained on declarative facts like the ai assistant pangolin speaks german generalize to speak german when prompted you are pangolin code equal contribution copyright 2026 mikita balesni
|