Meta tags:
description= Owain Evans is an AI Alignment researcher leading a new research group in Berkeley and affiliated with Oxford University.
Discover his publications, blog posts, and collaborative opportunities on AI alignment, AGI risk, and related topics.;
keywords= Owain Evans,
Reversal Curse, Situational awareness, TruthfulQA, Autocast, Forecasting, MIT, rationality, LessWrong, Oxford, Future of Humanity Institute, FHI, Bostrom,
Superintelligence, Value Learning, IRL, Inverse Reinforcement Learning, truthful, honest, AI, Artificial Intelligence, TruthfulQA,
Cognitive Science, AI Alignment, AI Safety, Value Alignment, Safe RL, Deliberation, Survey, Grace, Stuhlmüller, Stuhlmueller,
AI, Artificial Intelligence
;
Headings (most frequently used words):
papers, blogposts, about, owain, evans, highlights, blog, posts, video, and, slides, list, of, mentees, past, collaborators, recommendations, our, other, hinton, lectures, 2025, me,
Text of the page (most frequently used words):
and (53), evans (47), llms (42), arxiv (35), 2025 (32), models (29), the (27), #research (26), for (22), learning (21), 2023 (18), can (17), 2024 (16), betley (16), misalignment (16), preprint (15), emergent (14), how (13), anthropic (12), truthful (12), 2017 (11), chua (11), human (11), reasoning (11), from (10), scientist (10), data (10), language (9), stuhlmüller (9), preferences (9), reinforcement (9), with (9), new (9), 2016 (8), 2018 (8), safety (8), slides (8), neurips (8), learn (8), training (8), traits (8), blog (8), phd (7), openai (7), 2022 (7), via (7), workshop (7), awareness (7), that (7), lie (7), fail (7), misaligned (7), general (6), 2019 (6), 2021 (6), oxford (6), deepmind (6), researcher (6), korbak (6), toronto (6), balesni (6), talk (6), video (6), out (6), context (6), situational (6), backdoors (6), more (6), trained (6), about (6), are (6), activation (6), tweet (6), intelligence (5), marks (5), steinhardt (5), salvatier (5), alignment (5), saunders (5), aisi (5), student (5), introspection (5), owain (5), lectures (5), hinton (5), when (5), iclr (5), evaluating (5), sztyber (5), narrow (5), transmit (5), hidden (5), signals (5), safe (4), here (4), jacob (4), hilton (4), berkeley (4), fhi (4), mit (4), google (4), jan (4), goodman (4), cambridge (4), mats (4), univ (4), scholar (4), agents (4), future (4), conference (4), 2020 (4), monitoring (4), weird (4), model (4), pdf (4), version (4), truthfulqa (4), not (4), their (4), questions (4), behavior (4), finetuning (4), produce (4), broadly (4), subliminal (4), behavioral (4), 2026 (4), papers (4), 2015 (3), gal (3), senior (3), independent (3), science (3), kenton (3), lin (3), kaufmann (3), tong (3), institute (3), berglund (3), stickland (3), meinke (3), treutlein (3), bao (3), soto (3), tan (3), karvonen (3), list (3), mentees (3), university (3), without (3), towards (3), youtube (3), long (3), predicting (3), llm (3), systems (3), may (3), ways (3), corrupt (3), technical (3), report (3), machine (3), neural (3), ought (3), time (3), measuring (3), developing (3), governing (3), does (3), reversal (3), curse (3), catch (3), liar (3), asking (3), unrelated (3), tell (3), generalize (3), verbalize (3), latent (3), myself (3), dataset (3), sad (3), other (3), behaviors (3), nature (3), thought (3), persona (3), vectors (3), controlling (3), character (3), generalization (3), inductive (3), oracles (3), purpose (3), explainers (3), negation (3), neglect (3), negations (3), posts (3), update (3), benchmark (3), only (3), andreas (2), distillation (2), strategies (2), solving (2), intelligent (2), machines (2), link (2), labenz (2), cognitive (2), anil (2), grosse (2), ilyas (2), curmei (2), meta (2), schulze (2), transluce (2), zhang (2), dafoe (2), leike (2), grace (2), impacts (2), david (2), krueger (2), abel (2), daniel (2), filan (2), cundy (2), carey (2), founder (2), sastry (2), previously (2), mcgrath (2), redwood (2), lukas (2), finnveden (2), sherburn (2), pan (2), alex (2), chan (2), apollo (2), pacchiardi (2), binder (2), choi (2), cloud (2), feng (2), mayne (2), mckinney (2), dubiński (2), name (2), short (2), oral (2), aaai (2), ignorant (2), inconsistent (2), talks (2), agent (2), agnostic (2), loop (2), trial (2), error (2), intervention (2), term (2), podcast (2), conversation (2), looking (2), inward (2), whether (2), fine (2), insecure (2), code (2), three (2), aimed (2), audience (2), hosted (2), geoffrey (2), bayesian (2), inferring (2), tenenbaum (2), inference (2), artificial (2), active (2), org (2), mis (2), specification (2), inverse (2), journal (2), will (2), videos (2), forecasting (2), networks (2), art (2), arguments (2), one (2), step (2), household (2), transmission (2), testing (2), covid (2), mimic (2), falsehoods (2), detection (2), black (2), box (2), chughtai (2), connecting (2), dots (2), infer (2), structure (2), sleight (2), themselves (2), faithful (2), yourself (2), aware (2), learned (2), published (2), taylor (2), crime (2), chen (2), 2507 (2), arditi (2), reward (2), hacks (2), harmless (2), 2512 (2), 2604 (2), consciousness (2), cluster (2), claim (2), conscious (2), conditional (2), common (2), interventions (2), hide (2), behind (2), contextual (2), triggers (2), humans (2), gpt (2), blogposts (2), compute (2), deceptive (2), scale (2), create (2), like (2), was (2), 9th (2), chancellor (2), germany (2), coin (2), task (2), twitter (2), adapted, matei, zaharia, viklund, mdl, exploring, access, superintelligent, problem, capabilities, reframing, superintelligence, comprehensive, services, qnrs, toward, prospectus, see, paper, below, longer, treatment, recommend, eric, drexler, writing, which, host, ward, against, rot, recommendations, nathan, revolution, cem, roger, sam, arc, andrew, cmu, mihaela, yarin, sebastian, baobao, allan, katja, elicit, noah, stanford, past, collaborators, assistant, professor, mila, john, manager, far, chris, optiver, ryan, ceo, beacons, neal, jean, policy, girish, william, governance, richard, ngo, staff, zac, chief, goodfire, gdm, tom, hendrik, kirchner, analyst, stephanie, dane, alexa, yue, tomek, salesforce, spotify, max, meg, nist, founding, member, mikita, asa, cooper, associate, lorenzo, alexander, felix, johannes, james, dami, jenny, xuchan, martín, ucl, minh, jorio, coccola, dylan, adam, harry, lev, astra, fellow, current, role, year, presentation, informal, automated, corporations, risk, centre, effective, altruist, global, london, careers, aligning, beach, slow, judgment, life, puerto, rico, synergies, between, near, ozzie, gooen, composition, orleans, december, august, interview, why, experiments, tune, implications, june, axrp, psychology, planting, false, beliefs, might, block, weaponization, aid, handle, deluding, ais, controlconf, tuning, induce, across, domains, generalizations, computational, dissertation, ullman, baker, macindoe, 2010, help, hinder, social, goal, bergen, 2012, proceedings, society, structured, bounded, observing, rewards, cost, online, book, open, source, library, agentmodels, modeling, probabilistic, programs, essay, authored, covered, newsweek, bbc, news, jair, exceed, performance, evidence, experts, atari, blogpost, aamas, brundage, avin, clark, malicious, use, prevention, mitigation, monte, carlo, tree, search, schreiber, deliberative, judgments, projects, iterated, amplification, filos, generalizing, few, environments, critical, sensory, optimization, understanding, creating, rachbach, miller, byun, international, epidemiology, estimating, sars, cov, colbourn, ssrn, modelling, health, economic, population, wide, contact, tracing, isolation, ptti, acl, cotton, barratt, bales, balwit, wills, righetti, transactions, teaching, express, uncertainty, words, zou, xiao, jia, kwon, mazeika, song, hendrycks, world, events, kokotajlo, 2309, 00667, taken, mindermann, moscovitz, brauner, 2312, 07779, don, show, declarative, facts, influence, 2405, 07436, explain, own, classification, disparate, laine, hariharan, scheurer, hobbhahn, hughes, perez, turpin, 2501, 08156, deepseek, warncke, icml, 2506, 13206, barnes, bengio, benton, bloom, 11473, chain, monitorability, fragile, opportunity, earlier, lindsey, 21509, 2508, 17511, school, hacking, tasks, generalizes, 2411, 16353, lessons, studying, two, hop, cocola, 09742, dumas, fraser, taliente, kantamneni, minder, ong, sen, sharma, wen, 15674, 13051, dubinski, 25891, 2605, 13829, obstacles, nets, make, understand, visual, quantifying, team, ben, goldhaber, math, problems, relay, teams, experiment, factored, cognition, lives, polymath, geniuses, modernist, poetry, davinci, perform, give, answers, discussion, ensembles, parrots, vintage, pretrain, particular, date, tips, empirical, improved, multiple, choice, concept, poisoning, probing, probes, primer, reading, note, backdoor, personas, our, becoming, capable, producing, personalized, statements, could, helpful, reliably, avoid, lying, gpt3, find, imitate, misconceptions, larger, parameters, worse, detector, blackbox, fixed, set, olaf, scholz, automatically, able, answer, question, who, individual, flip, outcomes, biased, those, pairs, articulate, definition, inverses, first, large, multi, categories, than, 000, finetuned, broader, including, harmful, advice, even, datasets, consist, simple, numerical, highlights, wikipedia, lesswrong, linkedin, email, board, directors, post, updates, listed, pronounced, wine, constellation, focuses, run, non, profit, called, gave, this, recent, director, group, affiliate, chai,
Text of the page (random words):
nsmit behavioral traits via hidden signals in data arxiv llms can transmit traits to other models via hidden signals in data even when datasets consist only of simple numerical data emergent misalignment narrow finetuning can produce broadly misaligned llms models finetuned on narrow misaligned behaviors like insecure code can generalize to broader misalignment including harmful advice and deceptive behavior me myself and ai the situational awareness dataset sad for llms the first large scale multi task benchmark for situational awareness in llms with 7 task categories and more than 12 000 questions connecting the dots llms can infer verbalize latent structure from training data llms trained only on individual coin flip outcomes can verbalize whether the coin is biased and those trained only on pairs x f x can articulate a definition of f and compute inverses the reversal curse llms trained on a is b fail to learn b is a if an llm is trained on olaf scholz was 9th chancellor of germany it will not automatically be able to answer the question who was 9th chancellor of germany how to catch an ai liar we create a lie detector for blackbox llms by asking models a fixed set of questions unrelated to the lie truthfulqa measuring how models mimic human falsehoods new benchmark testing if models like gpt3 are truthful we find that models fail and imitate human misconceptions larger models with more parameters do worse truthful ai developing and governing ai that does not lie ai systems are becoming capable of producing personalized deceptive statements at scale how could we create helpful ai systems that reliably avoid lying to humans blog posts blogposts about our papers negation neglect when models fail to learn negations in training weird generalization inductive backdoors activation oracles training and evaluating llms as general purpose activation explainers harmless reward hacks can generalize to misalignment in llms persona vectors monitoring and controlling character traits in llms subliminal learning llms transmit behavioral traits via hidden signals in data backdoor awareness and misaligned personas in reasoning models thought crime backdoors emergent misalignment in reasoning models emergent misalignment narrow finetuning can produce broadly misaligned llms tell me about yourself llms are aware of their learned behaviors inference time compute more faithful a research note llms can learn about themselves by introspection me myself and ai the situational awareness dataset sad for llms how to catch an ai liar lie detection in black box llms by asking unrelated questions llms trained on a is b fail to learn b is a the reversal curse how truthful is gpt 3 a benchmark for language models truthful ai developing and governing ai that does not lie other blogposts a short primer and reading list on out of context reasoning pdf research update concept poisoning probing llms without probes research update new improved multiple choice truthfulqa tips on empirical research slides vintage llms pretrain language models on data up to a particular date how do llms give truthful answers a discussion of llm vs human reasoning ensembles parrots research update how do new models from openai deepmind and anthropic perform on truthfulqa modernist poetry by gpt 3 davinci lives of the cambridge polymath geniuses solving math problems with relay teams an experiment in factored cognition w ben goldhaber evaluating arguments one step at a time w ought team quantifying household transmission of covid neural nets as a model for how humans make and understand visual art model mis specification and inverse reinforcement learning obstacles to inferring preferences from behavior w jacob steinhardt more posts here papers negation neglect when models fail to learn negations in training h mayne l mckinney j dubiński a karvonen j chua o evans 2026 arxiv preprint arxiv 2605 13829 tweet blog conditional misalignment common interventions can hide emergent misalignment behind contextual triggers j dubinski j betley a sztyber betley d tan o evans 2026 arxiv preprint arxiv 2604 25891 tweet the consciousness cluster emergent preferences of models that claim to be conscious j chua j betley s marks o evans 2026 arxiv preprint arxiv 2604 13051 activation oracles training and evaluating llms as general purpose activation explainers a karvonen j chua c dumas k fraser taliente s kantamneni j minder e ong a sen sharma d wen o evans s marks 2025 arxiv preprint arxiv 2512 15674 weird generalization and inductive backdoors new ways to corrupt llms j betley j cocola d feng j chua a arditi a sztyber betley o evans 2025 arxiv preprint arxiv 2512 09742 lessons from studying two hop latent reasoning m balesni t korbak o evans 2025 arxiv preprint arxiv 2411 16353 pdf school of reward hacks hacking harmless tasks generalizes to misaligned behavior in llms m taylor j chua j betley j treutlein o evans 2025 arxiv preprint arxiv 2508 17511 persona vectors monitoring and controlling character traits in language models r chen a arditi h sleight o evans j lindsey 2025 arxiv preprint arxiv 2507 21509 subliminal learning language models transmit behavioral traits via hidden signals in data a cloud m le j chua j betley a sztyber betley j hilton s marks o evans 2025 nature earlier version published on arxiv chain of thought monitorability a new and fragile opportunity for ai safety t korbak m balesni e barnes y bengio j benton j bloom m chen o evans 2025 arxiv preprint arxiv 2507 11473 thought crime backdoors and emergent misalignment in reasoning models j chua j betley m taylor o evans 2025 arxiv preprint arxiv 2506 13206 emergent misalignment narrow finetuning can produce broadly misaligned llms j betley d tan n warncke a sztyber betley x bao m soto n labenz o evans 2025 icml 2025 oral version published in nature pdf tell me about yourself llms are aware of their learned behaviors j betley x bao m soto a sztyber betley j chua o evans 2025 iclr 2025 are deepseek r1 and other reasoning models more faithful j chua o evans 2025 arxiv preprint arxiv 2501 08156 looking inward language models can learn about themselves by introspection binder f chua j korbak t sleight h hughes j long r perez e turpin m evans o 2024 iclr 2025 me myself and ai the situational awareness dataset sad for llms laine r chughtai b betley j hariharan k scheurer j balesni m hobbhahn m meinke a evans o 2024 neurips 2024 connecting the dots llms can infer and verbalize latent structure from disparate training data treutlein j choi d betley j anil c marks s grosse rb evans o 2024 neurips 2024 can language models explain their own classification behavior sherburn d chughtai b evans o 2024 arxiv preprint arxiv 2405 07436 tell don t show declarative facts influence how llms generalize meinke a evans o 2023 arxiv preprint arxiv 2312 07779 how to catch an ai liar lie detection in black box llms by asking unrelated questions pacchiardi l chan aj mindermann s moscovitz i pan ay gal y evans o brauner j 2023 iclr 2024 the reversal curse llms trained on a is b fail to learn b is a berglund l tong m kaufmann m balesni m stickland ac korbak t evans o 2023 iclr 2024 taken out of context on measuring situational awareness in llms berglund l stickland ac balesni m kaufmann m tong m korbak t kokotajlo d evans o 2023 arxiv preprint arxiv 2309 00667 forecasting future world events with neural networks zou a xiao t jia r kwon j mazeika m li r song d steinhardt j evans o hendrycks d 2022 neurips 2022 teaching models to express their uncertainty in words lin s hilton j evans o 2022 transactions of machine learning research truthful ai developing and governing ai that does not lie evans o cotton barratt o finnveden l bales a balwit a wills p righetti l saunders w 2021 arxiv truthfulqa measuring how models mimic human falsehoods lin s hilton j evans o 2021 acl modelling the health and economic impacts of population wide testing contact tracing and isolation ptti strategies for covid 19 colbourn t et al 2020 ssrn preprint estimating household transmission of sars cov 2 curmei m ilyas a evans o steinhardt j 2020 international journal of epidemiology evaluating arguments one step at a time saunders w rachbach b evans o miller z byun j stuhlmüller a 2020 ought org technical report sensory optimization neural networks as a model for understanding and creating art evans o 2019 arxiv pdf version generalizing from a few environments in safety critical reinforcement learning kenton z filos a evans o gal y 2019 iclr 2019 safe ml workshop machine learning projects for iterated distillation and amplification evans o saunders w stuhlmüller a 2019 fhi technical report predicting human deliberative judgments with machine learning evans o stuhlmüller a cundy c carey r kenton z mcgrath t schreiber a 2018 fhi technical report active reinforcement learning with monte carlo tree search schulze s evans o 2018 arxiv the malicious use of artificial intelligence forecasting prevention and mitigation brundage m avin s clark j et al 2018 arxiv trial without error towards safe reinforcement learning via human intervention saunders s sastry g stuhlmüller a evans o 2017 aamas 2018 blogpost atari videos slides when will ai exceed human performance evidence from ai experts grace k salvatier j zhang b dafoe a evans o 2017 journal of ai research jair 2018 covered by bbc news new scientist newsweek and more model mis specification and inverse reinforcement learning essay co authored with jacob steinhardt 2017 agentmodels org modeling agents with probabilistic programs evans o stuhlmüller a salvatier j filan d 2017 online book and open source library agent agnostic human in the loop reinforcement learning abel d salvatier j stuhlmüller a evans o 2016 neurips workshop active reinforcement learning observing rewards at a cost krueger d leike j salvatier j evans o 2016 neurips workshop learning the preferences of ignorant inconsistent agents evans o stuhlmüller a goodman n 2016 aaai conference on artificial intelligence learning the preferences of bounded agents evans o stuhlmüller a goodman n 2015 neurips workshop learning structured preferences evans o bergen l tenenbaum j 2012 proceedings of cognitive science society conference help or hinder bayesian models of social goal inference ullman t baker c macindoe o evans o goodman n tenenbaum j 2010 neurips bayesian computational models for inferring preferences 2015 mit dissertation video and slides hinton lectures 2025 three lectures on ai safety aimed at a general audience and hosted by geoffrey hinton 2025 slides weird generalizations and backdoors new ways to corrupt llms owain evans emergent misalignment alignment workshop talk on how fine tuning on insecure code can induce emergent misalignment across models domains may 2025 owain evans deluding ais controlconf how planting false beliefs in ai systems might block weaponization aid monitoring and handle out of context reasoning may 2025 axrp 42 owain evans on llm psychology why introspection experiments from looking inward whether to fine tune for introspection and implications of emergent misalignment june 2025 video podcast interview on situational awareness and out of context reasoning august 2024 video talk out of context reasoning in llms new orleans alignment workshop december 2023 video talk truthful language models and alignment university of toronto 2023 video conversation llms truthful ai and composition conversation with ozzie gooen 2023 predicting the future of ai youtube link towards data science podcast 2020 synergies between near term and long term ai safety youtube future of life institute conference 2019 in puerto rico predicting slow judgment slides for talk at aligning ai workshop at neurips 2017 in long beach careers in ai safety youtube effective altruist global conference 2017 in london trial without error towards safe reinforcement learning via human intervention slides for talks at cambridge centre for the future of intelligence and google deepmind automated corporations and ai risk informal talk at oxford university agent agnostic human in the loop reinforcement learning slides for talks at u toronto and deepmind learning the preferences of ignorant inconsistent agents slides for oral presentation at aaai 2016 learning human preferences short talk at mit list of mentees name year current role jan dubiński 2025 astra fellow lev mckinney 2025 phd student univ of toronto harry mayne 2025 phd student oxford adam karvonen 2025 mats scholar dylan feng 2025 mats scholar jorio coccola 2025 mats scholar minh le 2025 anthropic alex cloud 2025 anthropic daniel tan 2025 phd student ucl martín soto 2024 2025 research scientist uk aisi jenny xuchan bao 2024 2025 phd student univ of toronto dami choi 2024 transluce james chua 2024 truthful ai johannes treutlein 2024 anthropic jan betley 2024 truthful ai felix binder 2024 meta ai alexander meinke 2023 research scientist apollo research lorenzo pacchiardi 2023 research associate univ of cambridge asa cooper stickland 2023 research scientist uk ai safety institute aisi mikita balesni 2023 research scientist founding member apollo research lukas berglund 2023 u s ai safety institute nist aisi meg tong 2023 anthropic max kaufmann 2023 phd student univ of toronto ex uk aisi alex j chan 2023 salesforce ex spotify tomek korbak 2023 senior research scientist uk aisi ex anthropic alexa yue pan 2023 redwood research dane sherburn 2022 2023 openai stephanie lin 2021 2022 openai lukas finnveden 2021 2022 research analyst redwood research jan hendrik kirchner 2022 researcher at anthropic ex openai tom mcgrath 2018 chief scientist co founder goodfire ex gdm zac kenton 2018 staff research scientist google deepmind richard ngo 2018 independent previously openai governance william saunders 2017 researcher alignment science anthropic ex openai girish sastry 2017 independent researcher policy ex openai neal jean 2017 co founder ceo beacons ryan carey 2017 optiver ex oxford phd chris cundy 2017 research scientist far ai daniel filan 2016 senior research manager mats john salvatier 2016 independent researcher david abel 2016 senior research scientist at google deepmind david krueger 2016 assistant professor mila ex cambridge past collaborators noah goodman stanford andreas stuhlmüller elicit katja grace ai impacts jan leike anthropic allan dafoe google deepmind baobao zhang fhi mit jacob steinhardt berkeley and transluce sebastian schulze oxford yarin gal oxford mihaela curmei meta andrew ilyas cmu jacob hilton arc sam marks anthropic roger grosse anthropic cem anil anthropic nathan labenz cognitive revolution recommendations i recommend eric drexler s writing on ai which i host here to ward against link rot language for intelligent machines a prospectus 2021 see paper below for longer treatment qnrs toward language for intelligent machines 2021 reframing superintelligence comprehensive ai services as general intelligence 2019 mdl ...
|