Meta tags:
Headings (most frequently used words):
my, research, selected, publications, other, writing,
Text of the page (most frequently used words):
the (18), and (15), for (12), causal (11), incentives (9), ryan (7), carey (7), how (6), tom (6), everitt (6), that (6), also (5), value (5), agent (5), influence (5), one (5), system (5), human (5), graphical (4), control (4), aaai (4), can (4), systems (4), goal (4), framework (3), this (3), with (3), may (3), not (3), are (3), its (3), complete (3), information (3), unfairness (3), 2022 (3), algorithms (3), models (3), research (3), forum (2), prize (2), work (2), see (2), interesting (2), alignment (2), problems (2), corrigibility (2), study (2), learning (2), incorrigible (2), concepts (2), when (2), incentivized (2), even (2), labels (2), fair (2), criterion (2), diagrams (2), than (2), optimize (2), objective (2), but (2), user (2), games (2), modelling (2), reasoning (2), 2023 (2), about (2), causality (2), definitions (2), including (2), they (2), from (2), phd (2), best (2), specification (2), where (2), assigned (2), problem (2), identify (2), delicate (2), try (2), safe (2), follow (2), been (2), safety (2), working (2), group (2), firstname, lastname, jesus, coverage, business, insider, much, combinator, founders, earn, show, shaping, your, talent, direct, reply, interpreting, compute, trends, editor, handbook, addressing, three, counterfactual, bad, bets, defending, against, backstops, overconfidence, other, writing, method, cooperative, inverse, reinforcement, prevent, behaviour, aies, 2018, incorrigibility, cirl, largely, determined, context, paper, gives, sound, criteria, four, incentive, response, eric, langlois, pedro, ortega, shane, legg, 2021, perspective, perhaps, surprisingly, completely, carolyn, ashurst, silvia, chiappa, why, yield, unfair, predictions, conditions, introduced, presents, more, decision, node, along, homomorphisms, trees, chris, van, merwijk, soluble, you, tell, any, means, engagement, without, manipulating, sebastian, farquhar, path, specific, objectives, safer, introduces, structural, single, allows, both, game, theoretic, lewis, hammond, james, fox, alessandro, abate, michael, wooldridge, artificial, intelligence, journal, variants, assurances, offer, autonomy, used, obtain, them, uai, selected, publications, since, lot, these, analyses, benefit, using, studies, represent, marginalisation, conditionalisation, graphs, third, gaming, fulfills, extreme, version, rather, intended, proposed, remedy, sample, actions, performed, demonstrator, quantilisation, has, some, nice, properties, don, hold, all, kinds, mis, quantilise, second, shape, such, whether, compels, fairly, respond, sensitive, demographic, characterics, safely, parts, environment, sometimes, structure, alone, suffices, closely, related, issue, diagram, won, variable, fact, general, template, many, past, implicitly, modify, identifying, nonrequisite, edges, design, corrigibile, wants, manipulate, instructions, learn, goals, corrigible, behave, unsafely, whereas, shutdown, instructable, especially, interested, finding, tools, final, year, student, oxford, supervised, theory, involving, cofounder, which, uses, reason, previously, fellow, future, humanity, institute, intern, deepmind, openai, founder, robin, evans, twitter, scholar,
Text of the page (random words):
cv scholar twitter causal incentives working group i m a final year phd student at oxford supervised by robin evans where i work on theory involving causal models i m also a cofounder of the causal incentives working group which uses causal models to reason about ai safety previously i ve been a research fellow at the future of humanity institute a research intern at deepmind and openai and the founder of the ea forum my research i ve been especially interested in finding concepts and tools for modelling ai safety problems one interesting problem is how to design a corrigibile system one that wants to follow and not manipulate its instructions even systems that try to learn the human s goals may be incorrigible also corrigible systems may behave unsafely whereas shutdown instructable systems are safe a second problem is how to identify and shape agent s incentives such as whether an agent s goal compels it to un fairly respond to sensitive demographic characterics or un safely influence delicate parts of the environment sometimes the causal structure alone suffices to identify the incentives see also the closely related issue of identifying nonrequisite edges in an influence diagram one can also modify an ai system so that it won t try to influence a delicate variable and in fact this is a general template that many past safe ai algorithms implicitly follow a third is specification gaming where a system fulfills an extreme version of its assigned goal rather than the intended goal one proposed remedy is for the ai system to quantilise the assigned objective i e to sample from the best n of actions performed by a human demonstrator quantilisation has some nice properties but they don t hold for all kinds of goal mis specification since a lot of these analyses benefit from using graphical causal models my phd studies causality including how to best represent marginalisation and conditionalisation in causal graphs selected publications human control definitions and algorithms we study definitions of human control including variants of corrigibility and alignment the assurances they offer for human autonomy and the algorithms that can be used to obtain them ryan carey tom everitt uai 2023 reasoning about causality in games introduces structural causal games a single modelling framework that allows for both causal and game theoretic reasoning lewis hammond james fox tom everitt ryan carey alessandro abate michael wooldridge artificial intelligence journal 2023 path specific objectives for safer agent incentives how do you tell an ml system to optimize an objective but not by any means e g optimize user engagement without manipulating the user sebastian farquhar ryan carey tom everitt aaai 2022 a complete criterion for value of information in soluble influence diagrams presents a complete graphical criterion for value of information in influence diagrams with more than one decision node along with id homomorphisms and trees of systems chris van merwijk ryan carey tom everitt aaai 2022 why fair labels can yield unfair predictions graphical conditions for introduced unfairness when is unfairness incentivized perhaps surprisingly unfairness can be incentivized even when labels are completely fair carolyn ashurst ryan carey silvia chiappa tom everitt aaai 2022 agent incentives a causal perspective an agent s incentives are largely determined by its causal context this paper gives sound and complete graphical criteria for four incentive concepts value of information value of control response incentives and control incentives tom everitt ryan carey eric langlois pedro a ortega shane legg aaai 2021 incorrigibility in the cirl framework a study of how the value learning method cooperative inverse reinforcement learning may not prevent incorrigible behaviour ryan carey aies 2018 other writing addressing three problems with counterfactual corrigibility bad bets defending against backstops and overconfidence ai alignment prize ea handbook editor interpreting ai compute trends see also this interesting reply show a framework for shaping your talent for direct work ea forum prize how much do y combinator founders earn business insider coverage firstname lastname jesus ox ac uk
|