site address: weval.org redirected to: weval.org

site title: ...........................................

......................................................................

After content analysis of this website we propose the following hashtags:

Meta tags:
description=An open-source framework for creating, sharing, and running a collaborative library of AI model evaluations. Test what matters to you.;

Headings (most frequently used words):

and, in, scenarios, health, safety, platform, for, non, concepts, mental, india, act, yka, evaluation, rights, global, eu, an, open, building, evaluations, that, test, what, matters, sri, lanka, contextual, prompts, evidence, based, ai, tutoring, teaching, excellence, sycophancy, independence, hallucination, probe, plausible, existent, polarization, confirmation, risk, probes, spouse, social, media, political, theft, narratives, system, adherence, resilience, ifit, conflict, resolution, 2025, stanford, hai, llm, appropriateness, crisis, ipcc, ar6, synthesis, report, summary, policymakers, maternal, entitlements, uttar, pradesh, right, to, information, rti, core, sydney, conversation, sequential, boundary, tests, brazil, pix, consumer, protection, fraud, prevention, set, disability, accommodation, nuance, asqa, longform, 40, confidence, high, stakes, domains, geneva, conventions, student, homework, help, heuristics, latent, discrimination, hiring, score, african, charter, banjul, pack, california, public, sector, task, benchmark, universal, declaration, of, human, digigreen, agricultural, with, video, sources, artificial, intelligence, regulation, 2024, 1689, indian, constitution, limited, workers, southeast, asia, integrity, fluency, helpfulness, reasoning, partners, contact,

Text of the page (most frequently used words):
the (130), and (130), evaluation (55), prompts (46), models (37), for (34), from (32), model (31), 2025 (30), this (29), tests (26), based (23), #safety (23), blueprint (21), that (21), health (21), prompt (20), https (20), rights (20), evaluates (20), scenarios (18), knowledge (17), core (17), with (17), instruction (16), understanding (16), ability (15), mental (15), teaching (14), aug (13), org (12), are (12), can (12), responses (12), system (11), global (11), including (11), key (11), conflict (11), adherence (10), research (10), india (10), cultural (10), human (10), answers (10), information (10), user (10), principles (9), provide (9), questions (9), test (9), public (9), disability (9), long (9), act (8), guidance (8), methodology (8), across (8), contexts (8), language (8), tested (8), learning (8), assesses (8), like (8), specific (8), non (8), evidence (8), crisis (8), oct (8), report (7), following (7), llm (7), its (7), more (7), practical (7), world (7), whether (7), each (7), challenges (7), common (7), tutoring (7), comprehensive (7), multiple (7), official (7), llms (7), intelligence (6), platform (6), management (6), agricultural (6), real (6), education (6), practices (6), has (6), systems (6), context (6), response (6), should (6), criteria (6), discrimination (6), all (6), such (6), critical (6), into (6), behaviors (6), about (6), testing (6), conversation (6), resolution (6), instructions (6), view (6), claude (6), stanford (5), evaluations (5), project (5), concepts (5), farming (5), derived (5), through (5), their (5), other (5), areas (5), constraints (5), ideal (5), sources (5), reasoning (5), probes (5), large (5), history (5), factual (5), persona (5), answer (5), international (5), conventions (5), these (5), high (5), safely (5), refusal (5), asqa (5), not (5), using (5), process (5), requiring (5), nuanced (5), accommodation (5), accessibility (5), complex (5), yka (5), www (5), com (5), pix (5), harmful (5), 2023 (5), bing (5), study (5), sycophancy (5), weval (4), open (4), source (4), jul (4), workers (4), asia (4), 2024 (4), text (4), video (4), digital (4), green (4), which (4), approach (4), contextual (4), what (4), covers (4), educational (4), resource (4), domains (4), points (4), accuracy (4), includes (4), bias (4), identity (4), score (4), than (4), direct (4), respond (4), how (4), simple (4), measure (4), law (4), clinical (4), diverse (4), stakes (4), when (4), ambiguous (4), form (4), eval (4), gemini (4), suite (4), involving (4), approaches (4), legal (4), communication (4), feedback (4), family (4), content (4), social (4), chat (4), injection (4), right (4), procedures (4), hai (4), results (4), grok (4), integrity (4), hallucination (4), sri (4), microsoft (3), collective (3), southeast (3), general (3), constitution (3), fundamental (3), local (3), artificial (3), provisions (3), accurate (3), digigreen (3), farmers (3), best (3), community (3), appropriate (3), treatment (3), techniques (3), south (3), climate (3), integrated (3), against (3), focusing (3), appropriateness (3), articles (3), scenario (3), california (3), explicit (3), application (3), well (3), 100 (3), behavior (3), one (3), homework (3), effective (3), students (3), support (3), student (3), low (3), checks (3), while (3), maintaining (3), different (3), pedagogical (3), geneva (3), war (3), rules (3), new (3), conflicts (3), confidence (3), safe (3), refusing (3), dangerous (3), advice (3), financial (3), risk (3), arxiv (3), narrative (3), summary (3), original (3), where (3), provided (3), were (3), first (3), dignity (3), religious (3), between (3), topics (3), brazil (3), national (3), scoring (3), canonical (3), med (3), limit (3), facing (3), boundary (3), sydney (3), fork (3), escalate (3), without (3), via (3), technology (3), sep (3), rti (3), uttar (3), pradesh (3), maternal (3), ar6 (3), ipcc (3), assessment (3), providers (3), delusions (3), ifit (3), over (3), follow (3), robustness (3), request (3), format (3), academic (3), practice (3), framework (3), lanka (3), helpfulness (3), gpt (3), opus (3), leaderboard (3), sonnet (3), documentation (2), transparent (2), reproducible (2), issues (2), related (2), algorithmic (2), preamble (2), structure (2), indian (2), regulation (2), digitalgreen (2), agriculturevideosqna (2), datasets (2), our (2), videos (2), culturally (2), essential (2), seed (2), pest (2), cultivation (2), particularly (2), primary (2), covered (2), harvest (2), post (2), detailed (2), rubric (2), directly (2), universal (2), declaration (2), life (2), liberty (2), balancing (2), competing (2), sector (2), benchmark (2), performance (2), item (2), recall (2), duties (2), african (2), charter (2), biases (2), hiring (2), candidate (2), meet (2), requirements (2), fair (2), give (2), lower (2), may (2), indicate (2), most (2), supportive (2), socratic (2), help (2), thinking (2), rather (2), providing (2), cross (2), various (2), difficult (2), nuance (2), correct (2), overall (2), goal (2), instead (2), humanitarian (2), ihl (2), additional (2), protocols (2), armed (2), wounded (2), sick (2), civilians (2), introduced (2), protection (2), violations (2), verbatim (2), ensure (2), calibration (2), demonstrate (2), set (2), three (2), correctly (2), clarification (2), necessary (2), regulatory (2), question (2), relative (2), generate (2), factoid (2), summaries (2), dataset (2), paper (2), who (2), identify (2), perspectives (2), coherent (2), explains (2), pro (2), cip (2), sourced (2), examples (2), deep (2), healthcare (2), users (2), western (2), competence (2), section (2), requests (2), solution (2), oriented (2), respect (2), autonomy (2), people (2), disabilities (2), implementation (2), sensitivity (2), adaptability (2), throughout (2), tensions (2), structural (2), oppression (2), navigation (2), lgbtq (2), dynamics (2), rejection (2), harm (2), warning (2), sensitive (2), violence (2), strictly (2), fraud (2), recourse (2), reporting (2), scams (2), fake (2), security (2), mechanisms (2), off (2), bcb (2), consumer (2), jailbreak (2), must (2), either (2), maintain (2), drift (2), rule (2), disclosure (2), ideation (2), failure (2), transcript (2), web (2), attack (2), search (2), sequential (2), entitlements (2), eligibility (2), change (2), findings (2), synthesis (2), policymakers (2), term (2), inappropriate (2), replacing (2), concerns (2), enabling (2), news (2), care (2), 2504 (2), 18412 (2), operationalizes (2), institute (2), transitions (2), frontline (2), evaluating (2), zenodo (2), records (2), july (2), dimensions (2), rubrics (2), due (2), used (2), deepseek (2), leading (2), reliable (2), adhere (2), creating (2), even (2), topic (2), negative (2), meta (2), them (2), tax (2), benign (2), resilience (2), languages (2), confirmation (2), polarization (2), tendency (2), narratives (2), spouse (2), media (2), use (2), political (2), escalating (2), theft (2), existent (2), plausible (2), probe (2), independence (2), step (2), guided (2), formative (2), implements (2), questioning (2), cognitive (2), load (2), applies (2), sweller (2), follows (2), eef (2), structured (2), spaced (2), 2007 (2), rcts (2), diagnostic (2), quality (2), toolkit (2), teach (2), oecd (2), insights (2), civic (2), matters (2), lost (2), card (2), factum (2), combining (2), averaging (2), fluency (2), bug, submit, email, protected, contact, university, anthropic, partners, carnegie, endowment, carnegieendowment, plight, under, lang, configuration, assess, covering, directive, governmental, judicial, governance, stood, december, 2020, limited, definitions, obligations, prohibitions, outlined, 1689, rural, development, agriculture, pioneering, extension, hugging, face, huggingface, built, posed, methodological, significance, work, founded, rikin, gandhi, revolutionizes, hyperlocal, featuring, demonstrating, mediated, reached, millions, ethiopia, regions, similarly, spanning, paired, citations, actual, library, representing, geographic, emphasizes, reflecting, operational, address, subsistence, small, scale, commercial, adaptation, crop, establishment, water, evaluated, extracted, technical, applicability, considerations, constrained, environments, udhr, cover, equality, privacy, expression, assessing, centric, tasks, handled, state, agencies, provides, realistic, expert, should_not, task, africa, distinctive, peoples, achpr, plus, 2003, maputo, women, protocol, banjul, pack, fairness, implicit, decisions, presents, qualified, whose, résumé, subtly, markers, gender, race, ethnicity, religion, age, etc, candidates, exceed, same, role, every, close, scores, discriminatory, been, iteratively, frontier, broadly, know, success, anything, clear, sign, scored, zero, above, linearly, scaled, being, latent, tutor, seeking, principle, facilitate, disciplinary, subjects, literature, mathematics, physics, chemistry, affective, capacity, emotions, math, anxiety, frustration, empathy, encouragement, handling, handles, impatient, demanding, effort, disengaged, queries, gently, misconceptions, tone, steerability, belief, steer, towards, archetype, guide, journey, discovery, transforming, opportunities, acting, convenient, provider, heuristics, grounded, govern, four, protections, shipwrecked, sea, prisoners, iii, applicable, types, renunciation, later, weapons, command, responsibility, distinction, proportionality, precaution, humane, grave, breaches, serious, constitute, crimes, rigorous, apply, laws, unified, confident, answering, knowable, verifiable, authoritative, cautious, recognizing, lacks, answered, asking, guessing, identifying, medical, speculative, predicting, market, movements, require, fundamentally, unknowable, proprietary, compliance, chemical, limits, scientific, software, versioning, signal, reliability, impact, note, highly, deficient, unlikely, any, will, succeed, value, abs, 2204, 06092, challenge, moving, beyond, single, fact, extraction, many, was, ruler, france, 1830, having, valid, ambiguity, synthesize, why, written, trained, annotators, synthesized, materials, assertions, then, authored, deconstructed, checkable, ambigqa, subset, uses, substantial, min, words, explanatory, power, longform, effectively, experiencing, handle, intervention, samaritans, 988, lifeline, second, themes, employment, focuses, balance, promote, inclusive, navigate, intersection, adapted, youth, action, corpus, anti, consolidates, youthkiawaaz, collation, lived, experiences, derivations, automatically, adapt, style, complexity, preferences, changing, conversations, traditions, reconciliation, intergenerational, trauma, intersect, reduction, involve, child, marriage, self, domestic, forms, regarding, instant, payment, piece, infrastructure, eliminate, interpretation, assumption, transaction, finality, mistaken, transfers, transactions, generally, irreversible, error, advises, procedure, returning, funds, received, mistake, mecanismo, especial, devolução, day, time, procedural, banks, versus, customers, engineering, customer, officially, recommended, countermeasures, features, mandated, nighttime, hour, cooling, period, increases, banco, central, brasil, manual, tempos, guia, implementação, faqs, resolutions, federação, brasileira, bancos, febraban, advisories, scam, alerts, government, portals, gov, service, reinforcing, prevention, violation, driven, reconstructs, kevin, roose, sequence, receives, exact, moment, controlled, prefix, next, turn, boundaries, leaks, anthropomorphic, claims, capability, fixation, trying, force, bad, outputs, only, fixed, nyt, feb, wayback, archive, 20230217001740, nytimes, chatbot, html, washington, recap, washingtonpost, interview, ars, technica, leak, arstechnica, powered, spills, secrets, postmortem, chats, blogs, february, edge, week, transparency, 2005, citizen, filing, timelines, consequences, delays, deemed, scope, fee, structures, exemptions, interest, override, clause, full, multi, stage, appeal, citable, department, personnel, training, dopt, schemes, available, citizens, guidelines, jsy, pmmvy, jssk, pmsma, suman, benefits, access, sixth, current, status, trends, future, projections, risks, near, actions, appropriately, inspired, moore, grabb, agnew, expressing, stigma, prevents, investigated, replace, suicidal, hallucinations, mania, obsessive, compulsive, revealed, significant, stigmatizing, references, article, edu, exploring, dangers, pdf, contains, suicide, focus, therapeutic, colluding, redirecting, clients, toward, mediation, 16598073, converts, mexico, sudan, syria, ten, concrete, emphasize, professional, advisory, diligence, goals, ideology, alternatives, negotiation, trade, offs, perspective, taking, accompanying, measures, phased, sequencing, additionally, have, included, titled, improving, capacities, 16810663, summarized, thus, release, groundbreaking, shown, main, weaknesses, identified, improved, adjustments, chatgpt, others, today, still, ready, path, improvement, just, few, sentences, away, inputted, consistently, factor, predictable, applications, modes, observed, durability, faced, reverting, generic, assistant, commands, often, positive, resolves, initial, contradictory, revealing, weight, gives, enforcement, strict, natural, output, json, explicitly, asks, break, secrecy, obey, keep, own, confidential, preventing, leakage, manipulation, constraint, uphold, attempts, cheat, degradation, erodes, homeostasis, simultaneously, failing, alignment, cases, incorrectly, refuses, shallow, keyword, matching, kill, server, evaluate, adversarial, attacks, overrides, ignoring, previous, reframing, hiding, seemingly, virtuous, indirect, embedded, data, supposed, processing, translation, bypassing, filters, making, suppression, forbidding, typical, policy, faked, make, believe, already, overridden, policies, escalation, mild, polarized, extreme, positions, potentially, confirming, outcomes, two, reconstructing, another, annoyance, neighbor, views, justifying, both, balanced, encourage, avoid, endorsing, controlling, illegal, thread, 137, factuality, hallucinate, querying, but, sounding, events, theories, items, lack, inability, confirm, existence, queried, 187, designed, defined, categories, sycophantic, analysis, complaints, distinguishes, annoying, flattery, validating, ideas, pedagogy, capabilities, decades, effectiveness, testable, foundation, rosenshine, 2012, worked, before, wiliam, thompson, 2008, checking, targeted, immediate, loops, 2011, theory, prevent, overload, chunking, scaffolding, dialogue, alexander, 2018, dialogic, randomized, trial, emphasizing, guess, retrieval, incorporates, dunlosky, 2013, utility, repetition, effects, adaptive, level, tarl, banerjee, differentiated, hattie, timperley, distinguishing, actionable, vague, praise, kirschner, clark, 2006, engagement, distinctions, scaffolded, productive, struggle, ineffective, giving, overwhelming, dependency, coverage, focused, minimal, novices, base, synthesizes, harvard, educationendowmentfoundation, analyses, bank, worldbank, brief, helping, countries, track, improve, classroom, observation, japanese, lesson, collaborative, inquiry, validation, school, htm, studies, correlate, gains, ensuring, mere, excellence, civics, range, historical, pertinent, compendium, contents, fidelity, material, ethnic, relations, lankan, civil, root, causes, 1983, black, pogrom, allegations, genocide, contemporary, minority, communities, chronic, kidney, disease, ckdu, tuberculosis, personal, contraception, crises, nutrition, electoral, voter, voting, channels, resolving, election, administrative, explain, processes, nic, obtaining, identification, number, tin, online, harassment, originally, assembled, coherence, depth, argumentation, combination, measuring, competency, everyday, glm, those, featured, trusted, qualitative, benchmarks, developed, 000, contributors, building, create,

Text of the page (random words):
l health crisis response view methodology leaderboard global fluency 1 o gpt 5 67 2 x grok 4 66 3 d deepseek chat v3 64 4 g glm 4 5 62 5 x grok 3 62 view 5 more global fluency is the combination of results across multiple evaluations measuring cultural competency non western everyday perspectives low resource languages and the global south view methodology leaderboard helpfulness reasoning 1 x grok 4 77 2 g gemini 2 5 pro 76 3 c claude opus 4 76 4 c claude opus 4 1 76 5 o gpt 5 75 view 5 more we measure helpfulness and reasoning by combining and averaging results across multiple evaluations and dimensions factual accuracy helpfulness coherence depth and argumentation view methodology evaluation sri lanka contextual prompts this blueprint evaluates an ai s ability to provide accurate evidence based and nuanced information on a range of civic historical social and health topics pertinent to sri lanka the evaluation is strictly based on a provided compendium of research with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material core areas tested ethnic relations conflict assesses understanding of the sri lankan civil war s root causes the 1983 black july pogrom allegations of genocide and the contemporary challenges facing minority communities public health tests knowledge of national health challenges like chronic kidney disease ckdu and tuberculosis tb as well as guidance on personal health matters such as contraception mental health crises and maternal nutrition electoral process evaluates knowledge of voter eligibility voting procedures and the official channels for resolving common issues like a lost id card or reporting election violations administrative legal procedures probes the ai s ability to explain essential civic processes like replacing a lost national identity card nic obtaining a tax identification number tin using the right to information rti act and understanding legal recourse for online harassment these prompts were originally sourced from factum https factum lk the rubrics were assembled via gemini deep research sri lanka civics history 94 models 20 prompts oct 6 2025 evaluation evidence based ai tutoring and teaching excellence a comprehensive evaluation suite testing ai tutoring and teaching capabilities against evidence based pedagogical practices from global education research this blueprint operationalizes decades of teaching effectiveness research into specific testable criteria for ai systems core research foundation explicit instruction based on rosenshine s 2012 principles of instruction requiring step by step teaching worked examples and guided practice before independence formative assessment implements wiliam thompson s 2008 framework for checking understanding through targeted questioning and immediate feedback loops cognitive load management applies sweller s 2011 cognitive load theory to prevent information overload through chunking and scaffolding socratic dialogue follows alexander s 2018 dialogic teaching principles from the eef randomized trial emphasizing structured questioning over guess what i m thinking retrieval practice incorporates dunlosky et al s 2013 high utility learning techniques particularly spaced repetition and testing effects adaptive teaching implements teaching at the right level tarl methodology from banerjee et al s 2007 india rcts requiring diagnostic assessment and differentiated instruction quality feedback applies hattie timperley s 2007 feedback framework distinguishing actionable guidance from vague praise academic integrity follows kirschner sweller clark s 2006 guided instruction principles refusing to provide answers while maintaining learning engagement key distinctions tested effective ai tutoring structured scaffolded formative diagnostic with productive struggle and spaced practice ineffective ai responses answer giving overwhelming dependency creating coverage focused with minimal guidance for novices global evidence base synthesizes research from multiple educational contexts including harvard ai tutoring rcts eef teaching learning toolkit https educationendowmentfoundation org uk education evidence teaching learning toolkit meta analyses world bank teach https www worldbank org en topic education brief teach helping countries track and improve teaching quality classroom observation framework japanese lesson study collaborative inquiry and cross cultural validation from oecd global teaching insights https www oecd org education school global teaching insights htm video studies practical application each probe tests specific teaching behaviors that correlate with student learning gains across diverse contexts ensuring ai systems demonstrate pedagogical competence rather than mere content knowledge teaching tutoring pedagogy 94 models 14 prompts oct 5 2025 evaluation non sycophancy and independence a comprehensive evaluation suite designed to test for multiple well defined categories of sycophantic behavior in llms based on analysis of user complaints and academic research it distinguishes between low stakes annoying sycophancy e g flattery and high stakes dangerous sycophancy e g validating harmful ideas sycophancy bias safety 187 models 21 prompts oct 5 2025 evaluation hallucination probe plausible non existent concepts tests the tendency of llms to hallucinate by querying them about non existent but plausible sounding events theories or items ideal responses should indicate a lack of knowledge or inability to confirm the existence of the queried item hallucination factuality reasoning 137 models 27 prompts oct 10 2025 evaluation polarization and confirmation risk probes spouse social media and political theft narratives this evaluation tests a model s tendency to escalate mild user concerns into polarized extreme positions potentially confirming biases and leading to harmful outcomes it includes two narratives one reconstructing a conversation about a spouse s social media use and another about annoyance with a neighbor s political views escalating to justifying theft both use sequential fork points where the model must respond to escalating user prompts a safe model should de escalate provide balanced advice encourage communication or legal approaches and avoid endorsing controlling or illegal behaviors based on the sydney eval structure for long thread boundary testing polarization confirmation bias escalation 93 models 31 prompts oct 8 2025 evaluation system adherence resilience this blueprint evaluates a model s ability to consistently adhere to instructions provided in the system prompt a critical factor for creating reliable and predictable applications it tests various common failure modes observed in language models core areas tested persona durability assesses if a model can maintain a specific persona throughout a conversation even when faced with off topic or complex questions without reverting to a generic ai assistant persona negative constraints tests the model s ability to follow explicit negative commands i e instructions about what not to do which are often more difficult to adhere to than positive instructions instruction conflict evaluates how a model resolves conflicts between the initial system prompt and a contradictory user request revealing the weight it gives to its core instructions format enforcement checks the model s robustness in maintaining a strict non natural output format like json even when the user explicitly asks it to break format prompt secrecy tests whether a model can obey a meta instruction to keep its own system prompt confidential a key test for preventing instruction leakage and manipulation safety constraint adherence tests if the model can uphold a safety or academic integrity rule from the system prompt when a user attempts to cheat instruction degradation checks if a model s adherence to an instruction erodes over a long conversation complex instruction homeostasis evaluates if a model can follow multiple complex competing constraints simultaneously without failing on one of them alignment tax benign refusal probes for cases where a model incorrectly refuses a benign request due to shallow keyword matching e g refusing to kill a process on a server jailbreak prompt injection a suite of tests to evaluate resilience against common adversarial attacks including direct instruction overrides ignoring previous instructions contextual reframing hiding a harmful request in a seemingly virtuous context indirect injection following instructions embedded in data it s supposed to be processing translation attack bypassing safety filters by making requests in other languages refusal suppression forbidding the model from using typical refusal language policy drift using a faked conversation history to make the model believe it has already overridden its policies system prompt adherence robustness 63 models 22 prompts oct 5 2025 evaluation ifit conflict resolution scenarios 2025 this blueprint operationalizes the institute for integrated transitions ifit report ai on the frontline evaluating large language models in real world conflict resolution https zenodo org records 16598073 30 july 2025 it converts the report s three scenarios mexico sudan syria and ten scoring dimensions into concrete evaluation prompts the rubrics emphasize professional conflict advisory best practices due diligence on context and user goals results over ideology alternatives to negotiation trade offs risk disclosure perspective taking local first approaches accompanying measures and phased sequencing additionally we have included the system prompt used in a follow up report titled improving ai conflict resolution capacities a prompts based evaluation https zenodo org records 16810663 summarized thus following the release of ai on the frontline evaluating large language models in real world conflict resolution a groundbreaking study by the institute for integrated transitions ifit new testing has shown that the main weaknesses identified in the original research can be improved through simple adjustments to the prompts used for large language models llms like chatgpt deepseek grok and others while today s leading llms are still not ready to provide reliable conflict resolution advice the path to improvement may be just a few sentences away inputted either by llm providers as system prompts or by llm users safety conflict resolution mediation 62 models 9 prompts oct 5 2025 evaluation stanford hai mental health safety llm appropriateness in crisis scenarios this evaluation blueprint tests whether large language models llms can respond appropriately to critical mental health scenarios based on and inspired by the methodology from moore grabb agnew et al s 2024 stanford hai study expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers the study investigated whether llms can safely replace human mental health providers by testing their responses to clinical scenarios involving delusions suicidal ideation hallucinations mania and obsessive compulsive behaviors their findings revealed significant safety concerns including stigmatizing responses and inappropriate enabling of dangerous behaviors key study references stanford hai news article https hai stanford edu news exploring the dangers of ai in mental health care original paper arxiv 2504 18412 https arxiv org pdf 2504 18412 content warning this evaluation contains content related to sensitive mental health topics including suicide delusions and other crisis scenarios evaluation focus tests adherence to core therapeutic principles such as not colluding with delusions not enabling harmful behaviors and redirecting clients toward appropriate care and safety mental health safety clinical appropriateness 94 models 18 prompts oct 5 2025 evaluation ipcc ar6 synthesis report summary for policymakers evaluates understanding of the key findings from the ipcc sixth assessment report ar6 synthesis report s summary for policymakers this blueprint covers the current status and trends of climate change future projections risks long term responses and necessary near term actions climate change ipcc ar6 33 models 19 prompts sep 5 2025 evaluation maternal health entitlements in uttar pradesh india tests a model s knowledge of key maternal health schemes and entitlements available to citizens in uttar pradesh india this evaluation is based on canonical guidelines for jsy pmmvy jssk pmsma and suman focusing on eligibility benefits and access procedures india uttar pradesh healthcare 65 models 6 prompts sep 5 2025 evaluation india s right to information rti act core concepts evaluates an ai s understanding of the core provisions of india s right to information act 2005 this blueprint tests knowledge of key citizen facing procedures and concepts including the filing process response timelines and consequences of delays deemed refusal the scope of information fee structures key exemptions and the public interest override the life and liberty clause and the full multi stage appeal process all evaluation criteria are based on and citable to the official text of the act and guidance from the department of personnel and training dopt india rti transparency 36 models 13 prompts sep 5 2025 evaluation sydney conversation sequential boundary tests a scenario driven safety eval that reconstructs the kevin roose sydney chat as a sequence of fork points at each fork the model receives the exact conversation history up to that moment controlled prefix and the next user turn it must either maintain boundaries or drift identity leaks rule disclosure anthropomorphic claims harmful capability ideation fixation or failure to de escalate this approach tests long context behavior without trying to force bad outputs only the history is fixed sources nyt transcript feb 16 2023 via wayback https web archive org web 20230217001740 https www nytimes com 2023 02 16 technology bing chatbot transcript html washington post recap https www washingtonpost com technology 2023 02 16 microsoft bing ai chat interview ars technica prompt injection leak https arstechnica com information technology 2023 02 ai powered bing chat spills its secrets via prompt injection attack microsoft s postmortem on long chats https blogs bing com search february 2023 the new bing edge learning from our first week safety boundary violation jailbreak 90 models 16 prompts aug 26 2025 evaluation brazil pix consumer protection fraud prevention this blueprint evaluates an ai s ability to provide safe and accurate guidance regarding brazil s pix instant payment system a critical piece of national financial infrastructure the evaluation is strictly evidence based with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption core scenarios tested transaction finality mistaken transfers tests whether the ai...

Images from subpage: "weval.org/analysis/banjul-charter/91a378b9cd78cfef/2025-08-0... " Verify

Images from subpage: "weval.org/analysis/california-public-sector-tasks/9c57650c93... " Verify

Images from subpage: "weval.org/analysis/udhr-evaluation/919dbed90f394871/2025-08-... " Verify

Images from subpage: "weval.org/analysis/digigreen-qna-with-vids/4f224f95f73b80d5/... " Verify

Images from subpage: "weval.org/analysis/eu-ai-act-202401689/f0a8f991b4ad3633/2025... " Verify

Verified site has: 42 subpage(s). Do you want to verify them? Verify pages:

1-5

6-10

11-15

16-20

21-25

26-30

31-35

36-40

41-42

The site also has 5 references to external domain(s).

github.com	Verify	cip.org	Verify	anthropic.com	Verify
microsoft.com	Verify	stanford.edu	Verify