CDT AI Governance Lab Archives - Center for Democracy and Technology https://cdt.org/area-of-focus/cdt-ai-governance-lab/ Wed, 14 May 2025 18:55:51 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://cdt.org/wp-content/uploads/2019/11/cropped-cdt-logo-32x32.png CDT AI Governance Lab Archives - Center for Democracy and Technology https://cdt.org/area-of-focus/cdt-ai-governance-lab/ 32 32 Op-Ed – Artificial Sweeteners: The Dangers of Sycophantic AI https://cdt.org/insights/op-ed-artificial-sweeteners-the-dangers-of-sycophantic-ai/ Wed, 14 May 2025 18:49:56 +0000 https://cdt.org/?post_type=insight&p=108846 This op-ed – authored by CDT’s Amy Winecoff  – first appeared in Tech Policy Press on May 14, 2025. A portion of the text has been pasted below. At the end of April, OpenAI released a model update that made ChatGPT feel less like a helpful assistant and more like a yes-man. The update was quickly rolled back, […]

The post Op-Ed – Artificial Sweeteners: The Dangers of Sycophantic AI appeared first on Center for Democracy and Technology.

]]>
This op-ed – authored by CDT’s Amy Winecoff  – first appeared in Tech Policy Press on May 14, 2025. A portion of the text has been pasted below.

At the end of April, OpenAI released a model update that made ChatGPT feel less like a helpful assistant and more like a yes-man. The update was quickly rolled back, with CEO Sam Altman admitting the model had become “too sycophant-y and annoying.” But framing the concern as just about the tool’s irritating cheerfulness downplays the potential seriousness of the issue. Users reported the model encouraging them to stop taking their medication or lash out at strangers.

This problem isn’t limited to OpenAI’s recent update. A growing number of anecdotes and reportssuggest that overly flattering, affirming AI systems may be reinforcing delusional thinking, deepening social isolation, and distorting users’ grip on reality. In this context, the OpenAI incident serves as a sharp warning: in the effort to make AI friendly and agreeable, tech firms may also be introducing new dangers.

At the center of AI sycophancy are techniques designed to make systems safer and more “aligned” with human values. AI systems are typically trained on massive datasets sourced from the public internet. As a result, these systems learn not only from useful information but also from toxic, illegal, and unethical content. To address these problems, AI developers have introduced techniques to help AI systems respond in ways that better match users’ intentions.

Read the full article.

The post Op-Ed – Artificial Sweeteners: The Dangers of Sycophantic AI appeared first on Center for Democracy and Technology.

]]>
AI Agents In Focus: Technical and Policy Considerations https://cdt.org/insights/ai-agents-in-focus-technical-and-policy-considerations/ Wed, 14 May 2025 15:26:42 +0000 https://cdt.org/?post_type=insight&p=108816 AI agents are moving rapidly from prototypes to real-world products. These systems are increasingly embedded into consumer tools, enterprise workflows, and developer platforms. Yet despite their growing visibility, the term “AI agent” lacks a clear definition and is used to describe a wide spectrum of systems — from conversational assistants to action-oriented tools capable of […]

The post AI Agents In Focus: Technical and Policy Considerations appeared first on Center for Democracy and Technology.

]]>
AI Agents In Focus: Technical and Policy Considerations. White and black document on a grey background.
Brief entitled, “AI Agents In Focus: Technical and Policy Considerations.” White and black document on a grey background.

AI agents are moving rapidly from prototypes to real-world products. These systems are increasingly embedded into consumer tools, enterprise workflows, and developer platforms. Yet despite their growing visibility, the term “AI agent” lacks a clear definition and is used to describe a wide spectrum of systems — from conversational assistants to action-oriented tools capable of executing complex tasks. This brief focuses on a narrower and increasingly relevant subset: action-taking AI agents, which pursue goals by making decisions and interacting with digital environments or tools, often with limited human oversight. 

As an emerging class of AI systems, action-taking agents indicate a distinct shift from earlier generations of generative AI. Unlike passive assistants that respond to user prompts, these systems can initiate tasks, revise plans based on new information, and operate across applications or time horizons. They typically combine large language models (LLMs) with structured workflows and tool access, enabling them to navigate interfaces, retrieve and input data, and coordinate tasks across systems, in addition to often offering conversational interfaces. In more advanced settings, they operate in orchestration frameworks where multiple agents collaborate, each with distinct roles or domain expertise.

This brief begins by outlining how action-taking agents function, the technical components that enable them, and the kinds of agentic products being built. It then explains how technical components of AI agents — such as control loop complexity, tool access, and scaffolding architecture — shape their behavior in practice. Finally, it surfaces emerging areas of policy concern where the risks posed by agents increasingly appear to outpace the safeguards currently in place, including security, privacy, control, human-likeness, governance infrastructure, and allocation of responsibility. Together, these sections aim to clarify both how AI agents currently work and what is needed to ensure they are responsibly developed and deployed.

Read the full brief.

The post AI Agents In Focus: Technical and Policy Considerations appeared first on Center for Democracy and Technology.

]]>
It’s (Getting) Personal: How Advanced AI Systems Are Personalized https://cdt.org/insights/its-getting-personal-how-advanced-ai-systems-are-personalized/ Fri, 02 May 2025 21:37:40 +0000 https://cdt.org/?post_type=insight&p=108515 This brief was co-authored by Princess Sampson. Generative artificial intelligence has reshaped the landscape of consumer technology and injected new dimensions into familiar technical tools. Search engines and research databases now by default offer AI-generated summaries of hundreds of results relevant to a query, productivity software promises knowledge workers the ability to quickly create documents […]

The post It’s (Getting) Personal: How Advanced AI Systems Are Personalized appeared first on Center for Democracy and Technology.

]]>
This brief was co-authored by Princess Sampson.

It’s (Getting) Personal: How Advanced AI Systems Are Personalized. White and black document on a grey background.
It’s (Getting) Personal: How Advanced AI Systems Are Personalized. White and black document on a grey background.

Generative artificial intelligence has reshaped the landscape of consumer technology and injected new dimensions into familiar technical tools. Search engines and research databases now by default offer AI-generated summaries of hundreds of results relevant to a query, productivity software promises knowledge workers the ability to quickly create documents and presentations, and social media and e-commerce platforms offer in-app AI-powered tools for creating and discovering content, products, and services.

Many of today’s advanced AI systems like chatbots, assistants, and agents are powered by foundation models: large-scale AI models trained on enormous collections of text, images, or audio gathered from the open internet, social media, academic databases, and the public domain. These sources of reasonably generalized knowledge allow AI assistants and other generative AI systems to respond to a wide variety of user queries, synthesize new content, and analyze or summarize a document outside of their training data.

But out of the box, generic foundation models often struggle to surface details likely to be most relevant to specific users. AI developers have begun to make the case that increasing personalization will make these technologies more helpful, reliable, and appealing by providing more individualized information and support. As visions for powerful AI assistants and agents that can plan and execute actions on behalf of users motivate developers to make tools increasingly “useful” to people — that is, more personalized — practitioners and policymakers will be asked to weigh in with increasing urgency on what many will argue are tradeoffs between privacy and utility, and on how to preserve human agency and reduce the risk of addictive behavior.

Much attention has been paid to the immense stores of personal data used to train the foundation models that power these tools. This brief continues that story by highlighting how generative AI-powered tools use user data to deliver progressively personalized experiences, teeing up conversations about the policy implications of these approaches.

Read the full brief.

The post It’s (Getting) Personal: How Advanced AI Systems Are Personalized appeared first on Center for Democracy and Technology.

]]>
Op-ed: Before AI Agents Act, We Need Answers https://cdt.org/insights/op-ed-before-ai-agents-act-we-need-answers/ Tue, 22 Apr 2025 13:40:41 +0000 https://cdt.org/?post_type=insight&p=108437 CDT Ruchika Joshi penned a new op-ed that first appeared in Tech Policy Press on April 17, 2025. Read an excerpt: Tech companies are betting big on AI agents. From sweeping organizational overhauls to CEOs claiming agents will ‘join the workforce’ and power a multi-trillion-dollar industry, the race to match hype is on. While the boundaries of what qualifies as […]

The post Op-ed: Before AI Agents Act, We Need Answers appeared first on Center for Democracy and Technology.

]]>
CDT Ruchika Joshi penned a new op-ed that first appeared in Tech Policy Press on April 17, 2025.

Read an excerpt:

Tech companies are betting big on AI agents. From sweeping organizational overhauls to CEOs claiming agents will ‘join the workforce’ and power a multi-trillion-dollar industry, the race to match hype is on.

While the boundaries of what qualifies as an ‘AI agent’ remain fuzzy, the term is commonly used to describe AI systems designed to plan and execute tasks on behalf of users with increasing autonomy. Unlike AI-powered systems like chatbots or recommendation engines, which can generate responses or make suggestions to assist users in making decisions, AI agents are envisioned to execute those decisions by directly interacting with external websites or tools via APIs.

Where an AI chatbot might have previously suggested flight routes to a given destination, AI agents are now being designed to find which flight is cheapest, book the ticket, fill out the user’s passport information, and email the boarding pass. Building on that idea, early demonstrations of agent use include operating a computer for grocery shoppingautomating HR approvals, or managing legal compliance tasks.

Yet current AI agents have been quick to break, indicating that reliable task execution remains an elusive goal. This is unsurprising, since AI agents rely on the same foundation models as non-agentic AI and so are prone to familiar challenges of bias, hallucination, brittle reasoning, and limited real-world grounding. Non-agentic AI systems have already been shown to make expensive mistakesexhibit biased decision making, and mislead users about their ‘thinking’. Enabling such systems to now act on behalf of users will only raise the stakes of these failures.

As companies race to build and deploy AI agents to act with less supervision than earlier systems, what is keeping these agents from harming people?

The unsettling answer is that no one really knows, and the documentation that the agent developers provide doesn’t add much clarity. For example, while system or model cards released by OpenAI and Anthropic offer some details on agent capabilities and safety testing, they also include vague assurances on risk mitigation efforts without providing supporting evidence. Others have released no documentation at all or only done so after considerable delay.

Read the full op-ed.

The post Op-ed: Before AI Agents Act, We Need Answers appeared first on Center for Democracy and Technology.

]]>
Newsom Working Group Calls for Vital Transparency into AI Development https://cdt.org/insights/newsom-working-group-calls-for-vital-transparency-into-ai-development/ Thu, 10 Apr 2025 19:13:49 +0000 https://cdt.org/?post_type=insight&p=108283 In September of last year, California Governor Gavin Newsom created a working group, led by renowned academics and policy experts, to prepare a report on AI frontier models in order to inform regulators and create a framework for how California should approach the use, assessment, and governance of this advanced technology. In March, the working […]

The post Newsom Working Group Calls for Vital Transparency into AI Development appeared first on Center for Democracy and Technology.

]]>
In September of last year, California Governor Gavin Newsom created a working group, led by renowned academics and policy experts, to prepare a report on AI frontier models in order to inform regulators and create a framework for how California should approach the use, assessment, and governance of this advanced technology. In March, the working group released a draft of its report for public comment, and this week, CDT submitted our feedback.

Our comments commended the working group for emphasizing the critical importance of transparency in AI governance. For years, researchers and advocates have argued that transparency is a vital lever for managing the risks of AI and making its benefits available to a variety of stakeholders, by enabling rigorous research and providing the public with critical information about how AI systems affect their lives. Transparency requirements can also create conditions for companies to develop AI systems more responsibly, and help hold those companies accountable when those systems cause harm.

But as of now, essentially all transparency in the AI ecosystem depends on purely voluntary commitments from AI companies, which are not a stable foundation for managing AI risks. AI companies have seemed all too willing to backtrack on their transparency commitments when their business incentives change. Recently, for example, Google released the state-of-the-art AI model Gemini 2.5 Pro without a key safety report — violating a previous commitment it had made. This reveals a vital role for regulators: ensuring that AI developers provide the transparency needed for safe, responsible, and accountable AI development.

At the same time, we pushed the working group to add more detail to its transparency recommendations, given that the most useful transparency measures are those that are precisely scoped and backed by a clear theory of change. We called for the disclosures emphasized by the draft report to be complemented by further crucial information, which would enable visibility into not only the technical safeguards a developer implements, but also the internal governance practices that the developer relies on to manage risk. We also called for visibility into how developers assess the efficacy of their safeguards, as well as information about how developers decide whether risks have been mitigated adequately enough to deploy a model. In addition, we urged the working group to promote visibility into how developers intend to respond to significant risks that materialize. Finally, we urged the working group to consider how developers ought to be incentivized to grant pre-deployment access to qualified third-party evaluators where warranted.

Read our full comments here.

The post Newsom Working Group Calls for Vital Transparency into AI Development appeared first on Center for Democracy and Technology.

]]>
CDT Submits Comment on AISI’s Updated Draft Guidance for Managing Foundation Model Misuse Risk https://cdt.org/insights/cdt-submits-comment-on-aisis-updated-draft-guidance-for-managing-foundation-model-misuse-risk/ Fri, 14 Mar 2025 19:37:38 +0000 https://cdt.org/?post_type=insight&p=107903 In January, the U.S. AI Safety Institute (AISI) released an updated draft of its guidance on managing the risks associated with foundation model misuse. The guidance describes best practices that developers can follow to identify, measure, and mitigate risks when creating and deploying advanced foundation models. CDT submitted comments suggesting certain improvements on the original […]

The post CDT Submits Comment on AISI’s Updated Draft Guidance for Managing Foundation Model Misuse Risk appeared first on Center for Democracy and Technology.

]]>
In January, the U.S. AI Safety Institute (AISI) released an updated draft of its guidance on managing the risks associated with foundation model misuse. The guidance describes best practices that developers can follow to identify, measure, and mitigate risks when creating and deploying advanced foundation models. CDT submitted comments suggesting certain improvements on the original draft guidance.

Our comments on the updated draft note that this updated draft is a marked improvement over AISI’s initial draft. We were heartened to see several of the themes we emphasized in our comments on that initial draft incorporated into this update. Two particular improvements stand out:

  1. While the focus of the guidance remains on developers, this draft includes clear, actionable recommendations for other actors in the AI value chain as well. This update is in line with the conclusion — found both in our earlier comments and in prior research — that actors across the AI value chain must all act responsibly in order to effectively address AI risks.
  2. This update gives developers more robust guidance on how to weigh the potential benefits of a model against its risks when deciding whether to deploy or continue developing it. This guidance is particularly relevant to developers of open foundation models.

Understandably, this guidance does not aim to address every important risk associated with foundation model development, and AISI explicitly notes its limited scope. At the same time, our comments emphasize that the exclusion of certain risks from this guidance should not deter developers from addressing them. Foundation models may be improperly biased or facilitate unlawful discrimination, and developers should continue to mitigate those risks.

We continue to recommend that AISI explicitly encourage developers to seek input from independent domain experts at relevant points throughout the risk management process. This guidance already gives domain experts an important role. However, research from CDT’s AI Governance Lab has shown that involving independent experts allows for crucial scrutiny of a developer’s risk management processes and the resulting determinations of risk. We also push AISI to clarify that domain experts may come from a variety of disciplines, including the social sciences. Social scientists can provide important input on how to identify and measure risks, which developers should be encouraged to solicit and incorporate.

Additionally, while applauding this guidance’s emphasis on thorough documentation, we recommend that AISI clarify the purpose of each documentation artifact that it recommends that developers create. In a similar vein, AISI should also urge developers to develop these artifacts in consultation with the stakeholders that are their intended audience. Prior research from CDT has shown that these steps can help ensure that documentation serves its intended purpose.

Finally, we urge AISI to clarify that post-deployment monitoring must not violate users’ privacy. While post-deployment monitoring is important for preventing dangerous misuse, developers should avoid invasive monitoring methods, and instead rely on privacy-preserving techniques for detecting misuse while respecting privacy.

Read the comments here.

The post CDT Submits Comment on AISI’s Updated Draft Guidance for Managing Foundation Model Misuse Risk appeared first on Center for Democracy and Technology.

]]>
Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems https://cdt.org/insights/assessing-ai-surveying-the-spectrum-of-approaches-to-understanding-and-auditing-ai-systems/ Thu, 16 Jan 2025 05:04:00 +0000 https://cdt.org/?post_type=insight&p=106948 With contributions from Chinmay Deshpande, Ruchika Joshi, Evani Radiya-Dixit, Amy Winecoff, and Kevin Bankston What do we mean when we talk about “assessing” AI systems? The importance of a strong ecosystem of AI risk management and accountability has only increased in recent years, yet critical concepts like auditing, impact assessment, red-teaming, evaluation, and assurance are […]

The post Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems appeared first on Center for Democracy and Technology.

]]>
With contributions from Chinmay Deshpande, Ruchika Joshi, Evani Radiya-Dixit, Amy Winecoff, and Kevin Bankston

Graphic for CDT AI Gov Lab's report, "Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems." Illustration of a collection of AI "tools" and "toolbox" – a hammer and red toolbox – and a stack of checklists with a pencil.
Graphic for CDT AI Gov Lab’s report, “Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems.” Illustration of a collection of AI “tools” and “toolbox” – a hammer and red toolbox – and a stack of checklists with a pencil.

What do we mean when we talk about “assessing” AI systems?

The importance of a strong ecosystem of AI risk management and accountability has only increased in recent years, yet critical concepts like auditing, impact assessment, red-teaming, evaluation, and assurance are often used interchangeably — and risk losing their meaning without a stronger understanding of the specific goals that drive the underlying accountability exercise. Articulating and mapping the goals of various AI assessment approaches against policy proposals and practitioner actions can be helpful in tuning accountability practices to best suit their desired aims. 

That is the purpose of this Center for Democracy & Technology report: to map the spectrum of AI assessment approaches, from narrowest to broadest and from least to most independent, to identify which approaches best serve which goals.

Executive Summary

Goals of AI assessment and evaluation generally fall under the following categories:

  • Inform: practices that can facilitate an understanding of a system’s characteristics and risks
  • Evaluate: practices that involve assessing the adequacy of a system, safeguards or practices
  • Communicate: practices that help make systems and their impacts legible to relevant stakeholders
  • Change: practices that support incentivizing changes in actor behavior

Understanding the scope of inquiry, or the breadth or specificity of questions posed by an assessment or evaluation, can be particularly useful in determining whether that activity is likely to surface the most relevant impacts and motivate the desired actions. Scope of inquiry exists on a spectrum, but for ease of comprehension the following breakdown can be a useful mental model to understand different approaches and their respective theories of change:

  • Exploratory: Broad exploration of possible harms and impacts of a system, generally informed but unbounded by a set of known risks. 
  • Structured: Consideration of a set of harms and impacts within a defined taxonomy. 
  • Focused: Evaluation of a specific harm or impact or assessment against a procedural requirement. 
  • Specific: Analysis of a specific harm or impact using a defined benchmark, metric, or requirement.

Meanwhile, recognizing the degree of independence of particular assessment or evaluation efforts — for instance, whether developer or deployer of the system in question has control over the systems that will be included in a given inquiry, what questions may be asked about them, and whether and to what extent findings are disclosed — is important to understanding the degree of assurance such an effort is likely to confer. 

  • Low Independence: Direct and privileged access to an organization or the technical systems it builds 
  • Medium Independence: Verification of system characteristics or business practices by a credible actor who is reasonably disinterested in the results of their assessment 
  • High Independence: Impartial efforts to probe and validate the claims of systems and organizations — without constraint on the scope of inquiry or characterization of their findings

Assessment and evaluation efforts can shift up and down each of these two axes somewhat independently: a low-specificity effort can be conducted in a high-independence manner, while a highly specific inquiry may be at the lowest level of independence and still lead to useful and actionable insights. Ultimately though, the ability of different efforts in driving desired outcomes relates to where they sit on this matrix.

Recommendations

  • Evaluation and assessment efforts should be scoped to best support a defined set of goals. Practitioners and policymakers should be particularly attentive to whether the independence and/or specificity of their intended assessment and evaluation activities are well-matched to the goals they have for those efforts. 
  • Stakeholders involved in evaluation and assessment efforts should be transparent and clear about their goals, methods, and resulting recommendations or actions. Auditors and assessors should clearly disclose the methods they have employed, any assumptions that shaped their work, and what version of a system was scrutinized. 
  • Accountability efforts should include as broad an array of participants and methods as feasible, with sufficient resources to ensure they are conducted robustly. AI assessment and evaluation activities must include a pluralistic set of approaches that are not constrained to practitioners with technical expertise but rather encompass a sociotechnical lens, (i.e., considering how AI systems might interact in unexpected ways with one another, with people, with other social or technical processes, and within their particular context of deployment).

Ultimately, no one set of accountability actors, single scope of assessment, or particular degree of auditor independence can accomplish all of the goals that stakeholders have for AI assessment and evaluation activities. Instead, a constellation of efforts — from research, to assurance, to harm mitigation, to enforcement — will be needed to effectively surface and motivate attention to consequential impacts and harms on people and society. 

Read the full report.

The post Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems appeared first on Center for Democracy and Technology.

]]>
Hypothesis Testing for AI Audits https://cdt.org/insights/hypothesis-testing-for-ai-audits/ Thu, 16 Jan 2025 05:02:00 +0000 https://cdt.org/?post_type=insight&p=106969 Introduction AI systems are used in a range of settings, from low-stakes scenarios like recommending movies based on a user’s viewing history to high-stakes areas such as employment, healthcare, finance, and autonomous vehicles. These systems can offer a variety of benefits, but they do not always behave as intended. For instance, ChatGPT has demonstrated bias […]

The post Hypothesis Testing for AI Audits appeared first on Center for Democracy and Technology.

]]>
Introduction

AI systems are used in a range of settings, from low-stakes scenarios like recommending movies based on a user’s viewing history to high-stakes areas such as employment, healthcare, finance, and autonomous vehicles. These systems can offer a variety of benefits, but they do not always behave as intended. For instance, ChatGPT has demonstrated bias against resumes of individuals with disabilities,[1] raising concerns that if such tools are used for candidate screening, they could worsen existing inequalities. Recognizing these risks, researchers, policymakers, and technology companies increasingly emphasize the importance of rigorous evaluation and assessment of AI systems. These efforts are critical for developing responsible AI, preventing the deployment of potentially harmful systems, and ensuring ongoing monitoring of their behavior post-deployment.[2]

As laid out in today’s new paper from CDT, “Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems,” organizations can use a variety of assessment techniques to understand and manage the risks and benefits of their AI systems.[3] Some assessments take a broad approach, identifying the range of potential harms and benefits that could arise from an AI system. For example, an AI company might engage with different stakeholders who may be affected by the system to explore both the positive and negative impacts it could have on their lives. Other assessments are more focused, such as those aimed at validating specific claims about how the AI system performs. For example, a company developing a hiring algorithm may want to verify whether the algorithm recommends qualified male and female candidates at the same rate.

Stakeholders have noted the importance of evaluating specific claims about AI systems through what are often referred to as AI audits. Researchers have drawn comparisons between AI audits and hypothesis testing in scientific research,[4] where scientists determine whether the effects observed in an experiment are likely meaningful or simply due to random chance. Similarly, hypothesis testing offers AI auditors a systematic approach to assess patterns in AI system behavior. This method can help gauge the evidence supporting a claim, such as whether an AI system indeed avoids discrimination against particular demographic groups.

Using hypothesis testing in AI audits offers several advantages. It is a well-established method in empirical research, and so can provide AI auditors with tools for evaluation and interpretation. And hypothesis testing helps auditors quantify the uncertainty in their data, which is crucial for making informed decisions and developing action plans. However, like in other fields, hypothesis testing in AI has its limitations. Results can be influenced by factors other than the specific effects being evaluated, such as the particular subsets of data used to assess a claim. These sources of random error can impact the validity of the interpretations drawn from hypothesis tests. Therefore, accounting for these limitations is essential for auditors to make appropriate recommendations based on their analyses.

In this explainer, intended as a supplement to our broader paper on AI assessments, we focus in particular on the key ideas behind hypothesis testing, show how it can be applied to AI audits, and discuss where it might fall short. To help illustrate these points, we use computational simulations of a hypothetical hiring algorithm to show how hypothesis testing can detect gender disparities under different conditions.[5]

Hypothesis Testing in Statistics

Imagine a technology company is developing an algorithm that evaluates job applicants’ resumes, cover letters, and questionnaire responses to make hiring decisions. If the algorithm is trained on historical data from past applicants and hires, it might unintentionally learn existing biases or disparities, potentially leading to unfair hiring patterns. For example, an algorithm used to hire software engineers could end up disadvantaging female applicants due to the historical underrepresentation of women in technical fields. In such cases, an auditor may want to assess whether the algorithm recommends qualified male and female candidates at equal rates.

To conduct such a test, the auditor could use hypothesis testing, starting with a “null hypothesis” (H0), which usually represents the assumption that there is no difference between groups. The “alternative hypothesis” (H1) proposes that a difference does exist. In statistical terms, the hypotheses might look like this:

  • Null hypothesis / H0: There is no difference in the algorithm’s hiring recommendations for men and women.
  • Alternative hypothesis / H1: There is a difference in the algorithm’s hiring recommendations for men and women.


It might seem counterintuitive to set the default assumption as “no difference” if the auditor is investigating gender disparities. However, in hypothesis testing, the burden of proof lies on the evaluator to show that any observed effect (such as outcome disparities) is unlikely to have occurred by random chance. Later in this explainer, we will discuss how to use hypothesis testing when the null hypothesis assumes there is a difference.

When researchers conduct experiments, their goal is to understand how patterns or relationships appear in a broader group, or “population,” they are studying. Instead of gathering data from the entire population though, they typically rely on smaller subsets of data called “samples”— for a variety of reasons. For example, an auditor evaluating a hiring algorithm might not have access to data from every potential candidate, or it could be too time-consuming and costly to gather all this data. Instead, the auditor would analyze a smaller sample, assessing the level of disparity within that sample as a way to estimate the level of disparity in the population.

But results from a sample may not accurately represent the broader population due to random sampling variability. An auditor might, by chance, select for the sample a subset of men whom the algorithm is less likely to recommend compared to the overall population, or a sample of women whom the algorithm is more likely to favor. In other words, sample measurements are subject to some degree of random error. This is where hypothesis testing becomes essential—it allows researchers to evaluate whether the effects they observe in measurements of the sample are likely indicative of real patterns in the larger population, or if they could simply be due to random chance within that specific sample.

When conducting a hypothesis test, researchers need a way to decide whether to reject the null hypothesis, which means determining if there is enough evidence to conclude that an effect observed in a sample likely reflects a true effect in the population. This decision hinges on statistical significance, which indicates whether an observed effect in a sample is meaningful enough statistically to suggest it likely exists in the population. To assess statistical significance, researchers calculate a p-value, which represents the probability of obtaining results as extreme as those observed (or moreso) if the null hypothesis were true. 

The basic idea is that a strong effect observed in the sample — such as an algorithm recommending men for hire much more often than women — suggests it’s less likely that the result is due to random variation than a comparably weak effect might suggest. A common threshold for determining whether an effect is significant is a p-value of 0.05, which implies a 5% risk of concluding there is an effect when, in fact, there isn’t. If the p-value is less than 0.05, researchers conclude that the effect is statistically significant.

Random chance can influence the patterns researchers observe in a sample, leading to possible errors in researchers’ conclusions about the population. These errors fall into two categories. A Type I error occurs when investigators incorrectly conclude that there is an effect when, in reality, there is none; this is also known as a “false positive.” For example, if an auditor concluded that an algorithm demonstrates gender disparity based on the sample data when it actually does not, this would be a Type I error. Conversely, a Type II error happens when researchers fail to detect a real effect, resulting in a “false negative.” In this case, the auditor would fail to identify that the algorithm results in disparities, potentially allowing a discriminatory AI system to be deployed.

  • Type I error / false positive: Researchers or auditors incorrectly conclude there is an effect when there is none.
  • Type II error / false negative: Researchers or auditors fail to detect a real effect.

The probability of committing a Type II error is closely related to statistical power — the likelihood that a hypothesis test will correctly identify an effect when it exists.  Factors like sample size and the magnitude of the effect (e.g., the level of disparity driven by the algorithm) directly impact statistical power, emphasizing the need for careful planning in AI audits. Since low statistical power increases the risk of Type II errors or false negatives, auditors of AI systems will need sufficient statistical power to detect the effect they are investigating, such as gender disparity in an algorithm. 

Another factor that impacts statistical power is whether the hypothesis test evaluates differences only in one direction or in both directions. Imagine an auditor is particularly interested in whether the algorithm unfairly advantages male applicants. That is, even when men and women are equally qualified, it recommends men more frequently than women. Instead of testing for any difference between men and women, she specifically wants to investigate the level of evidence supporting bias against women. In this case, the alternative hypothesis focuses on men being selected more often than women, rather than looking for differences in either direction. This is known as a “one-sided” test in statistics, which is suitable when the goal is to investigate a specific outcome, such as a bias in favor of men. So here, the auditor’s null and alternative hypotheses would be:

  • Null hypothesis / H0: There is no difference in the hiring rates of men and women.  
  • Alternative hypothesis / H1: Men have a higher hiring rate than women.

This example illustrates an important aspect of hypothesis testing: how the hypotheses are framed directly impacts the interpretation of the results. Here, the auditor is testing specifically whether men are favored over women, not whether there is bias in favor of either men or women. As a consequence, if the algorithm instead favored women, this hypothesis formulation would not create the conditions for the auditor to detect that disparity because the test wasn’t designed to look for it.

It may seem counterintuitive to evaluate for differences (in this example, bias) only in one direction; however, in statistics, there are tradeoffs. In the case of one-sided tests, the tradeoff involves statistical power. A one-sided test has greater statistical power compared to a two-sided test because it focuses entirely on one direction of the effect. Imagine shining a flashlight to look for your keys in the dark. A one-sided test is like focusing all the light in one direction, making it easier to spot the keys in that direction, but impossible to find them if they are in the opposite direction. A two-sided test splits the light to cover both directions making it possible to search in both directions, but harder to see than if all of the light were in one spot. In many cases, it would make sense to have less light to cover more ground; however, if you have reason to expect that the keys would be to the right and not to the left, it makes more sense to focus on that area.

In our example, using a one-sided hypothesis makes the test more sensitive to detecting differences in the specified direction, which, in the auditor’s case, is a bias in favor of men. However, the tradeoff is that a one-sided test does not attempt to assess bias in the opposite direction.[6] Therefore, the decision to use a one-sided or two-sided test depends on the research question or audit objective, the stakes of evaluating bias in the opposite direction, and the context in which the results will be interpreted and acted upon.

Applying hypothesis testing to AI audits: A simulation approach

The main purpose of hypothesis testing is to examine patterns within a sample to make inferences about a larger population. In a real-world AI system audit, an auditor usually cannot assess the algorithm’s performance across the entire population. Auditors may face limitations in how much data they can analyze due to privacy concerns, regulations, resource constraints, or restrictions from the organization, making analyzing data within samples the only viable option. Or auditors may be hoping to understand not only how the algorithm performed with existing data, but how it might perform with future users, whose data would not be available to them. If full population data were available, hypothesis testing wouldn’t be necessary, as direct evaluation would provide the needed insights.

Explaining how both population dynamics and sample variability affect hypothesis testing can be challenging, especially since complete data on the population often doesn’t exist. In this explainer, we’ll rely on simulation — a computational approach that enables us to control population characteristics to explore their impact on hypothesis tests applied to samples — to illustrate these concepts. Our simulation will create a virtual population to model the algorithm’s hiring decisions, setting specific parameters, like the algorithm’s level of gender bias. This method allows us to illustrate hypothesis testing in a controlled setting, showing how factors such as sample size or the strength of gender bias in the algorithm can affect audit results.

To illustrate how hypothesis testing could be used in AI audits, we will simulate a population in which 5,000 men and 5,000 women are eligible to apply for a job. Let’s assume that 60% of both male and female applicants are qualified, and that the auditor is interested in understanding whether an algorithm exhibits demographic parity, or whether a machine learning algorithm makes decisions at the same rate for different demographic groups. In our simulation, achieving demographic parity would mean hiring men and women at the same rate. 

To illustrate the importance of hypothesis testing in AI auditing, we will simulate an algorithm that we know fails to achieve demographic parity. Figure 1 shows how the algorithm would make  hiring recommendations for male and female candidates in the entire population of candidates. Among qualified male applicants, the algorithm recommends hiring 80% of them, while it recommends only 60% of qualified female applicants. For unqualified applicants, the algorithm recommends hiring 20% of men but only 10% of women. Despite men and women being equally qualified, the algorithm ends up recommending women for hire 39% of the time, compared to 56% for men (a difference of about 0.17). For an auditor that has chosen demographic parity as the relevant fairness definitions, these recommendations indicate an unfair system. However, because the auditor would not have access to data on the entire population, the goal of her hypothesis testing would be to try to uncover these patterns in a smaller sample of the data.

Four pie charts, showing the proportions of applicants in the population that the algorithm would recommend, based on qualifications and gender. Top to bottom, left to right:
Qualified men: 80% hired, 20% not hired.
Qualified women: 60% hired, 40% not hired.
Unqualified men: 20% hired, 80% not hired. Unqualified women: 10% hired, 90% not hired.
Four pie charts, top to bottom, left to right: Qualified men: 80% hired, 20% not hired. Qualified women: 60% hired, 40% not hired. Unqualified men: 20% hired, 80% not hired. Unqualified women: 10% hired, 90% not hired.

Figure 1. Proportions of applicants in the population that the algorithm would recommend, based on qualifications and gender.

The auditor would need to decide whether to reject the null hypothesis by assessing the probability that any observed gender disparity in the sample occurred due to random chance. In other words, she would use her sample to determine if the algorithm demonstrates gender bias. However, due to random sampling error, the sample results might differ from the overall population, which would cause the outcome she observes to vary depending on which individuals are included in the sample.

Computational simulation can help demonstrate how sampling error can cause estimates of gender disparities to differ from the true population values. Starting with our simulated population of 5,000 men and 5,000 women, we randomly select 100 men and 100 women into the sample to observe the algorithm’s hiring recommendations. In this particular sample, the algorithm recommends hiring men 63% of the time and women 45% of the time (a difference of 0.18) — an estimate that shows slightly more gender disparity than what we know from the simulation scenario to be the true rates in the overall population. However, if we had chosen different samples, the estimates could have varied.

To understand how much sample estimates are likely to differ from the population, the simulation allows us to repeat this process multiple times and visualize variability in the results we observe. Figure 2 summarizes the results, with the x-axis representing the selection rates and the y-axis showing the frequency of each rate across the 100 simulations. The dashed lines indicate the selection rates in the entire population.

Graphic of an orange and blue histogram, showing the overlap in the distribution of the algorithm’s selection rates for men and women across 100 simulated samples. Men population rate: 0.56.
Women population rate: 0.39.
Graphic of an orange and blue histogram, showing the overlap in the distribution of the algorithm’s selection rates for men and women across 100 simulated samples. Men population rate: 0.56. Women population rate: 0.39.

Figure 2. Frequencies of the algorithm’s selection rates for men and women across 100 simulated samples.

The graph reveals that while most selection rates in the simulations cluster around the true population rates, there is still some variability. For example, in one simulated sample, the selection rate for women was as low as 27%, while in another, it reached 54%, almost matching the population rate for men. The graph demonstrates that while in general, results in a sample will tend to resemble the population, that is not necessarily the case in any given sample, which could lead to misleading audit results.

  • Statistical result: Across 100 simulations, each with 100 randomly selected men and women, the selection rate for women varied between 27% and 54%, while for men, it ranged from 45% to 67%. This variability in sample estimates, caused by sampling error, led to deviations from the true selection rates of the whole population.
  • Interpretation: The estimated disparities resulting from the algorithm in a given sample may appear more or less severe than it actually is in the population, depending on which specific individuals are randomly selected.

Another way to visualize these experimental results is by using a histogram that shows the difference in selection rates between men and women for each experiment, as seen in Figure 3. The dashed line marks the selection rate difference in the entire population, which is approximately 0.17.

Graphic of a blue histogram, showing the frequencies of the algorithm’s selection rate difference for men and women across 100 simulated samples. Population selection rate difference: 0.17.
Graphic of a blue histogram, showing the frequencies of the algorithm’s selection rate difference for men and women across 100 simulated samples. Population selection rate difference: 0.17.

Figure 3. Frequencies of the algorithm’s selection rate difference for men and women across 100 simulated samples.

While most outcomes apparent from samples cluster around this population difference, some indicate a much larger divergence. In one experiment, women were even selected more frequently than men, demonstrating the opposite effect as the trend observed in the overall population.[7] This random variation highlights a key challenge in hypothesis testing: determining whether the observed results indicate a genuine effect in the population or are simply due to chance. When auditors assess an algorithm’s behavior based on a sample, they must acknowledge that random error can influence their estimates. Therefore, auditors should interpret their findings cautiously, framing their conclusions with an awareness of sampling variability and its limitations.

Instead of just examining the selection rates, our simulation allows us to run a formal statistical test on each of our simulated samples and calculate a p-value in the same way that an auditor would also calculate a p-value in an audit. The p-value indicates the likelihood of obtaining the observed results—or even more extreme ones—purely by chance if there were actually no difference between the algorithm’s treatment of men and women in the overall population.

We can visualize the p-values from the 100 experiments in a histogram, as shown in Figure 4.

Graphic of a blue histogram, showing the frequencies of p-values in simulated sample tests.

Figure 4. Frequencies of p-values in simulated sample tests.

Using a significance level of p < 0.05, we would correctly reject the null hypothesis of no difference in selection rates in 78 out of the 100 experiments. In other words, 78% of the time, we would correctly conclude that it is unlikely that the disparity in our sample emerged by chance alone. However, in 22 experiments, the evidence was not strong enough to reject the null hypothesis, resulting in a Type II error (false negative).

This means that the auditor could fail to identify a statistically significant disparity in the algorithm due to the random selection of individuals in the sample, even though the algorithm does in fact produce disparate recommendations. In our simulation, this failure to detect the disparity would occur more than 20% of the time!

Statistical result: Across 100 simulations, each with random samples of 100 men and 100 women, and using a significance level of p < 0.05, the statistical test found a significant difference in 78% of the cases. However, in 22% of the samples, the test failed to reject the null hypothesis, incorrectly suggesting no difference in selection rates in the population (a Type II error).
Interpretation: In 78 out of 100 tests, the statistical test correctly identified that the algorithm leads to disparities. However, in 22 cases, it failed to detect the true difference. In the context of auditing an algorithm that does result in gender disparity, this would mean that the auditor would correctly conclude that the algorithm recommends women less frequently than men 78% of the time. But 22% of the time, the auditor would conclude that there was insufficient evidence to conclude that the algorithm led to disparities.

Drawing erroneous conclusions over 20% of the time is clearly not ideal. Fortunately, there are ways to improve the robustness of these conclusions. One method is to collect data from larger samples of men and women. As sample size increases, the samples tend to more closely resemble the overall population, reducing the likelihood of random error that causes large deviations in the sample outcomes.

However, while larger sample sizes lower the risk of statistical errors, gathering them is not always feasible. Smaller samples are often quicker and more cost-effective to collect, especially when data collection is time-consuming, expensive, or logistically difficult. As with analyzing data on entire populations, auditors may not be able to access large samples due to organizational restrictions or resource and operational constraints. Yet, relying on samples that are too small can lead to inaccurate conclusions, potentially resulting in the deployment of systems that cause real-world harm. Therefore, auditors must balance efficiency and accuracy, aiming to gather the minimum sample size necessary to reliably detect the patterns they are investigating.

We can use our simulation to illustrate how sample size affects the accuracy of statistical conclusions. Instead of gathering data from 100 men and 100 women, we can increase the sample size to 250 in each group and repeat the experiment 100 times. The results are shown in Figure 5. 

Graphic of an orange and blue histogram, showing the distribution of frequencies of the algorithm’s selection rates for men and women across 100 simulated samples, with sample sizes of 250 men and women. Men population rate: 0.56.
Women population rate: 0.39.
Graphic of an orange and blue histogram, showing the distribution of frequencies of the algorithm’s selection rates for men and women across 100 simulated samples, with sample sizes of 250 men and women. Men population rate: 0.56. Women population rate: 0.39.

Figure 5. Frequencies of the algorithm’s selection rates for men and women across 100 simulated samples with sample sizes of 250 men and women.

With this larger sample size, fewer experiments produced selection rates that deviated significantly from the population rates. Similarly, the statistical tests yielded p-values below 0.05 in 97 out of 100 experiments. This demonstrates that increasing the sample size significantly reduces the chances of making incorrect statistical conclusions about the population.

Sample size is not the only factor influencing an auditor’s ability to detect a real difference in a population from a sample; the magnitude of the difference also plays a key role. Consider a hypothetical population where the gender disparity is more pronounced than in our initial scenario. This time, we will set the difference in selection rates between men and women in the simulated scenario to be larger—0.28 instead of 0.17. In this scenario, even with only 100 participants in each group, we would correctly reject the null hypothesis that the algorithm selects men more often than women 100 out of 100 times.[8] In other words, the larger the effect size in the population, the easier it is to detect in a sample, even if the sample size is smaller.


In some applications, like hiring, where disparity testing is already common, auditors may use existing laws or norms to set specific thresholds for unacceptable disparities. For instance, employment discrimination jurisprudence leans on the “four-fifths” or “80%” rule of thumb that suggests that further investigation is warranted if the selection rate for any protected group (e.g., based on race or gender) falls below 80% of the rate for the group with the highest selection rate. Although the four-fifths rule is often misapplied in AI contexts,[9] it can still serve as a relevant threshold in certain situations. However, simply falling below 80% in a given sample may not be sufficient evidence to conclude that the disadvantaged group’s outcomes are below 80% of those of the more advantaged group in the entire population. In these cases, auditors will still need to use specific statistical tests to challenge the null hypothesis that the disadvantaged group receives at least 80% of the positive outcomes compared to the more advantaged group. When testing against specific thresholds rather than looking for whether there is any difference, auditors will still need to account for the possibility that random error could explain the observed differences in a sample.

In empirical research, scientists often use a method called power analysis to estimate how large a sample needs to be to detect an effect of a certain size. Our simulation functions similarly to a power analysis: by adjusting assumptions about the size of the difference in the population and experimenting with sample sizes, we can determine how large groups need to be to achieve an acceptable margin of error.[10] However, while power analysis is a useful tool, it does not guarantee that a true effect will be found, even if the sample size matches the recommended value. Moreover, if the auditor is uncertain about the expected effect size, choosing the appropriate sample size becomes more challenging. 

Practitioners using hypothesis testing in auditing should be aware of its limitations, including the fact that it doesn’t ensure meaningful effects will be detected in a sample when they exist in the population. To reduce the risk of drawing inaccurate conclusions about the population, auditors should ensure their sample is large enough to detect the anticipated effect size. If they are unsure of the expected effect magnitude, they should base their sample size estimates on the smallest effect size they would consider meaningful.

To someone less familiar with statistics, it might seem reasonable to try another approach: if the first test doesn’t show a significant result, the auditor could simply repeat the test with different samples or continue collecting data until the test yields a significant outcome. However, this approach greatly increases the risk of a false positive. Just as sampling error can sometimes hide real effects, it can also exaggerate them. The more tests the auditor runs, the higher the chance of finding a statistically significant result purely by chance, leading to a Type I error—incorrectly concluding that there is a meaningful effect in the population when there actually isn’t one.

This practice is known as p-hacking, where researchers manipulate their analyses or repeat tests until they achieve statistically significant results. P-hacking undermines the validity of findings, as it exploits random fluctuations in data rather than revealing true effects. Auditors must avoid this pitfall by defining their sample sizes and analysis plans before collecting any data. This improves the likelihood that any conclusions drawn from the sample are reliable.

In more complex or repeated audits, the auditor may need to conduct multiple statistical tests. For instance, they might test whether the algorithm results in disparities overall, as well as explore differences between intersectional subgroups, such as black women versus white women, or black women versus white men. Or the auditor might want to perform analyses at different points in time.

In these situations, auditors should use techniques known as multiple comparisons correction to reduce the risk of drawing conclusions based on false positives. This approach adjusts the threshold for statistical significance depending on the number of tests the auditor conducts. For example, if the auditor performs 10 tests, a multiple comparisons correction might lower the p-value threshold from 0.05 to a more stringent level (e.g., 0.005). Essentially, multiple comparisons correction demands stronger evidence before concluding that an effect exists, thereby reducing the risk of Type I errors and inaccurate findings.

Using simulations to demonstrate a lack of gender disparity in an algorithmic system

So far, we’ve discussed how auditors can use hypothesis testing to assess systems for gender disparity, typically starting with the null hypothesis that no difference in selection rates exists in the population. However, in some cases, it might be more appropriate to reverse the hypotheses: Some scholars argue that when quantitative measures indicate a performance disparity, the burden of proof should be on the company to demonstrate that no disparity exists.[11] In other words, the default assumption should shift to the algorithm showing disparity.

In our auditing example, the null hypothesis could state that men are hired more often than women, while the alternative hypothesis would suggest no difference in hiring outcomes between the two groups. In this setup, the burden of proof falls on the auditor to show that, based on the sample, there is sufficient reason to believe that no disparity is present.

However, particularly in statistical testing, the absence of evidence is not the same as evidence of absence. For example, in our first simulated scenario, the selection rate difference between men and women was 0.17, but due to the small sample size, the hypothesis test failed to reach statistical significance in over 20% of the samples. In this case, the experiment simply lacked the power to detect the disparity. Companies seeking to avoid accountability might design audits with insufficient statistical power, ensuring that even if their systems show disparity, the test results would not allow auditors to confidently identify it.

Failing to reject a null hypothesis that there is no difference does not prove that no difference exists in the population. In statistics, it generally requires more evidence to conclude that an effect is absent than to suggest it is present. Consider this analogy: if someone is trying to determine whether a haystack contains any needles and only searches a portion of it without finding one, this doesn’t necessarily mean there are no needles. The more of the haystack they search, the more confident they can be that no needles are present.

It is statistically impossible to prove that two different populations are exactly the same in any particular respect, but it is possible to evaluate whether any difference between them is likely small enough to be considered acceptable.[12] This approach, known as “non-inferiority testing,” was developed by researchers in pharmacology. For example, pharmacology researchers might use non-inferiority testing to determine whether a generic drug is not significantly worse than the brand-name version. Another example is assessing whether a new drug, which may have fewer side effects or be easier for patients to take, isn’t significantly less effective than an existing drug that is more challenging to use.

In non-inferiority testing, researchers first define what they consider an acceptable difference. They then perform a statistical test to estimate, within a margin of error, how much one treatment might be worse than another in the overall population. If the worst-case scenario (the lower bound of the estimate) does not exceed the pre-defined acceptable difference, they can reject the null hypothesis that the difference is unacceptably large.

We can apply this concept to our simulation of an algorithmic hiring system. Returning to the scenario where the selection rate difference between men and women in the population is 0.17, we could introduce a threshold of 0.2. This means we want to assess how likely we are to correctly reject the null hypothesis that the difference between men’s and women’s selection rates is greater than 0.2.

In the population, we know that the selection rate difference is less than 0.2. However, when we analyze the p-values from the non-inferiority tests (shown in Figure 6), the test was significant in only 10 out of 100 experiments. In the other 90 experiments, we would incorrectly fail to reject the null hypothesis. As a result, we would not be able to statistically conclude that the algorithm does not produce outcome disparities for women that exceed our threshold.

Graphic of a blue histogram, showing the frequencies of p-values in simulated sample non-inferiority tests. Significance: p < 0.05.
Graphic of a blue histogram, showing the frequencies of p-values in simulated sample non-inferiority tests. Significance: p < 0.05.

Figure 6. Frequencies of p-values in simulated sample non-inferiority tests.

We can attempt to rectify this problem, as before, by expanding our sample size to 250 men and 250 women. This raises the number of statistically significant samples to 12, but still means that in the majority of cases, the auditor’s experiment is not sufficiently powerful to make the correct conclusion. Even with a sample size of 1,000 men and 1,000 women, the auditor would only be able to correctly reject the null hypothesis 39 times out of 100. In sum, in order to be sufficiently powered, an experiment testing these specific hypotheses would need to have a very large sample, one that would constitute a significant proportion of the relevant populations. 

Other factors that would affect this test would be the threshold for acceptable difference and the actual difference that exists in the population. If we set the threshold at 0.25 — well above the level of gender disparity in the population — with samples of 500 men and women, we would correctly reject the null hypothesis in 85 out of 100 experiments. 

Practitioners interested in auditing AI systems should bear in mind that, when performing non-inferiority testing, the smaller the difference is between their threshold and the difference in the population, the larger their sample will need to be. Practically, this may mean that audits leveraging non-inferiority testing will be more expensive or resource/intensive to conduct than those that are conducting traditional hypothesis tests. In instances where sufficiently large samples cannot be analyzed, non-inferiority testing may not offer a viable approach.

Interpreting traditional hypothesis tests can be challenging, especially for those less familiar with statistics, and non-inferiority tests can be even more difficult to understand. However, auditors should be aware that if companies want to demonstrate that their systems do not show disparity beyond a certain level, they will need to rely on non-inferiority testing to support these claims. Simply failing to reject a traditional null hypothesis of no difference is not enough. Also, even when the null hypothesis in a non-inferiority test is rejected, the auditor cannot conclude that outcomes for the group of interest are not inferior at all — only that they are not more inferior than the specified threshold.[13] Statistical tests are used to evaluate precise claims. As such, the results of those tests must also be interpreted precisely.

***

Hypothesis testing can be a valuable tool in AI audits, offering a structured framework for assessing potential issues within AI systems. By using well-established statistical methods, auditors can evaluate the strength of the evidence supporting a hypothesis about how an AI system behaves while accounting for various sources of uncertainty.[14] These calibrated assessments enable auditors to draw informed conclusions and provide guidance to companies or third parties, which may include recommending remedies or, in some cases, catalyzing enforcement actions. However, hypothesis testing is not without its challenges — the same limitations that affect its use in empirical sciences also apply in AI auditing. Therefore, auditors and those interacting with them should approach these assessments with a solid understanding of statistical constraints, potential errors, and the practical aspects of data collection. This careful approach will allow for more accurate interpretations and the formulation of robust, evidence-based recommendations that promote the development and deployment of responsible AI systems. 


[1] Kate Glazko et al., “Identifying and Improving Disability Bias in GPT-Based Resume Screening,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24: The 2024 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro Brazil: ACM, 2024), 687–700, https://doi.org/10.1145/3630106.3658933. [perma.cc/8ZBC-E3G2

[2] Merlin Stein and Connor Dunlop, “Safe beyond Sale: Post-Deployment Monitoring of AI,” Ada Lovelace Institute (blog), June 28, 2024, https://www.adalovelaceinstitute.org/blog/post-deployment-monitoring-of-ai/. [perma.cc/4WV8-ZW3H

[3] Bogen, M. (2025). Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems. Center for Democracy and Technology. https://cdt.org/insights/assessing-ai-surveying-the-spectrum-of-approaches-to-understanding-and-auditing-ai-systems/

[4] Sarah H. Cen and Rohan Alur, “From Transparency to Accountability and Back: A Discussion of Access and Evidence in AI Auditing,” in Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’24: Equity and Access in Algorithms, Mechanisms, and Optimization, San Luis Potosi Mexico: ACM, 2024), 1–14, https://doi.org/10.1145/3689904.3694711.  [perma.cc/8F57-EL95

[5] Code to reproduce the simulation can be found at: https://github.com/amywinecoff/ml-teaching/blob/main/audit_simulation.ipynb. [https://perma.cc/7FTM-2Y9J

[6] It’s important to note that in a one-sided test, evidence within a sample showing that the algorithm selected women more often would not be ignored. Instead, it would simply be interpreted as not providing support for the alternative hypothesis that men are recommended more frequently than women.

[7] Patterns in samples that contradict the expected effect direction cannot be assessed with a one-directional test. For this reason, researchers usually reserve one-directional tests for situations where there is a strong theoretical, empirical, or legal justification.

[8] This does not guarantee that an auditor would find an effect 100% of the time, merely that in our simulation, we correctly rejected the null hypothesis in 100 out of 100 samples. If we run the simulation 1,000 times, we correctly reject the null hypothesis 996 times. 

[9] Elizabeth Anne Watkins and Jiahao Chen, “The Four-Fifths Rule Is Not Disparate Impact: A Woeful Tale of Epistemic Trespassing in Algorithmic Fairness,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24: The 2024 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro Brazil: ACM, 2024), 764–75, https://doi.org/10.1145/3630106.3658938.  [perma.cc/YG66-2STY

[10] For relatively straightforward statistical tests, software such as G*Power (https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html [https://perma.cc/4BC8-R3HA]) and code libraries such as Python’s statsmodels (https://www.statsmodels.org/ [perma.cc/8TEH-WUJK]) can offer sample size estimates. However, these packages may be less reliable for more complex experimental designs. 

[11] The Limits of the Quantitative Approach to Discrimination, 2022 James Baldwin Lecture (Princeton University, 2022), https://www.cs.princeton.edu/~arvindn/talks/baldwin-discrimination/.  [perma.cc/ST5X-SWYQ

[12]  Daniël Lakens, “Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses,” Social Psychological and Personality Science 8, no. 4 (May 2017): 355–62, https://doi.org/10.1177/1948550617697177. [https://perma.cc/JT84-74FZ

[13] Jennifer Schumi and Janet T Wittes, “Through the Looking Glass: Understanding Non-Inferiority,” Trials 12, no. 1 (December 2011): 106, https://doi.org/10.1186/1745-6215-12-106. [perma.cc/E6FF-2NTN

[14]  We note that hypothesis testing can be a valuable tool for evaluating statistical claims about system behavior; however, conclusions drawn from sample data do not necessarily imply intrinsic properties of the system. For example, if an auditor finds that an algorithm likely shows gender disparity in its recommendations, this does not necessarily indicate that the system is inherently “gender biased” in a more abstract sense. If the auditor chose different definitions of gender or gender disparity, or if the system were evaluated within a distinctly different population (e.g., the U.S. versus Japan), the hypothesis test might yield a different pattern of evidence.

The post Hypothesis Testing for AI Audits appeared first on Center for Democracy and Technology.

]]>
Adopting More Holistic Approaches to Assess the Impacts of AI Systems https://cdt.org/insights/adopting-more-holistic-approaches-to-assess-the-impacts-of-ai-systems/ Thu, 16 Jan 2025 05:01:00 +0000 https://cdt.org/?post_type=insight&p=106954 by Evani Radiya-Dixit, CDT Summer Fellow As artificial intelligence (AI) continues to advance and gain widespread adoption, the topic of how to hold developers and deployers accountable for the AI systems they implement remains pivotal. Assessments of the risks and impacts of AI systems tend to evaluate a system’s outcomes or performance through methods like […]

The post Adopting More Holistic Approaches to Assess the Impacts of AI Systems appeared first on Center for Democracy and Technology.

]]>
by Evani Radiya-Dixit, CDT Summer Fellow

As artificial intelligence (AI) continues to advance and gain widespread adoption, the topic of how to hold developers and deployers accountable for the AI systems they implement remains pivotal. Assessments of the risks and impacts of AI systems tend to evaluate a system’s outcomes or performance through methods like auditing, red-teaming, benchmarking evaluations, and impact assessments. CDT’s new paper published today, “Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems,” provides a framework for understanding this wide range of assessment methods; this explainer on more holistic, “sociotechnical” approaches to AI impact assessment is intended as a supplement to that broader paper.

While some have focused primarily on narrow, technical tests to assess AI systems, academic researchers, civil society organizations, and government bodies have emphasized the need to consider broader social impacts in these assessments. As CDT has written about before, AI systems are not just technical tools––they are embedded in society through human relationships and social institutions. The OMB guidance on agency use of AI and the NIST AI Risk Management Framework seem to recognize the importance of social context, including policy mandates and recommendations for evaluating the impact of AI-powered products and services on safety and rights.

Many practitioners use the term “sociotechnical” to refer to these human and institutional dimensions that shape the use and impact of AI. Researchers at DeepMind and elsewhere have recommended frameworks that help envision what this more holistic approach to AI assessment can look like. These frameworks consider a few layers: First, assessments at the technical system layer focus on the technical components of an AI system, including the training data, model inputs, and model outputs. Some technical assessments can be conducted when the application or deployment context is not yet determined, such as with general-purpose systems like foundation models. But since the impact of an AI system can depend on factors like the context in which it is used and who uses it, evaluations focused on the human interaction layer consider the interplay between people and an AI system, such as how AI hiring tools transform the role of recruiters. And beyond this, an AI system can impact broader social systems like labor markets on a larger scale, requiring attention to the systemic impact layer. Assessments of the human interaction and systemic impact layers, in particular, require understanding the context in which an AI system is or will be deployed, and are critical for assessing systems built or customized for specific purposes. Importantly, these three layers are not neatly divided, and social impacts often intersect multiple layers.

To illustrate how these three layers can be applied in a tangible context, we consider the example of facial recognition. Clearly a rights-impacting form of AI, this example usefully demonstrates how social context can be incorporated in technical assessments, while also highlighting the limitations of technical assessments in addressing broader societal impacts.

The Need for More Holistic Approaches

Current approaches for assessing the impacts of AI systems often focus on their technical components and rely on quantitative methods. For example, audits that evaluate the characteristics of datasets tend to use methods like measurement of incorrect data and ablation studies, which involve altering aspects of a dataset and measuring the results. Initial industry efforts towards more holistic approaches to assess AI’s impacts have often involved soliciting and crowdsourcing public input. For example, OpenAI initiated a bug bounty program and a feedback contest to better understand the risks and harms of ChatGPT. While these efforts help prevent technical assessments from being overly driven by internal considerations, they still raise questions about who is included, whether participants are meaningfully involved in decision-making processes, and whether broader harms like surveillance, censorship, and discrimination are being considered in the public feedback process.

Given the limits of narrow evaluation and feedback methods, we emphasize the role of mixed methods––incorporating both qualitative and quantitative approaches––across different layers of assessment. While quantitative metrics can be useful for evaluating AI systems at scale, they risk oversimplifying and missing nuanced notions of harms. In contrast, qualitative assessments can be more holistic, although they may require more resources. 

Graphic of a table, showing examples of a Quantitative Assessment vs. Qualitative Assessment.
Graphic of a table, showing examples of a Quantitative Assessment vs. Qualitative Assessment.

As indicated in the table above, practitioners should actively consider social context across each layer and center marginalized communities most impacted by AI systems to ensure that assessments address the systemic inequities these communities face. These considerations can be strengthened through participatory methods that involve users and impacted communities in decision-making processes over how AI systems are evaluated.

To make these approaches actionable for practitioners, below we outline an array of methods to better assess and address the impacts of AI systems, along with examples of assessments that use these methods.

1. Incorporate social context and community input into evaluations of AI’s technical components

Evaluating an AI system requires not only analyzing its technical components but also examining its impact on people and broader social structures. Traditional assessments often narrowly evaluate impacts at the technical system layer like accuracy or algorithmic bias, relying on quantitative metrics pre-determined by researchers and practitioners. However, even when conducting a technical assessment, there are opportunities to consider the social dimensions of the technical components and decisions that shape the AI system.

By integrating context about social and historical structures of harm, researchers and practitioners can better identify which impacts to evaluate –– such as a more nuanced notion of bias –– and determine the appropriate quantitative or qualitative methods for assessing those impacts. In the case of facial recognition tools, understanding how social structures often privilege cisgender men can inform an analysis of how these tools operationalize gender in a cis-centric way, treating it as binary and tied to physical traits. While many quantitative analyses of facial recognition technology focus narrowly on comparing performance between cis men and cis women to assess gender bias, one study conducted a mixed methods assessment of how this technology performed on transgender individuals and their experiences with the technology. This example shows that more holistic perspectives can be integrated even in technical assessments.

Input from affected communities can also be incorporated to identify what aspects of an AI system are most relevant to consider in a technical evaluation. For example, through a participatory workshop, one study identified harms that AI systems pose to queer people, such as data misrepresentation and exclusionary data collection, which can inform technical assessments that delve deeper into these harms and consider the lived experiences of queer people. Organizations advocating on behalf of communities –– such as Queer in AI and the National Association for the Advancement of Colored People (NAACP) –– can offer valuable input on which impacts to evaluate, without overburdening individual community members. (At the same time, neither organizations nor individual members fully represent the entire community, and affected communities should not be treated as monoliths. And it is critical to remember that affected communities include not only those impacted by AI’s outputs, but also those involved in its inputs and model development, such as data workers who produce and label data.)

In the case of facial recognition, traditional assessments use metrics like false positive rates to measure the technology’s performance. However, civil rights organizations such as Big Brother Watch offer community input that these metrics can be misleading and suggest practitioners look to more nuanced metrics like precision rates across demographic groups to better understand how the technology impacts different communities. (False positive rates measure the number of errors relative to the total number of people scanned, which can result in a misleadingly low error rate when facial recognition is used to scan large crowds. In contrast, precision rates assess errors against the number of facial recognition matches, providing a clearer picture of the technology’s accuracy.)

Evaluating facial recognition models could also involve input from individuals whose data was used in training. A qualitative assessment might focus on how they were given the agency to provide informed consent, while a quantitative assessment might estimate the percentage of facial images in a dataset collected without consent. Such assessments are important, especially as companies seek to diversify their datasets, which has led to ethically questionable practices like Google reportedly collecting the images of unhoused people without their informed consent to improve the Pixel phone’s face unlock system for darker-skinned users.

Of course, these examples illuminate the limits of a technical assessment, as they do not capture the many significant harms of facial recognition systems and related technologies, including their role in overpolicing and oversurveilling Black and brown communities. So while social context can be more deeply incorporated in technical assessments, this does not negate the need to consider the broader impact of AI on people and social structures.

Methods for considering social dimensions in technical assessments

Literature reviews can be used to incorporate context about social structures of harm into a technical assessment. For example, this quantitative evaluation of racial classification in multimodal models was grounded in a qualitative and historical analysis of Black studies and critical data studies literature on the dehumanization and criminalization of Black bodies. Consistent with this literature, the evaluation found that larger models increasingly predicted Black and Latino men as criminals as the pre-training datasets grew in size. Another example is this evaluation of the ImageNet dataset, informed by a literature review of the critiques of the dataset creation process. The evaluation examined issues of privacy, consent, and harmful stereotypes and uncovered the inclusion of pornographic and non-consensual images in ImageNet. (Literature reviews can also be helpful when evaluating a technical system with respect to large-scale societal impacts. For example, to evaluate the environmental costs of AI systems, this article reviews existing tools for measuring the carbon footprint when training deep learning models.)

Technical assessments can be co-designed with impacted and marginalized communities using processes like Participatory Design, Design from the Margin, and Value Sensitive Design. For example, one study conducted community-based design workshops with older Black Americans to explore how they conceptualize fairness and equity in voice technologies. Participants identified cultural representation –– such as the technology having knowledge about Juneteenth or Black haircare –– as a core component of fairness, while also expressing concerns about disclosing identity for representation. This work could inform a co-designed assessment of how voice technologies represent the diversity of Black culture and how much they learn about users’ identities. Another study used participatory design workshops to broadly examine the perceptions of algorithmic fairness among traditionally marginalized communities in the United States, which could serve as a foundation for co-designing evaluation metrics.

Social science research methods like surveys, interviews, ethnography, focus groups, and storytelling can be used to center the lived experiences of impacted communities when evaluating technical components like model inputs and outputs. Research has shown that surveys on AI topics often decontextualize participant responses, exclude or misrepresent marginalized perspectives, and perpetuate power imbalances between researchers and participants. To move towards more just research practices, surveys should be co-created with impacted communities, and qualitative methods with carefully chosen groups of participants should be adopted. For example, one study used focus groups with participants from three South Asian countries to co-design culturally-specific text prompts for text-to-image models and understand their experiences with the generated outputs. The study found that these models often reproduced a problematic outsider’s gaze of South Asia as exotic, impoverished, and homogeneous. Another study involved professional comedians in focus groups to evaluate the outputs of language models for comedy writing, focusing on issues of bias, stereotypes, and cultural appropriation. Additionally, at a FAccT conference tutorial, Glenn Rodriguez, who was formerly incarcerated, used storytelling to illuminate how an input question to the COMPAS recidivism tool –– which asks an evaluator if the person appears to have “notable disciplinary issues” –– could result in the difference between parole release and parole denied.

When gathering community input through the co-design and social science methods discussed above, it is important to conduct a literature review beforehand to understand the histories and structures of harm experienced by affected communities. This desk research helps reduce misunderstandings and enables informed community engagement.

2. Engage with users, impacted communities, and entities with power to evaluate human-AI interactions

To evaluate the interactions between people and an AI system, it is important to engage with the users of the system, communities affected by the system, and entities that hold significant influence over the design and deployment of the system. 

First, researchers and practitioners can examine how users interact with the AI system in practice and how the system shapes their behavior or decisions. In the case of police use of facial recognition technology, a qualitative assessment could investigate whether and how officers modify the images submitted to the technology. In contrast, a quantitative assessment might measure the accuracy of officer verifications of the technology’s output when they serve as the “human in the loop,” given the risk that they may incorrectly view the technology as objective and defer to its decisions.

However, it is important to recognize that the users of an AI system are not always the communities impacted by the system. For instance, police use of facial recognition in the U.S. often disproportionately harms Black communities, who have been historically oversurveilled. To understand this broader impact of the technology on people, one study used a mixed methods approach to examine how impacted communities in Detroit perceived police surveillance technologies. Another assessment might examine the technology’s impact on encounters that Black activists have with police, as seen in the case of Derrick Ingram, who was harassed by officers after being targeted with the technology at a Black Lives Matter protest.

Moreover, just as researchers and practitioners can uncover how communities are impacted by an AI system, they can also “reverse the gaze” by examining the entities that hold power over the system. In the case of facial recognition, one might examine where police deploy the technology and their decision-making processes that shape deployments. For instance, Amnesty International’s Decode Surveillance initiative mapped the locations of CCTV cameras across New York City that can be used by the police. Their quantitative and qualitative analysis revealed that areas with higher proportions of non-white residents had a higher concentration of cameras compatible with facial recognition technology.

Methods for holistic assessments of human-AI interactions

Human-computer interaction (HCI) methods like surveys, workshops, interviews, ethnography, focus groups, diary studies, user research, usability testing, participatory design, and behavioral experiments can be used to engage with users of an AI system, impacted communities, and the entities shaping the system. For example, one study conducted an ethnography to examine how users –– specifically, judges, prosecutors, and pretrial and probation officers –– employed risk scores from predictive algorithms to make decisions. Another study assessed child welfare service algorithms using interactive workshops with front-line service providers and families affected by these algorithms. Still another study conducted interviews with financially-stressed users of instant loan platforms in India to investigate power dynamics between users and platforms, possibly influencing Google to improve data privacy for personal loan apps on its Play Store.

Investigative journalism methods like interviews, content and document analysis, and behind-the-scenes conversations with powerful stakeholders are valuable for examining how entities influence or deploy an AI system. When an AI system operates as an opaque black box, legal channels like personal data requests under the California Consumer Privacy Act or public records requests under the Freedom of Information Act can enable access to relevant information about how powerful stakeholders shape and use the system. For example, a public-service radio organization in Germany analyzed whether a food delivery company improperly monitored its riders by creating an opportunity for riders to request the data the company tracks under the European General Data Protection Regulation and then share it with the organization for analysis. Researchers at the Minderoo Center for Technology and Democracy used freedom of information requests and document analysis to examine how UK police design and deploy facial recognition technology. While most commonly used by third-party researchers, these methods are not limited to external actors; internal practitioners working on AI ethics and governance can use similar methods to assess how research and product teams design AI systems before they launch.

3. Evaluate AI’s impact on social systems and people’s rights with specific objectives to enable accountability

Assessing the impact of an AI system requires considering not only how different groups of people interact with it but also its role within broader social and legal contexts. Important values such as privacy and equity are embedded in legal systems, and evaluating a technology’s impact on people’s rights can support advocacy and policy efforts. In the case of facial recognition, one might qualitatively examine the technology’s impact on the rights to free expression, data protection, and non-discrimination, such as protections codified in the First Amendment, the California Consumer Privacy Act, and the Civil Rights Act in the U.S.

It is also important to consider the impact of AI on social systems, like mass media, the environment, labor markets, political parties, educational institutions, and the criminal legal system, as well as effects on social dynamics like public trust, cultural norms, and human creativity. In the context of facial recognition, for example, a broader assessment might examine how community safety and public trust in institutions are impacted when this technology is adopted more widely, not only by the criminal legal system but also by schools, airports, and businesses.

For such assessments of broader impacts to be impactful and support holding AI actors accountable, they should be designed with specific objectives and outcomes in mind. For example, an assessment of facial recognition might focus on its impact on the right to free expression and target entities that shape governance around the technology, such as the U.S. Government Accountability Office. Moreover, the assessment should aim for a concrete outcome, like determining whether the use of facial recognition meets specific legal standards, rather than producing a broad, open-ended list of legal concerns. Although specificity is often associated with technical evaluations, research has identified that when evaluations of broader impacts are made specific, they can prompt stakeholders to take action, help advocates cite concrete evidence, and enable more precise and actionable policy demands.

When an assessment is made specific, it is important to prioritize the most relevant systemic impacts. For example, one might focus on facial recognition’s effect on free expression since surveillance can significantly inhibit political dissent, which is vital for social justice movements. To operationalize this investigation, one could evaluate how the presence of the technology at protests affects activists’ participation or how the application of the technology online affects the use of social media for activism.

Methods for considering broader societal impacts in assessments

Social science research methods like surveys, forecasts, interviews, experiments, and simulations can be used to evaluate the impact of AI on social systems and dynamics. For example, one study analyzed the chilling effect of peer-to-peer surveillance on Facebook through an experiment and interviews. Another assessment used simulation to examine how predictive algorithms in the distribution of social goods affect long-term unemployment in Switzerland. To understand the environmental impact of AI systems, one study estimated the carbon footprint of BLOOM, a 176-billion parameter language model, across its lifecycle, while another argued that assessments should focus on a specific physical geography to highlight impacts on local communities and shape local actions that can advance global sustainability and environmental justice.

Legal analysis is a useful method for assessing the legal compliance of an AI system’s design and usage. This method involves examining how the AI system may infringe upon rights by reviewing relevant case law, legislation, and regulations. For example, one audit evaluated Facebook’s ad delivery algorithm for compliance with Brazilian election laws around political advertising. Another study examined the London Metropolitan Police Service’s use of facial recognition with respect to the Human Rights Act 1998, finding that the usage would likely be deemed unlawful if challenged in court.

Power mapping can be used to identify target entities and design assessments that foster accountability. This method can help identify what will motivate influential individuals and institutions to take action. For example, the Algorithmic Ecology tool mapped the ecosystem surrounding the predictive policing technology Predpol, outlining PredPol’s impact on communities and identifying key actors across sectors who have shaped the technology. The Algorithmic Ecology tool has been crucial for understanding the extent of PredPol’s harms, challenging its use, and offering a framework that can be applied to other technologies.

Not All Assessments Are Created Equal

We discuss a range of approaches to assess the impacts of a given AI system –– at the technical system layer, the human interaction layer, and the systemic impact layer. However, efforts across these layers may not necessarily carry equal weight in every context, and researchers and practitioners should prioritize certain layers based on the specific AI system being assessed. The greater the system’s potential to affect people’s rights, the more critical it is to consider its impact on users, communities, and society at large. 

For example, an assessment of police use of facial recognition should center its significant impact of oversurveilling and overpolicing communities of color, rather than focusing narrowly on its performance on communities of color, which can result in technical improvements that perfect it as a tool of surveillance. In contrast, an assessment of a voice assistant like Siri, which may pose a lower immediate risk, could initially focus on the technical system. Yet, the social dimensions are still crucial to consider at this layer. For instance, understanding the dominance and enforcement of standardized American English, practitioners might explore how the voice assistant performs on African American Vernacular English and may exclude or misunderstand Black American speakers.

By prioritizing certain kinds of assessments, we can not only gain a deeper understanding of the impacts of AI technology, but also shape decisions around its design and deployment, and identify red lines where we may not want the technology to be developed or deployed in the first place. Additionally, by assessing AI systems that have real-world influence, we can draw attention to their actual, everyday impacts rather than hypothetical concerns.

Our recommendations consider AI technology not merely as a technical tool, but as a system that both shapes and is shaped by people and social structures. Understanding these broader impacts requires a diverse set of methods that are appropriate for the specific AI system being assessed. Thus, we encourage researchers and practitioners to adopt more holistic methods and urge policymakers to support and incentivize these approaches in AI governance. Moreover, we hope this work fosters the development of assessments that scrutinize systems of power and ultimately uplift the communities most impacted by AI.

Acknowledgements

Thank you to Miranda Bogen and Ozioma Collins Oguine for valuable feedback on this blog post. We also acknowledge the Partnership on AI’s Global Task Force for Inclusive AI Guidelines for insights on participatory approaches to understanding the impacts of AI systems.

The post Adopting More Holistic Approaches to Assess the Impacts of AI Systems appeared first on Center for Democracy and Technology.

]]>
Beyond High-Risk Scenarios: Recentering the Everyday Risks of AI https://cdt.org/insights/beyond-high-risk-scenarios-recentering-the-everyday-risks-of-ai/ Tue, 22 Oct 2024 16:58:27 +0000 https://cdt.org/?post_type=insight&p=106055 This blog was authored by Stephen Yang, CDT Summer Fellow for the CDT AI Governance Lab. Developers and regulators have spent significant energy addressing the highly consequential –– often life-altering –– risks of artificial intelligence (AI), such as how a misdiagnosis of an AI-powered clinical decision support system could be a matter of life and […]

The post Beyond High-Risk Scenarios: Recentering the Everyday Risks of AI appeared first on Center for Democracy and Technology.

]]>
This blog was authored by Stephen Yang, CDT Summer Fellow for the CDT AI Governance Lab.

Developers and regulators have spent significant energy addressing the highly consequential –– often life-altering –– risks of artificial intelligence (AI), such as how a misdiagnosis of an AI-powered clinical decision support system could be a matter of life and death.

But what about the more prosaic scenarios where AI causes harm? An error in an AI transcription system may lead to complications in insurance reimbursement. An average person’s likeness may be co-opted to peddle a subpar product or scam. A service chatbot may misunderstand a prompt and fail to process requests as intended. These harms may not immediately or dramatically change people’s lives, but their impacts could nonetheless harm people’s well-being in appreciable ways.

Today’s risk-based AI governance frameworks would likely deem these types of scenarios as “low risk.” For instance, the EU AI Act, which recently entered into force, would consider these situations to pose “limited risks” and receive lower levels of scrutiny compared to “high-risk” contexts (such as when an AI system is used to determine a family’s “riskiness” for child welfare concerns). While high-risk scenarios are prioritized for the most thorough risk management process, low-risk instances are generally deprioritized, subject to more relaxed regulatory standards and a higher tolerance for such risks to manifest.

At first glance — and particularly when considering the severity of discrete events — AI systems that post these sorts of lower risks may appear relatively inconsequential. But if we consider their aggregated effects on the societal level and their compounded effects over time, these risks may not be as mundane as they may initially appear. To safeguard us from the risks of AI, AI practitioners and policymakers must give serious attention to the prevalence of AI’s everyday harms.

Emerging Frameworks Prioritize Perilous Scenarios Over Everyday Realities

To effectively apply safeguards to AI, risk governance frameworks matter. They not only define what risks of AI are pertinent and urgent, but also what are trivial and permissible. Their metrics for prioritization shape how practitioners and policymakers triage their risk management processes, regulatory standards, and tolerance levels. 

From the EU AI Act to the AI Risk Management Framework by the National Institute of Standards and Technology (NIST), regulatory frameworks call for joint consideration of the severity and probability of harm in determining their priority level. In the EU AI Act, for example, “high-risk” refers to scenarios that can cause considerable harm and are somewhat probable in the present and foreseeable future — such as the risks of discrimination by AI systems that make decisions for job hiring, determine loaning decisions, or predict the likelihood of committing a crime.

As these frameworks determine severity levels on the basis of individual occurrences or events, seemingly mundane risks of AI often fall through the cracks. For instance, these frameworks are likely to deem the risks of semantic errors in customer service chatbots as trivial, given that a single incident of such an error is unlikely to have consequential effects on people’s lives. Yet this overlooks how the aggregated harms of such errors may in some cases be just as –– if not more –– severe than what these frameworks deem as “high risk.”

Meanwhile, commercial providers that develop frontier AI technologies, such as OpenAI, Anthropic, and Google Deepmind, are beginning to put greater emphasis on severity over probability. For instance, Anthropic’s Responsible Scaling Policy (RSP) and OpenAI’s Preparedness Framework focus on mitigating risks associated with doomsday scenarios where AI contributes to existential risks, such as pandemics and nuclear wars.  

These frameworks may lead frontier AI developers to overlook risks that are not necessarily catastrophic but nonetheless consequential, especially when viewed in the aggregate. For instance, the risks of cheap fakes –– manipulated content that doesn’t appear realistic yet can still sway public opinions –– would seem to fall under the lowest risk level of OpenAI’s framework despite their ability to distort the perceptions and potential actions of many people

The increasing emphasis on the severity level of discrete situations, especially when focused on hypothetical scenarios, risks diverting attention from some of AI’s more common harms and leaving them without adequate investment from practitioners and policymakers.

The Everyday Harms of AI Systems (That We Already Know)

Many of the risks posed by commonplace AI systems like voice assistants, recommender systems, or translation tools may not seem particularly acute when considering discrete encounters on an individual level. However, when considering their aggregated effects on the societal level and their compounded effects over time, the consequences of these risks are much more significant.

1. Linkage Errors Across Databases

Many contemporary AI systems draw on, write to, or otherwise work across multiple systems and databases. This is increasingly true as organizations adopt techniques like retrieval augmented generation (RAG) in order to mitigate risks such as model hallucination. Linkage errors — or missed links between records that relate to the same person or false links between unrelated records — have long proven prevalent in data processing and can have considerable downstream consequences. 

Take, for instance, an AI system responsible for processing hospital admission records. If linkage errors occur within such an AI system, a patient could fail to receive alerts about relevant updates to their case in a timely manner, leading them to miss out on important medical care or be overcharged for services and receive staggering medical bills. Yet, such linkage errors are not necessarily considered to be high risk.

Linkage errors also occur at the population level, where they lead to misclassification and measurement errors in marginalized populations. This is a known problem for organizations like the U.S. Census Bureau and systems like electronic health records (EHRs), and such errors have been shown to disproportionately affect undocumented immigrants, incarcerated people, and unhoused people. When missing or erroneous links go unnoticed in large-scale analyses, decision-makers may unknowingly deprioritize the interests of already marginalized groups. For instance, as undocumented immigrants are often absent or misclassified in digital records, the welfare needs in immigrant neighborhoods may be overlooked in municipal planning over time. Systems most likely to suffer from such errors may not be explicitly classified as higher risk, but the impact of these errors can be profound.

2. Inaccurate Information Retrieval and Summarization

When AI systems serve the function of information retrieval and summarization, semantic errors that misinterpret meanings are commonplace. These errors can occur when systems misunderstand the information they process, or perceive spurious patterns that are misleading or irrelevant to the task (sometimes known as hallucinations). As a result, these systems may misrepresent meanings in their outputs, or create outputs that are nonsensical or altogether inaccurate.

Such errors can significantly undermine people’s trust in the deployers of these systems –– from commercial actors to public services to news providers. Take, for example, an AI-powered sales assistant that gave misleading advice on an airline’s refund policies, contradicting the company’s policies. Scenarios like this can affect both consumers and businesses negatively. Not only could an incident like this lead to lengthy (and oftentimes legal) disputes, but the repeated occurrence of such errors can erode an organization’s reputation and cause significant financial losses for both consumers and businesses over time.

Furthermore, semantic errors can transform into acute risks when the information they present ends up playing a role in informing consequential decisions, from border control to legal verdicts. This could occur when a general-purpose model, which may not have been designed or tested for deployments in high-risk scenarios, is integrated into decision-making systems.

This risk is distinct from the errors that occur when an AI system explicitly makes or assists with a highly consequential decision — say, when it recommends whether a job candidate should be hired or not. Instead, these are errors that might occur when, say, an AI system that summarizes an applicant’s resume inaccurately interprets these documents. Even though the summarization system itself may not directly determine hiring decisions, the inaccurate summarization it provides could lead human decision-makers to make ill-informed judgments.

3. Reduced Visibility in Platformized Environment 

Recommender systems have frequently been shown to reduce the visibility of online content from marginalized groups. As online platforms that leverage such systems increasingly mediate everyday facets of people’s lives, visibility reduction risks deterring certain people from fully participating in public culture and accessing economic opportunities.

Visibility reductions generally occur when recommender systems make a prediction that a piece of content violates a platform’s policies, which then triggers content moderation interventions. If a piece of content is erroneously predicted to violate a policy even if it does not, it is then likely to be unjustly downranked –– or outright banned –– in platformized environments. 

Such interventions are colloquially known as “shadowbanning,” when users are “uninformed of the nature, reason, and extent of content moderation.” For users who depend on visibility and reach for their careers and income, such interventions can cause financial harm. If errors are distributed disproportionately across a population, this can lead to discriminatory or otherwise harmful outcomes: recommender systems have been reported to disproportionately reduce the visibility of content by creators who are women, black, queer, plus-sized, transgender, and/or from the majority world. 

On a societal level, communities whose visibility is repeatedly reduced also find themselves systematically excluded from public conversations on the Internet. Individually, people from marginalized backgrounds often experience feelings of sadness and isolation when they cannot connect with people like themselves on the Internet. Even if recommender systems are not operating in a domain that has been characterized as high-risk, these effects can compound significantly over time.

4. Quality-of-Service Harms

Errors in AI applications lower the quality of service for chatbots and voice assistants, which are increasingly becoming people’s gateway to all sorts of services, from customer support to public service assistance. In addition to simple linguistic errors like grammar, spelling, or punctuation mistakes, semantic errors can lead to major misunderstandings of meanings. Previous research has shown that these systems perform considerably less well for non-English speakers, people of color, people who speak English with a “non-standard” accent, and non-native English speakers.

The impact of high error rates is a form of quality-of-service harm. At a minimum, repeated encounters with such errors mean marginalized users have to expend extra effort and time to reap the benefits of AI applications. At worst, the low quality could prevent people from taking advantage of AI-powered products or important opportunities they mediate. Oftentimes, this deters marginalized populations from accessing services that are set out to empower them in the first place. This sort of error will be critical to keep in mind as a growing number of public services are turning to chatbots and voice assistants as their primary –– or even default –– interfaces.

5. Reputational Harms

From recommender systems to generative models, AI systems pose the risks of depicting regular people in undesirable or harmful ways. Particularly when the depictions are not legally protected under anti-discrimination laws, these grey-area representational harms can end up falling through the cracks.

For instance, when someone searches for information about an individual through a search engine, the first few search results can play a crucial role in determining impressions of the person. When a search for a certain name leads to criminal history and mugshot websites, it paints a negative picture. Since there is no legal requirement for this information to be accurate or timely, the persistence of inaccurate or out-of-date information risks wrongly harming people’s reputations, which can indirectly affect their access to important opportunities. 

Meanwhile, the wide availability of generative models makes it faster and easier to abuse the likeness of others without their consent. Anyone with photos on the Internet may be subjected to the misuse of AI likeness. For instance, to cut down on the cost of paying for models, digital advertisers may instead create AI likenesses based on images of people found online –– oftentimes without permission. Beyond the lack of financial compensation to real models who would otherwise be hired, and to those whose faces and voices are leveraged without consent, such likenesses may also cause harm by portraying people in an undesirable light. For instance, one’s likeness may be manipulated to discuss deeply sensitive matters, as in the abuse of AI likeness to sell health supplements. These likenesses don’t even have to be photo-realistic to cause harm; they just have to be realistic enough. The misuse of AI likeness often occurs repeatedly for the same people and disproportionately falls on women

Despite the prevalence of such harms, presently, it can be difficult for everyday people to track down evidence of such harm, and even when adequate interventions do happen, they often only take place long after the harm has already occurred.

Beyond Severity: Rethinking the Goalposts of AI Risk Governance

Given how risk-based frameworks help shape the priorities of both policymakers and practitioners on “high risk” scenarios, AI’s everyday harms are increasingly likely to fall through the cracks. Practitioners and policymakers alike are shifting focus to severe scenarios, with frontier AI developers putting even greater emphasis on severity over probability and frequency.

To effectively and responsibly triage time and resources for AI risk governance, we must move beyond the view that puts greater and greater distance between “high-risk” and “low-risk” scenarios; severity should not be the only metric for prioritization. A more pragmatic orientation of risk governance must also take the prevalence of risks into serious consideration. From shadowbanning on social media platforms to low-quality chatbot service for non-English speakers, AI risk governance frameworks must encourage due attention to the everyday risks of AI that are already creating concrete harms.

While these so-called “lower-risk” scenarios may not appear severe as discrete events, their aggregated impact on the societal level and their immediacy mean that organizations should nevertheless invest in mitigating their risks. 

To prepare for the everyday risks of AI systems, practitioners must be ready to act on their negative residual risks, or risks that remain unmitigated despite the safety measures in place. To do so, AI practitioners need safety infrastructures that can systematically monitor and rapidly respond to these risks as they manifest as harm. Policymakers can even mandate that such infrastructures should always accompany the deployment of AI systems.

For instance, to address semantic errors in chatbot services, organizations may consider post-deployment monitoring, with particular attention to scenarios where users get stuck or encounter a failure mode. When these scenarios take place, organizations can immediately connect these users to human operators for troubleshooting. Not only will this make it possible for organizations to address incidents of harm in a more timely manner, but over time, they will also be able to better anticipate and account for potential vulnerabilities once they have a better understanding of how such harms manifest on the ground.

The everyday risks of AI should not be an afterthought; they deserve far greater priority than they currently receive. Presently, AI’s everyday harms are disproportionately affecting the most vulnerable populations, from unhoused people to non-English speakers, to formerly incarcerated people. Framing harms as “low-risk” means neglecting the harms that AI systems continue to inflict on these populations on a day-to-day basis.

The post Beyond High-Risk Scenarios: Recentering the Everyday Risks of AI appeared first on Center for Democracy and Technology.

]]>