targets WHAT is validity evaluation?

Validity is the "degree to which evidence and theory support the interpretations of the test scores entailed by proposed uses of the test" (AERA, NCME, & APA, 1999, p.9). Validity evaluation is a research-based process of articulating, testing, and evaluating the argument underlying an assessment, including the claims and assumptions that must hold true to support the interpretation and use of assessment scores. The results of a validity evaluation should provide evidence that confirms, refines, or refutes these claims or assumptions. The purpose of validity evaluation is not to judge the validity of the test itself, "only the validity of the inferences from the scores in the context of specified purposes and uses" (Perie & Forte, 2011).

document and magnifying glassWHY evaluate the validity of English Language Proficiency Assessments (ELPAs)?

States may want to engage in validity evaluation of an ELPA assessment system for several reasons. Most practically, states may want to begin to build a case for federal peer review. Although the US Department of Education (the Department) does not currently require states to provide evidence for supporting the validity of their ELPA assessments and score uses, it is possible that the Department may start to require such documentation in the future. More importantly, ELPA scores are also used in ways that have real consequences for students, teachers, and schools. A careful validity evaluation can determine whether scores provide the right kind of information to serve as the basis for important decisions and identify the intended or unintended consequences of using ELPA scores for student placement, retention, and exit, as well as teacher evaluation or district-level program decisions (see more on ELPA policy).

notebook and pencilHOW does one evaluate the validity of an ELPA?

The validity evaluation research process should result in both thoughtful consideration about the theory of action that frames the assessment system and evidence to help determine whether a state's ELPA scores, as well as their various uses, are meaningful and appropriate. The process itself has five major steps:

openArticulate the Theory of Action for the assessment system

Identify and articulate the purposes, goals, and guiding philosophy of the assessment system (Theory of Action)
The first step in validity evaluation is to articulate the guiding philosophy behind the ELPA system. What are the intended goals or outcomes of the ELPA? What do we think scores tell us about students? About ELP programs? About teachers or schools? How do we act on the information? In this stage, states can begin to draft a logic model based on assumptions about how the entire assessment system must function in order for scores to provide meaningful evidence about student achievement for intended purposes and uses. For the scores to be valid for these intended uses, what are all of the components of the assessment system, from identification, to curriculum and teaching, to test administration that must function as intended?

openIdentify key claims and priorities for evaluation

Identify key claims and priorities for evaluation; create an interpretive argument.
Once states have identified the purposes, goals, and guiding philosophies that frame the ELPA system, the next step is to identify the specific claims or issues that are critical for your state. Stakeholders across a state may provide important information about aspects of the ELPA system that deserve closer scrutiny. States may consider holding focus groups with teachers, test coordinators, principals, or district administrators to gather input about their perceptions of and experiences with the ELPA and ELP programs. These stakeholders may bring up new issues to pursue or they may reinforce assumptions about which aspects of the assessment system pose a particular concern.

The key claims, assumptions, and concerns that states and stakeholders identify can be represented visually as an interpretive argument that illustrates these elements within a conceptual framework. How do these claims and assumptions relate to each other and how do they lead to the assessment system's goals and intended outcomes? What factors must hold true for assessment scores to be both meaningful and useful for program and policy goals? To support states' thinking when creating an interpretive argument, EVEA experts have identified five distinct components within an organizing conceptual framework:

Precursors & Context:

What are some of the conditions that must be in place for the assessment system to function as intended? These may include:
  • Appropriate screening and identification practices
  • Adequate supports and resources for teachers to teach and assess ELs
  • High quality state ELD/P standards

Assessment System:

What aspects of the assessment system must function as intended in order for the interpretation of assessment scores to be valid?
  • Administration and scoring fidelity
  • Alignment to ELD/P standards
  • Well-designed assessment items
  • Well-designed scoring matrix, cut scores, and performance level descriptors

Primary Claims:

How should assessment scores and performance levels be interpreted? What are they supposed to represent?

Score Uses:

How are the scores used? Who uses scores, and for what purposes?


What are the intended goals, outcomes, or consequences of the assessment system?

openCreate an interpretive argument

Together, the claims and assumptions within each component of the conceptual framework create an interpretive argument for the validity of the assessment system.This is the visual representation of the theory of action, organized within an argument-based logic model. The interpretive argument illustrates the inter-relatedness of many aspects of the assessment system: the validity of no one assessment component is sufficient, alone, to ensure the validity of the entire system.

Common Interpretive Argument

Through its work with five states, the EVEA project developed a sample Common Interpretive Argument that represents, broadly, the most common claims and assumptions that form the basis for states' ELPA systems.

See EVEA Common Interpretive Argument.

The most important component of the interpretive argument framework is the goal of the assessment system. Many ELPA systems have dual goals: 1. That EL students become proficient in English, acquiring the academic language skills necessary to participate fully in instructional discourse conducted in English; and 2. That ELP programs meet accountability requirements (i.e. students meet state-defined ELD/P standards and exit from services). For these goals to be realized, a number of other claims and assumptions must hold true, including the primary claim that "ELPA scores/performance levels reflect meaningful differences in students' English language proficiency." Even if this primary claim is supported by evidence, scores must be used appropriately to lead to intended goals and outcomes.

However, for this primary claim to be valid, a number of assumptions about the educational and programmatic context must hold true: students must have been appropriately identified to participate in the ELP program and the ELPA, and ELD/P standards must have been have been developed to support the acquisition of English language proficiency necessary to achieve academic content and performance expectations. Given these contextual assumptions, the ELPA must also be designed to yield scores that reflect students' knowledge and skills in relation to academic English language expectations defined in the ELD/P standards, and the ELPA must be administered and scored as intended. Finally, in order for the ELPA system to achieve its stated goals, teachers must also have the support and resources to provide instruction and administer assessment to promote students' acquisition of academic English.

Each box in the interpretive argument could be modified to reflect state-specific assumptions or concerns, and additional boxes could be added within each conceptual component of the model. For instance, a state that is concerned about whether teachers have been properly trained to administer the ELPA may choose to modify the claim, "The ELPA is administered and scored as intended," to "Teachers have been properly trained to administer and score the ELPA as intended." States that are concerned about the accessibility of the ELPA for students with disabilities may choose to add a claim within the "assessment system" component that says, "ELs with disabilities receive appropriate accommodations for the ELPA" or "The ELPA is universally designed to mitigate the effects of disabilities on assessment scores." Finally, states that use the ELPA for a goal not specified on the Common Interpretive Argument, such as informing classroom instruction or grouping, should think carefully about the assumptions that must hold true for these uses to be valid.

openDesign studies to evaluate prioritized claims

Design studies to evaluate prioritized claims; choose studies that may explore alternative explanations, not those that will certainly reinforce current assumptions.
Once the interpretive argument is complete, stakeholders may again inform decisions about which of the claims deserve further examination. Because of finite time and resources, a state's immediate or short-term validity evaluation research agenda often can only encompass a few of the claims or assumptions, while other concerns may need longer- or later-term attention. In consultation with stakeholders such as teachers and administrators, the next step is to prioritize the claims on the interpretive argument and choose a few that will become the basis of systematic study. States should design a research protocol that will produce evidence that could refute or confirm the prioritized claims.

It is important to choose studies that yield evidence that could contradict claims and assumptions about the ELPA system, or that explore alternate explanations for observed phenomena. The purpose of validity evaluation is to test assumptions and produce actionable recommendations. Perie and Forte (2011) call this a "falsification orientation." Studies that will likely confirm prior assumptions may not be worth the time and resources devoted to validity evaluation and cannot be taken as evidence that "proves" any claim or assumption to be true.

openRe-evaluate assumptions based on evidence, weigh alternate explanations, implement recommendations, and synthesize to form validity argument.

Re-evaluate assumptions based on evidence, weigh alternate explanations, implement recommendations, and synthesize to form validity argument.
The studies that states conduct in the validity evaluation process should yield evidence to support, refute or refine claims in the interpretive argument. In the evaluation process, some studies will uncover new issues or point to areas that deserve further investigation; the claims tested in these studies may need to be dropped or modified based on evidence, or may require further investigation, often after changes in policy or practice. Other studies may find supportive evidence for claims, and lend strength to the state's overall argument by providing research-based evidence that certain parts of the system are functioning as intended. Together, all of these findings and evidence ultimately will form the basis of a final validity argument, which is a synthesis of the research-based evaluation and analysis of the original interpretive argument.

Despite being called "final," a validity argument is based on dynamic systems and processes that states must continue to monitor and evaluate to ensure they continue to function as intended. Having gone through the validity evaluation process once, most states will find they have to make changes and conduct further research to determine whether and how problems with certain claims in the interpretive argument can be addressed to strengthen the system's overall validity. The power of a validity evaluation is that it is actionable; it should produce recommendations for policy and practice to improve the validity of inferences from scores. As such, the process of collecting evidence, identifying actionable recommendations, implementing changes to policy or practice, and refining the interpretive argument should be iterative as all states continue to refine high quality assessment systems that contribute to positive outcomes for students.

bottom left corner

edCount, LLC
5335 Wisconsin Ave NW Suite 440
Washington DC 20015

Dr. Sara Waring