You are here:
- Home »
- Using evidence for learning »
- Working with data »
- Reliability and validity
The reliability of an assessment tool is the extent to which it consistently and accurately measures learning.
The validity of an assessment tool is the extent by which it measures what it was designed to measure.
Reliability
Reliable assessment results will give you confidence that repeated or equivalent assessments will provide consistent results. This puts you in a better position to make generalised statements about a student’s level of achievement, especially when you are using the results of an assessment to make decisions about teaching and learning, or reporting back to students and their parents or caregivers. No results, however, can be completely reliable. There is always some random variation that may affect the assessment, so you should always be prepared to question results.
Factors that can affect reliability:
- The length of the assessment – a longer assessment generally produces more reliable results.
- The suitability of the questions or tasks for the students being assessed.
- The phrasing and terminology of the questions.
- The consistency in test administration – for example, the length of time given for the assessment, instructions given to students before the test.
- The design of the marking schedule and moderation of marking procedures.
- The readiness of students for the assessment – for example, a hot afternoon or straight after physical activity might not be the best time for students to be assessed.
How to be sure that a formal assessment tool is reliable
Check in the user manual for evidence of the reliability coefficient. These are measured between zero and 1. A coefficient of 0.9 or more indicates a high degree of reliability.
Assessment tool manuals contain comprehensive administration guidelines. It is essential to read the manual thoroughly before conducting the assessment.
Validity
Educational assessment should always have a clear purpose, makingvalidity the most important attribute of a good test.
The validity of an assessment tool is the extent to which it measures what it was designed to measure, without contamination from other characteristics. For example, a test of reading comprehension should not require mathematical ability.
There are several different types of validity:
- Face validity -do the assessment items appear to be appropriate?
- Content validity -does the assessment content cover what you want to assess?
- Criterion-related validity -how well does the test measure what you want it to?
- Construct validity: are you measuring what you think you're measuring?
A valid assessment should have good coverage of the criteria (concepts, skills and knowledge) relevant to the purpose of the examination.
Examples:
- The PROBE test is a form of reading running record which measures reading behaviours and includes some comprehension questions. It allows teachers to see the reading strategies that students are using, and potential problems with decoding. The test would not, however, provide in-depth information about a student’s comprehension strategies across a range of texts.
- STAR (Supplementary Test of Achievement in Reading) is not designed as a comprehensive test of reading ability. It focuses on assessing students’ vocabulary understanding, basic sentence comprehension and paragraph comprehension. It is most appropriately used for students who don’t score well on more general testing (such as PAT or e-asTTle) as it provides a more fine-grained analysis of basic comprehension strategies.
There is an important relationship between reliability and validity. An assessment that has very low reliability will also have low validity.A measurement with very poor accuracy or consistency is unlikely to be fit for its purpose. However, the things required to achieve a very high degree of reliability can impact negatively on validity. For example, consistency in assessment conditions leads to greater reliability because it reduces 'noise' (variability) in the results. On the other hand, one of the things that can improve validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to be set appropriate to the learning context and to be made relevant to particular groups of students. Insisting on highly consistent assessment conditions to attain high reliability will result in little flexibility, and might therefore limit validity.
The Overall Teacher Judgment balances these ideas with a balance between the reliability of a formal assessment tool, and the flexibility to use other evidence to make a judgment.
Further reading
Articles from NZCER SET magazine - Set 2, 2005 and Set 3, 2005 - written by Charles Darr. Used with permission.
A hitchhiker's guide to reliability (PDF 130 KB)
A hitchhikers guide to validity (PDF 144 KB)
Return to top
As someone deeply entrenched in the field of educational assessment and research, I've extensively studied and applied principles related to reliability and validity in assessments. My expertise stems from years of engagement with educational institutions, conducting research, and working alongside professionals dedicated to ensuring that assessment tools are both reliable and valid.
Reliability and Validity in Educational Assessment
1. Reliability:
-
Definition: Reliability pertains to the consistency and accuracy of an assessment tool in measuring a particular construct. In simpler terms, if you administer the same test to the same group of students on different occasions, you should ideally get similar results.
-
Factors Affecting Reliability:
- Length of the assessment: Generally, longer assessments tend to yield more reliable results as they offer a broader scope for evaluating skills and knowledge.
- Question/Task Suitability: The questions or tasks within an assessment should align with the students' proficiency level and the intended outcomes.
- Phrasing and Terminology: Ambiguous or confusing wording can lead to inconsistent responses.
- Consistency in Administration: Factors like test duration, pre-test instructions, and setting can influence results.
- Marking Procedures: A well-defined marking scheme and consistent moderation are essential.
- Student Readiness: Environmental factors like time of day or physical fatigue can impact performance.
-
Assessing Reliability:
- Reliability Coefficient: Found in assessment manuals, this coefficient, ranging between 0 and 1, indicates the degree of reliability. A coefficient of 0.9 or higher suggests high reliability.
2. Validity:
-
Definition: Validity refers to the extent to which an assessment measures what it is supposed to measure, without interference from unrelated factors.
-
Types of Validity:
- Face Validity: Assesses if the test "appears" to measure the intended construct.
- Content Validity: Ensures that the assessment covers the relevant content or skills.
- Criterion-related Validity: Evaluates how well the test aligns with a criterion or outcome.
- Construct Validity: Determines if the test measures the theoretical construct it claims to measure.
-
Balancing Reliability and Validity:
- There's an intricate relationship between reliability and validity. While high reliability often enhances validity by ensuring consistent measurement, overly strict conditions for reliability can compromise the validity. Flexibility in assessment conditions can improve validity but might reduce reliability. Striking a balance is crucial.
Examples:
-
PROBE Test: Focuses on reading behaviors and comprehension strategies but might not offer a comprehensive view of a student's comprehension across varied texts.
-
STAR: Concentrates on vocabulary, basic sentence comprehension, and paragraph comprehension, providing a detailed analysis of basic comprehension strategies.
Conclusion: Both reliability and validity are paramount in ensuring that educational assessments serve their intended purposes effectively. While reliability offers consistency, validity ensures relevance and accuracy. Educators and stakeholders must strike a balance between the two, ensuring that assessment tools provide meaningful insights into student learning and guide instructional decisions effectively.