Definition of Reliability (2024)

This article is a part of the guide:

Discover 21 more articles on this topic

Don't miss these related articles:

Validity and Reliability
Types of Validity
Content Validity
Construct Validity
External Validity

Browse Full Outline

1Validity and Reliability
2Types of Validity
3External Validity
- 3.1Population Validity
- 3.2Ecological Validity
4Internal Validity
5Test Validity
- 5.1Criterion Validity
  - 5.1.1Concurrent Validity
  - 5.1.2Predictive Validity
6Content Validity
7Construct Validity
- 7.1Convergent and Discriminant Validity
8Face Validity
9Definition of Reliability
10Test–Retest Reliability
- 10.1Reproducibility
- 10.2Replication Study
11Interrater Reliability
12Internal Consistency Reliability
13Instrument Reliability

In everyday language, we use the word reliable to mean that something is dependable and that it will givebehave predictably every time. We might talk of a football player as reliable, meaning that he gives a good performance game after game.

In science, the idea is similar, but the definition is much narrower. Reliability is a property of any measure, tool, test or sometimes of a whole experiment. It’s an estimation of how much random error might be in the scores around the true score.

For example, you might try to weigh a bowl of flour on a kitchen scale. A reliable scale will show the same reading over and over, no matter how many times you weigh the bowl. There may be slight error here and there – you may notice that some readings differ by just a fraction of a gram – but overall the scale is reliable. If the scale gave a reading of 1 kg and then a minute later gave a reading of 1.5 kg, the error has become so large that the instrument’s reliability is seriously undermined.

When we talk about instruments, it does not necessarily mean a physical instrument, such as a mass-spectrometer or a pH-testing strip. An educational test, questionnaire, or assigning quantitative scores to behavior are also instruments.

Another way of looking at reliability is by consideringit as a way to maximize the inherent repeatability or consistency in an experiment. To maintain reliability, a researcher will use as many repeat sample groups as possible, to reduce the chance of an abnormal sample group skewing the results. This is a little like weighing the bowl several times and using the average reading.

Reliability can be determined statistically by calculating the correlation coefficient. If a test is reliable it should show a high positive correlation between repeat scores. If you use three replicate samples for each manipulation, and one generates completely different results from the others, there is likely something wrong with the experiment.

For most experiments of natural phenomena, results follow a normal distribution and there is always a chance that your sample group produces results at one of the extremes. Using multiple sample groups will smooth out these extremes and generate a more accurate spread of results. But if your results continue to be wildly different, then there is likely something wrong with the design itself. In this case, the entire experiment is externally unreliable.

Good experimental design will allow for plenty of replicate samples by the researchers. But other researchers should also be able to perform exactly the same experiment, with similar equipment, under similar conditions, and achieve exactly the same results. If they cannot, then the design is externally unreliable.

A good example of a failure to apply the definition of reliability correctly is provided by the cold fusion case of 1989. Fleischmann and Pons announced to the world that they had managed to generate heat at normal temperatures, instead of the huge and expensive tori used in most research into nuclear fusion.

This announcement shook the world, but researchers in many other institutions failed to replicate the experiment. It’s unclear whether the researchers lied or genuinely made a mistake, but it was impossible to accept their results since they were unreliable.

Internal Reliability and Personality Tests

If you’ve ever completed a long questionnaire, you might have noticed how some questions seem to be subtle variations on one another. A personality test may have “I like to plan my activities ahead of time”, “I am spontaneous” and “I like to go with the flow” as three separate items which seem quite similar.

The reason some tests do this is to increase their internal reliability. Internal reliability is about the consistency across separate items within a measure. A test is internally consistent if each item contributes equally to the overall construct being measured.

How to test internal reliability

In the social sciences and psychology, testing internal reliability is essentially a matter of comparing the instrument with itself.

The split-half method

How could you determine whether each item on an inventory is contributing to the final score equally? One technique is the split-half method which cuts the test into two pieces and compares those pieces with each other. The test can be split in a few ways: either the first vs. the second half, or the odd-numbered items vs. the even-numbered, for example.

Split-half methods can only be done on tests measuring one construct – for example an extroversion subscale on a personality test. Psychometrics use split-half methods to identify items on a test which don’t correlate strongly with the others, and then remove or improve those items.

Internal Consistency

The internal consistency test compares two different versions of the same instrument, to ensure that there is a correlation and that they ultimately measure the same thing.

For example, imagine that an examining board wants to test that its new mathematics exam is reliable, and selects a group of test students. For each section of the exam, such as calculus, geometry, algebra and trigonometry, they actually ask two questions, designed to measure the aptitude of the student in that particular area.

If there is a high internal consistency, i.e. the results for the two sets of questions are similar, then each version of the test is likely to be reliable. The test - retest method involves two separate administrations of the same instrument, while internal consistency measures two different versions at the same time. Researchers may use internal consistency to develop two equivalent tests to later administer to the same group.

A statistical formula called Cronbach's Alpha tests the reliability and compares various pairs of questions. Luckily, modern computer programs take care of the details saving researchers from doing the calculations themselves.

How to test external reliability

There are two common ways to establish external reliability: test-retest and inter-rater methods.

Test-Retest Method

The Test-Retest Method is the simplest method for testing external reliability, and involves testing the same subjects once and then again at a later date, then measuring the correlation between those results. A test retaken after a month, for example, should yield the same results as the original, if it’s a reliable test.

One difficulty with this method lies with the time between the tests. This method assumes that nothing has changed in the meantime. If the tests are administered too close together, then participants can easily remember the material and score higher on the second round. But if administered too far apart, other variables can enter the picture: participants themselves may change enough to make their scores on the second batch not truly comparable with the first. To prevent learning or recency effects, researchers may administer a second test that is different but equivalent to the first.

Inter-rater Methods

Anyone who has watched American Idol or a cooking competition will understand the principle of inter-rating reliability. Here, what is being measured is performance, but with a panel of judges in the role of “instrument.”

An example is clinical psychology role play examinations, where students are rated on their performance in a mock session. Another example is a grading of a portfolio of photographic work or essays for a competition.

Processes that rely on expert rating of performance or skill are subject to their own kind of error, however. Inter-rater reliability is a measure of the agreement of concordance between two or more raters in their respective appraisals, i.e. the degree of consensus among judges.

The principle is simple: if several expert raters all agree on a performance rating, that rating shows high reliability. If, however, the judges have wildly different assessments of that performance, their assessments show low reliability. Importantly, reliability is a characteristic of the ratings, and not the performance being rated.

Reliability - One of the Foundations of Science

As we have seen, understanding the definition of reliability is extremely important for any scientist but, for social scientists, biologists and psychologists, it’s a crucial foundation of any research design. In psychometry, for example, the constructs being measured first need to be isolated before they can be measured. Thus, building inventories and tests cannot be done without constant assessment of that construct’s validity and reliability. For this reason, extensive research programs always involve comprehensive pre-testing, ensuring that the instruments used are both consistent and valid.

Those in the physical sciences also perform instrumental pre-tests, ensuring that their measuring equipment is calibrated against established standards.