Part 1: Principles for Evaluating Psychometric Tests (2024)

The purpose of this section is to provide principles that can be used by scientists and regulators to determine if a psychometric test is adequate for assessing neurodevelopment or CNS function (including specific neurobehavioral domains or traits) or to aid in the selection of psychometric tests for research studies. While this document emphasizes developmental neurotoxicity studies of chemical exposures, these principles could be extended to other exposures (e.g., psychosocial stress, nutrition, etc.). The major principles for evaluating psychometric tests, which are described below, are those commonly used by psychometrists for this purpose. They include specific methods for evaluating aspects of the four overarching psychometric criterion areas: reliability, validity, standardized administration methods, and normative data associated with specific tests. It is critical to note that these criteria apply only to the features of the psychometric tests themselves and not to the application of the test in a research setting. (Part 2 of this document proposes criteria for test application.)

The assessment of test-specific aspects of reliability, validity, standardized test administration, and normative data for the tests featured in AppendixB were based on information derived from a combination of sources described in the Introduction to this document and summarized in the extraction table (AppendixC). Evaluation of specific aspects or subcriteria for each of the four approaches to understanding the psychometric integrity and properties of the tests was completed independently by both of the evaluators (Dr. Roberta White and Dr. Joseph Braun) using the data summarized in the extraction table.

The ratings for each subcriterion were the following: adequate, deficient, not applicable, or not present (i.e., not enough information available to the evaluator). For the normative data subcriteria, separate ratings were determined for adults and children (adulthood was defined as beginning at age 18 years). It should be noted that the evaluative ratings used do not necessarily dictate whether or not a test is appropriate for an individual study. For example, if a test or outcome is being used to evaluate a specific brain function in a unique population, but the test lacks adequate population norms, it can be used if its raw scores are appropriately analyzed. In addition, some criteria were difficult to rate because, in some cases, multiple sources of information or studies in the peer-reviewed literature on a test were consulted that varied considerably in quality or level of detail or contained different results across studies. Once independent ratings were determined by both evaluators, a consensus meeting was held to finalize them through discussion between the evaluators. Discussion was needed to arrive at a consensus rating for approximately 20% of the ratings. A final review of ratings was completed by the evaluators to ensure consistent application of evaluation criteria. Additional test descriptions and rating justification notes were added to the evaluation tables when needed. Explanatory notes for rating justifications were added for each instance of a deficient or not present rating and for some adequate ratings that were not clearly derived from the material provided in the extraction table (AppendixC). When data were not available in manuals or other literature (and therefore not reflected in the extraction table), the evaluators based their ratings on their own knowledge and noted this in the evaluation tables.

While all four psychometric criterion areas (reliability, validity, standardized administration methods, and normative data) are important in evaluating psychometric tests, it should be noted that reliability, validity, and standardized administration methods are considered most important in selecting psychometric tests for research studies and in determining the adequacy of psychometric tests to assess neurodevelopment or CNS function in epidemiological research. It is recommended that scientists and regulators consider the strength of the normative data only for tests that are considered adequate regarding reliability, validity, and standardized administration methods.

Given the above considerations, the evaluation ratings for normative data are presented separately from the ratings for reliability, validity, and standardized administration methods because the adequacy of normative data is most relevant once a valid and reliable test has been developed. Moreover, adequacy of normative data is only applicable in epidemiological studies that use normative scores as outcomes rather than raw scores. In addition, many domain-specific tests are used to test hypotheses regarding specific skills and abilities to assess specific brain systems and are often not developed with the same resources as larger omnibus tests, resulting in limitations to or a lack of normative data. These tests can still be valid and reliable, but the scores they produce might need to be adjusted for age, sex, or other factors predictive of raw scores.

AppendixB contains the evaluation ratings and notes for the tests by domain. For each domain, the ratings for reliability, validity, and standardized administration methods are provided in one table, and the ratings for normative data are provided in a second table. Subtests and subscales within omnibus tests assessing domain-specific functions that have been identified in epidemiological studies from the in-progress EPA IRIS toxicological review of MeHg are listed below their respective domain-specific evaluation tables. The evaluation tables provide the publication date of each test, as age of the test at the time a study was completed can be an important factor in considering whether a test was appropriately applied. For example, older tests may contain items or questions that no longer persist in general knowledge (e.g., naming outdated technology or household items, such as a record player). However, because test age is not static (i.e., it depends upon lag time between date of test and date of study, as well as reasons investigators chose the test), evaluation criteria were not developed for this variable. Factors related to the age of a test at the time it was employed in a study are considered in some detail in Part 2 of this document.

This document does not provide an overall designation of adequate or inadequate (or any other ranking) for specific tests. The complexities of choosing, applying, and interpreting neurobehavioral methods in research settings prevents simplistic summary evaluations. Some users of this document may consider making this designation when they are trying to determine if a study using a given test will be included in a meta-analysis, if the results related to a test are to be used for policy decisions, or if the test will be administered as part of a research study. Thus, the relative importance of the four criteria (and subcriteria) in making these types of decisions will differ with the goal of the end user. For example, researchers selecting a test for administration in a research study might weigh the availability of specific, applicable normative data more heavily than the other domains because they are conducting a study in a culturally unique population. As another example, scientists selecting tests for inclusion in a meta-analysis of a specific neurobehavioral domain might place more weight on the validity of a test if they want to ensure that only results from tests accurately measuring the specific domain are included.

Some caveats to applying the criteria in this document should be noted. First, designations of adequate or deficient are applied to the criteria without additional gradations. This is because there are standards available to designate a test as adequate or deficient for some aspects of the psychometric criteria, but additional gradations are not available or widely used (White et al. 1994; White and Proctor 1992). While alternative methods (e.g., risk-of-bias analysis) couldprovide finer gradations of each criterion, systematic approaches that could do this for psychometric tests are not available. Thus, a binary designation allows users to determine if a given test meets the criteria as described below in a reasonable enough fashion that it would be acceptable for use in population-based research. This approach is consistent with clinical and research practice as some psychometric tests are used in clinical or research settings when thetest has known inadequacies. For instance, this situation can arise when there are no betteralternatives for assessing a given domain or when a test assesses a highly specific cognitive process.

Many psychometric tests include both an omnibus assessment of a neurobehavioral function or successful neurodevelopment as well as subtests assessing specific domains related to that function. Therefore, some specific summary or subscale scores within an omnibus test may be adequate or deficient while others are not. In general, focusing on the summary scores (e.g., IQ measures, domain summaries) from tests is recommended for most purposes. In cases for which a test provides multiple domain or trait scores but no summary measure(s), using the overall pattern of adequacy/deficiency across domain or trait scores is recommended to determine the adequacy of a given criterion for the test as a whole. If the goal of the user is to apply a limited number of an instrument’s subtests (one or more) to assess specific domain functioning, only information relevant to the subscale(s) of interest should be considered by the user. This information is generally available in test manuals and can also be found in the peer-reviewed or gray literature.

The major principles as noted above for evaluating psychometric tests (reliability, validity, standardized administration methods, and normative data) are described below.

Reliability

For a psychometric test to be reliable, its results should be consistent across time (test-retest reliability), across items (internal reliability), and across raters (inter-rater reliability). Part 2 discusses inter-rater reliability of the document because it is not an intrinsic feature of a test. Thus, internal reliability demands that the individual items on a given test should measure the same domain(s) or trait(s) (i.e., internal consistency). Reproducibility, or test-retest reliability, requires that consistent scores would be obtained from the same individual upon repeated testing.

To assess the internal reliability of a test, items within the test should be correlated with each other to ensure internal consistency. To assess the test-retest reliability of a test, it should be administered in a standardized manner to the same person twice, and the score(s) from the repeated measurements should be consistent.

When assessing internal consistency, a high correlation among items on domain-specific subscales indicates that the test items measure the same trait (e.g., as indicated by having a high split-half reliability). The most popular criterion used to assess internal consistency was developed by Sattler (2001). He recommended that tests with reliability coefficients<0.6 (e.g., correlations mentioned above) be deemed unreliable. Moreover, for research purposes, Sattler (2001) suggested that tests with reliability coefficients≥0.6 and <0.7 be considered marginally reliable and those with coefficients ≥0.7 be considered relatively reliable.

For test-retest reliability, high correlations between repeated administrations of a test to the same person within an appropriate time interval ensures that the test can consistently measure trait(s) assessed by the instrument in an individual. Test-retest reliability is generally assessed by intraclass correlation coefficient (ICC; ideally>0.4), Pearson correlation coefficient (ideally>0.3), or Cohen's kappa coefficient (>0.4).

Determining adequacy: When evaluating a given psychometric test, it must have internal consistency reliability coefficients of ≥0.6 (e.g., Cronbach’s alpha, ICC) to be considered “adequate.” Test-retest reliability should meet one of the following criteria as indicated above: ICC>0.4, Pearson correlation coefficient >0.3, or Cohen's kappa coefficient >0.4. Some deviations for subtests are acceptable if summary scales or the majority of subtests are at least marginally reliable.

Validity

Validity is typically assessed across three broad domains: content, construct, and criterion validity. Each is distinct but ultimately all are related to a test’s ability to measure what it is designed to measure. It is critical to note that validity is not a static, “all or none” metric and is re-evaluated as a test is used in varied clinical practice and research settings over time.

Content validity is the extent to which the test items, tasks, and questions assess the trait that the test is designed to measure. This can be thought of as a sampling issue, wherein the test content should be representative of the population of all possible test content that could measure that trait. Content validity is assessed by evaluating test themes, theoretical models, scientific evidence supporting a test, domain definition, domain operationalization, item selection, and item review. Review of content validity is often qualitative in nature and relies on expert evaluation and judgment; however, quantitative techniques like factor analytic approaches are often used to refine test content and confirm content validity.

Construct validity is the degree to which the test estimates the trait of interest using the items selected for the test. It usually pertains to complex traits (e.g., intelligence). Note that a construct is theoretical and requires accumulation of evidence from several sources beyond correlation of tests purported to measure constructs such as intelligence. Construct validity is evaluated with formal construct definitions, correlations with other tests that measure the same (convergent validity) and different (divergent validity) construct(s), and factor analysis. Construct validity is quantitatively assessed using results (typically correlation coefficients) from well-designed studies that administer the test of interest to normative and clinical samples of individuals. There are no strict thresholds to establish construct validity, but minimum correlation coefficients of 0.3 have been proposed (Lezak 1995; Lezak et al. 2012; Lezak et al. 2004). Correlations with related tests reflect convergent validity, while relationships to tests that measure other traits should be low, establishing discriminant validity.

Finally, criterion validity assesses the ability of a psychometric test to predict an individual’s performance or outcome now (concurrent validity) or in the future (predictive validity). It requires identification of an appropriate criterion for comparison (e.g., clinical disease related to the trait), assessment of the test and criterion, calculation of classification accuracy, or correlation with other tests/criteria.

Determining adequacy: Content, construct, and criterion validity should be separately evaluated for each test.

  1. Content validity: Qualitatively determine that the test is theoretically grounded, had item content appropriately identified from a large item pool that was expertly judged and curated, and has defined and theoretically justified domains. Factor analysis can be used to confirm that included items are specific to the domain(s) of interest.

  2. Construct validity: Must show validity through positive correlations with other measures of same construct or similar test (i.e., convergent validity). Ideally, the test should not be correlated with unrelated constructs (i.e., divergent validity). Factor analysis can be used to support any summary or subtest scales.

  3. Criterion validity: Criterion should be well defined; must be reasonably accurate in association with or for predicting criterion (e.g., kappa>0.6).

Standardized Administration Methods

Psychometric tests must be administered in a rigorous and standardized fashion. This precision is critical in population-based studies when groups of participants with different levels of exposure are being compared with one another, as non-standardized administration could introduce random orsystematic bias. When comparing results from one study with another, it is also critical to ensure that data were collected in the same fashion (i.e., the studies carried out the same test in the sameway).

Well-designed psychometric tests include explicit guidelines regarding test material presentation/organization, instructions to participants, instructions to test administrators on scoring participant responses and calculating test scores, and explicit phrasing for oral instructions and/or verbal questions. Some psychometric tests use stimulus material (e.g., pictures, blocks), and the same standardized materials must be used across test sessions to ensure consistent responses from subjects. In addition, the materials used in the test should be identical to those described in the test administration manual or provided by the publisher of the test.

Finally, psychometric tests often require administration by trained personnel or supervision by a clinical psychologist, neuropsychologist, or other appropriate professional. The interpretation of test results or feedback to parents/guardians and affected communities must be conducted by persons with professional credentials appropriate to the outcomes and the setting in which the study is conducted. The required qualifications of the test administrator should be indicated in the test manual or a document of standardized test procedures. An exception to this can be self-administered questionnaire instruments that are completed by individuals about themselves (self-reports) or others (teacher or parent ratings of children).

Determining Adequacy: The following rules should be used to evaluate the adequacy of a test’s administration instructions. Part 2 notes specific administration factors relevant to studies of developmental neurotoxicity.

  1. The test must have a manual or published paper that provides explicit and clear instructions on how test materials should be administered, how responses are scored, and how normative scores are calculated.

  2. Tests that use stimulus materials should include standardized materials for administration.

  3. Test manuals should explicitly state the qualifications necessary to administer a test, with the exception of questionnaire instruments.

Normative Data

Almost all psychometric test manuals provide normative data that allow conversion of participants’ raw scores into scaled scores (mean=10, SD=3), T-scores (mean=50, SD=10), or standard scores (mean=100, SD=15) with corresponding percentiles. These converted scores and percentiles are calculated based on data from a reference (or normative) population. Typically, test developers administer the test to a sample of hundreds or thousands of participants drawn from the target population of interest; normative scores, as well as corresponding percentiles, are derived from these individuals. This scoring is often conducted by calculating means and SDs of raw scores for specific ages of children or adults.

In culturally distinct populations for which test items and raw scores are determined to be valid and reliable, normative data are not necessary. Thus, the adequacy of normative data does not need to be assessed. Raw scores or study-specific normative scores can be used in statistical models when certain assumptions are met, and appropriate statistical techniques are used. The nature of some raw scores may preclude using them as the outcome in regression models. For instance, the Bayley Scales have infants or children complete a different set and number of items based on their age. Thus, the raw scores may not be equivalent across individuals. While a variety of methods could be used to create new scores (e.g., summed scores, PCA-derived scores), they have various strengths and limitations (McNeish and Wolf 2020). A full evaluation of these methods is beyond the scope of this document.

It is important to note that the sample size and representativeness of normative data for some domain-specific tests of neurobehavioral function are smaller and less generalizable, respectively, than those for commonly used omnibus tests such as intelligence tests or tests of overall neurodevelopment that have been developed by large psychological service companies.

The importance of adequate normative data for tests depends heavily on why the researcher is utilizing the test and how the outcomes are scored. Some tests, especially those that assess highly specific neuropsychological functions, are used because they allow for evaluation of specific relationships between structural brain function and a predictor such as exposure to a toxicant. Other tests are applied to an experimental situation because no standardized tests are available for the population being evaluated. In these situations, and in other circ*mstances when the normative data available for a test are not appropriate according to the standards listed below, the instrument may be legitimately applied, but outcome data must be appropriately analyzed. For example, raw scores might be adjusted for relevant confounders such as age, gender, educational attainment, or parental education. Several features of a test’s normative data should be evaluated.

  1. First, differences in native language, even different dialects, can affect an individual’s performance on a psychometric test. Thus, normative data should ideally be derived from participants with the same language and, if possible, the same dialect.

  2. Different cultures, races, and ethnicities may be exposed to different educational materials or have different socioeconomic backgrounds. These factors can affect test performance. Thus, normative data should be representative of these subgroups. In some cases, researchers develop normative data for which a test is adapted to specific cultures, languages, or subgroups. When examining the normative data associated with a test, it is important to consider the sample sizes for subgroups (e.g., racial/ethnic minorities), as this affects the precision of normative data for these subgroups.

  3. Almost all psychometric tests, particularly those administered to infants and children, are age-standardized to account for age-related neurodevelopment. Thus, age-specific normative data should be available for specific age groups. Moreover, the age ranges within the age bins used for standardization should be examined to ensure that they are granular enough and include data from enough children to accurately capture age-related differences in neurodevelopment.

  4. The process for generating normative data should be systematic in terms of participant recruitment, representativeness, test administration, and score/percentile derivation.

Determining adequacy: The following criteria should be met for a psychometric test to have adequate normative data.

  1. Normative data should be based on sample sizes of at least 1,000 for omnibus tests and should adequately represent the population for which the test was intended. Smaller sample sizes may be appropriate for domain-specific tests. (In the evaluation tables, a sample size of 250 was considered adequate for domain-specific tests.)

  2. Normative data should be appropriate for the culture and language of the participants to which the test is being administered.

  3. Normative data should be derived in a systematic fashion and not from convenience samples.

  4. Age-specific normative data should be derived. Adequacy of age-specific information available for tests is judged by several factors. The age-specific norms should be appropriate for the population to which the test is administered and of an adequate size to derive stable means and percentiles—measured in weeks or months for infants, months for younger children, and years in older children (late adolescence) and adults. Age bands in adulthood should not be too wide (e.g., greater than 5years) in later adulthood, when declines can occur for many tests. In addition, the number of participants included in the determination of the age-specific norm should be adequate; generally, this ranges from 30 to 100 depending on the kind of test. Large population omnibus measures such as IQ tests should average about 100, whereas 30 may be adequate for domain-specific tests. Finally, the age bins or age ranges used should be appropriate for the trait being evaluated. For example, tests of cognitive abilities generally require much narrower age bands than tests of social/emotional traits. When evaluating the age-specific normative data, the evaluators considered all three criteria in determining adequacy.

Part 1: Principles for Evaluating Psychometric Tests (2024)

FAQs

What are the principles of psychometrics test? ›

Psychometrics is the science of psychological assessment, and is a foundation of assessment and measurement. Within psychometrics there are four fundamental principles whereby the quality of an assessment is judged. These are (1) reliability, (2) validity, (3) standardisation and (4) freedom from bias.

How to evaluate a psychometric test? ›

While all four psychometric criterion areas (reliability, validity, standardized administration methods, and normative data) are important in evaluating psychometric tests, it should be noted that reliability, validity, and standardized administration methods are considered most important in selecting psychometric ...

What are the three parts of psychometric approach? ›

Important psychometric questions include (1) how much information about the latent variable is contained in the data (measurement precision), (2) whether the test scores indeed measure the intended construct (validity), and (3) to what extent the test scores function in the same way in different groups (measurement ...

What are the four psychometric properties? ›

Ideally, they will use standardized tests that have strong psychometric properties (eg, reliability, validity, sensitivity, specificity).

What are the basic principles of psychological testing? ›

Important principles of psychological measurement and assessment are covered, including: standardisation, norms, reliability, test development and validation. The practical program emphasises test development and test administration; scoring and interpretation; …

What are the three characteristics of psychometric tests? ›

A good psychometric test must have three fundamental properties: reliability, validity, and norming.

How to score well in a psychometric test? ›

Tips for taking psychometric tests
  1. sit somewhere quiet with no distractions and try to stay calm.
  2. have a pen, paper and calculator to hand.
  3. do the tests on a laptop or PC and use headphones if you have them.
  4. make sure you have a reliable internet connection.

How do you interpret a psychometric test? ›

To interpret psychometric test results, you need to review the test manual and the technical information to understand the design, scoring method, reliability, validity, and norm groups. Then compare the scores of the candidates with the relevant norms to see how they perform relative to others.

How do you give feedback on a psychometric test? ›

This can be done through an email, or face-to-face interview or telephone call. Some companies will require a clear form of prompt from the candidates before releasing the results. Usually, the recruiters allocate sufficient time for the feedback session.

What are the five stages of psychometrics? ›

In a similar vein, Furr (2011) also described it as a process completed in five steps: (a) Define the Construct measured and the Context, (b) Choose Response Format, (c) Assemble the initial item pool, (d) Select and revise items and (e) Evaluate the psychometric properties (see relevant section).

What are the 3 types of psychometric assessments? ›

The 3 types of psychometric tests are personality assessments, intelligence tests, and aptitude tests. Personality assessments measure traits and behaviours, while intelligence tests evaluate cognitive abilities. Aptitude tests assess specific skills in domains like numerical reasoning and logical reasoning.

What is the psychometric test methodology? ›

Psychometric tests are a standard and scientific method used to measure individuals' mental capabilities and behavioural style. Psychometric tests are designed to measure candidates' suitability for a role based on the required personality characteristics and aptitude (or cognitive abilities).

What is the validation of a psychometric test? ›

Validity is defined as an assessment's ability to measure what it claims to measure. The validity of a Psychometric test depends heavily on the sample set of participants (including age, culture, language and gender) to ensure the results apply to a vast range of cultures and populations.

What are the two main personality test in psychometric? ›

There are two main types: personality tests and aptitude tests. Personality tests explore your interests, values and motivations, analysing how your character fits with the role and organisation. They analyse your emotions, behaviours and relationships in a variety of situations.

How to design psychometric tests? ›

How do you build a Psychometric Test?
  1. Define the purpose of the test. This is perhaps the most important step, as it will help to determine the specific content and format of the test. ...
  2. Draft a test blueprint. ...
  3. Construct the test items. ...
  4. Pilot test the test. ...
  5. Administer the test.

What are the concepts of psychometric testing? ›

A psychometric test is used to assess a candidate's cognitive ability or their personality traits. In talent management, psychometric testing can predict valuable insights such as job performance, competence, and motivations. There are two general types of psychometric tests: ability tests and personality tests.

What are the main psychometric tests? ›

Types of psychometric testing
  • diagrammatic reasoning.
  • error checking.
  • numerical reasoning.
  • spatial reasoning.
  • verbal reasoning.

What are the four importances of psychometric tests? ›

Assesses your abilities to fit in a specific workplace environment with a specific personality type. Increases self-awareness and accurate insight. Advancement and successful planning. Overall evaluation and the best fit in the role.

What is the structure of the psychometric test? ›

The Psychometric Entrance Test (paper-and-pencil version) consists of nine sections, each of which belongs to one of the following domains: Verbal Reasoning, Quantitative Reasoning or English. The first section of the test is part of the Verbal Reasoning domain and consists of a writing task.

Top Articles
Latest Posts
Article information

Author: Lakeisha Bayer VM

Last Updated:

Views: 6513

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Lakeisha Bayer VM

Birthday: 1997-10-17

Address: Suite 835 34136 Adrian Mountains, Floydton, UT 81036

Phone: +3571527672278

Job: Manufacturing Agent

Hobby: Skimboarding, Photography, Roller skating, Knife making, Paintball, Embroidery, Gunsmithing

Introduction: My name is Lakeisha Bayer VM, I am a brainy, kind, enchanting, healthy, lovely, clean, witty person who loves writing and wants to share my knowledge and understanding with you.