Practice effect and test–retest reliability of the Wechsler Memory Scale-Fourth Edition in people with dementia
BMC Geriatrics volume 23, Article number: 209 (2023)
The Wechsler Memory Scale-Fourth Edition (WMS-IV) has been widely used to assess memory function in people with dementia. The older adult battery of the WMS-IV includes four indices and seven subtests. The aims of this study were to examine the practice effect and test–retest reliability and calculate the reliable change index modified for practice (RCIp) for the indices and subtests of the older adult battery of the WMS-IV for people with dementia.
Fifty-six participants completed the WMS-IV twice, two weeks apart. The practice effect was investigated using effect size (Cohen’s d) and bootstrapping mixed design analysis of variance while considering the severity of dementia. The test–retest reliability was estimated using intraclass correlation coefficient (ICC).
The results showed non-significant practice effects with Cohen’s d < 0.20 in different severities of dementia on two indices and five subtests. The ICC values of these indices and subtests were 0.82–0.85 and 0.57–1.00, respectively. The other two indices (i.e., auditory memory and immediate memory) and two subtests (i.e., logical memory delayed recall and visual reproduction immediate recall) demonstrated small to moderate practice effect (d = 0.46–0.74) for people with mild severity of dementia.
On the whole, the WMS-IV has no to moderate practice effects and moderate to excellent test–retest reliability in people with dementia. The values of the RCIp with 95% confidence interval for the indices and subtests were provided in this study, which are useful to clinicians and researchers for interpreting the real score change in persons with dementia. The two indices (i.e., auditory memory and immediate memory) and two subtests (i.e., logical memory delayed recall and visual reproduction immediate recall) with noticeable practice effect should be used with caution when assessing memory function repeatedly in people with mild severity of dementia.
Due to disease and aging, the population experiencing dementia is growing rapidly. The number of people with dementia rose from 57.4 million people in 2019 to 15.8 million people in 2050 . Memory function decline is one of main features for people with dementia and assessment of memory function decline is crucial for clinical diagnosis. Memory function decline influences people with dementia when performing daily tasks  and consequently leads to enormous burdens on caregivers . Therefore, a measure of memory function is necessary to aid diagnosis, make treatment plans, and monitor recovery or deterioration of memory function for people with dementia in both clinical and research settings.
The Wechsler Memory Scale-Fourth Edition (WMS-IV) is used worldwide to assess memory function in people with dementia . It contains index scores and subtest scores to describe different aspects of memory function. The WMS-IV has the following three characteristics. First, it assesses visual and auditory memory functions in a comprehensive manner. Second, it includes immediate and delayed memory subscales to verify deficits of short-term memory and long-term memory. Third, it is appropriate for illiterate people, which does not require reading, using only pictures and the recalling of sounded words. Therefore, the WMS-IV is suitable for assessing memory function in people with dementia.
Empirical ground evidence on practice effect and test–retest reliability is essential for clinicians and researchers to ascertain measurement errors in a measure. Practice effect is defined as improvements in test score over repeated administrations, due to earlier experiences in conducting the same items . Reliable change index estimates the change in score regardless of measurement error, which demonstrates a real score change for a person. Reliable change index modified for practice (RCIp) refers to corrected reliable change index while taking practice effect into consideration . Test–retest reliability evaluates the stability of a measure (i.e., same test results over repeated administrations) . Examining practice effect, and test–retest reliability and calculating RCIp are necessary to increase the utility of the WMS-IV in clinical and research settings.
Practice effect and test–retest reliability have been verified as indices of the WMS-IV for healthy people . However, its practice effect and test–retest reliability have not been examined and RCIp has not been calculated in people with dementia, which limits the explanations of the index scores and subtest scores. Therefore, this study aimed to (1) examine the practice effect and test–retest reliability of the indices and subtests of the WMS-IV for people with dementia; and (2) calculate the RCIp values for the indices and subtests. The WMS-IV includes an adult battery (for ages 16–69 years) and an older adult battery (for ages 65–90 years) . We considered the ages of the dementia population and we chose to examine the practice effect and test–retest reliability of the older adult battery for people with dementia in this study.
We recruited people with dementia who were outpatients from one hospital in northern Taiwan between July 2019 and April 2020. The following criteria were used to determine eligibility in this study: (1) diagnosed as dementia based on the Diagnostic and Statistical Manual of Mental Disorders, fifth edition; (2) aged 65–90 years old (suggested in the WMS-IV manual); (3) stable status (i.e., same scores of the Clinical Dementia Rating [CDR] between two administrations); and (4) willingness to participate in the study (signed informed consent by the patient and family caregiver). The criteria for exclusion were diagnosis of intellectual disability and history of brain injury. This study was endorsed by the Institutional Review Board of the hospital.
One examiner administered the WMS-IV in this study. This examiner was a certified nurse, who had work experience with people with dementia and worked in the cognitive laboratory. The examiner learned and practiced administering and scoring the WMS-IV under a certified psychologist three days a week for two months. People who met the inclusion criteria were assessed by the CDR and WMV-IV twice, two weeks apart. All assessments were conducted in a quiet place to avoid interference and to prevent any effect on participant’s performance. The CDR was administered though interviewing participants with dementia and their caregivers. The demographic data were collected from medical records.
The older adult battery of the WMS-IV contains four indices (i.e., auditory memory [AMI], visual memory [VMI], immediate memory [IMI], and delayed memory [DMI]) and seven subtests (i.e., logical memory immediate recall [LM I], logical memory delayed recall [LM II], verbal paired associates immediate recall [VPA I], verbal paired associates delayed recall [VPA II], visual reproduction immediate recall [VR I], visual reproduction delayed recall [VR II], and symbol span [SSP]). The auditory memory and visual memory indices assess the ability of remembering oral and visual information, respectively. The immediate memory and delayed memory indices assess the ability of remembering information presented immediately and 20–30 min delayed, respectively. The LM I subtest assesses immediate narrative memory by immediate free recalling of two short stories. The LM II subtest assesses long-term narrative memory by retelling stories and recognizing questions related to stories. The VPA I subtest assesses immediate verbal memory by immediate cued recalling of 10 word-pairs for four trials. The VPA II subtest assesses long-term verbal memory by recalling and recognizing word pairs. The VR I subtest assesses immediate visual-spatial memory by immediately recalling and drawing five designs. The VR II subtest assesses visual-spatial memory by redrawing and recognizing designs. The SSP subtest assesses visual working memory by recognizing figures and their relative spatial position [8, 9]. The AMI index is summed up by four subtests: LM I, LM II, VPA I, and VPA II. The VMI index is summed up by two subtests: VR 1 and VR II. The IMI index is summed up by three immediate recall subtests: LM I, VPA I, and VR I. The DMI index is summed up by three delayed subtests: LM II, VPA II, and VR II. The age-corrected scaled score of each subtest is on a metric with mean of 10 and standard deviation of 3. The scaled score ranges from 0–19. Each index score is scale on a metric with mean of 100 and standard deviation of 15. The index score ranges from 40–160 . A higher index score and subtest scaled score indicate better memory function.
The CDR is a tool for assessing cognitive and functional impairments in people with dementia. It contains 6 domains: orientation, memory, judgment and problem solving, community affairs, home and hobbies, and personal care . The personal care domain is ranked on a four point scale (0–1-2–3). The other five domains have five grades (0–0.5–1-2–3). The total score is derived from the six domains and defines the severity of dementia: 0 (healthy), 0.5 (earliest cognitive changes, questionable), 1 (mild), 2 (moderate), and 3 (severe). The CDR has adequate reliability and validity in people with dementia .
Effect size (Cohen’s d) was calculated to estimate the magnitudes of the practice effects. The criteira of d value were: 0.20–0.49, small effect size; 0.50–0.79, moderate effect size; and ≥ 0.80, large effect size . To determine whether the practice effect would differ for participants with two different levels of the CDR (i.e., mild severity of dementia, CDR = 1 and moderate to severe severity of dementia, CDR = 2–3), we applied the bootstrapping mixed design analysis of variance (ANOVA) which can provide reliable conclusions when the data distribution assumption is violated . In the case of an index or subtest whose practice effect was affected by the level of the CDR, we examined practice effect at each level of the CDR using paired t-test with bootstrapping. The RCIp with 95% confidence interval (CI) was calculated as follows :
where mean practice effect is the mean of the difference between two administrations, SEdiff is the standard error of the differences, SEM is the standard error of measurement, SD is the standard deviation of the first administration and ICC is the intraclass correlation coefficient.
The ICC was applied to examine test-rest reliability by a random-effects, two-way analysis of variance. An ICC value of 0.80–1.00 indicates excellent reliability, 0.60–0.79 indicates good reliability, 0.40–0.59 indicates moderate reliability, and < 0.39 indicates poor reliability . ICC values ≥ 0.80 and ≥ 0.90 can be used for group comparions in research settings and for individual comparisons in clinical settings, repectively .
Fifty-six people with dementia completed all assessments. The average age of the participants was 79.4 years and 66.1% were female. The mean score of the CDR was 1.6. The demographic information of the participants is shown in Table 1.
Different levels of education did not influence the practice effect of the indices and subtests of the WMS-IV. The results of the bootstrapping mixed design ANOVA showed non-significant practice effects on two indices (i.e., VMI and DMI) and five subtests (i.e., LM I, VPA I, VPA II, VR II, and SSP) in different levels of the CDR (p = 0.088–0.893) (Table 2). The Cohen’s ds of these indices and subtests were < 0.20, except for the VMI index and SSP subtest. The ICC values of the VMI and DMI indices were > 0.80. The ICC value of the VR II subtest was 1.00 and those of the other four subtests were 0.72–0.78.
The practice effects of two indices (i.e., AMI and IMI) and two subtests (i.e., LM II and VR I) were affected by the different levels of the CDR. Further examination displayed that these indices and subtests had significance in paired t-test with bootstrapping and showed small to moderate effect sizes (d = 0.46–0.74) for participants with mild severity of dementia (Table 3). The ICC values of these indices and subtests were 0.72–0.76. For participants with moderate to severe severity of dementia, there was no significant practice effect with d < 0.20 on the indices and subtests. The ICC values of the AMI index and LM II subtest were ≥ 0.90. The values of RCIp with 95% CI for the four indices and seven subtests are shown in Table 2 and Table 3.
To the best of our knowledge, this is the first study to examine practice effect and test–retest reliability of both index scores and subtest scores in the older adult battery of the WMS-IV for people with dementia. Considering the severity of dementia, two indices (i.e., AMI and IMI) and two subscales (i.e., LM II and VR I) displayed small to moderate practice effects in people with mild severity. This study provides empirical evidence for enriching the utility of the WMS-IV in clinical and research settings.
The two indices (i.e., AMI and IMI) showed small to moderate effect sizes (d = 0.45–0.74) in people with mild severity of dementia, but the two indices demonstrated negligible effect sizes (d = 0.05–0.07) in people with moderate to severe severity of dementia. A previous study demonstrated obvious large effect size (ηp2 = 0.33–0.49) on four indices of the older battery for healthy people over a short-term interval . Practice effect is produced when examinees evolve strategies to answer or remember the items from previous experiences . Healthy people without memory deficits and people with mild severity of dementia could evolve strategies and remember items better than people with moderate to severe severity of dementia. Thus, the AMI and IMI indices of the WMS-IV have obvious practice effect in healthy people and people with mild severity of dementia, but these indices demonstrate no practice effect in people with moderate to severe severity of dementia. Regarding to the subtests of the WMS-IV, practice effects were found in two subtests (i.e., LM II and VR I) for people with mild severity of dementia. A possible reason of noticeable practice effect in the LM II subtest could be that examinees received practice using the items in the administration. In the LM I subtest, examinees listened to story A twice and then listened to story B. After 20–30 min, examinees retold stories A and B and were asked questions related to stories A and B in the LM II subtest. Thus, examinees may memorize the stories, especially story A, and thus gains higher scores on the LM II subtest in the second administration. In the VR I subtest, examinees recall and draw five designs. The VR I subtest revealed practice effect maybe because the items were administered from easy to difficult. Examinees could evolve strategies for memorizing figures when being administered items with less difficulty.
Regarding the clinical implication, the values of the RCIp with 95% CI for the indices and subtests were provided in this study. These values are helpful for interpreting the results (i.e., whether a score change for a person with dementia achieves a real deterioration or improvement) with 95% certainty. For example, the value of the RCIp with 95% CI of the AMI index is [-12.2, 19.1] for people with mild severity of dementia. A person with mild severity of dementia having a score change (i.e., posttest minus pretest) lower than -12.2 or higher than 19.1 indicates a real deterioration or improvement, respectively, after intervention. Therefore, the results of this study can support clinicians and researchers in interpreting the scores of a person with dementia more precisely and reasonably, while considering measurement errors, including practice effect.
Satisfactory test–retest reliability of four indices in healthy people . In this study, our findings displayed better test–retest reliability on the VMI and DMI indices in people dementia and the AMI and IMI indices in people with moderate to severe severity of dementia (ICC > 0.80), which can be used for group comparisons. The AMI index with ICC > 0.90 in people with moderate to severe severity of dementia can be applied for individual comparisons. At the subtest level, the SSP subtest showed relatively lower ICC value, which was similar to the test–retest result of the SSP subtest in the Wechsler Memory Scale-Third Edition . Thus, the SSP subtest may not assess particular memory functions consistently over repeated administrations.
Three limitations were noticed in this study. First, participants were recruited from one hospital, which may limit the generalizability of our results. Second, we did not examine the practice effect and test–retest reliability of the adult battery of the WMS-IV in people with dementia. Except for the four indices and seven subtests, the adult battery has one more index (i.e., visual working memory) and one more subtest (spatial addition). Future studies are warranted to examine the practice effect and test–retest reliability of the adult battery for people with dementia. Third, the participants were people with dementia and thus floor effects (i.e., the percentage of participants with lowest score > 20%) were observed in the sample of this study for the indices and subtests of the WMS-IV, except for the VMI index and SSP subtest. The sample size was slightly small and the age-corrected scaled scores of two administrations in the VR II subscale were the same. The small sample size and high floor effect may influence the results of the practice effect and test–retest reliability. Further cross-validation with the big sample size is needed.
Overall, the WMS-IV has no to moderate practice effects and sufficient test–retest reliability in people with dementia. The values of the RCIp with 95% CI of the indices and subtests in the WMS-IV are provided herein, which can help clinicians and researchers to explain the results of particular memory functions over repeated assessments. The two indices (i.e., AMI and IMI) and two subtests (i.e., LM II and VR I) with obvious practice effect should be used cautiously while repeatedly assessing memory function in people with mild severity of dementia.
Availability of data and materials
Data supporting the findings are available upon request. Please contact the corresponding author, En-Chi Chiu (enchichiu@ ntunhs. edu. tw), for data access.
GBD 2019 Dementia Forecasting Collaborators: Estimation of the global prevalence of dementia in 2019 and forecasted prevalence in 2050: an analysis for the Global Burden of Disease Study 2019. Lancet Public Health. 2022;7(2):e105-e25.
Cipriani G, Danti S, Picchi L, Nuti A, Fiorino MD. Daily functioning and dementia. Dement Neuropsychol. 2020;14(2):93–102.
Tsai CF, Hwang WS, Lee JJ, Wang WF, Huang LC, Huang LK, et al. Predictors of caregiver burden in aged caregivers of demented older patients. BMC Geriatr. 2021;21(1):59.
Jian ZH, Chiu EC. A review of psychometric properties of memory measures freqnently used in patients with dementia. J Taiwan Occup Ther Res Pract. 2020;16(1):27–39.
Chiu EC, Koh CL, Tsai CY, Lu WS, Sheu CF, Hsueh IP, et al. Practice effects and test-re-test reliability of the Five Digit Test in patients with stroke over four serial assessments. Brain Inj. 2014;28(13–14):1726–33.
Chiu EC, Hung JW, Yu MY, Chou CX, Wu WC, Lee Y. Practice effect and reliability of the motor-free visual perception test-4 over multiple assessments in patients with stroke. Disabil Rehabil. 2022;44(11):2456–63.
Bouman Z, Hendriks MP, Aldenkamp AP, Kessels RP. Temporal Stability of the Dutch Version of the Wechsler Memory Scale-Fourth Edition (WMS-IV-NL). Clin Neuropsychol. 2015;29(Suppl 1):30–46.
Wechsler D. WMS-IV technical and interpretive manual. San Antonio TX: Pearson; 2009.
Spores JM. Clinician’s guide to psychological assessment and testing: with froms and templates for effective practice. New York, NY: Springer Publishing Company, LLC; 2013.
Wechsler D. Wechsler Memory Scale®-Fourth Edition. London, UK: Pearson Education, Inc.; 2010.
Morris JC. The clinical dementia rating (CDR): current version and scoring rules. Neurology. 1991;41:1588–92.
Marin DB, Flynn S, Mare M, Lantz M, Hsu MA, Laurans M, et al. Reliability and validity of a chronic care facility adaptation of the Clinical Dementia Rating scale. Int J Geriatr Psychiatry. 2001;16(8):745–50.
Cohen J. Statistical power analysis for the behavioral sciences. Hillsdale: Lawrence Erbaum Associates; 1988.
Zientek LR, Thompson B. Applying the bootstrap to the multivariate case: bootstrap component/factor analysis. Behav Res Methods. 2007;39:318–25.
Bushnell CD. Johnston, Dean CC, Goldstein, Larry B: Retrospective assessment of initial stroke severity: comparison of the NIH Stroke Scale and the Canadian Neurological Scale. Stroke. 2001;32(3):656–60.
Chiu EC, Wu WC, Chou CX, Yu MY, Hung JW. Test-retest reliability and minimal detectable change of the Test of Visual Perceptual Skills-in patients with stroke. Arch Phys Med Rehabil. 2016;97(11):1917–23.
Chiu EC, Lee SC. Test-retest reliability of the Wisconsin Card Sorting Test in people with schizophrenia. Disabil Rehabil. 2021;43(7):996–1000.
Martin R, Sawrie S, Gilliam F, Mackey M, Faught E, Knowlton R, et al. Determining reliable cognitive change after epilepsy surgery: development of reliable change indices and standardized regression-based change norms for the WMS-III and WAIS-III. Epilepsia. 2002;43(12):1551–8.
We are grateful to all participants for their involvement.
This study was supported by Taipei City Government (grant no. 10901–62-038).
Ethics approval and consent to participate
This study was approved by the Taipei City Hospital Research Ethics Committee (TCHIRB-10801014). All participants provided written informed consent before enrolment. All methods were performed in accordance with the relevant guidelines and regulations (Declaration of Helsinki).
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Lee, SC., Chien, TH., Chu, CP. et al. Practice effect and test–retest reliability of the Wechsler Memory Scale-Fourth Edition in people with dementia. BMC Geriatr 23, 209 (2023). https://doi.org/10.1186/s12877-023-03913-2