Practice effect and test-retest reliability of the Mini-Mental State Examination-2 in people with dementia

Background The Mini-Mental State Examination-Second Edition (MMSE-2) consists of three visions: a brief version (MMSE-2:BV), a standard version (MMSE-2:SV), and an expanded version (MMSE-2: EV). Each version was equipped with alternate forms (blue and red). There was a lack of evidence on the practice effect and test-retest reliability of the three versions of the MMSE-2, limiting its utility in both clinical and research settings. The purpose of this study was to examine the practice effect and test-retest reliability of the MMSE-2 in people with dementia. Methods One hundred and twenty participants were enrolled, of which 60 were administered with the blue form twice (i.e., the same-form group, [SF group]) and 60 were administered with the blue form first and then the red form (alternate-form group, [AF group]). The practice effect was evaluated using a paired t-test and Cohen’s d. The test-retest reliability was examined using the intraclass correlation coefficient (ICC). Results For the practice effects, in the SF group, no statistically significant differences were found for the MMSE-2:BV and MMSE-2: EV total scores and eight subtests (p = 0.061–1.000), except for the MMSE-2:SV total score (p = 0.029). In the AF group, no statistically significant differences were found for all three versions of the total scores and subtests (p = 0.106–1.000), except for the visual-constructional ability subtest (p = 0.010). Cohen’s d of all three versions’ total scores and subtests were 0.00–0.20 and 0.00–0.26 for SF group and AF group, respectively. For the test-retest reliability, ICC values for all three versions and eight subtests in SF and AF groups were 0.60–0.93 and 0.56–0.93, respectively. Conclusion Our results demonstrated that the practice effect could be minimized when alternate forms of the MMSE-2 were used. The MMSE-2 had good to excellent test-retest reliability, except for three subtests (i.e., visual-constructional ability, registration, and recall). Caution should be taken when interpreting the results of visual-constructional ability, registration, and recall subtests of the MMSE-2. Supplementary Information The online version contains supplementary material available at 10.1186/s12877-021-02732-7.


Background
It has been estimated that about 46.8 million people are suffering from dementia worldwide and that this figure will increase to 74.7 million by 2030 and to 131.5 million people by 2050 [1]. Cognitive function decline is the primary characteristic of dementia that is comprehensively impaired in cognitive function domains such as learning and memory, language, attention, and executive functions [2]. Previous studies had shown that cognitive impairment affects the ability of people with dementia to perform activities of daily living that might have further impact on their and their caretakers' quality of life [3][4][5]. Therefore, early detection in order to identify Open Access *Correspondence: enchichiu@ntunhs.edu.tw 5 Department of Long-Term Care, National Taipei University of Nursing and Health Sciences, No.365, Ming-Te Road, Peitou District, Taipei 112303, Taiwan Full list of author information is available at the end of the article those who are suspected of having cognitive impairment is important.
Four commonly-used cognitive screening tools have been used for the early detection of dementia including the Mini-Mental State Examination (MMSE), the Short Portable Mental Status Questionnaire, the Montreal Cognitive Assessment, and the Saint Louis University Status Examination [6]. Each of abovementioned screening tools has its own merits and among these four, the MMSE has been the most extensively used in clinical and research settings due to its practicality. The MMSE is easy to administer and requires no specialized equipment or training [6][7][8]. It has been reported that overall, the four screening tools are similar in test-retest reliability; however, the MMSE has demonstrated higher test-retest reliability with acceptable random measurement error and small practice effect [6]. Although the MMSE has its own weaknesses, such as less sensitive to change with increasing age, having a ceiling effect, and vulnerability to practice effect [9,10], the new version of MMSE, the Mini-Mental State Examination-Second Edition (MMSE-2) was developed to overcome these issues [7].
The MMSE-2 preserves the clinical utility and efficiency of the original MMSE while expanding its application in populations with dementia [7]. The MMSE-2 has two features. First of all, the MMSE-2 is composed of three versions, including a brief version (MMSE-2: BV), standard version (MMSE-2: SV), and expanded version (MMSE-2: EV) ( Table 1). Depending on which version is selected, the MMSE-2: BV takes only 5 min while the MMSE-2: SV needs 20 min to complete. The MMSE-2: BV is part of the standard version, which can be used by clinicians to quickly screen patients, and retains the structure and scoring of the MMSE. The MMSE-2: SV is equivalent to the original MMSE [7,[11][12][13]. The MMSE-2: EV includes two more subtests, story memory and processing speed, to extend the ceiling effect and increase the sensitivity of the MMSE-2 to subcortical vascular dementia [14].
Secondly, each version of the subtests of the MMSE-2 has two alternate forms (blue and red forms) in order to decrease the practice effects that might occur over repeated testing [7]. Almost all the subtests of the two forms differ in the contents of questions, but the structures are similar (e.g., registration subtest: blue form asks patient to repeat "milk, sensible, before" back to the rater whereas the red form asks "egg, confident, after". The orientation subtest and attention and calculation subtest remain the same in both forms). The equivalency of the alternate forms of the MMSE-2 has been reported to be 0.96 [7,12]. These two features have added extra value to the MMSE-2.
An assessment would be considered useful if it could produce stable and consistent results with repeated administration [15]. The practice effect refers to improvements in test results in repeated assessments, given that previous experience might carry over to the next assessment in the absence of any interventions [16]. The practice effect might obscure the true cognitive decline of people with dementia. One of the methods to reduce practice effects is by using alternate forms [17,18], Test-retest reliability concerns the extent of agreement between repeated assessments under similar assessments conditions [19]. A measure with acceptable testretest reliability allows users to consistently identify those at risk for cognitive function decline [6]. The abovementioned psychometric properties are essential for a measure to ensure its utility for repeated assessments in people with dementia. Therefore, the purpose of this study was to examine the practice effect and test-retest reliability of the MMSE-2 in people with dementia.

Participants
A convenience sample of people with dementia was recruited from Department of Psychiatry or Department of Neurology of two teaching hospitals in northern Taiwan between March 2019 and April 2020. The following criteria were used to determine the eligibility of people with dementia to participate in this study: (1) diagnosis of probable dementia and dementia according to the National Institute on Aging and Alzheimer's Association [20]; (2) age ≥ 65 years; and (3) having a stable condition with a stable dose of medication within the past month. The exclusion criteria were: (1) diagnosis of mental retardation, (2) history of severe brain injury, and (3) having different scores on the Clinical Dementia Rating (CDR) over two repeated tests (any change in CDR score was considered an unstable cognitive condition). This study was approved by the Research Ethics Committee of the

Procedure
Prior to the study, three raters (raters A, B, and C) familiarized themselves with the MMSE-2. The three raters reviewed the user manual of the MMSE-2 and received 4 hours of training from the corresponding author on the administration of the MMSE-2. Then, the three raters performed the MMSE-2 on the corresponding author, and their score results were checked. If there were any discrepancies in the score results, discussions with the corresponding author were required to ensure proper administration procedures and scoring. Finally, the three raters independently administered the MMSE-2 to five people with dementia. The corresponding author observed the assessments and gave MMSE-2 scores simultaneously to confirm that all three raters performed the MMSE-2 correctly in a standardized manner. In addition, when the study began, the raters did not discuss the results of scores with each other to avoid potential bias.

The same-form group, SF group
The blue form was administered twice in a two-week interval by raters A and B to the participants, who were from hospital A.

The alternate-form group, AF group
The alternate forms (i.e., the blue form at the first assessment and the red form at the second assessment) were administered by rater C 2 weeks apart to the participants who were from hospital B. The AF group completed the MMSE-2 in a fixed order (i.e., the blue form first and red form second). All assessments were conducted in a quiet room to avoid interference that might have affected the performance of the participants. Between each assessment, the participants were allowed to break to minimize fatigue. The demographic data of all participants were collected from their medical records.

Measures
The MMSE-2 was developed to assess cognitive impairment. The MMSE-2: BV (score range: 0-16) is composed of three subtests: registration, orientation, and recall. The MMSE-2: SV (score range: 0-30) is composed of six subtests: three subtests of the MMSE-2:BV along with attention and calculation, language, and visual-constructional ability. The MMSE-2: EV (score range: 0-90) is composed of eight subtests: six subtests of the MMSE-2:SV, as well as story memory and processing speed (Table 1) [14]. A higher total score indicates better cognitive function [7].
The CDR measured the level of dementia severity and was used to examine whether the symptom severity of participants was stable during the test and retest sessions. The CDR is composed of six domains, including orientation, memory, judgment and problem solving, community affairs, home and hobbies, and personal care [21]. A global CDR score can be obtained from the six domains to quantify the severity of dementia on a fivepoint grade scale (0, 0.5, 1, 2, and 3), where 0 indicates no dementia, 0.5 indicates questionable dementia, 1 indicates mild dementia, 2 indicates moderate dementia, and 3 indicates severe dementia [22]. The CDR has been reported to have satisfactory reliability and validity in patients with dementia [23].

Data analysis Practice effect
Paired t-tests were performed to evaluate statistically significant differences between the two assessments. In addition, we calculated Cohen's d as effect sizes to assess the magnitude of change. Effect sizes between 0.00-0.09 were categorized as no practice effect, 0.10-0.19 as trivial, 0.20-0.49 as small, 0.50-0.79 as medium, and ≥ 0.80 as largesized effects [24]. As multiple t-tests were conducted, the significance level was adjusted using a Bonferroni correction [25] for 11 t-tests by dividing the significance level of .05 by 11, resulting in a significance level of p < .0045. Data were analyzed with IBM SPSS Statistics (Version 22.0; IBM Corp., Armonk, NY).

Test-retest reliability
To examine the test-retest reliability of the MMSE-2, the intraclass correlation coefficient (ICC 2,1 ) was calculated using a two-way random analysis of variance with absolute agreement. An ICC value of 0.81-1.00 indicates excellent reliability, 0.61-0.80 indicates good reliability, 0.41-0.60 indicates moderate reliability, and < 0.40 indicates poor reliability [26]. In addition, the minimal detectable change (MDC at the 95% confidence level, MDC 95 ) was calculated on the basis of the standard error of measurement (SEM) to estimate the random measurement error of the MMSE-2 [27]. In Formula 1, SD represents the standard deviation of all scores of the repeated assessments. In Formula 2, a value of 1.96 was used for the confidence interval of a standard normal distribution (i.e., 1.96 for the 95% confidence level in this study). The √ 2 multiplier represents the additional uncertainty when different scores from the repeated assessments are used. (1) We also calculated the MDC percentage (MDC%), which was independent of the units of measurement, and used it to determine a relatively true change between two assessments: MDC% = (MDC/ highest score of all test data) × 100 [28]. In this study, an MDC% of less than 30% was considered an acceptable random measurement error [29].

Results
A total of 120 participants participated in this study; of these, 60 from hospital A were assigned to the SF group and 60 from hospital B were assigned to the AF group. The participants in the SF group had a mean age of 81.5 ± 7.8 years, 55% were female, and 58.3% had an educational level below elementary school. The participants in the AF group had a mean age of 79.1 ± 6.8 years, 67.3% were female, and 43.3% had an educational level below elementary school. There were no significant group differences in age, t = 1.781, p = 0.080, gender, χ 2 = 0.094, p = 0.759, education, t = 39.992, p = 0.105, or CDR, t = 3.436, p = 0.752. The characteristics of the participants are shown in Table 2.

Practice effect
In the SF group, no statistically significant differences were found between the two assessments for the MMSE-2:BV and MMSE-2: EV total scores and eight subtests (p = 0.061-1.000). There was a statistically significant difference in the MMSE-2:SV total score (p = 0.029). The Cohen's d for all three versions total scores and subtests, except for the visual-constructional ability subtest, ranged from 0.00 to 0.15, indicating no or trivial practice effects. Cohen's d for the visual-constructional  ability subtest was 0.20, indicating a small practice effect (Table 3).
In the AF group, no statistically significant differences were found between the two assessments of the total scores and subtests (p = 0.106-1.000), except for the visual-constructional ability subtest (p = 0.010). Cohen's d for all three versions' total scores and subtests, except for the visual-constructional ability subtest, ranged from 0.00 to 0.10, indicating no to trivial practice effects. Cohen's d for the visual-constructional ability subtest was 0.26, indicating a small practice effect (Table 4).

Test-retest reliability
In the SF group, all three versions and subtests, except for the visual-constructional ability subtest (ICC = 0.60), showed good to excellent reliability (ICC = 0.67-0.93) ( Table 2). At the 95% confidence level, the MDC values of all three versions and eight subtests ranged from 0.23 to 1.08. The MDC% of all three versions and eight subtests were all < 30% (ranging from 2.10 to 27.69%).
In the AF group, all three versions and subtests, except for the registration and recall subtests (ICC = 0.56), showed good to excellent reliability (ICC = 0.67-0.93) ( Table 3). At the 95% confidence level, the MDC values of all three versions and eight subtests ranged from 0.37 to 1.18. The MDC% of all three versions and eight subtests were all < 30% (ranging from 2.57 to 25.19%).

Discussion
We found that the MMSE-2:SV total score appeared significantly different over repeated assessments in the SF group, which might have led to an examiner's misinterpretation of a patient's progress in cognitive function when the same form of the MMSE-2:SV total score was used. The MMSE-2:SV has been reported to be equivalent to the original MMSE [11], which might have increased familiarity in our participants and, as a result, increased the likelihood of practice effects. However, the results showed that when the alternate form of the MMSE-2: SV was used, no significant difference was found between the two assessments. Therefore, based on our results, it is important to use the alternate form of MMSE-:SV when repeatedly assessing cognitive functional status in people with dementia. In addition, we found that Cohen's d values for all three versions and almost all subtests in the AF group were relatively smaller than the values in the SF group. These results demonstrate that the practice effects were mitigated by using alternate forms. Thus, based on our findings, it is beneficial to use the alternate forms of the MMSE-2 in retests in order to minimize practice effects in people with dementia.
The visual-constructional ability subtest in both the SF and AF groups showed small effects. A possible reason for the slightly higher practice effect might be that the same question was used both in the red and blue form of the MMSE-2, with participants being asked to draw two intersecting pentagons in which the interconnected area should be shaped like a rhombus [9,30]. Our participants might have become familiar with the drawing upon repeated administrations. To reduce the impact of practice effects, there is a need to develop a new question design for the alternate form of the visual-constructional ability subtest.
A measure with sufficient test-retest reliability allows users to obtain stable and consistent results over time when repeatedly used [31]. Our results showed that the three versions of the MMSE-2 total scores (the same or alternate forms) all revealed good to excellent test-retest reliability. These results demonstrate that it is reliable to use the same forms or alternate forms of the MMSE-2 total scores to monitor the changes in cognitive function of people with dementia over time. In addition, because of the good and excellent test-retest reliability found in the three versions of the MMSE-2 total scores in the AF group, these results could imply that the blue and red forms of the MMSE-2 total scores are equivalent to one another in repeated assessments. The two subtests (i.e., registration and recall) in the AF group showed moderate test-retest reliability, indicating that these two subtests may not consistently assess specific subtest functions over repeated assessments. One possible reason might be that alternate forms (blue and red forms) were used in the AF group, which may have resulted in more variation compared to using the same form as the SF group. Although the alternate forms used diminished the carryover effect, they increased the variation due to random errors in measurement. Thus, caution is needed when using the alternate forms of registration and recall subtests in people with dementia.
The MDC values can be used as thresholds to indicate whether an individual's changed score between two successive assessments is due to real improvement or due to measurement errors [29]. For example, the MDC value of the MMSE-2:BV total score was 0.55, indicating that a change of at least 0.55 points in a successive administration of the same form of the MMSE-2:BV total score could be interpreted as a real change (i.e., beyond a random measurement error) with 95% confidence. Therefore, clinicians could use the MDC values of the MMSE-2 (with either the same or alternate forms) to interpret the change in cognitive function of an individual patient after an intervention.
Our results showed that the three versions of the MMSE-2 total scores and all subtests were all below 30% of the highest score of all test data, indicating an acceptable random measurement error. Thus, the MMSE-2 appeared reliable for describing cognitive function in people with dementia. Although the random measurement error was acceptable, we found that the MDC% of the visual construction ability subtest in both groups and the registration subtest in the AF group were higher than those in the other subtests. The large amount of MDC% values indicates that the scores of these two subtests are unstable and thus may obscure the real changes of a person with dementia. Possible reasons for these results might be the nature of practice effects and moderate test-retest reliability. Future studies are needed to investigate the causes of variations in people's responses to visual-constructional ability and registration subtests between repeated assessments. The results could help users exclude factors that increase random measurement errors and improve the utility of the MMSE-2.
To the best of our knowledge, this was the first study to comprehensively examine the practice effect and testretest reliability of the MMSE-2 in people with dementia. Although the number of people with severe dementia were different between the two groups (SF group = 0 vs. AF group = 9), the results of the AF group with a sample size of 51were not dramatically different when compared with the previous results using a sample size of 60 (Additional file 1). In addition, we found that all three versions of the MMSE-2 and almost all of their subtest scores, except for the story memory subtest score (p = 0.003), had non-significant differences at baseline (p > 0.05) (Additional file 2). One possible reason for the significant difference in the story memory subtest score might be due to dementia-memory problems and the discrepancy in the number of people with severe dementia in the two groups. In this case, we could expect that people with severe dementia won't be influenced by previous testing when repeatedly assessed using the MMSE-2. In contrary to common belief that practice effect might obscure true cognitive decline, this information provides an important clue to users. Particularly, the changes in the scores of the MMSE-2 when repeatedly administered on people with severe dementia might imply that their cognitive abilities further decline [32,33] and thus the change scores could act as an indicator to identify individuals at greater risk of clinical progression. On the basis of this implication, we believe that people with severe dementia could still gain benefit from the small practice effect of the MMSE-2. Our study results could broaden the utility of the MMSE-2, since we comprehensively examined all three versions of the MMSE-2 and proved that the MMSE-2 was useful for repeatedly screening the cognitive function of people with dementia.
Our study had five limitations. First, the participants were recruited from northern Taiwan through convenience sampling and the red forms of the MMSE-2 were not used repeatedly in our study, thus the test-retest reliability of the red forms remain unknown, which restricts the generalizability of the results in this study. Second, a two-week interval was chosen as the practice effect in this study. Different retest lengths may result in different study findings. Future studies are needed to examine the practice effect of the MMSE-2 at different time intervals. Third, the fixed order design (i.e., blue form first and red form second) used in the AF group might have influenced the results of the study. To confirm our findings, future studies that randomized the alternate form order may be needed. Fourth, the inter-rater reliability between rates has not yet been established, which may jeopardize our current validation of the MMSE-2. Future studies are needed to examine the inter-rater reliability of the MMSE-2 in people with dementia. Fifth, we did not collect information regarding whether the patients had been assessed with the MMSE or MMSE-2 before or during our study, in which case the participants might have become familiar with the screening tools. Familiarity with the screening tools might have caused underestimations of the practice effect and test-retest reliability in this study.

Conclusion
Overall, the practice effect of the MMSE-2 diminished when the alternate forms were performed. In addition, the MMSE-2 had good to excellent test-retest reliability with an acceptable random measurement error, which supports the use of this measure in both clinical and research settings. However, the visual-constructional ability subtest showed small practice effects, and both the registration and recall subtests were found to have moderate test-retest reliability. Thus, caution should be exercised when interpreting the results of the visual-constructional ability, registration, and recall subtests of the MMSE-2.