Item distribution, internal consistency, and structural validity of the German version of the DEMQOL and DEMQOL–proxy

Background Accurate assessment of health-related quality of life as an endpoint in intervention studies is a major challenge in dementia research. The DEMQOL (29 items) and the proxy version (32 items), which is partly based on the DEMQOL, are internationally used instruments. To date, there is no information on the structural validity, item distribution, or internal consistency for the German language version of these questionnaires. Methods This psychometric study is based on a secondary data analysis of a sample of 201 outpatients with a mild form of Alzheimer’s disease (AD) and their informal caregivers. The informal caregivers who were interviewed were involved in the care of the person with AD several times per week. The analysis for the evaluation of the structural validity was performed using Mokken scale analysis. The internal consistency was calculated using the ρ of the Molenaar Sijtsma statistic and Cronbach’s α. Results For both versions, four subscales were identified: [A] “positive emotions”, [B] “negative emotions”, [C] “physical and cognitive functioning”, and [D] “daily activities and social relationships”. For both instruments, the internal consistency of all subscales was considered “good” (ρ = 0.71–0.88, α = 0.72–0.87). Conclusions The results are a first indication of good construct validity of the instruments used for the German setting. We recommend further investigations of the test-retest reliability and the inter-rater reliability of the proxy instrument. Electronic supplementary material The online version of this article (10.1186/s12877-018-0930-0) contains supplementary material, which is available to authorized users.


Background
According to the World Alzheimer Report, a person was diagnosed with dementia every 3.2 s in 2015. Currently, approximately 46.8 million people worldwide are living with dementia [1]. Dementia is a neurocognitive disorder associated with a significant cognitive decline from a previous level of performance, resulting in a dependency on others to perform activities of daily living [2].
Health-related quality of life (HRQoL) reflects an important desire for persons living with dementia and is therefore used as a general endpoint in many interventional studies. Additionally, HRQoL is increasingly used for assessments of anti-dementia drugs and by the European Medicines Agency to determine the benefits of such drugs [3]. HRQoL in persons living with dementia also is also considered by regulatory authorities and administrative agencies who must judge this parameter based on a resident's degree of self-sufficiency. Furthermore, HRQoL is used in economic evaluations of persons in all stages of dementia [4][5][6].
HRQoL is defined by Hays and Reeve as how well a person functions in his/her life and perceives his/her well-being in the physical, mental, and social domains of health [7]. In this definition, functioning refers to the individual's ability to achieve predefined activities, and well-being refers to individual's subjective feelings [8,9]. Based on this and similar definitions of HRQoL, Karimi and Brazier [8] conclude that HRQoL is a particular type of health description; therefore, the World Health Organization (WHO) defines health as "a state of complete physical, mental and social well-being, and not merely the absence of disease and infirmity" [10]. This definition indicates that HRQoL measurements reflect health in a wider sense (i.e., well-being and functioning) than solely other clinical outcomes (e.g., 5-year survival rate, rate of restenosis, death, and tumor recurrence).
According to Bakas et al. [11], three models of HRQoL are frequently used: the WHO [12] International Classification of Functioning, Disability and Health (ICF) Model of Functioning and Disability, the HRQoL Model from Wilson and Cleary [9], and, based on this model, the quality of life (QoL) measurement by Ferrans et al. [13].
Smith et al. observed the need to develop a conceptual framework that addressed the differences between the views expressed about HRQoL by people with dementia and their caregivers [14]. The results of the literature analysis and the findings from expert opinions could also be verified by data from interviews of individuals with dementia and their family caregivers [14]. Thus, an empirical justification of the conceptual framework can be assumed. Based on the five domains of the conceptual framework ("health and well-being", "cognitive functioning", "daily activities", "social relationships", and "selfconcept"), they developed two interviewer-administered instruments called Dementia Quality of Life (DEMQOL) and its proxy version (DEMQOL-Proxy). The authors conducted a pretest factor analysis during the development of the instrument (DEMQOL n = 130, DEMQOL-Proxy n = 126) that covered the same four dimensions: "positive emotions", "negative emotions", "memory", and "daily activities". For the DEMQOL-Proxy, a two-factor solution, as given by "emotion" and "functioning", has been suggested by the results of the pretest. The factors of both the self-report and proxy version, however, did not fully support the original conceptual framework [15]. Mulhern and colleagues [4] published a factor analysis with a sample of 644 persons with mild to moderate dementia and 682 proxies. In their study, the subscales "cognition", "positive emotion" and "negative emotion" could be used on both instruments [4]. However, "social relationships" and "loneliness" were observed only in the DEMQOL, while the subscales "daily activities" and "appearance" only occurred in the DEMQOL-Proxy [4].
The DEMQOL consists of 28 items, while the DEM-QOL-Proxy includes 31 items on a four-point Likert-type scale with the following responses: a lot, quite slightly, a little, and not at all. The scale includes an additional global QoL item (item 29 resp. 32). Items were scored from 1 to 4, with higher scores indicating better HRQoL. It must be noted that there are five contraindicative items in the DEMQOL and the DEM-QOL-Proxy (4 = a lot, ..., 1 = not at all). The global QoL item also contraindicates the answer options of "very good", "good", "fair", and "poor". Fifteen items, in addition to the global QoL item, are similar in both versions of the DEMQOL; however, there are also items that are not part of the other instrument [15].
The DEMQOL can be used in mild to moderate dementia as a self-report form and also for severe dementia in a proxy version (DEMQOL-Proxy) across different types of dementia and care arrangements [15]. The utility score (DEMQOL-U), which is created from a subset of items from the DEMQOL, can also be used for economic assessments [4]. The DEMQOL instrument was developed and tested in the UK, which was reported in a Health Technology Assessment (HTA) report [15]. Consequently, it is used more frequently in the UK. While there is a German translation, the results of a linguistic validation have not been reported [16]. No adequate results are available for psychometric testing of the German versions, which are required for both research and applied purposes [17]. To date, the DEMQOL has been subjected to at least four more latent variable modeling investigations in two countries since its foundation work [15], two factor analyses [4,18], bifactor modeling [19], and Rasch modeling [20].
This paper consequently targets the first evaluation of the item distribution, structural validity as a part of the construct validity, and internal consistency of the German version of both the DEMQOL and DEMQOL-Proxy. For this purpose, a parallel iterative Mokken scale analysis (MSA) is used as a further procedure in addition to the aforementioned methods.

Methods
The analysis for the present study was performed on a secondary data analysis using anonymous baseline data of a randomized controlled trial called the Cognitive Rehabilitation and Cognitive Behavioral Treatment for Early Dementia in Alzheimer's Disease (CORDIAL) study [21]. To test the structure found in terms of a sensitivity analysis, we used the data of the follow-up surveys after three (T 1 ) and after nine months (T 2 ). The CORDIAL study was performed to provide clinically meaningful benefits and to evaluate the feasibility, acceptance, efficacy, and usefulness of interventions in cognitive rehabilitation. The study was accomplished as a multicenter randomized controlled trial on persons living with Alzheimer's disease (AD) and their informal caregivers (as proxy raters). Ethical clearance was granted for the CORDIAL study by the Ethics Commission of the Faculty of Medicine of the Technical University of Munich on 12/10/2009 under the number 2113/ 08 S. We have refrained from re-auditing as ethical clearance is not required for analyses based on secondary data [22] or for studies using anonymous data.

Setting and participants
The baseline data of the CORDIAL study were collected from July 2008 to September 2009. The first inclusion criterion required participants to be elderly outpatients with an established ICD-10 diagnosis of AD with mild severity, as defined by a Mini-Mental State Examination (MMSE) score of 21 or above. A differential diagnosis to other forms of dementia was conducted by the recruitment centers. This process was completed to obtain a similar picture of symptoms of the participants, in which memory problems in the early stage are the focus of their everyday problems. Patients were recruited from the ten recruitment centers of the study, including memory clinics and neurological and psychotherapeutic practices throughout Germany.
The need for a designated informal caregiver who was involved in the care of the person living with AD several times a week was the second inclusion criterion for the study. Exclusion criteria were comprised of acute psychiatric or physical disorders, ongoing formal psychotherapy or cognitive training, regular visits to day care facilities, an impending hospital or nursing home admission, a poor command of the German language, alcohol or substance dependence, and participation in another interventional trial. Stable doses of cholinesterase inhibitors, memantine, nootropics, antidepressants, and antipsychotics were permissible as concomitant medications of the person living with AD [21].

Procedures
The task of the recruitment centers was to inform possible study participants in advance of the study. This process was completed through personal conversations and informational materials. Written informed consent was obtained from both, persons living with AD and their informal caregivers. Independent psychologists, serving as raters, conducted the assessments of the CORDIAL study. The raters received a one-day seminar with case studies for use during the assessments, including the DEMQOLquestionnaires. The two interviews with the persons living with AD and the informal caregivers were completed separately. Thus, the informal caregivers were blinded to the answers of the persons living with AD. The monitoring of the study was conducted by an interdisciplinary data monitoring and safety board.

Measurements
In addition to the previously described DEMQOL and DEMQOL-Proxy, which are the main topics of interest for our study, further instruments were used. To determine the cognitive ability of the persons living with AD, the German version of the MMSE was used [23]. It is an eleven-question assessment covering five areas of cognitive functioning: orientation, registration, attention and calculation, recall, and language. Each of the 30 tasks is evaluated using a point (range of total scores: 0 to 30). A lower MMSE score indicates a more severe cognitive impairment. To assess impairment of activities of daily living among persons living with AD, the Bayer Activities of Daily Living Scale (B-ADL) by Hindmarch et al. [24] was used. The B-ADL displays a proxy rating assessment for elderly persons with loss of cognitive performance. It is comprised of 25 items rated on a 10-point scale (1 = never, …, 10 = always). The total scores range from 1 to 10, and higher scores indicate higher impairment. To evaluate depressive symptoms, the long form of the Geriatric Depression Scale (GDS) from Yesavage [25] was used. This assessment has 30 dichotomous items (yes or no; directed differently); thus, the total score ranges from 0 to 30 points (higher scores reflect more severe depression). Finally, the Neuropsychiatric Inventory (NPI) by Cummings [26] was used to characterize the neuropsychiatric symptoms and psychopathology of persons living with AD. The NPI covers twelve types of neuropsychiatric disturbances. The frequencies of the symptoms are rated on a 4-point scale multiplied by their severities on a 3-point scale. A higher score (0 to 144) indicates more challenging behavior.
For a description of the informal caregiver, two assessments were used. First, the Beck Depression Inventory (BDI) from Hautzinger et al. [27] is an instrument that assesses the severity of depression. For each of the 21 questions, there are four different answers that are arranged according to their intensity (e.g., item 1: 0 = I do not feel sad; …, 3 = I am so sad or unhappy that I cannot stand it). A higher BDI score (range for total scores: 0 to 63) indicates greater severity of depressive symptoms. Second, the full Zarit Burden Interview (ZBI-22; 22 items) by Zarit et al. [28] measures the subjective burden of informal caregivers, associated with functional/behavioral impairments and the home care situation. The ZBI-22 is rated on a 5-point scale (0 = never, …, 4 = nearly always), and higher scores (range for total scores: 0 to 88) indicates a higher burden.
The socio-demographic data collected for the informal caregiver included age in years and sex, and for the person living with AD, education in years was recorded. In addition, the informal caregiver was asked to explain his/her relationship to the person living with AD.

Statistical analysis
The descriptions of the participants, missing data, and item distribution were conducted using descriptive statistics. The analysis of the item difficulty was based on the proportion of responses endorsing the best and worst ratings (i.e., ceiling/floor effects). A corresponding effect was assumed conservatively, as long as the mean value of the item was in the lower end (upper 20%) of the respective item range.
For the analysis of the structural validity of the DEM-QOL and DEMQOL-Proxy, as part of the construct validity, we used the MSA. The MSA is a useful tool for researchers who wish to construct unidimensional tests or use questionnaires that comprise multiple binary or polytomous items and enable the examination of reliability without the use of Cronbach's alpha [29]. The MSA is a method of the non-parametric item response theory originating from assumptions of the unidimensionality of tests or scales, local independency and monotonicity [30]. The method is established in the context of scale development and has been widely used in QoL research [31,32].
The MSA provides additional information about the relationship between items. As an indicator of the internal correlation of each subscale, the MSA uses Loevinger's H coefficient (H S ). According to Sijtsma and Molenaar [33], the following interpretation of H S scores was applied to describe the scale: > 0.5 = "strong", > 0.4 = "medium", and > 0.3 = "weak". The correlation of a single item to the other items of the scale is expressed by the value H i . The H i should be non-negative for the Mokken model to hold. Depending on the source, an H i -value from 0 to 0.55 is recommended. We have fixed the H i -value to the typically used minimum of > 0.3. Items that fall below this level have weak discrimination power and are not useful for this scale. The H ij designated the coefficient of two items.
The criterion (Crit) of the MSA by Molenaar and Sijtsma [34] was used to identify items that partially satisfy the assumptions for monotonous homogeneity or double monotonicity. For each item, this diagnostic value combines the H coefficient, the frequency and size of the violations, and their significance. Every item should have a Crit value of less than 40, and optimally a Crit value of 0. A Crit value of greater than 80 displays a strong indication that an item has violated the assumption for the MSA in this subscale. The critical values were calculated separately for each of the ten imputed records (see below). As a result, individual injuries of double monotonicity should not be systematically increased by a factor of ten.
For the exploratory investigation of the instruments, a method of parallel iteration was used, which consists of two steps. In the first step, cores were determined. The cores are items in an item pool that are strongly correlated with each other (H ij ) and are less correlated with the other cores when examined as a dyad. This finding means that other items from the pool could improve the H S value in a similar way that a second core could. Otherwise, the weaker second core was returned into the item pool as single items. As a strong correlation for a core, we defined a minimum H ij -value of 0.45 as a reference according to Müller-Schneider [35]. Analogous to the procedure in a factor analysis, the number of subscales is thereby predefined. In the second step, an iterative MSA was performed in parallel for each core determined. All items were tested in parallel. Accordingly, all remaining items would be tested against any core, and the item with the highest H i -value to a specific core would be chosen. In doing so, the assignment of an item should not lead to a violation of monotonicity (Crit > 40). This procedure was used with regard to content in the case of a possible allocation to two different cores (cross loader). The search procedure stopped when there were no further items that fulfilled the requirements (H i ≥ 0.3 or Crit > 40) or when all the items had been incorporated into a scale. This method of parallel iterative analysis allowed for the identification of smaller subscales with higher H S values, as opposed to the Automated Item Selection Procedure (AISP) for MSA [36].
As a precondition, the MSA assumes only complete cases and integers as values. Therefore, missing values may have to be imputed. In the case of instrument testing, however, the imputation of missing data should be performed with caution. We used a two-way imputation, which is a Bayesian method for estimating missing data in tests and questionnaires [37].
The internal consistency was assessed with the coefficient rho (ρ) of the Molenaar Sijtsma statistic. The ρ coefficient is not as prone to bias as Cronbach's α and should therefore be preferentially used [38,39]. For comparison purposes, we also calculated Cronbach's alpha (α). Values for a ρ or α between 0.70 and 0.95 indicated "good" internal consistency [40]. Finally, we conducted a part-whole-corrected item-total correlation (r it ) calculation. For this purpose, the coefficient for the item to be examined against the scale without this item was computed. Items with r it coefficients > 0.5 reflected a "high" correlation, and those with r it > 0.3 reflected a "moderate" correlation [41]. An r it correlation of 0.3 and below indicated that the item did not correlate well with the scale as the item may not measuring the same construct as the other variables.

Study population
A summary of participant characteristics is shown in Table 1. The total number of participants was 201 dyads (persons living with mild AD and their informal caregivers). Notably, 48% (n = 90) of informal caregivers did not specify their relationship to the person with AD (Table 1).

Missing value analysis
For the DEMQOL, with its 28 items and with 201 participants, there were seven missing responses (0.1%), while in the proxy version of the DEMQOL (31 items), there were eleven missing responses (0.2%; Table 2). The missing values corresponded to seven cases of one missing item (3.6%) in the DEMQOL, as well as nine cases in which one item (3.2%) and one case in which two items (6.5%) were missing in the DEMQOL-Proxy. Ten complete datasets with two-way imputed integers for missing data were generated for the MSA.

Item distribution
Five items (20-22, 25, 26) for the DEMQOL and six items (21-24, 28, 29) for the DEMQOL-Proxy showed a ceiling effect ( Table 2). The 15 identical items of the DEMQOL and DEMQOL-Proxy demonstrated that proxy ratings are typically lower than the corresponding self-ratings (Table 2).
For the DEMQOL, the H S coefficient showed a "medium" . The assumption of monotonicity for the MSA has been achieved for each of the ten imputed samples for all items (Crit = 0). All items with a ceiling effect could be found in subscale [D].
Sensitivity analyses were carried out to determine the stability of the identified subscales using the data from the CORDIAL study follow-up surveys, which can be found in Additional file 1: Table A1 for time T 1 after three months and in Additional file 2: Table A2 for Table A1). At T 2 , the assessment on subscale [A] improved to a "medium", whereas subscale [D] marginally worsened. The subscales [B] and [C] of the DEMQOL-Proxy improved to "strong" for both timepoints. The same result was found for subscale [D] at T 2 . Therefore, at T 2 , all subscales of the DEM-QOL-Proxy are "strong" (Additional file 2: Table A2).

Internal consistency
The subscales revealed a ρ of 0.71-0.84 (α = 0.72-0.83) for the DEMQOL and a ρ of 0.79-0.88 (α = 0.81-0.87) for the DEMQOL-Proxy. According to the ρand α-values, all subscales were considered "good" ( Table 3). The internal consistency was also "good" for T 1 and T 2 for all subscales, with ρand α-values > 0.72 (Additional file 1: Table A1 and Additional file 2: Table A2). The r it revealed only one item (item 9 from the DEMQOL-Proxy, "irritable") that was below 0.4, which was not scalable at all in the DEMQOL version. All other items had a "high" (n = 41) or "moderate" (n = 15) r it -values of 0.4 or more.          OverallThinking about all of these things in the last week, how would you rate

…
OverallThinking about all of these things in the last week, how would you say   [4,15], especially to the conceptual framework of Smith and colleagues [14] as well as to their explorative factor analysis of the pretest during the development of the instrument. As such, our results can be considered to be first indications of the structural validity of the DEMQOL and the DEMQOL-Proxy.
In our data evaluation, the domain "health and well-being" of the conceptual framework is represented by subscales [A], "positive emotions", and [B], "negative emotions", except items 27 ("how you feel in yourself") and 28 ("your health overall"), which have been loaded onto subscale [C], "physical and cognitive functioning". Therefore, this subscale has been complemented by the word "physical". For the DEMQOL-Proxy, only item 31 ("his/her physical health") from the framework was loaded onto subscale [D], "daily activities and social relationships". Subscale [C] characterizes the domain "cognitive functioning" of the conceptual framework in our data evaluation. Herein, 100% conformity for the DEM-QOL-Proxy could be shown. However, item 24 of the DEMQOL ("making yourself understood") loaded onto subscale [D]. The domains "daily activities" and "social relationships" of the framework were combined into the fourth subscale of our study, which was designated as subscale [D], "daily activities and social relationships". However, item 13 of the DEMQOL ("that there are things that you wanted to do but couldn't") was an exception, as it was loaded to the subscale [C], "physical and cognitive functioning". This loading may be due to the translation of the word "inability" into German. Furthermore, item 30 of the DEMQOL-Proxy ("not playing a useful part in things") was added to subscale [D]. This was the only item that could be assigned to the "self-concept" domain of the conceptual framework.
Compared to the findings of Mulhern and colleagues, we also found subscales for "positive emotions" and "negative emotions" in our study. In contrast, in our assessment, subscale [C], "physical and cognitive functioning", is related to what has been termed "cognition" in the HTA report by Mulhern and colleagues [4]. The subscales "social relationships" in the DEMQOL and "daily activities" in the DEMQOL-Proxy were referred to the common subscale [D], "daily activities and social relationships", for both instruments within our data evaluation. Taken together, 71% of the DEMQOL items and 65% of the DEMQOL-Proxy items coincided with the results of Mulhern et al. [4]; that is, 20 items that are equally distributed in both instruments.
Furthermore, the subscales "positive emotions" (both instruments) and "negative emotions" (DEMQOL-Proxy only) are completely identical, if the cross loaders of the DEMQOL-Proxy in the Mulhern study are not removed. Subscale [B] of the DEMQOL, "negative emotions", exhibits a difference, as given by the loading of items 8 and 9, while item 11 is not loaded. In contrast, the subscale "cognition" shows high consistencies with the DEMQOL and DEMQOL-Proxy. Differences in the DEMQOL, however, exist only in the additional loading of the items 13, 27 and 28, while item 15 was not loaded. On the DEMQOL-Proxy, item 26 was not loaded, while all other items were identical. Item 20 of DEMQOL was additionally loaded within subscale [D], "daily activities and social relationships", which was named "social relationships" by Mulhern et al. [4]. Only three items of the DEMQOL-Proxy could be seen in the data set of Mulhern et al. [4], while our data evaluation further loaded the items 21, 22, and 26-31.
In summary, our data demonstrated the determined subscales to be highly consistent with the conceptual framework of Smith et al. [15] and that these subscales further exhibit similarities to those of Mulhern et al. [4]. The sensitivity analysis at times T 1 and T 2 showed a stable result for assignment of the items to the subscales (Additional file 1: Table A1 and Additional file 2: Table  A2). In contrast, the designation of subscale [D] was identical in the DEMQOL and DEMQOL-Proxy, while this finding differed according to Mulhern et al. [4] and was therefore presented as "social relationships" in the DEMQOL and "daily activities" in the DEMQOL-Proxy. However, if both subscales were taken together, they represent a similar construct as our subscale [D], "daily activities and social relationships".
In accordance with the HRQoL definition from Hays and Reeve [7] provided in the background, the subscales we found in our study cover the aspects of "how well a person functions in his/her life" (physical and cognitive functioning [C]) and his or her "perceived well-being in physical (daily activities), mental (positive and negative emotions [A-B]), and social domains (social relationships [D]) of health".

Limitations
The presented data of this psychometric study used datasets from the CORDIAL study by Kurz et al. [21]. Thus, since additional data could not be obtained, no statements on the inter-rater reliability of the DEM-QOL-Proxy can be made to estimate the quality of the underlying data. Similarly, it was no longer possible to influence the number of study participants. According to a study by Straat et al. [46], a sample of more than 250 respondents should be given if the quality of the answers is high. In the present study, however, only the data of 201 persons could be used, reflecting a limitation of the results. Furthermore, the CORDIAL study included only persons with a mild severity of AD. The mild form might explain the ceiling effects of subscale [D]. Thus, the generalizability of the results is limited due to the absence of other forms of dementia.

Conclusions
In this psychometric study using the German versions of the instruments DEMQOL and DEMQOL-Proxy, four equal subscales were found in both instruments, demonstrating "good" internal consistency. The subscales reflect the conceptual framework of the instrument developers to a high degree. Thus, the results can be considered a first indication of the construct validity of the two German versions. In our opinion, DEMQOL subscale scores are more explanatory than a total score because HRQoL is a multidimensional concept and respective domain scores may help clarify treatment impacts. Moreover, our internal consistency results reflect the homogeneity of the subscales. However, Chua et al. [19] used bifactor models for direct comparisons between total and subscale scores and showed that the latter scores had poor reliability and should not be used. Such direct comparisons were not performed in our study due to differences in modeling decisions (parallel iterative MSA rather than bifactor modeling). The merits of MSA and bifactor modeling for clarifying multidimensionality are debatable [47,48]. Therefore, more empirical data are needed before definitive recommendations can be made.
However, we recommend further investigations prior to or integrated in future studies using the German version of the DEMQOL and DEMQOL-Proxy. In particular, we advise the integration of the evaluation of test-retest reliability (both DEMQOL versions) and inter-rater reliability (DEMQOL-Proxy) as part of the instrument application in future studies. Continued research should also be carried out on structural validity using various latent variable models. In addition, further investigation of data from persons living with moderate dementia, or with severe dementia for the proxy version, should be performed according to the ceiling effects we found in our study. For the proxy version, it would also be important to conduct a study with professional nurses to make statements on the use of the instrument in German nursing homes. Similarly, an investigation should be conducted to analyze against external criteria for other proportions of the construct validity.