Psychometric properties of multicomponent tools designed to assess frailty in older adults: A systematic review

Background Frailty is widely recognised as a distinct multifactorial clinical syndrome that implies vulnerability. The links between frailty and adverse outcomes such as death and institutionalisation have been widely evidenced. There is currently no gold standard frailty assessment tool; optimizing the assessment of frailty in older people therefore remains a research priority. The objective of this systematic review is to identify existing multi-component frailty assessment tools that were specifically developed to assess frailty in adults aged ≥60 years old and to systematically and critically evaluate the reliability and validity of these tools. Methods A systematic literature review was conducted using the standardised COnsensus‐based Standards for the selection of health Measurement INstruments (COSMIN) checklist to assess the methodological quality of included studies. Results Five thousand sixty-three studies were identified in total: 73 of which were included for review. 38 multi-component frailty assessment tools were identified: Reliability and validity data were available for 21 % (8/38) of tools. Only 5 % (2/38) of the frailty assessment tools had evidence of reliability and validity that was within statistically significant parameters and of fair-excellent methodological quality (the Frailty Index-Comprehensive Geriatric Assessment [FI-CGA] and the Tilburg Frailty Indicator [TFI]). Conclusions The TFI has the most robust evidence of reliability and validity and has been the most extensively examined in terms of psychometric properties. However, there is insufficient evidence at present to determine the best tool for use in research and clinical practice. Further in-depth evaluation of the psychometric properties of these tools is required before they can fulfil the criteria for a gold standard assessment tool. Electronic supplementary material The online version of this article (doi:10.1186/s12877-016-0225-2) contains supplementary material, which is available to authorized users.


Background
It is estimated that between the years 2000 and 2050, the percentage of the world's population over 60 years old will double from 11 to 22 % [1]. Frailty is considered one of the most complex and important issues associated with human ageing, with significant implications for both patient outcomes and healthcare service utilisation. The links between frailty and increased risk of adverse outcomes such as falls, loss of functional independence, decreased quality of life, institutionalisation and mortality have been clearly evidenced [2][3][4][5][6][7].
A recent systematic review of frailty prevalence worldwide concluded that 10.7 % of community dwelling adults aged ≥65 years were frail and 41.6 % pre-frail [8]. It was noted that prevalence figures varied substantially between studies (ranging from 4.0 to 59.1 %), with studies applying a physical phenotypical definition of frailty consistently reporting lower prevalence rates than those utilising a broader definition of frailty which included psychosocial domains [8]. This highlights the potential disparities in the identification of frailty depending on the definition of frailty applied.
The issue of identifying frailty is compounded by the fact that there is currently no universally accepted operational definition of frailty. A recent Delphi methods based consensus statement on frailty concluded that additional research into clinical and laboratory biomarkers of frailty is needed before an operational definition of frailty can be achieved [9]. However, expert agreement was reached on the basic theoretical underpinnings of frailty; the results of which were reflective of the defining characteristics of frailty for which there is a consensus in the literature. It is widely recognised that frailty is a distinct multifactorial clinical syndrome or state that is separate from, but often associated with, disease and disability [9][10][11]. Frailty is considered to be a dynamic, non-linear process characterised by decreased reserves and resistance resulting in poor maintenance of physiological homeostasis [10][11][12]. The dynamic nature of the frailty syndrome gives rise to the potential for preventative and restorative interventions.
Many models have been suggested to conceptualise frailty, however, at present there is no gold standard. The two models which have the largest evidence-base and are the most widely accepted are the Cardiovascular Health Study (CHS) Phenotype Model [13] and the Canadian Study of Health and Ageing (CSHA) Cumulative Deficit Model [14]. The CHS Phenotype Model [13] establishes a frailty phenotype with 5 variables (involuntary weight loss, self-reported exhaustion, slow gait speed, weak grip strength and self-reported sedentary behaviour), whereas the CSHA Cumulative Deficit Model [14] measures frailty via an index of age-related deficits including diseases and disabilities.
A wide variety of tools to screen for, diagnose and measure frailty have been developed based on models of frailty. However, at present no existing assessment tool is considered to be of a gold standard. In view of the predicted rise in the world's older adult population, the prevalence of frailty in this population, the evidenced links between frailty and adverse outcomes, and the potential for preventative and restorative interventions, the accurate assessment of frailty remains a significant clinical and research priority.
Six systematic reviews regarding the assessment of frailty have been published to date [15][16][17][18][19][20]. One review focused on the identification of frailty assessment tools [15]. Two reviews focused on the diagnostic test accuracy of frailty assessment tools; one reviewed the accuracy of simple measures to assess frailty [16] and one reviewed the sensitivity, specificity and predictive validity of instruments based on major theoretical views of frailty [17]. A further review examined the criterion validity, construct validity and responsiveness specifically of Frailty Indexes [18]. These reviews focused on the appraisal of a specific subset of frailty assessment tools and did not examine all aspects of validity or explore the reliability of the tools identified. Only two reviews have reported an evaluation of both the reliability and validity of frailty assessment tools [19,20]; the literature searches for which were completed in February 2010 and May 2011, respectively. Given the current vast expansion of the frailty literature, an updated review in this area is justified. The evaluation of psychometric properties was not the sole focus in either review [19,20]. An in-depth evaluation of all available reliability and validity data for existing frailty assessment tools; including an assessment of both the methodological quality of the evidence presented and the statistical significance of the results has not been completed. Further, both of these earlier reviews included studies which reported the assessment of frailty via tools that were developed to assess alternative constructs such as disability rather than frailty per se. Tools that have been developed to assess alternate constructs will be based on alternative conceptual models and frameworks that do not represent all aspects of frailty; resulting in limited construct validity when applied to the measurement of frailty. Also, where a tool has been developed to measure a concept that is distinct from but linked to frailty, such as disability, there is a significant chance of confounding of the assessments results, leading to the inaccurate assessment and diagnosis of frailty based on disability factors alone. The inclusion of such tools in a review limits the conclusions that can be drawn in specific reference to the assessment of frailty. One review also included studies involving singlecomponent assessment tools such as grip strength as a single measure [19]. Given the multifactorial and complex nature of the frailty syndrome, a tool to assess frailty should be multicomponent to capture this multifactorial complexity and grounded within a robust evidence-based model of frailty. Tools originally created to assess an alternative concept but later applied to frailty assessment suggest a lack of theoretical robustness, as does the application of a single-component assessment tool to assess a multifactorial clinical syndrome. Consequently, the aims of this review were to: Systematically and critically evaluate the available evidence concerning the reliability and validity of multi-component frailty assessment tools that were specifically developed to assess frailty in older adult populations; establishing the tool with the best evidence to support its use in both research and clinical settings.

Selection criteria
Studies were selected for inclusion for review if they met the following criteria: Study participants were aged ≥60 years old. The study described a multi-component tool (defined as a tool that assesses ≥2 indicators of frailty. Single-component tools were excluded due to the multifactorial and complex nature of the frailty syndrome). The study described a tool that was specifically developed to assess frailty (tools which were developed for alternative purposes and then applied to measure frailty were excluded as they do not exclusively assess frailty, but may assess related constructs such as disability resulting in a potentially invalid assessment of frailty and misdiagnosis). The main purpose of the study was the development and/or evaluation of the reliability and validity of a multi-component tool to assess frailty. The study applied the original version of a multi-component tool to assess frailty (studies citing modified versions were excluded as reliability and validity data relate to the modified tool only and reviewing all modified versions was beyond the scope of this review due to the large number of modified tools identified in the literature). The study reported quantitative data (the study must have reported inferential validation, studies reporting descriptive data alone were excluded). Studies were available in English or were translated wherever possible.
Studies were screened and selected for inclusion by JLS.

Assessment of the methodological quality of studies and data extraction
The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist is a standardized tool for assessing the methodological quality of studies examining the measurement properties of health-related instruments [21][22][23]. It assesses measurement properties in a number of domains: Internal Consistency (the degree of the inter-relatedness among items), Reliability (the proportion of the total variance in measurements due to "true" differences among patients), Measurement Error (the systematic and random error of a patient's score that is not attributed to true changes in the construct to be measured), Content Validity (the degree to which the content of an instrument is an adequate reflection of the construct to be measured), Construct Validity (the degree to which the scores of an instrument are consistent with hypotheses based on the assumption that the instrument validly measures the construct to be measured), Criterion Validity (the degree to which the scores of an instrument are an adequate reflection of a "gold standard") and Responsiveness (the ability of an instrument to detect change over time in the construct to be measured) [22]. A '' gold standard" measurement instrument is defined in the context of the COSMIN checklist as a valid and reliable instrument that has been widely accepted as a gold standard by experts in the field of its application [21][22][23].
Structural Validity (the degree to which the scores of an instrument are an adequate reflection of the performance of the dimensionality of the construct to be measured), Hypothesis Testing (item construct validity; the formulation of a hypothesis a priori with regard to correlations between the scores on the instrument and other variables e.g. with regard to internal relationships or relationships with scores on other instruments) and Cross Cultural Validity (the degree to which the performance of the items on a translated or culturally adapted instrument are an adequate reflection of the performance of the items of the original instrument) are assessed as part of Construct Validity [22].
With respect to scoring, each item in the COSMIN checklist is rated as 'excellent', 'good', 'fair', or 'poor' quality [21][22][23]. A rating of 'excellent' indicates that the evidence provided for that measurement property is adequate [21]. A rating of 'good' indicates that the evidence provided can be assumed to be adequate (although all relevant information may not be reported) [21]. Finally, ratings of 'fair' and 'poor' indicate that the evidence provided is questionable and inadequate, respectively [21].
The COSMIN checklist was applied to each study and data were extracted by two independent, blind raters (JLS, RLG, MCC, AMB, EVW, SD, SPN). Any disagreements were resolved through discussion. Data were then extracted regarding the methods and outcomes of the statistical analyses employed in each study to assess the identified measurement properties of each assessment tool. The outcomes of the statistical analyses employed by each study were compared to the accepted statistical parameters of significance for said test as identified in medical statistics literature (see Additional file 1 footnote). This allowed for the identification of statistically significant evidence of measurement properties testing.

Reporting
This review followed the PRISMA standards [24] for reporting of systematic reviews (see Additional file 2).

Literature search and inclusion for review
Five thousand sixty-three studies were identified in total, 73 of which were included for review following assessment against inclusion criteria (see Fig. 1) [2,13,.
The TFI and the FI-CGA were the only tools which had both reliability and validity data within statistically significant parameters of fair-excellent quality [45-47, 55, 57, 88-94]. The TFI had acceptable evidence of psychometric testing for 4 measurement domains; Reliability, Content Validity, Hypothesis Testing and Responsiveness. The FI-CGA had acceptable evidence of psychometric testing for 3 measurement domains; Reliability, Content Validity and Hypothesis Testing. The following tools were found to have no reliability or validity evidence of fairexcellent quality within statistically significant parameters; BFI [29], EFS [41,42], Frailty Index for Elders [51], Frail Non-Disabled Instrument [52], Frailty Screening Tool [53], Marigliano-Cacciafesta Polypathological Scale [68] and Strawbridge Frailty Measure [82,83].

Discussion
To the authors' knowledge this is the first review of the overall reliability and validity of multi-component frailty assessment tools that were specifically developed to assess frailty in older adult populations. This review presents a comprehensive list of multi-component frailty assessment tools for which there are published psychometric data.
Whilst 73 papers met the inclusion criteria for review, many more were excluded as they directly or indirectly reported on the psychometric evaluation of an amended version of an established frailty assessment tool. This was predominantly observed in relation to the CHS Phenotype Model [13] and the CSHA Cumulative Deficit Model [14], where modified versions of Fried's Phenotype of Frailty tool and Mitinski's Frailty Index were applied. While evidence from such studies supports the robustness of these models to conceptualise frailty, it does not provide evidence for the reliability or validity of the original assessment tool. This application of nonstandardised versions of frailty assessment tools within  [14]. A wide range of non-standardised Frailty Indexes were identified in the literature, which was outside of the scope of this review to explore; a recent systematic review by Drubbel et al. [18] specifically explored the criterion validity, construct validity and responsiveness of the Frailty Indexes when applied in a community-dwelling older adult population. It was observed that many of the frailty assessment tools included for review were developed and tested retrospectively using data available from large-scale longitudinal studies or were developed in conjunction with a larger trial; the main aim of which was not the development of a frailty assessment tool. This lack of focused primary research may partly explain why there are limited reliability and validity data of high quality for many of the tools identified.
In summary, the GFI and TFI were the most frequently examined tools with respect to psychometric properties (11 and 9 studies respectively). 22/38 tools identified had only 1 study concerning psychometric properties; this limited evidence-base reduces the generalisability of the results and conclusions that can be drawn.
Health measurement instruments must be both reliable and valid to ensure diagnostic accuracy and consistency in measurement [23]. Of the 38 multi-component frailty assessment tools identified, no tool has been examined in all reliability and validity domains assessed by the COS-MIN checklist. The TFI and GFI had the most psychometric domains explored (8/9 and 7/9 domains, respectively). However, not all of this evidence was assessed to be of fair-excellent quality within statistically significant parameters. Only the TFI and FI-CGA had reliability and validity data within statistically significant parameters of fair-excellent quality. The TFI had acceptable evidence of psychometric testing for 4 measurement domains, while the FI-CGA had acceptable evidence of psychometric testing for 3 measurement domains.

Research and clinical implications
The frailty assessment tool that has been most extensively examined in terms of its psychometric properties and has the most robust evidence supporting its reliability and validity is the TFI. However, for a frailty assessment tool to meet the requirements of a gold standard it must be based on a universally accepted operational definition of frailty and have evidence pertaining to all aspects of the tool's reliability and validity of high methodological quality [9]. Further research of good-excellent quality is needed, encompassing all aspects of reliability and validity, before the TFI tool can be classified as a gold standard.
The application of a tool without a strong evidencebase of reliability and validity significantly increases the risk of invalid assessment and misdiagnosis of frailty. The consequent implications for research are substantial, including an increased likelihood of the interpretation and reporting of flawed results. The implications for treatment provision and patient outcomes in a clinical setting are also substantial; with potential for decreased recognition of risks for adverse outcomes, inappropriate treatment planning and inappropriate allocation of resources including unsuitable provision of preventative and restorative interventions. Therefore, the scope and quality of reliability and validity evidence must be considered when selecting an assessment tool in both settings. Other key considerations that are important to note when selecting a frailty assessment tool are the interpretability and generalisability of the evidence-base. Evidence of the X reliability and validity of an assessment tool relates only to its application within the specific setting and population that it was developed for and validated in. The utility of the tool should also be considered, specifically the appropriateness of the mode of administration in relation to the setting and the time and resource demands associated with the tool. The development and psychometric evaluation of frailty assessment tools should be the primary focus of research projects to further develop a strong evidence-base. When evaluating existing tools, studies should apply a standardised version where feasible. The consensus on a universally accepted operational definition of frailty should also be a key focus of future frailty research to support the development of a gold standard frailty assessment tool.

Limitations of the review
The selection of studies for inclusion was completed by the lead author (JLS) only, which increased the potential for selection bias; this risk was minimised by following a comprehensive search strategy and the PRISMA standards for reporting in systematic reviews [24]. Studies examining tools that were not specifically developed to assess frailty were excluded; this resulted in the exclusion of some tools such as the Short Physical Performance Battery [96] and Comprehensive Geriatric Assessment [97] which have been referred to in the frailty literature as tools with potential utility in assessing frailty as part of a wider comprehensive assessment. This limits the scope of this review, but was considered reasonable given the complexity of the frailty syndrome. Studies which directly or indirectly reported on the psychometric evaluation of an amended version of an established frailty assessment tool were also excluded. This again limits the scope of the review but was considered reasonable due to the large number of studies citing modified tools identified in the literature and the large variation in the types of modifications.
The COSMIN checklist has several limitations in its application. When assessing Criterion Validity the COSMIN checklist requires the comparator tool to be of a gold standard. There is currently no gold standard frailty assessment tool. Thus, whilst the majority of studies included for review assessing Criterion Validity compared one frailty assessment tool to another widelyused tool, the COSMIN guidance stipulated that this should be rated as evidence of poor methodological quality in relation to Criterion Validity. The COSMIN guidance does however allow for this relationship between frailty assessment tools to be rated as part of Construct Validity, so the evidence of validity provided by such studies was still represented in the COSMIN scoring system. With regards to the COSMIN scoring system, the overall methodological quality rating per measurement property is obtained by taking the lowest rating of all the items assessed for that property giving a 'worst counts score' [21][22][23]. Occasionally, however, a measurement property scored highly for all items assessed except for one which resulted in a 'poor' overall score which did not accurately reflect all the presented evidence. Such a measurement property received the same overall rating as measurement properties that had entirely poor ratings for all items. It was not within the scope of this systematic review to differentiate between such ratings on an item by item basis when reporting results. Whilst this is a limitation, receiving a rating of 'poor' for one item is an indication of inadequate methodological quality so it does not impact on the overall quality assessment. The application of the COSMIN checklist; a standardised tool developed specifically to assess the methodological quality of studies examining the measurement properties of health-related instruments remains a strength of this review.

Conclusions
This review provides an up-to-date comprehensive list of all multi-component frailty assessment tools for which there is published psychometric data. It identifies a large number of multi-component frailty assessment tools in existence; however, the breadth and quality of the psychometric properties of these tools is limited. Only the FI-CGA [45][46][47] and TFI [54,56,[86][87][88][89][90][91][92][93][94] have both reliability and validity data within statistically significant parameters and of fair-excellent quality. However, this should be interpreted with caution as a score of 'fair' on the COSMIN checklist means that the evidence is only of questionable quality. At present, the TFI has the most robust evidence-base supporting its reliability and validity in assessing frailty. However, the psychometric properties of the TFI and all other multi-component frailty assessment tools require further in-depth evaluation before they can fulfil the criteria for a gold standard assessment tool, and before definitive conclusions regarding the best tool for use in research and clinical settings can be drawn. edited the manuscript. All authors have read and approved the final version of the manuscript.