We report the predictive utility of a short self-report postal screener suitable for administration to older people in the primary care setting. The postal fall risk screener was simple to administer but had limited accuracy in identifying those at higher risk of falling. According to accepted standards of interpretation, accuracy for single falls was poor, and repeat falls was acceptable/fair. We also examined whether addition of easy to collect sociodemographic and health-related variables could improve prediction. Additional variables improved all fall models to fair accuracy. This gives the possibility of extending current UK and international guidance for opportunistic screening for falls to a more population-based and systematic approach. The postal screener had poorer prediction of fracture risk. Fracture risk was only partially predicted by fall risk, but the addition of age, gender and BMI could improve performance of the short tool.
Our PreFIT postal screener incorporated risk factors previously identified within the USA WHAS [21]. We examined the characteristics of an ‘ideal’ predictive screening tool, by using recommended approaches to develop clinical prediction models statistically, using data from a derivation or development sample to re-test in a separate sample from a similar target populations [36] [32]. The WHAS was internally validated using appropriate statistical modelling but had yet to be tested in a large UK independent external dataset. We aimed to examine uptake and completion of a self-report postal screening version rather than a face-to-face, researcher or clinician-administered tool. The questions were short and understandable, asking about difficulties with balance whilst undertaking simple daily activities. Screener completion rates were high, with almost 90% returned to general practices.
Our short one-page screener had better prediction of those at greater risk of falling, people falling repeatedly over one year. This is perhaps unsurprising, but reassuringly, this is the group of fallers who are most likely to experience injury and decline in function [20]. Prediction improved slightly with adjustment for age, sex, frailty, physical and mental quality of life; lower mental health scores were predictive of falls. However, addition of all these items would expand the screener considerably, from three to 33 questions (SFI = 16 items, SF-12 = 12 items, age + sex = 2). This would undoubtedly impact upon completion and return rates, with minimal improvement in predictive utility. Prediction of fractures improved when adjusted for only age, sex and BMI (AUC 0.71), although this is still interpreted as ‘fair’ at best. We found evidence of collinearity between frailty and quality of life affecting fracture models, and these predictors were excluded. Few studies have tested the utility of predicting fall-related fractures, although several have examined injurious falls. Overall, studies are mixed, possibly due to low sample sizes, low event rates and variability in definitions for extent and type of injury [37].
A thorough, high-quality systematic review completed by Gade and colleagues identified 72 different prognostic models in 30 studies predicting falls in community-dwelling older adults (including the WHAS development cohort), although only three models had been externally validated, using data not used in original model development [38]. Overall risk of bias was deemed high, due to lack of standardised definitions for falls, low sample sizes and statistical concerns over model performance. Of the three validated models reported by Gade, their AUC values ranged from 0.62 to 0.69.
Our analytical approach adheres to TRIPOD recommendations for development and validation of prediction models [18]. Prediction models are widely used to augment rather than replace clinical decision making, to inform treatment or the need for further testing [32]. Risk tools either generate a continuous score used to estimate cumulative risk or yield a categorical score whereby risk status is classified as being at risk or not. We recommend recent explanatory guides for the application of prediction models in the clinical setting (e.g. [39]). Within falls prevention, these decisions are typically binary and thus require decision thresholds that are clinically relevant [33]. The optimal cut-offs for screeners should be chosen according to the relative costs of administration and subsequent referral for treatment, based on consideration of false positives and false negatives. In the context of falls, misclassification of those at lowest risk (false positive) may be less of an issue than the inaccurate underestimation of those at highest risk of falling (false negatives).
It is important for clinicians to understand the populations in whom models are developed and validated, bearing in mind any differences between source and test population characteristics. Item selection for our postal screener was informed by the WHAS dataset from another setting, undertaken in a female-only US sample of older women with mild disability. PreFIT included both sexes without restriction by upper age, functional or cognitive ability. Our large sample size of almost 10,000 older people from rural and urban settings across England, with an age range spanning thirty years, from 70 to 101 years, included older people with morbidity and disability. We used this comprehensive trial dataset to develop and validate our prognostic risk models using a split-sample validation approach (type IIa), as outlined by TRIPOD [18]. Calibration was excellent, although this may be expected given that a similar recruitment approach was used across groups in our sample.
We carefully considered our testing approach given demands to improve methodological quality and reporting standards of prognostic risk models. A systematic review of 120 prediction models found that the median number of predictors included in statistical models was six, median sample size 795 and median number of outcome events was 106 [32]. Our sample size was high and median number of events far exceeded those in other fall risk prediction studies (> 10,000 falls; > 300 fracture events) when compared to the findings of this systematic review. Many existing falls assessment tools were developed using expert clinical consensus or by selection of associative factors from cross-sectional studies rather than by formal quantitative testing of risk factors identified in prospective cohorts. Although considered modest (fair to poor), our ROC values are comparable with the US Elderly Accidents, Deaths and Injuries (STEADI) fall risk screener which reports AUC values of 0.64, although as with our findings, STEADI models also improved (to AUC 0.67) with the addition of sociodemographic characteristics [6]. The authors of STEADI argue that the sensitivity and specificity of their algorithm, interpreted as having moderate predictive validity, was tempered by the simplicity of measurement and adaptability for large scale survey purposes. Yet STEADI is a much longer screener tool and involves face-to-face assessment and balance tests. Extensive face-to-face objective testing is overly complex for routine use in primary care services on a population level. Muir et al. [40] argued that self-reported balance problems are on a par with more detailed measures and assessments of postural stability, such as observational gait and tandem stance tests.
Factors affecting sensitivity and specificity include prevalence and length of observation [41]. Unlike conventional screening for disease presence or absence, fall-related risk factors can change quickly over time and an important question for clinical recommendations is at what point should screening be repeated in older adults? Most hospital-based studies range from days to weeks of follow-up, compared to primary care or cohort studies that generally have longer time frames spanning a year or more. Longer-term monitoring may be required for those at higher risk and one option could be to undertake annual re-administration of the short screener within primary care, to identify the subgroup most likely to benefit from falls prevention services.
The PreFIT screener also compared well to other established clinical prediction tools in other areas of medicine. Higaonna [41] reported that, even in the acute hospital setting, no hospital-based fall screening tool exhibited both sensitivity and specificity of > 0.70, the minimum predictive validity criteria suggested by Oliver et al. [42]. Alba and colleagues [31] systematically reviewed the discriminatory capacity of prediction tools for: stroke risk in patients with atrial fibrillation (AUC range 0.66 to 0.75); risk of prostate cancer (AUC 0.56–0.74); mortality in people with heart failure (AUC 0.62–0.78), and adverse events in adults after discharge from emergency departments (AUC 0.58–0.64). These predictive tools were validated in at least five or more different external cohorts.
Strengths
PreFIT had good geographical spread across England, with excellent representation of the ‘oldest old’ living independently, with a third being aged 80 years or older (n = 3,248). We used standardised definitions for outcomes as per international recommendations (e.g. PRoFaNE [28]), outcome assessors were blinded to all baseline predictors and data missingness was low (95% and 99.9% for falls and fractures, respectively). We interpreted AUC values cautiously, according to statistical guidance [34]. Current international guidance recommends at least 200 events for prediction modelling and our falls analyses far exceeded the minimum requirement. [18, 32] Guidelines highlight that it is optimal to refine existing models rather develop new risk models, hence split-sample validation using risk factors from the WHAS dataset. Few falls prediction tools incorporate ADL difficulties, and it was feasible to screen for ability to complete basic or instrumental ADLs. Future research could test telephone administration the screener and it is also feasible for use by those with limited clinical training.
Limitations
Our analytical approach adheres to TRIPOD Type IIa, random split-sample development and validation [18]. The next phase of research would be to undertake full external validation (type 4) testing of the PreFIT screener in a separate cohort of older adults. Nevertheless, our trial dataset was large and used cluster sampling (GP practices), whereby participants were recruited and completed baseline measures prior to randomisation thus risk of contamination was low. There is a risk that trial interventions were confounded with the risk of falls or fracture, although our main analysis did not identify a significant treatment effect on fracture outcomes. Although the postal screener was short and completion rate high, we did not impute data for the 12% who failed to return a screener. Moreover, response rate may be lower in a sample who have not consented to participate in a research study. Our sample included predominantly white, cognitively intact community dwelling older adults, with a small proportion (< 5%) scoring lower on the clock drawing test (0–3). This may have impacted on the high response and completion rate. Representation of Black and minority ethnic groups within our recruited sample was lower than the English population [20]. Finally, we analysed falls events reported in postal questionnaires with four-month fall recall periods rather than falls reported in prospective monthly diaries, due to our observation that attrition increased over time when using falls diaries over 12-months [29]. We accept that recall bias and under-reporting of falls is a potential limitation.