Translation
The BESTest and the Mini-BESTest were translated into Norwegian following international guidelines [16]. Both tests were translated into Norwegian by three experienced physiotherapists, fluent in both English and Norwegian. These three versions were compared and discussed until agreement was reached. In addition, a senior researcher commented on the translation. A professional translator then back-translated both tests to English. Throughout the translation process we had communication with the original author (Fay Horak) who gave us permission to conduct the translation and who also approved the back-translated version. In the Norwegian versions, meters and centimetres rounded up to the closest centimetre are used instead of inches and feet.
Design and subjects
This was an observational study with a test-retest design. Three groups of participants, elderly over 65 years of age, people with a history of stroke or Multiple Sclerosis (MS), representing different clinical conditions and fall-risk profiles were recruited to ensure heterogeneity of balance impairments. In order to be included the participants had to be able to walk 6 m without a walking aid, and to be able to meet for testing on two occasions with a two-day interval. Exclusion criteria were to be unable to understand or follow oral instructions. Eligible people were asked to participate by their treating physiotherapist from three inclusion sites: The Geriatric rehabilitation ward at Oslo University Hospital (OUS), the Department of physiotherapy at Oslo and Akershus University College of Applied Sciences (HiOA) and from the Multiple Sclerosis Centre Hakadal in the period of 01.09.11–01.06.12. Forty-two participants were included in the study.
Qualification of raters
Two raters with extensive experience conducted the assessments, rater A had 20 years of experience both as a clinical physiotherapist and as a teacher in physiotherapy education at the university, rater B had 16 years of experience as a clinical physiotherapist.
Both raters had attended a 3-day BESTest workshop led by the developer of the tests and also watched the training videos available at the BESTests web portal [19]. Before the study, the raters had three training sessions where they were allowed to discuss how to score the test items with each other. The raters were not allowed to discuss the scoring during the study period.
Procedures
The test sessions took place at the three inclusion sites, and the same test equipment was used at all three sites. The participants were tested with two-day interval; both test sessions were conducted in the same room and at the same time of the day. Instructions to the participants were to live their normal life, and take their medication according to their normal regime during the test period. Before the second test session, the participants were asked about any relevant changes in self-perceived balance state since the first test session. The participants performed the BESTest and the Mini-BESTest barefoot except for the tasks in section VI where they were allowed to wear flat-heeled shoes. All participants used the same shoes at both sessions.
Both the BESTest and the mini-BESTest were scored at both test sessions. Rater B administered all the tests at both sessions, while both raters scored the participants performance from the same test trials. All items in the mini-BESTest are included in the BESTest, so each task (item) was only performed once and scored according to the test criteria. The participants were allowed to rest as needed during the test sessions.
Demographic information (age, weight, height, diseases, number of medications and number of falls during the last year) was obtained by interviewing the subjects before the first test session. At the end of the first session, the FES-I was administered by rater B as a structured interview. Total time for first session was approximately 60 min. The second session was without the interviewing, and took approximately 40 min to administrate.
Assessment tools
BESTest
The BESTest consists of 27 different tasks, including a total of 36 items, because some tasks include testing of both the right and the left side of the body [2]. All items are scored on a 4-point ordinal scale (0–3), with higher scores indicating better balance. Test scores are calculated for each of the 6 sections and for the summary of all test items for all sections (0–108). Section scores and total score are usually converted to percent-scores. The BESTest takes approximately 35–40 min to complete.
Mini-BESTest
Mini-BESTest consists of 14 tasks focusing on dynamic balance [10]. The items are scored on a 3-point ordinal scale (0–2), giving a maximum score of 32 points, with higher scores indicating better balance. The Mini-BESTest takes approximately 10–15 min to administer.
Falls efficacy scale - international
The FES-I is a 16-item questionnaire for assessment of fall-related self-efficacy, and related to performance of common activities in a person’s everyday life [15]. The items are rated according to "how concerned you are about the possibility of falling" using a 4-point scale (1–4) with the following responses; 1) not at all, 2) somewhat, 3) fairly, 4) very concerned, giving a total score from 16 to 64 points. Higher scores indicate more concerns for falling. The Norwegian translation of the FES-I has been established in individuals with increased risk of falling [20].
Data analysis
All statistical analyses were performed using the IBM SPSS Statistics, version 22.0 (IBM Corp., Armonk, New York). For calculation of relative and absolute reliability the criteria for evaluation of measurements developed by the prevention of Falls Network Europa [21] and The COSMIN checklist was followed [22].
The sample size calculations were based on the formula n = 2×(SD/Δ)2xk [23], where the expected standard deviation (SD) was based on Horak’s study with SD = ±9.6% [2]. At the time, no study had established clinical relevance change for the BESTest; we therefore chose the least clinically difference, which is the difference in score which patients perceive as important, to be 7.0% based on other clinical outcome measures [24]. We used α = 0.05 with a power of 0.80. This gave a sample size of minimum 30 participants [23].
Intraclass Correlation Coefficient (ICC) with 95% confidence interval was used as measures of relative reliability [25, 26]. For the interrater reliability the ICC(2,1) and the ICC(3,1) were used. ICC(2,1) is based on a two-ways random absolute agreement, shows variability between raters, and the results can be generalized to other raters. ICC(3,1), using two-ways mixed, consistency, is a measure of the consistency of the scoring of each rater. Systematic errors are not included as measurement error. When ICC(2,1) and ICC(3,1) are identical or shows only minor differences, there is no systematic error e.g. learning effect, present [25].
For the test-retest reliability ICC(1,1) using a one-way random model, and ICC(3,1) was used. With ICC(1,1) all systematic and random intrasubject variability is seen as measurements error. Again, if ICC(1,1) and ICC(3,1) is identical or shows only minor differences, there is no systematic error present [25].
Measurement error is the systematic and random error of a subject’s score that is not attributed to the true changes in the construct to be measured [22]. For assessment of absolute reliability Standard Error of Measurement (SEM) and Smallest Detectable Change (SDC95) were used. SEM represents the standard deviation of repeated measures in one participant (SEM = SD/√2). SDC95 represents the smallest change that a participant must show to ensure that the observed change is real and not just a measurement error (SDC95 = SEMx√2 × 1.96) [22]. Since the FES-I demonstrated a skewed distribution we used the Spearman Rank Correlation to examine concurrent validity between the FES-I and rater A’s total scores of the BESTest and the Mini-BESTest. Correlation coefficients of 0.00–0.25 were interpreted as little to no correlation, 0.25–0.49 as fair, 0.50–0.75 as a moderate to good, and above 0.75 as a strong correlation [18].
The presence of floor and ceiling effects was defined as 15% or more of the participants having the lowest or the respectively the highest possible score on the BESTest and the Mini-BESTest [27].