Absolute and relative reliability of acute effects of aerobic exercise on executive function in seniors

Background Aging is accompanied by a decline of executive function. Aerobic exercise training induces moderate improvements of cognitive domains (i.e., attention, processing, executive function, memory) in seniors. Most conclusive data are obtained from studies with dementia or cognitive impairment. Confident detection of exercise training effects requires adequate between-day reliability and low day-to-day variability obtained from acute studies, respectively. These absolute and relative reliability measures have not yet been examined for a single aerobic training session in seniors. Methods Twenty-two healthy and physically active seniors (age: 69 ± 3 y, BMI: 24.8 ± 2.2, VO2peak: 32 ± 6 mL/kg/bodyweight) were enrolled in this randomized controlled cross-over study. A repeated between-day comparison [i.e., day 1 (habituation) vs. day 2 & day 2 vs. day 3] of executive function testing (Eriksen-Flanker-Test, Stroop-Color-Test, Digit-Span, Five-Point-Test) before and after aerobic cycling exercise at 70% of the heart rate reserve [0.7 × (HRmax – HRrest)] was conducted. Reliability measures were calculated for pre, post and change scores. Results Large between-day differences between day 1 and 2 were found for reaction times (Flanker- and Stroop Color testing) and completed figures (Five-Point test) at pre and post testing (0.002 < p < 0.05, 0.16 < ɳp 2 < 0.38). These differences notably declined when comparing day 2 and 3. Absolute between days variability (CoV) dropped from 10 to 5% when comparing day 2 vs. day 3 instead of day 1 vs. day 2. Also ICC ranges increased from day 1 vs. day 2 (0.65 < ICC < 0.87) to day 2 vs. day 3 (0.40 < ICC < 0.93). Interestingly, reliability measures for pre-post change scores were low (0.02 < ICC < 0.71). These data did not improve when comparing day 2 with day 3. During inhibition tests, reaction times showed excellent reliability values compared to the poor to fair reliability of accuracy. Conclusion Notable habituation to the whole testing procedure should be considered as it increased the reliability of different executive function tests. Change scores of executive function after acute aerobic exercise cannot be detected reliably. Large intra- and inter-individual of responses to acute aerobic exercise in seniors can be presumed.


Background
One-third of the global and nearly half of the Western population will be aged >60 years at the end of the twenty-first century [1]. The process of biological aging, particularly in later adulthood, goes along with deteriorations of physical and cognitive function [2]. Cognitive function is a robust predictor of mortality in older people at population level [3] and executive control, seems to predict the functional status during daily life in seniors [4]. Executive functions refer to a family of topdown cognitive processes underlying the organization and control of goal-directed behaviour [5]. According to Miyake et al. [6] inhibitory control (control of attention, behaviour and emotions to override a dominant or prepotent response), working memory (storage, manipulation and retrieval of information) and cognitive flexibility (ability to flexibly shift between mental sets) are considered its core components [6].
Previous reviews led to the "executive function hypothesis", which suggests that regular physical activity and exercise selectively elicit benefits in this cognitive domain [7,8]. Although benefits of exercise targeting cardiovascular fitness were also reported for attention and long-term memory in older adults [9], some metaanalytical findings show no evidence for improvements of cognitive performance after a period of aerobic training. Nonetheless, improvements in cognitive function following chronic exercise are considered clinically relevant as most findings from observational studies suggest that regular exercise can delay the onset of future dementia [10], possibly due to a promotion of cognitive reserves protecting cognitive function in spite of disease or damage [11]. Even low but regular PA levels were found to be positively associated with cognitive function (>100.000 participants across 20 nations) during aging [12].
In contrast to the heterogeneous findings obtained from longitudinal studies, acute bouts of aerobic exercise seem to transiently improve several dimensions of cognitive function, whereby immediate and delayed benefits were pronounced for executive function [13]. A recent meta-analysis revealed that moderate aerobic exercise elicits particularly beneficial effects for inhibitory control, working memory and task-switching in preadolescent children and older adults compared to other age groups [14]. These acute improvements of cognitive function were reported for time-dependent measures and appeared to be independent from the participant's fitness level. Acute exercise-induced changes of cortical, vascular, hemodynamic and metabolic functions [15][16][17] have been discussed as underlying mechanisms for improved cognitive control. Although benefits elicited by a single exercise bout are considered to be transient, such cognitive improvements are still of high practical relevance. One major advantage of acute effects over chronic effects is that independent of the fitness level they can be elicited quickly [14]. Furthermore, older adults may use a single exercise session as a strategy to prepare for situations demanding high executive control. Whereas both acute and chronic facilitation of executive function by exercise gained notable attention, the variability and reliability of employed tests for assessing cognitive domain in older adults as well as the reliability of acute effects of exercise have long been disregarded.
Thus, examining changes of executive function following acute exercise adequate detection of meaningful change following these interventions [18]. Therefore, the identification of baseline and/or postexercise variability of the respective cognitive testing parameters is needed. Otherwise, a reliable justification of detrimental or beneficial effects on cognitive performance is hindered. In this regard, the quantification of day-to-day reliability of a certain variable reflecting executive function needs to be discussed towards 1) chance, 2) system-immanent errors and 3) biological variability [19]. A resulting fundamental question is "how reliable is a particular assessment tool and how precisely and reproducible can acute exercise training effects on executive function be identified". This is of particular importance as a certain amount of error is inherent when testing human beings and, thus, reliability can be considered as the amount of error or variability which is accepted for a particular test [20]. Hence, it is important to know whether an acute change of performance (e.g., executive function) can be attributed to the intervention rather than to its day-to-day variation [21]. Day-today variation can be due to the training setting, agegroup, occurring fatigue, activity level, disease status [18]. In this regard, relative (e.g., intra-class correlation coefficients) and absolute variability estimates (e.g., standard error of measurement or coefficients of variation) need to be identified for baseline and postexercise executive function in an acute setting of healthy seniors.
Against this background, the present study aimed at investigating day-to-day reliability of a variety of tests on executive function before and after acute aerobic exercise (i.e., Eriksen Flanker Test, Stroop Test, Digit Span, and Five Point Test) in a group of healthy seniors. Thereby, various absolute and relative reliability indices were collected within a three days repeated measures design in an acute exercise setting. We aimed at disentangling whether acute changes of executive function do vary in seniors. Beside acute effects of exercise on executive function this information is needed to estimate the likelihood of detectable change also in future training studies on exercise and executive function.

Participants and general design
Twenty-two healthy and physically and recreationally active seniors (Table 1) were recruited via a Senior Club (ProSenectute) and voluntarily participated in the present reliability study. Prior stroke, heart attack, heart failure and surgery, bypass, cardiac dysrhythmia, acute flu or cold, spinal-, joint and headpain, diabetes mellitus, untreated hypertension (>160/>100), acute and chronic inflammatory condition, severe arthrosis, recurrent vertigo, knee-or hipendoprosthesis, trauma within the last 6 month). None of the included participants reported any of those conditions. We conducted a repeated betweenday comparison (i.e., day 1 vs. day 2 and day 2 vs. day 3) within weekly intervals (Fig. 1). None of the included participants reported any cardiac, pulmonary or neurological condition, elevated blood pressure or medication intake based on the physical activity readiness questionnaire (PAR-Q) [22]. Seniors with at least mild cognitive impairment (MCI) based on the Mini-mental state exam scoring between 20 and 25 [23] and clock drawing test [24] were excluded. None of the recruited seniors needed to be excluded due to at least mild MCI. The sample size of 20 Subjects was based on an at least moderate correlation between pre and post testing during executive function testing and deliver a strong power (1-beta error) of 90% and a very high alpha level of 0.01. We recruited more participants due to expected drop outs that did not occur.
All seniors were requested to refrain from moderate to severe exercise within the last 24 h prior to spiroergometric exercise testing or acute aerobic exercise training. Caffeine intake was not allowed 5 h prior to exercise testing or training. No caffeine withdrawal symptoms were observed. Between-day variability of cognitive function before and after acute aerobic exercise was assessed on three days in weekly intervals (Fig. 1). The study was approved by the local ethical committee of the University of Basel (11/23/ 2015-254) and meets the criteria of the declaration of Helsinki. After receiving all relevant study information, the participants signed an informed consent to the study including a permission to publish the data. The Freiburg physical activity questionnaire was used to assess baseline physical activity in h/week (Frey et al. 1995). The total amount of weekly physical activity includes baseline physical activity (e.g., daily walked or biked distance, stair climbing), leisure time activity (e.g., hiking, dancing, bowling) and sportive activity (disciplines). The summarized weekly hours were used to describe activity of both groups. Participants' body fat and weight was assessed using the InBody 170 (JP Global Markets GmbH, Germany). To measure body height a measuring pole was used. These are common, valid and good to excellently reliable tools for anthropometric assessment (0.8 > r > 0.95).

Aerobic cycling exercise and exercise intensity determination
Based on health-related exercise recommendations of the American College of Sports Medicine [25], 30 min of aerobic cycling exercise at 70% of the heart rate reserve (HRR) using the "Karvonen-formula" (0.7 × (HR max -HR rest )) was applied in-between cognitive testing [26,27]. In order to calculate HRR, maximal heart rates (HR max ) were obtained from maximal spiroergometric ramp-like exercise testing on a treadmill. Briefly, the protocol started at a velocity of 6 km/h − 1 and an inclination of 0.1% for a time period of 1 min. Intensity was increased by 1 km/h − 1 every minute until volitional exhaustion was reached. Maximal exertion levels have been verified if at least three out of the following 5 criteria were reached: 1) agepredicted maximal heart rates [28], 2) breathing frequency (>35/min) 3) capillary lactate concentration (>8 mmol/L), 4) ventilator equivalent for oxygen uptake (>35) and 5) respiratory exchange ratio (RER >1.1) [29]. Ventilatory parameters were collected using the Cortex Metalyzer® 3B metabolic test system (Cortex Biophysik GmbH, Leipzig, Germany). VO 2peak and HR max were derived from the final 30 s before exercise cessation.

Cognitive testing
Different aspects of executive function were assessed using the Five-Point-Test [30] and computer-based modified versions of the Eriksen-Flanker task (Eriksen & Eriksen, 1974), Stroop Color-Word [31] as well as Digit-Span (forward & backward) in a counterbalanced order. All computer-based tests were administered by the same rater with Presentation 18.0 (NeuroBehavioral Systems, USA). No breaks between testing were allowed. Cognitive assessments were performed in a separate room with one participant at a time. Prior to testing, instructions were provided verbally in a standardized manner. Afterwards, instructions were also presented on the screen to make sure the participants understood the task. Following the instruction, noise was kept to a minimum. Environmental temperature was held constant at 21°C during cognitive testing.

Flanker task
The modified Flanker task was used to assess the inhibitory component of executive control [32]. During the task, participants are required to respond to a centrally presented target stimulus (vertical visual angle: 8.5°) by pressing a button corresponding to its direction. In congruent trials, the target stimulus was surrounded by six arrows facing the same direction, whereas in incongruent trials the centrally presented target stimulus was facing in the opposite direction of the flanking arrows. Participants completed one practice block with 10 trials and one test block with 100 trials. In each trial, arrows were presented focally for 200 ms after a fixation period of 1000 ms. The response window was set to 1000 ms and participants received feedback on their response. Congruent and incongruent trials were presented in random order and with equal probability. Reaction time and accuracy for congruent and incongruent trials were calculated to assess speed processing and interference control.

Stroop color test (SCT)
The Stroop Color-Word is a standard test to assess inhibitory control [32,33]. The stimulus used in this task is a color name presented in the centre of the screen (vertical visual angle: 8.5°). It is either printed in ink matching the color name (compatible trials) or in a different color of ink (incompatible trials). Participants are instructed to press a button corresponding to the ink in the color block or the name of the color in the word block. The colors/words chosen for this task were "rot" (red), "grün" (green), "gelb" (yellow) and "blau" (blue). Participants completed a practice block with 12 trials as well as one color and one word block with 96 trials each. The order of test blocks was counter-balanced across participants and both types of trials (compatible, incompatible) appeared with equal probability. Each trial started with a 500 ms fixation period, followed by the presentation of a stimulus over 200 ms. Responses were  collected within a 2500 ms window and participants received feedback on their accuracy. As dependent measures of information processing and interference control reaction time and accuracy were calculated for compatible and incompatible trials, respectively.

Digit span forward and backward (DSF/DSB)
Digit Span Forward and backward is used to assess working memory and updating of working memory, respectively [32]. In this task, participants were required to repeat a sequence of digits (1-9 presented with equal probability) on a computer keyboard in the same (forward) and in reversed order (backward). Digits did not occur in regular ascending or descending sequences with equal consecutive step sizes. In all trials, each digit was presented for 500 ms with an inter-stimulus interval of 500 ms. Although participants were instructed to provide a timely response, the time window was not limited. The span length was increased by one digit every two trials (starting from 3 in digit span forward and 2 in backward), until the limit of two successive errors was reached. Measures obtained from the task were lengths of the longest span answered correctly forward and backward as well as the number of cumulative errors.

Five-point test
The Five-Point test is used for the assessment of figural fluency functions, which relate to the set-shifting component of executive control [33]. For this task, participants received a sheet of paper, on which 40 five-dot matrices were printed. Each matrix was identical and consisted of a fixed pattern of five symmetrically arranged dots. Participants were required to draw as many designs as possible in 2 min by connecting the dots with at least one straight line. After the investigator demonstrated two possible designs, participants were asked to perform the task. The Five-Point test was scored by counting the total number of unique designs and repetitions of designs (perseverative errors).

Statistics
All outcome measures were checked for normal distribution (Kolmogorov Smirnov test) and variance homogeneity (Levene test). Data are given as means with standard deviations (SD) and 90% confidence intervals (90% CI), respectively. Repeated measures analyses of variance were applied separately for each outcome measure between the two subsequent trials at the beginning and at the end of the testing procedure. An α-level of p < 0.05 was accepted as statistically significant. Effect sizes for variance analyses were given as partial eta squared (η p 2 ) with values ≥0.01, ≥ 0.06, ≥ 0.14 indicating small, moderate, or large effects, respectively. Intraclass correlation coefficients as a measure of relative reliability was computed according to the formula ICC = 1 -(SEM 2 /SD 2 ) with SD serving as between subject standard deviation [21].
We calculated the standard error of measurements (SEM, computed as the SD of the difference divided by the square root of 2) as well as the log-transformed coefficient of variation (CoV) together with 90% confidence limits as measures of absolute reliability [18,19]. Reliability data were analysed using a published spreadsheet [34] in Microsoft Excel® of Hopkins (Hopkins 2007).

Results
Day-to-day variability between the 1st and 2nd day at pre and post testing At pre testing, meaningful differences were observed between 1st and 2nd day of testing for the Eriksen-Flanker-(compatible: p = 0.05, ɳ p 2 = 0.12) and Five-Point-test (correctly completed figures: p = 0.05, ɳ p 2 = 0.20). The Stroop Color-Word test revealed relevant differences between 1st and 2nd day for compatible and incompatible reaction time at both pre-and posttesting (pre: 0.002 < p < 0.04, 0.16 < ɳ p 2 < 0.38) ( Table 2, column 4 and 5) Coefficients of variation below 10% were found for reaction times and accuracy on the Eriksen-Flanker-(0.5% < CoV < 5.5%) and Stroop-Color-test (0.8% < CoV < 8.7%), (   Day-to-day variability for the change score between day 1 and 2 and day 2 and 3 Change scores between pre and post testing for day 1 vs. day 2 as well as day 2 and day 3 showed insufficient relative and absolute reliability values (  (Table 3).

Discussion
The present study assessed absolute (e.g., CoV) and relative (e.g., ICC) between-day variability of executive function (i.e., Eriksen-Flanker-test, Stroop-Color-test, Digit-Span and Five-Point-test) before and after a single bout of moderately intense aerobic cycling exercise (30 min at 70% of the heart rate reserve). The study was applied to healthy and active seniors using a three-day (habituation day, first day, second day) repeated measures design. First, we found notable between-day habituation (from 1st to 2nd day) mainly for some time-dependent measures obtained from the Eriksen-Flanker-, Stroop-Color as well as completed figures in the Five-Point test at both pre and post testing. This is not surprising from a general viewpoint of between-day-learning. However, our testing and training session were interspersed by 7 days. Thus, also longer during between-trial breaks of up to 7 days should account for habituation effects. As these habituation effects became smaller from day 2 to day 3, relative between-day reliability (i.e., ICC) of time-dependent measures of inhibitory control (assessed with Eriksen-Flanker-test, Stroop Color-Word test) and the number of completed figures in the Five-Point test further increased from acceptable to excellent reliability. For the Stroop Color-Word and Flanker task, habituation effects have been shown previously using a short [35] or longer retest-interval [36]. The present results therefore indicate that habituation to the testing procedures should be considered in older adults, when studies aim to examine changes of inhibitory control over time. As the reliability of time-dependent measures of inhibitory control was excellent after a practice day at pre and post level, the temporal stability of these outcomes suggests that this subcomponent of executive functioning reflects stable individual differences. Despite comparatively high accuracy (>90%) on the Flanker Stroop Color-Word test, the percentage of correct responses showed remarkably lower absolute and relative reliability values than time-dependent measures in both testing scenarios. This could be due to the fact that older adults achieved very high accuracy rates, so that a ceiling effect in performance resulted in less discriminative power and variance between participants. Furthermore, a similar accuracy rate on trials assessing information processing and trials assessing inhibitory control indicates that the number of correct responses does not discriminate well between different cognitive functions and should not be used as the main outcome for the Flanker and Stroop Color-Word test. This is considered to be true, if the high accuracy rates are not solely due to participants completing the tasks with a prevention focus. This strategic inclination promotes an increase of correct responses, whereas a promotion focus reduces reaction time [37]. However, all participants were instructed to perform the task as quick and accurate as possible, so that the instructor did not systematically change the strategic inclination of the participants and a simple trade-off between reaction time and accuracy seems less likely. Concerning working memory, the Digit-Span testing revealed poor to fair indices for both relative (ICC) and absolute (ICC) reliability, with CoV around 20% for completed stages and 40% for cumulative errors. These findings hold true for pre and post exercise testing values. For the Digit span backward, Waters and Caplan [38] have also reported a test-retest reliability that is lower than desirable in older adults [38]. However, habituation to cognitive testing increased the relative reliability of the number of completed stages. Consequently, this is the only measure of the Digit Span showing a temporal stability that justifies its use in the detection of acute or chronic effects of exercise on different aspects of executive function. However, a lower reliability of working memory measures compared to outcomes obtained from the inhibition tasks might not be test-specific. The number of trials in the Digit Span was much lower than in the Stroop Color-Word and Flanker task. It is very likely that a higher number of trials would have decreased the variability between test and retest. Therefore, increasing the number of runs on the backward and forward version of the Digit Span in addition to a habituation to the testing procedures might be most promising for the improvement of test-retest variability.
In contrast to other studies assessing the reliability of different executive function tasks, test-retest variability was measured before and after a moderate Although widely and solely reported [39], these "relative reliability" [40] data need to be handled with caution. Intra-class correlation coefficients are highly sensitive to inter-individual variability (heterogeneity) and their magnitude can be difficult to interpret [41]. Absolute reliable data enable the comparison with other testings. Performance tests in athletes mostly require CoV levels below 5% [41] and recreational settings mostly require CoV values around or below the 10% level [18]. Higher CoV values increasingly impede a reliable detection of "real" change due to the respective intervention. However, heterogeneous populations (e.g., seniors with mental decline, chronic disease) and settings (e.g., lab, home-based) might entail meaningful baseline and exercise-induced variability of cognitive outcomes. Thus, meaningful interventional change on individual level could be "masked" by variability of the measuring and biological "system", respectively. 10% levels of variability values are given for timedependent variables in our group of healthy seniors. Only a minority of testing instruments (E.g., Digit-Span testing) revealed inacceptable absolute variability in seniors. Low CoV values of speed and accuracy values are needed to increase the likelihood to detected true intervention-related changes and not due to chance variations. From a scientific point of view, acute intervention studies commonly evaluate mean changes from pre-to post-testing on a group or population level with a notable and inherent amount of noise. As a consequence, reported reliability data should be used to estimate the required sample size to detect meaningful intervention effects.
The present study comprises some limitations that need to be addressed. First, we included active and healthy seniors only. It might be reasonable that participants with MCI show larger absolute variability with lower values for correct responses (ceiling effects) compared to their healthy counterparts. However, seniors per se provide large inter-individual differences in cognitive functions due to different morphological and functional aspects of brain aging (e.g., localization of lesions). Thus, our results cannot be transferred to older and frail seniors with mild to severe cognitive impairment. Moreover, day-to-day variability in post-exercise assessments might have been influenced by the participants' dynamic capacity for adjusting cognitive processing to external demands (i.e., cognitive reserve). Second, test-retest reliability was assessed for the acute effects of exercise on executive function, so that it remains unclear whether similar ICCs can be expected in a longitudinal design. However, the reproducibility of post-exercise effects indicates that studies investigating chronic effects should control for any exercise bouts performed prior to the assessment of executive function. Third, test-retest reliability was assessed before and after a single aerobic exercise session. Consequently, the present findings do not permit any conclusions about the durability of the effects elicited by acute exercise. A review of the current literature suggests that acute benefits on executive function are maintained for at least 20 to 60 min after exercise cessation [14].

Conclusion
Mainly time-depended variables (e.g., reaction time of the Eriksen-Flanker-and Stroop-Color test, correctly completed figures of the Five-Point test) of executive function showed notable differences at baseline and after moderate aerobic exercise between day 1 and 2. This difference decreased when comparing day 2 and 3. Thus, a notable habituation effect to the whole experimental setup can be assumed and should be considered in future acute exercise studies on executive function. As a consequence, acute effects of moderate exercise intensity should not be overrated as the change scores are poorly reliable. Also absolute (CoV dropped from around 10% to 5% for reaction time) and relative reliability indices improved (ICC values from poor/fair to good/excellent) when comparing between-day reliability for day 2 vs. 3 compared to days 1 vs. 2. Correct responses and cumulative errors as accuracy indicators showed high percentage values indicating a ceiling effect in this population. However, reliability of accuracy turned out to be poor. Thus, highly accurate response with less variation before and after exercise can, however, cause poor reliability outcomes. Overall, Digit-span testing revealed absolute variability between 20 and 40%. This testing instrument might impede sufficient detection of exercise induced effects on executive function. Future research on baseline variability and exercise-induced effect on reliability including different types of exercise (e.g., strength, endurance, balance) in frailer and diseased seniors is needed in order to elucidate the specificity effect of exercise on executive function in the elderly population.