Reliability and validity of the Patient Benefit Assessment Scale for Hospitalised Older Patients (P-BAS HOP)

Background The Patient Benefit Assessment Scale for Hospitalised Older Patients (P-BAS HOP) is a tool which is capable of both identifying the priorities of the individual patient and measuring the outcomes relevant to him/her, resulting in a Patient Benefit Index (PBI) with range 0–3, indicating how much benefit the patient had experienced from the admission. The aim of this study was to evaluate the reliability, validity, responsiveness and interpretability of the P-BAS HOP. Methods A longitudinal study among hospitalised older patients with a baseline interview during hospitalisation and a follow-up by telephone 3 months after discharge. Test-retest reliability of the baseline and follow-up questionnaire were tested. Percentage of agreement, Cohen’s kappa with quadratic weighting and maximum attainable kappa were calculated per item. The PBI was calculated for both test and retest of baseline and follow-up and compared with Intraclass Correlation Coefficient (ICC). Construct validity was tested by evaluating pre-defined hypotheses comparing the priority of goals with experienced symptoms or limitations at admission and the achievement of goals with progression or deterioration of other constructs. Responsiveness was evaluated by correlating the PBI with the anchor question ‘How much did you benefit from the admission?’. This question was also used to evaluate the interpretability of the PBI with the visual anchor-based minimal important change distribution method. Results Reliability was tested with 53 participants at baseline and 72 at follow-up. Mean weighted kappa of the baseline items was 0.38. ICC between PBI of the test and retest was 0.77. Mean weighted kappa of the follow-up items was 0.51. ICC between PBI of the test and retest was 0.62. For the construct validity, tested in 451 participants, all baseline hypotheses were confirmed. From the follow-up hypotheses, tested in 344 participants, five of seven were confirmed. The Spearman’s correlation coefficient between the PBI and the anchor question was 0.51. The optimal cut-off point was 0.7 for ‘no important benefit’ and 1.4 points for ‘important benefit’ on the PBI. Conclusions Although the concept seems promising, the reliability and validity of the P-BAS HOP appeared to be not yet satisfactory. We therefore recommend adapting the P-BAS HOP. Supplementary Information The online version contains supplementary material available at 10.1186/s12877-021-02079-z.


Background
Healthcare interventions are often evaluated in terms of survival or disease-specific measures, while for many older people more personal goals such as functional status, social functioning and relief of symptoms, which are considered important by the individual self, are prioritised [1,2]. Furthermore, which outcomes are considered important differ per individual [1,3]. When care is to be systematically evaluated by personal goal-oriented outcomes, a tool is needed which is capable of both identifying the priorities of the individual patient and measuring the outcomes relevant to him/her. We therefore developed the Patient Benefit Assessment Scale for Hospitalised Older Patients (P-BAS HOP) [4].
The P-BAS HOP is an interview-based tool consisting of two parts: 1) a baseline questionnaire to select and assess the importance of various predefined goals, based on subjects derived from qualitative interviews with hospitalised older patients and 2) an evaluation questionnaire to evaluate the extent to which the hospital admission helped to achieve these individual goals. Based on these data it is possible to compute an individual Patient Benefit Index. The comprehensibility, feasibility and a first indication of content validity were already tested in pilot test and field tests [4]. The aim of the present study is to evaluate the reliability, validity, responsiveness and interpretability of the P-BAS HOP.

Design and population
This longitudinal study was performed among hospitalised older patients. The first face-to-face interview took place within the first 4 days of hospitalisation. The follow-up interview was performed 3 months after discharge by telephone.
Eligible participants were 70 years and older; had either a planned or unplanned hospital admission on medical or surgical wards of a university teaching hospital in the Netherlands, had an expected hospital stay of at least 48 h; were able to speak and understand Dutch and were without cognitive impairment. Inclusion criteria were verified with the staff nurse. Patients were approached by a trained research assistant and gave signed informed consent.

Questionnaire: P-BAS HOP
The P-BAS HOP is an interview-based questionnaire. The baseline questionnaire consists of two parts: in the first part the interviewer lists subjects and the participant can indicate whether experiencing or expecting limitations regarding that subject. In the second part, the participant is asked, for each subject identified in the first part whether it is a goal for the current hospitalisation and, if so, how important the goal is. Answer options are: does not apply to me; not at all important; somewhat important; quite important and very important.
At follow-up, the participant is asked per selected goal to what extent the hospitalisation helped to achieve that goal. The answer options are: not at all; somewhat; quite; completely.
With the scores of the baseline and follow-up questionnaire, a Patient Benefit Index (PBI) can be calculated: this is the mean of the benefits, weighted by the importance of the goals: with k goal-items (G i )(range 0-3, related to answer options for importance) and benefit-items B i (range 0-3, related to answer options for achievement of goals).

Other questionnaires and constructs
For the construct validity the used questionnaires or constructs are summarised in Table 1. Full details are given in Additional file 1.

Reliability
Test-retest reliability of the baseline questionnaire was performed with an interval of 1 to 3 days, while the participant was still hospitalised. The participant was not notified in advance of the retest, but asked for permission for another test on the other day. Then only the P-BAS HOP was repeated. For a better understanding of the difference between test and retest, a short qualitative evaluation was done: a selection of seven participants were asked, after the retest, to explain what caused the discrepancies per item between test and retest.
Test-retest of the follow-up questionnaire was performed in another sample than the baseline test-retest with an interval of 7 to 14 days. At the end of the first follow-up interview, the participant was asked permission to be called back a week later to repeat some questions, without specifying which questions. Only the P-BAS HOP was repeated.
Percentage of agreement, Cohen's Kappa with quadratic weighting and maximum attainable kappa [11,12] were calculated per item for the agreement on importance of the goals on baseline, and the extent the hospitalisation helped to achieve the set goals on follow-up. Both the goal items 'doesn't apply to me' and 'not at all important' were valued as zero. For all kappa calculations an online calculator was used [13]. For the interpretation of the kappa values, the classification of Landis and Koch [14] was used.
The PBI was calculated for both test and retest of baseline and follow-up and compared with Intraclass Correlation Coefficient (ICC).

Baseline questionnaire
The hypotheses we developed to test the construct validity of the baseline questionnaire are listed in Table 2.
Hypotheses 1 to 5 were evaluated using Cramér's V statistic. Hypotheses 6 and 7 were evaluated with the Spearman's rank-order correlation. Since experiencing a symptom or restraint in a certain subject, does not necessarily mean that this goal is a priority for hospital admission, the hypotheses are confirmed if the correlation exceeds 'small' as defined by Cohen [15], meaning the correlation > 0.10. The answer option 'does not apply to me now' and 'not at all important' were coded as 0, the options somewhat important, quite important and very important were coded respectively as 1, 2, 3. Only when the assumptions of Cramér's V statistic were not met because of too low (expected) cell frequency, categories were combined.
For hypothesis 8, a random selection of 50 cases was made and goals mentioned in the open question were coded using the item names of the P-BAS HOP. When a participant mentioned a goal that was not in the P-BAS Table 1 Constructs measured for the construct validity

Construct Operationalisation
Appetite Dutch VMS screening program (VMS) [5] Symptoms experienced on admission day Rotterdam Symptom Checklist (RSCL) [6] Pain, experienced at moment of interview Numeric rating scale (NRS) pain (0: no pain at all to 10: the worst imaginable pain) Fatigue, experienced at moment of interview NRS fatigue (0: no fatigue at all to 10: the worst imaginable fatigue) Health related quality of life 2 weeks before admission/ at moment of follow-up interview EQ-5D [7] Admission  HOP, it was coded as 'other'. The coding was done by two researchers independently and then compared and discrepancies were solved by consensus. Subsequently, the percentage of agreement between the labels and the answers given in the P-BAS HOP was calculated. The baseline questionnaire was considered valid if a minimum of 75%, thus six, of the first seven hypotheses were confirmed and hypothesis 8 was confirmed in a minimum of 75% of the selected cases [16].

Follow-up questionnaire
The extent to which the hospitalisation helped to achieve the set goals is compared with the progression or deterioration of items between baseline and follow-up from other known questionnaires. Hence the formulated hypotheses are listed in Table 3.: Hypotheses 1 to 9 were evaluated using Cramér's V statistic. Hypotheses 10 to 12 were evaluated with the Spearman's rank-order correlation. Since experiencing a progression or deterioration in a certain subject, does not necessarily mean that this is due to the hospital admission, the hypotheses are confirmed if the correlation exceeds 'small' as defined by Cohen [15], meaning the correlation > 0. 10.
For hypothesis 13 the same records were used as for hypothesis 8 on baseline. For the dyads with agreement between the code for the open question and the P-BAS HOP item, the Spearman's rank-order correlation between the answer on the open question and the corresponding P-BAS HOP item was calculated. The hypothesis was confirmed if the correlation > 0.50.
The follow-up questionnaire was considered valid if a minimum of 75% [16], thus nine of the first 12 hypotheses, were confirmed and hypothesis 13 was confirmed.

Responsiveness
The following anchor question was used to validate the PBI: 'How much have you benefited from the admission?' With the following answer options: not at all, a little bit, somewhat, much, very much. Cramér's V > 0.10 37 nc n.a.
5 Participants who indicated a deterioration on the EQ-5D item pain/discomfort, are expected to have a lower score on the item 'pain'.
Cramér's V > 0.10 102 0.14 a R 6 Participants who indicated a lack of appetite on the VMS, are expected to have a lower score on the item 'appetite'.
Cramér's V > 0.10 45 0.46 C 7 Participants who indicated a deterioration on the MSPP item organised sports and/or the MSPP item 'done something with others that required considerable physical effort', are expected to have a lower score on the item 'sports'.
8 Participants who indicated a deterioration on the MSPP item seeing family/acquaintances or the SF36-social functioning, are expected to have a lower score on the item 'visiting family or friends'.
9 Participants who moved from independent living to sheltered living or a nursing home, are expected to score lower on the item 'return back to my home'.
10 Participants with an increasing difference score between baseline and follow-up on the EQ-5D thermometer 'general health', are expected to have a higher score on the item 'feeling better'.
Spearman's > 0.10 241 0.14 C 11 Participants with an increasing difference score between baseline and follow-up on the sum score 'MSPP-daytrip', are expected to have a higher score on the item 'go on outings'. The PBI is considered valid when it has a Spearman's correlation coefficient ≥ 0.50 with the anchor question [17,18].

Interpretability
The interpretability is evaluated with the visual anchorbased minimal important change distribution method [11,18]. Participants who indicated: 'not at all', and 'a little bit', were considered as having no important benefit. Participants who indicated 'very' or 'very much', are considered as having important benefit. As it was not clear whether 'somewhat benefit' was considered as important benefit or not, we labelled this as 'borderline'. The receiver operating characteristic (ROC) curve was used to determine the optimal cut-off points for important and no important benefit.

Missing values
When the P-BAS HOP was not administered, the case was completely deleted. For all other missing values, we used pairwise deletion. The computation of the PBI was based on non-missing items.

Sample
From the 2798 eligible patients, 1130 were approached for informed consent and 472 gave informed consent. After exclusion of 21 cases, we had 451 baseline cases. We lost 98 cases to follow-up and in an additional nine cases the P-BAS HOP was not administered at followup, which resulted in 344 follow-up cases. Full details are shown in Fig. 1. Most (43%) baseline interviews were done on the third day of admission.
Sample characteristics are shown in Table 4 and Additional File 2 shows the scores of the other questionnaires measured for the construct validity. Table 5 shows the baseline and follow-up descriptive statistics of the P-BAS HOP. The number of goals selected as minimum 'somewhat important' varied from zero to 17 per person, with a median of five. Eleven persons selected no goals from the P-BAS HOP. Nineteen participants mentioned an extra goal. Examples of an extra goal were: resuming work; giving informal care to a relative or partner; being able to swallow. The missing values at baseline are mostly due to the interviewer accidentally omitting a question; five times it was because the participant did not know the answer.

Descriptive statistics P-BAS HOP
At follow-up, participants sometimes mentioned that the goal was not applicable for them. This ranged from 1.6 to 34.0% per goal, except for the extra goal. Missing values are in two cases due to the participant stopping answering questions halfway through the P-BAS HOP.
The item 'alive' had the highest number of missing values, mostly (eight times) because the participant did not know the answer. The item 'disease under control' had the second highest number of missing values. Regarding this question, some participants mentioned they did not know how their situation was at that moment, because they were still under treatment or waiting test results.
The PBI ranged from 0 to 3 points, with a mean of 1.71 and a standard deviation of 0.93.

Baseline questionnaire
For the test-retest reliability, 60 participants were approached. Seven times the participant refused the retest, resulting in 53 participants performing a baseline test-retest reliability. Median time between test and retest was 1 day. In 33 cases the retest was performed by another interviewer and in 20 cases with the same interviewer. We therefore decided also to distinguish between intra-and inter-rater reliability.
Of the 21 specified goals, from which participants could select, the number of discrepancies between test and retest per participant ranged from zero to a maximum of 11 (52% of the number of goals) with a median of four goals (19%). From the cases with the same interviewer, the number of discrepancies between test and retest per participant ranged from zero to seven (33%) with a median of three goals (14%). The cases with different interviewers had one (5%) to 11 (52%) discrepancies between test and retest per participant with a median of five goals (24%). Of the total of 228 discrepancies, in 100 (44%) cases the goal was selected only during the test and in 128 (56%) cases only during the retest. These proportions were the same for the intra-and inter-rater reliability.
The complete crosstabulations of all items are included in Additional File 3. Table 6 shows the weighted kappa per item in descending order. The weighted kappa for the item 'home' could not be calculated because of too many empty cells. Two items had substantial agreement, eight moderate agreement, seven fair agreement and three slight agreement.
When the weighted kappa was calculated as a proportion of the maximum attainable kappa, the item 'gardening' had almost perfect agreement, three items had substantial agreement, seven items moderate agreement, eight fair agreement and the item 'driving' slight agreement.
Three participants who had a retest only mentioned an extra goal in the test, while three others only mentioned an extra goal in the retest. One participant mentioned a goal in the test and in the retest, but this was a different goal. Therefore, no kappa value was calculated for the extra option.
The mean of all the weighted kappa values showed fair agreement, when calculated as a proportion of the maximum attainable kappa, moderate agreement. The mean of the intra-rater kappa values showed moderate agreement, when calculated as a proportion of the maximum attainable kappa, substantial agreement. The mean of the inter-rater kappa values showed fair agreement.
Asking the participants the reason for the discrepancies between test and retest, revealed several reasons: 1) Difference in interpretation at different moments, for example the participant had difficulties with walking due to shortness of breath, but did not have any problems with the legs. At the retest the participant did take into account the shortness of breath, at the test only the legs. 2) Priority is assessed differently at different moments, for example groceries are normally done by the partner, but it would be nice if the participant could help, or the pain is present but the participant could cope with it. 3) Progressive insight during the hospital admission: through more information, or the experience of a disappointing recovery, goals were lowered or suddenly became much more important. 4) In some cases the participant was not able to explain the reason.

Follow-up questionnaire
For the follow-up test-retest reliability, 90 participants were approached. In 11 cases the participant refused the retest, six times the participant could not be reached, for one case it was unknown why the retest was not performed. Finally, 72 participants performed a test-retest of the follow-up questionnaire. However, since only goals that were applicable were evaluated and the prevalence of some goals was quite rare, these goals had very small sample sizes. We therefore decided to compute weighted kappa values only when the sample size was ≥10 participants. Median time between test and retest was 9.5 days. In 43 cases the retest was performed by another interviewer and in 29 cases by the same interviewer. Sample sizes were too small to calculate kappa values for intra-and inter-rater reliability. Six values can be found in Additional file 4.
The complete crosstabulations of all the items are included in Additional File 4. Table 7 shows the weighted kappa in descending order. The item 'enjoying life' had almost perfect agreement. Two items had substantial agreement, six moderate agreement, two fair agreement and the item 'knowing what is wrong' slight agreement.
When the weighted kappa was calculated as a proportion of the maximum attainable kappa, four items had almost perfect agreement, three substantial agreement, three moderate agreement, one fair agreement and one slight agreement.
For ten items the sample size was too small to calculate a valid kappa. The percentage of agreement for these items varied widely from zero for groceries to one hundred for home and the extra goal, although these last two items were only answered by one and two participants, respectively.
The mean of all the weighted kappa values showed a moderate agreement, when calculated as a proportion of the maximum attainable kappa, a substantial agreement.

Baseline questionnaire
All baseline hypotheses were confirmed. Table 2 shows the test statistics and the complete descriptive information is shown in Additional file 5.
The 50 cases selected for the open question mentioned 110 goals in total. Of these, 23 goals could not be coded as an item in the P-BAS HOP because they were too vague to categorise or the goal did not exist in the P-BAS HOP and were therefore coded as 'other'. An example of a vague goal was: 'that it will be the way it was', an example of a goal that did not exist in the P-BAS HOP was: 'that I can lift my grandson again'. We consequently analysed the agreement between the codes and the answers given in the P-BAS HOP of 87 goals and found an agreement of 75%. An overview of the number of items coded and the amount of agreement is given in Table 8.

Follow-up questionnaire
Six hypotheses did not meet the assumptions for Cramér's V, because the number of people experiencing a deterioration on that item was very low. For four of a Due to a random temporary error in the computer system, the items defecation (n = 2) and walking (n = 4) were not asked these hypotheses the descriptive trend was in the right direction. From six of the first 12 hypotheses that were calculated, four were confirmed and two were rejected. Table 3 shows the test statistics and the complete descriptive information is shown in Additional File 6. The Spearman's correlation coefficient between the PBI and the anchor question was 0.51. Figure 2 shows on the left side the ROC curve of 'no important benefit', with an area under the curve of 0.73. The optimal cut-off point for 'no important benefit' was set at a sensitivity value of 73% and a specificity of 73%, resulting in an MIC of 0.7 points on the PBI.

Interpretability
The right side of Fig. 2 shows the ROC curve of 'important benefit', with an area under the curve of 0.80. The optimal cut-off point for 'important benefit' was set at a sensitivity value of 79% and a specificity of 75%, resulting in a MIC of 1.4 points on the PBI. This means the PBI values between 0.7 and 1.4 are considered as

Discussion
In this study we tested the reliability, validity, responsiveness and interpretability of the Patient Benefit Assessment scale (P-BAS HOP), which was designed to identify the goals of the individual patient and to measure his/her relevant outcomes. The results are mixed. The reliability of the individual items of the baseline questionnaire can be summarised as fair to moderate. Participants varied regularly in which goals they considered important. This could have several causes. Firstly,, although sample sizes being small, the intra-rater reliability of the baseline test appeared to be much better than inter-rater. It could have happened that the interviewer unintentionally influenced a participant when remembering the answer from the other day, but it is more probable that there is much variation between instructions given by various interviewers. This could be caused by not having all questions written out, giving more autonomy to the interviewer, or the instructions may have been insufficient. Secondly, a hospital admission is a highly unstable and unpredictable period. Symptoms vary, people receive treatments and medical information which can change their priorities. Thirdly, the definition of a problem or limitation was perhaps not very clear, since this could have been at the moment of interview, or at the moment of admission, or could have been a potential limitation. This could cause large differences in the crosstabulations: when someone, for example, declares at the test in the first step that an item does not apply, the answer is automatically doesn't apply/not important at all, while when saying in the retest it does apply the participant goes further to the second step and can indicate there that it is 'very important'. Fourthly, choosing which goals or items are relevant, is very different from usual questionnaires where the objective is to assess, for example, health status. When comparing the P-BAS HOP with other instruments where participants choose their own domains, it is seen that choosing other domains in the retest is common. For example in the 'schedule for the evaluation of individual quality of life' (SEIQoL-DW), 35 to 81% of the    [19,20]. In the Patient-Generated Index (PGI) participants have to choose a maximum of five domains and the mean number of change in the retest was 1.7. 21% of the participants chose three to five new domains [20,21]. A more technical explanation for the low kappa values, is that as a result of the individual approach of the tool, the percentage of 'doesn't apply to me' is often high, resulting in very homogeneous samples, causing low kappa values [11,12,22].
Although the reliability of the individual items of the baseline questionnaire is fair to moderate, the ICC between the PBI of the test and retest was 0.77, which is acceptable. This means that even though not all participants are very consistent in their choice of goals, this does not lead to very deviating PBI-scores. This could be explained by the fact that many people differ only in a few goals between test and retest and that there exist moderate to strong correlations between the achievement of many goals (data not shown).
The reliability of the follow-up questionnaire is better than the baseline with a mean weighted kappa of 0.51.
Participants were probably in a more stable situation during follow-up, although we have not asked whether anything had changed between test and retest. However, the variation between test and retest items on follow-up had more impact on the ICC, which was 0.62 and therefore not satisfactory. The follow-up intra-and inter-rater reliability were similar. This could be caused by having all questions written out at follow-up, leaving less room for variation between interviewers.
From the hypotheses for baseline validity, almost all hypotheses were confirmed. This suggests participants are likely to choose goals which are relevant for them. On the other hand, this is contradicted in the follow-up, where participants often stated that the goal was not applicable for them, for the goal 'washing and dressing' this was even 34%. This could have several causes: first, the P-BAS HOP does not discriminate between preservation and improvement, so the goal could have been to preserve a function, but this is not clear in the questioning, especially through use of the word 'again'. Second, participants may have forgotten in what poor condition they were during admission, therefore ignoring how much they have improved. In the literature, this is called response shift or recall bias, and occurs more frequently opposite, so patients underestimate afterwards their condition during admission [23][24][25]. However, Hinz et al. showed that around 20 to 30% of the patients afterwards overestimated their condition during admission [24]. A third explanation could be that it is unclear which time period the participants had to compare with: during hospitalisation, for example, participant were unable to wash and dress themselves, but before admission this was not a problem. Compared to the situation at admission it was an improvement, but compared to the situation before, the hospitalisation did not make a difference.
The agreement between goals coded in the open questions and the P-BAS HOP items was 75%, which we considered just valid. This could partly be due to ambiguity: some goals were difficult to code. For example: the goal 'that I can be part of club life' we coded as 'hobbies', but we were not sure what kind of club this participant wanted to be part of and whether this could be seen as a hobby or not. Nevertheless, there were also examples of situations where there was clear disagreement between the goal set by the participant in the open question and the P-BAS HOP. For example, a person stated in the open question 'being able to work in the garden' and in the P-BAS HOP the item 'gardening' was marked as 'not applicable'. This could be caused by the first part of the baseline questionnaire where the participant states whether experiencing or expecting limitations regarding that subject. Apparently a subject does not need to be an actual problem or limitations to be a goal.
A limitation of the method of comparing goals in the open question and the P-BAS HOP, is that participants could mention several goals, but we treated the coded goals and the answers in the P-BAS HOP as if they were independent.
For the testing of the validity in the follow-up, we were limited by small sample sizes and the fact that only small numbers of people deteriorated on the Katz-15, EQ-5D or MSPP between baseline and follow-up. Other studies reported higher amounts of deterioration from around one third of the older patients [26][27][28]. We probably had a selection bias of the most fit patients wanting to participate.
Of the follow-up hypotheses that were tested, one third were rejected, we therefore have to conclude that the validity of the follow-up questions was weak. This could be a result of recall bias, but also because participants did not know which time period they had to compare with. We did not observe difficulties with validity of the follow-up questionnaire in the Three Step Test Interviews (TSTI) during the pilot [4], but this could be due to the fact we did the TSTI at discharge and not when people were back home for several weeks.
Although the validity of the follow-up questionnaire was weak, the PBI could be considered valid, so the sum of the achievement of all goals weighted for their importance gives a good representation of the benefit the participant experienced by the hospital admission. A disadvantage of an anchor-based method is that the conclusion is always dependent on the anchor chosen [17]. Many participants gave an explanation to their answer to the anchor question, and this revealed that the conclusion of how much benefit the participant had, was not always based on the goals achieved, but could also be based on other indicators, for example how kind the hospital staff was.
For the interpretability we constructed cut-off values for relevant benefit, but one should take into account that a cut-off is in reality not an absolute value and could be dependent on the sample [18].

Limitations
The sample size of the reliability studies was quite low, especially when taking into account the homogenous samples at baseline. Therefore, the confidence intervals around the kappa values were often large. Another result of the homogenous samples at baseline, is that the numbers of the middle categories are quite low, not meeting the criterion of a minimum of 10 cases in the margins [29]. We therefore also computed kappa values for 2 × 2 tables, by combining the categories 'doesn't apply/not at all important' with 'somewhat important' and 'quite important' with 'very important'. This showed similar results, although still not all margins had 10 cases (data not shown). At follow-up the problem of the low sample sizes was larger, since only goals that applied were evaluated and some goals were only chosen by a few participants.
Since the P-BAS HOP was administered on paper, interviewers had to manually circle the goals to ask in the second part, based on the subjects indicated as applicable in the first part. This lead sometimes to the omission of a goal by forgetting to circle a goal.
The time between discharge and follow-up was 3 months, which is quite long if patients have to indicate to what extent the hospitalisation helped to achieve the set goals. In the meantime there could be various other factors which have influenced the result and which are difficult to disentangle from the hospital admission.

Conclusions and recommendations
Although the concept seems promising, the reliability and validity of the P-BAS HOP appeared to be not yet satisfactory in this format. We therefore recommend adapting the P-BAS HOP, subsequently re-evaluating