Measurement properties of oral health assessments for non-dental healthcare professionals in older people: a systematic review

Background Regular inspection of the oral cavity is required for prevention, early diagnosis and risk reduction of oral- and general health-related problems. Assessments to inspect the oral cavity have been designed for non-dental healthcare professionals, like nurses. The purpose of this systematic review was to evaluate the content and the measurement properties of oral health assessments for use by non-dental healthcare professionals in assessing older peoples’ oral health, in order to provide recommendations for practice, policy, and research. Methods A systematic search in PubMed, EMBASE.com, and Cinahl (via Ebsco) has been performed. Search terms referring to ‘oral health assessments’, ‘non-dental healthcare professionals’ and ‘older people (60+)’ were used. Two reviewers individually performed title/abstract, and full-text screening for eligibility. The included studies have investigated at least one measurement property (validity/reliability) and were evaluated on their methodological quality using “The Consensus-based Standards for the selection of health Measurement Instruments” (COSMIN) checklist. The measurement properties were then scored using quality criteria (positive/negative/indeterminate). Results Out of 879 hits, 18 studies were included in this review. Five studies showed good methodological quality on at least one measurement property and 14 studies showed poor methodological quality on some of their measurement properties. None of the studies assessed all measurement properties of the COSMIN. In total eight oral health assessments were found: the Revised Oral Assessment Guide (ROAG); the Minimum Data Set (MDS), with oral health component; the Oral Health Assessment Tool (OHAT); The Holistic Reliable Oral Assessment Tool (THROAT); Dental Hygiene Registration (DHR); Mucosal Plaque Score (MPS); The Brief Oral Health Screening Examination (BOHSE) and the Oral Assessment Sheet (OAS). Most frequently assessed items were: lips, mucosa membrane, tongue, gums, teeth, denture, saliva, and oral hygiene. Conclusion Taken into account the scarce evidence of the proposed assessments, the OHAT and ROAG are most complete in their included oral health items and are of best methodological quality in combination with positive quality criteria on their measurement properties. Non-dental healthcare professionals, policymakers and researchers should be aware of the methodological limitations of the available oral health assessments and realize that the quality of the measurement properties remains uncertain.


Background
Nowadays, in Western countries more older people retain all or a major part of their natural teeth which brings along new challenges for the oral healthcare system. Highly complicated restorations (e.g. crowns, bridges, implants) make it more difficult to perform adequate oral self-care, especially in frail older people [1], and as such may result in (oral) health-related complications [2,3].
Oral health problems like pain, abscesses, difficulties with eating and chewing may have a significant impact on older peoples' self-esteem, well-being, social life, and quality of life [4,5]. At the same time, oral problems like periodontitis are associated with for example cardiovascular diseases, diabetes and pneumonia [6,7]. Therefore, prevention and early diagnosis of oral diseases are important for the risk reduction of developing further problems with oral and general health.
Oral health prevention requires regular inspection of the oral cavity. Such inspections are traditionally performed by the dentist during preventive treatment sessions in dental practice. However, several barriers to seeking oral health care may contribute to a decrease in oral inspections. A review from Kiyak et al. (2005) concluded that barriers in seeking oral care in older people are depending on age, ethnicity, income, availability of dental insurances, type of residence (urban vs. rural), physical access and general health. Moreover, they concluded that attitude and psychosocial factors could contribute to older peoples' oral healthcare-seeking behavior. Since (frail) older people seek less frequently dental care, the role of non-dental care professionals gained importance in contributing to screen and triage oral health problems [8][9][10][11].
Over twenty years, several oral health assessments have been developed for use by non-dental healthcare professionals like nurses and caregivers. For example, the Oral Health Assessment Tool (OHAT), the Revised Oral Assessment Guide (ROAG), The Holistic Reliable Oral Assessment Tool (THROAT), and comparable assessments have been developed for inspection and triage the oral cavity of older people [10,12]. Such assessments may serve non-dental healthcare professionals, for example in the context of assessing oral health in older people. Moreover, specific oral assessments have been developed for cancer patients [13]. However, since this target group suffers from specific oral health issues like Mucositis, their oral healthcare demand differs from general older people and was not the focus of this review.
Available oral health assessment as reported in the literature may differ in their approach and they are described as tools, instruments, guides, and sheets for oral cavity inspection or triage. In this review, we use the generic term oral health assessment for all of the approaches that aim to inspect the oral cavity of older people. Earlier studies reported that oral health assessments in practice should be: easy and simple to use, inexpensive, and only require basic equipment [10,14]. Moreover, for evidence-based care decisions, the measurement properties of such (oral health) assessments are considered crucial and therefore should be tested. The measurement properties are divided into three domains [15,16]: -Validity, i.e. construct validity: align with the theoretical notion of oral health; content validity: include all items considered relevant by all stakeholders; criterion validity: correlates with a reference; -Reliability, i.e. similar results are obtained for repeated measurements; -Responsiveness, i.e. change over time is detected.  performed a systematic review on oral health assessments for use by nurses and caregivers of older people with dementia [10]. They concluded that there is a lack of validated and reliable tools for oral cavity inspection by non-dental healthcare professionals. Since then, new oral health assessments have been developed. Some of these were tested on their validity and reliability [17][18][19], while others were not [13,20,21]. To date, an overview of these assessments and their measurement properties has not been published.

Objective
The purpose of this systematic review was to evaluate the content and the measurement properties of oral health assessments for use by non-dental healthcare professionals in assessing older peoples' oral health, in order to provide recommendations for practice, policy, and research.

Study design and strategy
To identify all relevant publications, systematic searches were performed in the bibliographic databases PubMed, EMBASE.com, and Cinahl (via Ebsco) from inception to 13 November 2017. Search terms included indexed terms from MeSH in PubMed, EMtree in EMBASE.com, Cinahl headings in Cinahl as well as free text terms. Search terms referring to 'oral health assessments ' were used in combination with search terms comprising 'nondental healthcare professionals' and 'older people' (60+). Duplicate studies were excluded. The full search strategies for all databases can be found in Additional file 1 (Search strategies for databases). Reference lists of included studies were screened for additional relevant studies (cross-reference check).

Selection process
Two reviewers (BE and LWV) independently screened all potentially relevant titles and abstracts for eligibility. The selection process was performed using Covidence, a Cochrane online technology platform, to fulfill this procedure at distance [22]. If necessary, the full-text article was checked for the eligibility criteria. Differences in judgment were resolved through a consensus procedure. Studies were included if they met the following criteria: (i) full text available of the original article; (ii) include oral health assessments for oral cavity inspection of older people (60+) developed for use by non-dental healthcare professionals; (iii) report original investigative data on one or more measurement properties. Moreover, they should fulfill the criteria as defined by The Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) for systematic reviews: www.database.cosmin.nl [23].
Studies were excluded if they concerned: (i) publications in other languages than English; (ii) oral health assessments developed for dental professionals; (ii) oral health-related quality of life instruments; (iii) oral screening instruments based only on questionnaires; and (iiii) oral health assessments exclusively developed for patients with cancer or another specific illnesses.

General information of the included studies
To give an overview of the included studies, information has been extracted on: authors, publication year, study design, investigated measurement property, type of nondental healthcare professional, specification of the older people population, oral health assessment (and their items assessed), rating scale of the assessment and duration of the assessment. Data extraction was performed on all included studies.

Assessment of the methodological quality of the included studies per measurement property
When validity and reliability of an assessment tool are investigated in a study of good methodological quality, the results can be used in research or daily care. However, when the methodological quality of a study is inadequate, the results of the study cannot be trusted and the quality remains unclear [16]. Therefore, to assess the methodological quality of the included studies, The COSMIN 4-point scale checklist has been used [24]. This checklist is a tool for the assessment of the methodological quality of studies examining measurement properties and has shown good inter-rater agreement and user-friendliness [19]. The COSMIN checklist evaluates three main measurement properties: 1. Validity, 2. Reliability, and 3.Responsiveness (Fig. 1), which are further divided into nine measurement properties (Box A-I). A visualization of how these measurement properties are related is shown in Fig. 1. Within the COSMIN a separate score is assigned for the methodological quality of each of the nine measurement properties in a study. Depending on the measurement property that has been evaluated, multiple scores for the methodological quality can be assigned and the score can differ per measurement property. For example, the methodological quality investigating the content validity can be good, while at the same time, the reliability assessment was performed in a small sample size and therefore of poor methodological quality. Depending on the measurement property, the COSMIN checklist contains a minimum of 5 and a maximum of 18 questions to evaluate the methodological quality [24]. Scores per question were rated on a nominal scale (excellent, good, fair, poor). To determine the methodological quality per property 'The worst score counts' criterion is used, meaning that the lowest score on a question within one measurement property determines the methodological quality score. For the full assessments of all measurement properties, we refer to the original COSMIN guideline [24]. A definition of each measurement properties is given in Table 1 under the column 'description'. Definitions are based on Terwee et al. (2007) and slightly modified in terminology to fit the content of our study.
Two raters (BE & LWV) independently determined the overall methodological quality per property. A disagreement between the raters was resolved via a consensus meeting. A third reviewer (KJ) was consulted when an agreement was still not reached.
Quality criteria for the measurement properties on oral health assessments When measurement properties were of excellent, good or fair methodological quality, an assessment of the quality of the measurement properties has been performed. Measurement properties of poor methodological quality were excluded for further quality assessment of this specific measurement property. The scores for quality of measurement property were: positive (+), negative (−) or indeterminate (?). See the column 'Quality criteria for measurement properties' in Table 1 for the definitions.

Search results
The literature search generated a total of 879 references: 395 in PubMed, 393 in EMBASE.com and 91 in Cinahl. After removing duplicates, 557 references remained. Four hundred four studies were removed based on the screening of the title and the abstract. The flowchart of the search and selection process is presented in Fig. 2. After screening the full-text, 136 studies were removed based on the presented in-and exclusion criteria. One article which met the in-and exclusion criteria was added after reviewing the reference lists of included articles. Reasons for exclusion full-text articles are described in Fig. 2.

Included studies
In total, 18 studies describing eight different oral health assessments were included for analysis:  Table 2 gives an overview of the included studies and their investigated oral health assessments. Most nondental healthcare professionals involved were nurses, sub-classified as Registered Nurse (RN), Licensed Vocational Nurse (LVN), Clinical Nurse (CN) or Licensed Practical Nurse (LPN). In the study of Simpelaere et al. (2016), speech pathologists were included [38]. The population on which the oral health assessment was used was heterogeneous and consisted of rehabilitation residents, nursing home residents, hospitalized older people, community-dwelling older people and older people with mental problems ( Table 2).

The methodological quality of the included studies per measurement property
None of the studies assessed all measurement properties included in the COSMIN checklist.  investigated the most (N = 5) measurement properties of the OHAT ( Table 2). In total, five studies showed good methodological quality on at least one measurement property and 14 studies showed poor methodological quality on some of their measurement properties. An overview of the reasons for poor methodological quality is shown in Table 3. Below, the results on the methodological quality per measurement property will be described. The following measurement properties were not investigated by any of the included studies: Measurement error (box C), Structural validity (box E), Hypothesis testing (box F) and Responsiveness (box I).

The methodological quality of the measurement property validity
Nine out of the 18 included studies investigated the domain validity of the oral health assessments (Table 4).
Of those, all five studies that assessed content validity, scored poor on their methodological quality, mainly because the patient population was not involved in developing the oral health assessment and studies did not assess if the items comprehensively reflect the construct (i.e. "oral health") to be measured [19,25,29,33,40] (see Table 3). Two studies assessed cross-cultural validity. The ROAG was translated in Portuguese by Riberio et al.
(2014) using multiple forward translations and one backward translation [37]. Hanne et al. (2012) only conducted forward translation into Danish and scored therefore poor on the methodological quality [30] (Table 3).  (Table 3). Riberio et al. (2014) assessed the ROAG on criterion validity with a dentist considered as "gold standard" (reference-rater) and had good methodological quality [37].  [29,34]. They scored fair and good on the methodological quality on the measurement property respectively ( Table 4).

The methodological quality of the measurement property reliability
For this study, the reliability was divided into intra-rater reliability, inter-rater reliability, and test-retest to assess the methodological quality. Internal consistency was only   (Table 3).

Intra-rater reliability
The intra-rater reliability was investigated for the ROAG, OHAT, THROAT, MPS, and DHR. Good methodological quality of the intra-rater reliability assessment was performed for the ROAG and THROAT by Ribeiro

Inter-rater reliability
Inter-rater reliability was assessed for all oral health assessments in 14 included studies. Inter-rater reliability was investigated between several professions: nurses, speech pathologists or a dental professional with a nondental healthcare professional (  [18,19,35]. The MDS was assessed on inter-rater reliability by all five studies on MDS. However, the quality was rated poor for four of them because of the low quality of the statistical method and small sample size (Table 3) [26][27][28]31].
Studies investigating the OHAT, DHR, BOHSE, and OAS scored fair on methodological quality on the interrater reliability mainly because they reported unweighted kappas for ordinal scores [17,29,33,39]. The study of Henriksen et al. (1999), showed poor methodological quality (Table 3) [32].       For criterion validity, a non-dental healthcare professional was the index-rater, a dentist was used as reference-rater N.A. Not applicable was reported for the quality criteria when an article had poor methodological quality.  Only kappas are reported instead of percent agreement because this reflects better methodological quality according to the COSMIN criteria N.A. Not applicable was reported for the quality criteria when an article had poor methodological quality.

Test-retest reliability
time and therefore scored poor on the methodological quality (Table 3). Kayser-Jones et al. (1995) (BOSHE) also looked at test-retest reliability. The methodological quality was fair because of the moderate sample size and reported unweighted kappas for the ordinal score.
Characteristics of individual oral health assessments and the quality assessment of their measurement properties Overall, the oral health assessments include 18 items in the oral cavity. The most frequently assessed items are lips, mucosa membrane, tongue, gums, teeth, denture, saliva, and oral hygiene ( Table 6). The assessments of each item can differ. For example the item "Lips": some assessments assess it by color and moistness while others look at swelling and bleeding ( Table 6). If applicable, below the validity, intra−/inter-rater reliability and test-retest of the oral health assessments will be evaluated in their context and the quality assessment of the measurement property will be reported. No studies with acceptable methodological quality of any of the measurement properties were found for the MPS, so this assessment will not be discussed.  conducted a study on the interrater reliability between a dental hygienist and a registered nurse [18]. The percent agreement was the lowest for teeth/dentures and tongue and the highest for swallowing and voice. Only weighted kappas (κ w ) were reported on items that scored a minimum and maximum on the ordinal scale. For the items "voice"' and "gums" no maximum score (score 3) was registered and therefore unweighted kappas (K) were reported instead of weighted Kappas. The quality assessment of the measurement property scored therefor? /−. The Kappas ranged from 0.45-0.84 with a mean of 0.59 (Table 5). The lowest kappas were found for voice (κ), teeth/dentures (κ w ), tongue (κ w ), and saliva (κ w ) and the highest for swallowing (κ w ). Ribeiro et al. (2014) investigated the ROAG on validity and reliability in Portuguese [37]. Criterion validity was assessed with a dentist considered as "gold standard"(reference-rater). The measurement property was scored indeterminate (?) because sensitivity, specificity, and accuracy were reported. Sensitivity ranged from 0.17 for saliva to 1.0 for swallowing. Specificity ranged from 0.69 for teeth/dentures to 0.98 for saliva (Table 4). For intrarater reliability for the community health workers (CHW's), only weighted kappas were measured for the items with two or three levels of response: tongue, hygiene of teeth and dentures, and/or caries. They ranged from κ w = 0.38 to κ w = 0.88 and therefore scored +/− on the measurement property ( Table 5). The lowest weighted kappa was found for teeth/dentures. Unweighted kappas were the lowest for saliva and the highest for voice, lips, and swallowing.

MDS
The MDS was investigated by five different studies, however as described before, four of them had poor methodological quality and will not be evaluated in-depth. Morris et al. (1997), using the MDS-HC (for community-dwelling older people) reported overall weighted kappas between nurses for the oral health component ranging from κ w = 0.57 to κ w = 0.60. For MDS 2.0 (nursing homes) this was κ w = 0.70. Because of the spread between weighted kappas, a +/− was scored for the quality criteria (see Table 5) [35].  2005), on individual item level, intra-rater reliability ranged from 74.4% agreement for oral cleanliness to 93.9% for dental pain and 96.6% for a referral to the dentist [17]. Unweighted kappas were moderate: 0.51-0.60 for lips, saliva, oral cleanliness and referral to the dentist. All other categories showed kappas ranging from 0.61-0.80, which indicates substantial agreement. The overall intraclass correlation coefficient on the total score was 0.78 and all results were statistically significant. The quality of measurement property was scored +/? because of its high Intra Class Correlation (ICC) and reported unweighted kappas (Table 5).
For the inter-rater reliability between nurses, percent agreement ranged from 72.6% for oral cleanliness to 92.6% for dental pain and 96.8% for the referral to the dentist. Unweighted kappas varied from 0.48-0.60 for lips, tongue, gums, saliva, oral cleanliness and referral to the dentist. The other items scored between 0.61 and 0.80, indicating substantial agreement for inter-rater reliability. The correlation coefficient for the inter-rater agreement on the total score was 0.74. All statistics were statistically significant. The quality of measurement property was scored +/? because of its high ICC and unweighted kappas were reported (Table 5). Simpelaere et al. (2016) investigated the intra-, inter-and test-retest reliability in speech pathologists [38]. However, intra-rater reliability was of "poor" methodological quality as described earlier and will not be further described.
The inter-rater reliability was tested between three speech pathologists on 132 individuals. The ICC on the total score was 0.96 (95% CI 0.95-0.97) and scored therefore positive (+) on the quality criteria ( Table 5). The individual items varied with a Fleiss kappa from 0.83 to 1.00. No weighted kappa was calculated, therefore an indeterminate (?) rating was given. For the testretest, a second assessment was performed on 46 individuals after two weeks. The ICC for the two raters on the total score was 0.81 (95% CI 0.68-0.89) and 0.78 (95% CI 0.64-0.87). Kappas varied between 0.14 for dental pain and 0.91 for dentures and teeth. Another slight agreement was found for gums and tissues. Because of the reported unweighted kappas, and indeterminate (?) rating was scored (Table 5).

Throat
For the intra-rater agreement investigated by Dickinson et al. (2001), the weighted kappas varied between κ w = 0.69-0.96 for all items, except for the floor of the mouth and smell (κ w ) = 0. For the total score, intra-rater reliability was good κ w = 0.95 (95% CI 0.88-1.02) [19]. Because of the large spread between kappas, the measurement property scored +/− on the quality criteria ( Table 4). The Inter-rater assessment for the single items was performed between nurses and the dental hygienist reporting unweighted kappas of κ < 0.30 across the raters. Negative kappas were reported for teeth and smell. When raters were paired, the weighted kappas ranged from κ w = 0.46-0.89, with the lowest values for teeth and dentures. Because of the spread between kappas a +/− was scored on the quality criteria.
A positive (+) rating for the inter-rater reliability on the total score was reported because weighted kappas were κ w = 0.96 (95% CI 0.90-1.02) between a stroke specialist nurse and student nurse and κ w = 0.97 (95% CI 0.92-1.02) between stroke specialist nurses and dental hygienist. For criterion validity, a positive (+) rate was scored because correlations with their reported gold standards (Mucosal Plaque Index [32] and OHI-S [41]) was Rs = 0.78 and statistically significant (Table 4). For inter-rater reliability, the unweighted kappa between the dental hygienist and clinical nurse was κ = 0.4 (not statistically significant) and scored therefore indeterminate (?). Intraand inter-rater reliability has also been evaluated on a series of videos. The inter-rater reliability was scored indeterminate (?) because the unweighted kappa for the dental hygienist was 0.7 and for the clinical nurse κ = 0.8 (Table 5).

BOHSE
Lin et al. (1999) investigated the criterion validity using a dentist as "gold standard"(reference-rater) [34]. For criterion validity +/− was scored because the correlation coefficients varied between 0.351 and 0.578 for the dentist and the nurses (nurse and clinical nurse assistant (CNA)). However, correlation coefficients were lower than 0.70 and therefore they scored negative (−) on the quality criteria (Table 4).
Inter-rater reliability was also tested between the dentist and the nurses. An intermediate (?) score was given because only percent agreement and unweighted kappas were reported. The lowest percent agreements were found on the items lips, gums, natural teeth, and oral cleanliness: 60.7%, 37.5%, 60.7%, and 32.1% respectively. Kappas ranged from κ = 0.015 to κ = 0.519. The lowest kappas were reported for gums between the Doctor of Dental Surgery (DDS) and CNA and oral cleanliness between the DDS and the nurse. The highest kappa was reported for pairs of teeth in chewing position (Table 5). In addition, negative kappas were reported for: lymph nodes, lips, tongue and tissues/cheek and, the floor of the mouth.
In the study of Kayser-Jones et al. (1995) the inter-rater reliability on the total score was rated negative (−) because correlations varied between 0.40 (RN and CAN) and 0.68 (between the DDS and LVN) and were all statistically significant [33]. For the individual items, percent agreement ranged from 50.5-98.0. With the lowest values for oral cleanliness and the highest for lymph nodes. The unweighted kappas ranged from κ = 0.09 for the item tissues and κ = 0.82 for pairs in chewing position. Negative kappas were reported for lymph nodes. The individual items of the BOHSE scored indeterminate (?) because unweighted kappas were reported ( Table 5).
The test-retest reliability was assessed on the total score by Kayser-Jones et al. (1995) for the DDS, RN, LVN, and CNA. The highest correlation was reported for the RN between time 1 and 2. The quality criteria scored +/− because statistically significant correlations varied between r = 0.79 and r = 0.88 between time 1 and 2 for different raters (Table 5).
OAS Yanagisawa et al. (2017) investigated the inter-rater reliability between dental professionals and carers before and after training [39]. Between dental professionals, the Fleiss' kappa ranged from 0.49 to 0.83 and the ICC mean was 0.93. Kappa values were low for tongue coat, bad breath, and mouth opening.
The kappas between dental professionals and care workers ranged from 0.25-0.80 and were the highest for bad breath and tongue thrusting. After the training, the mean kappas increased to a mean of 0.72 and the ICC increased to 0.89, with the lowest values for the cleanliness of teeth and gums, bad breath and difficulty chewing. Indeterminate (?) score was reported because the unweighted kappas were reported and the ICC scored +/− because of the variance between the scores (Table 5).

Discussion
With this systematic review, we evaluated eighteen studies, investigating eight oral health assessments for use by non-dental healthcare professionals to assess older peoples' oral health, on their content and measurement properties in order to give recommendations for practice, policy and research.
Out of the eighteen included studies, only five of them scored good on the methodological quality of some of the measurement properties [18,19,34,35,37]. Overall, the OHAT has been most extensively investigated on its measurement properties with fair/good methodological quality and a positive(+)/indeterminate(?) quality assessment of the outcome. Similar results were found for the BOHSE (a prior version of OHAT) which was the most reliable and valid oral health assessment, according to the systematic review of Pearson and Chalmers in 2005 [10]. However, nurses concluded that the BOHSE was too long and complicated and therefore it has been simplified into the OHAT by   [17,33]. Three adaptations were made: 1. The category of lymph nodes and pairs of teeth in chewing position was eliminated; 2. The items tissue and gums were combined and 3. A category of behavioral problems and pain was added.
The ROAG, MDS, OHAT, THROAT, BOHSE, and OAS contain most items to inspect the oral cavity, varying between 6 and 12 items. The results of this review show the least agreement between raters on the items: oral hygiene, lips, saliva, and natural teeth. An explanation could be that non-dental healthcare professionals lack experience in assessing these items. Results from a focus group discussion from Chalmers (2005) support these findings; nurses felt less capable of assessing gums and tissues and natural teeth. Surprisingly, the nurses felt less capable of assessing the domain 'pain', which also showed the lowest kappa in the study of Simpeleare et al. (2016) between three speech pathologists.
Another remarkable result was the negative kappas in the study of Lin et al. (1999) for lymph nodes, lips, tongue, and tissues. In this study, they claim that a negative kappa for lymph nodes was found because the research population did not show enlarged lymph nodes during the study [34]. However, no explanation has been given for the other negative values. Literature states that a negative kappa can occur when the outcome is lower than expected or disagreement between two raters occurs [42]. However, more information on the context of the study is needed to give a reliable explanation. The study of Dickinson et al. (2001) reported negative kappas for the items teeth and smell. This study supports the explanation of too little variety between the scores [19]. Therefore they modified the THROAT by removing these items during further analysis.
As far as we know, this is the first systematic review that critically appraised the methodological quality of studies investigating the measurement properties of oral health assessments for use by non-dental healthcare professionals. When the methodological quality of the studies is lacking, the validity and reliability of the outcomes remain unclear [16]. Therefore, first, the methodological quality of the measurement property per study has been assessed. For this purpose, we used the COSMIN checklist with a 4-point scale [24]. Although recent updates of COSMIN are published, we chose to use the former version instead of the update. The updated COSMIN is specially developed for Patient-Reported Outcome Measures (PROMs), with a conditional step for good content validity for further assessment of other measurement properties [43], while the version of 2012 that we used focusses in a more general context on measurement properties of measurement instruments/assessments and therefore is better suited to our objective.
However, even the COSMIN version of 2012 lead to some discussion points in our study. Although developed for assessing measurement properties in a more general context, this version of COSMIN strongly emphasizes the involvement of the target population (patients) in developing a measurement instrument. As a result, content validity scored poor overall on the methodological quality in the included studies because none of the included studies involved patients in developing the oral health assessment [44]. Nevertheless, we doubt to what extent the input of patients should be highly rated in the development of an oral health assessment which is used by non-dental healthcare professionals. The input of experts and non-dental healthcare professionals, might, in this case, be more valuable. The included studies often consulted experts and non-dental healthcare professionals in the development of oral health assessments. Therefore, we think that the rating of poor methodological quality with the COSMIN on this item should be interpreted with reservations.
Regarding terminology, we noticed that "validity" and "reliability" are not used consistently in the included studies. We sometimes found mixed terminology for intra-rater reliability and test-retest reliability: Intra-rater reliability was described in the study, while a time interval of the second assessment was stated. Thus, in this case, test-retest would have been more appropriate.
In addition, comparisons between a dental professional and non-dental healthcare professionals were made in assessing the criterion validity in some studies, while other studies referred to this as inter-rater reliability. For inter-rater reliability, often a non-dental healthcare professional was compared to a dental care professional as the reference-rater. For criterion validity, the dental professional was referred to as the "gold standard". The purpose of investigating the criterion validity is to compare the investigated instrument/assessments against a gold standard. However, no gold standard for oral health assessments exist. The OHAT and DHR were the only assessments in which the single items were assessed using several standardized criteria [17,29]. However, these indices are not reported as gold standards. Since the aim of the oral health assessment is not to diagnose oral diseases but to screen and triage, we consider a dental professional as the expert in detecting oral problems and therefore we scored positive on the methodological quality of criterion validity when using a dental professional as "gold standard" (reference-rater).
Finally, a remark on the "worst score counts" method should be discussed: some studies scored good or excellent on a majority of the items, except for one single item, which resulted in a "poor" overall score. For example, the study of  scored poor on the validity items because of the small sample size, while all other items scored good/excellent. This makes the method very strict in its overall score and this should be taken into account when referred to as "poor" methodological quality items.

Recommendations for researchers, policymakers, and users
Based on our findings, we recommend more research on the measurement properties validity and reliability of the existing oral health assessments. This should be done in studies with good methodological quality as introduced by COSMIN. As a first step, there should be unanimity about the content of oral health assessments performed by non-dental healthcare professionals. Relevant stakeholders should determine which items assess a "healthy" versus "unhealthy" mouth. The FDI is working on a standardized set of oral health measures that could be used as background information and be adapted for this specific purpose (oral health assessment by non-dental healthcare professionals) [45]. In addition, when conducting research on the measurement properties, a proper distinction should be made between testing validity or reliability and the use of adequate statistical methods and analysis Furthermore, when investigating criterion validity, it is recommended to investigate the individual items of an oral health assessment using standardized criteria like the Mucosal Plaque Index and OHI-S, WHO oral lesions categories, Rise denture assessment and NIDR tooth status as conducted by  and Fjeld et al. (2007) [17,29]. Since research on validity and responsiveness requires "gold standards", which are not available for all aspects of oral health, we recommend research on the standardization of oral health measures and the possibility to develop gold standards. Finally, when new oral health assessments for non-dental healthcare professionals are developed we recommend using the COSMIN guideline to minimize methodological flaws and develop highly reliable and valid oral health assessments [46].
Policymakers should take into account the level of education and proper training of the healthcare workers when implementing an oral health assessment. Training in using an oral health assessment might not be sufficient as there is a need for improvement of oral health knowledge of non-dental healthcare professionals in general [47]. Several studies concluded that non-dental healthcare professionals lack knowledge about oral health [1,[47][48][49]. A literature review concluded that educational programs delivered, regularly reinforced by a dental hygienist, and using several teaching formats were most effective in the improvement of oral health of patients [47]. Therefore, we recommend that a dentist or a dental hygienist is involved during the implementation of oral health assessments of older people for continues training and feedback to support non-dental healthcare professionals.
For non-dental healthcare professionals, we recommend taking into account the objective of assessing the oral cavity when choosing an oral health assessment. When screening, triage or decision for a referral to a dental professional is the main objective, the OHAT (prior BOHSE) and ROAG could be suitable. However, also other oral health assessments could be relevant when: (1) it is part of a general geriatric assessment (MPS); (2) the oral health assessment is for a specific patient group (THROAT); (3) only oral hygiene will be evaluated (DHR); or (4) the objective of an assessment is to give an indication of the oral health situation and set up an oral health care plan of patients in a specific setting (ROAG, OAS).

Conclusion
In this systematic review, several oral health assessments have been evaluated on their measurement properties. Most studies suffer from methodological shortcomings (according to the COSMIN criteria). To increase the methodological quality of oral health assessments, and facilitate the investigation thereof in future research, standardization of oral health assessment is required.
Taken into account the scarce evidence of the proposed oral health assessments, the OHAT and ROAG are most complete in their included oral health items (including triage and referral to a dental professional when needed) and their studies are of best methodological quality in combination with a positive quality assessment on validity and reliability. Moreover, the OHAT has been most comprehensively investigated on its measurement properties. When choosing an oral health assessment, non-dental healthcare professionals should take such evidence into account. However, when using these oral health assessments one must realize that to date its evidence base is rather limited. Policymakers should be aware of the methodological limitations of the existing assessments when implementing them in healthcare and provide sufficient education for its users.