1. Introduction
Psychometric instruments are built in order to assess psychological constructs that cannot be operationally defined and, consequently, cannot be objectively assessed, such as multidimensional constructs that, according to [
1], consist of a number of interrelated attributes or dimensions and exist in multidimensional domains. In order to develop a psychometric instrument to assess a multidimensional construct, a set of items, that assess a dimension, is developed for each one of its dimensions in furtherance of assessing the construct as a whole. The validation process of such an instrument must guarantee that each item assesses its dimension correctly according to desirable characteristics such as reliability and trustworthiness [
2].
As psychometric instruments play an important role in researches in the areas of psychology and education, it is necessary that they are thoroughly developed and validated, so that no erroneous results are obtained by their application. The validity of an instrument is divided into four categories: predictive validity, concurrent validity, construct validity and content validity. The first two of these may be considered together as criterion-oriented validation processes [
3]. Predictive validity is studied when the instrument assesses a correlated construct to the criterion, providing a prediction for it, and concurrent validity is studied when the instrument is proposed as a substitute for another [
3]. The study of construct validity is necessary when the result of the instrument is the measure of an attribute or a characteristic that is not operationally defined, so an instrument is valid when it is possible to determine which construct accounts for the variance of its performance. Furthermore, content validity is established by showing that the instrument items are a sample of a universe in which the investigator is interested and is ordinarily established deductively, by defining a universe of items and sampling systematically within this universe to build the instrument [
3]. Another definition for content validity is that it is the degree to which elements of an assessment instrument are relevant to and representative of the targeted construct for a particular assessment purpose [
2]. See [
4] for more details on instrument validity and [
5,
6] for practical examples of validation processes.
A list consisting of thirty-five procedures for content validation was proposed by [
2]. Amidst these procedures are to match each item to the dimension of the construct that it assesses and request the judgement of specialists in the construct, also called judges, about the developed items. The accomplishment of these procedures is imperative to verify if the developed items are a sample of the universe that the instrument aims to assess. These procedures, components of the theoretical analysis of items, are subjective as they rely on the personal opinions of specialists and researchers. Indeed, the theoretical analysis of items is done by judges and aims to establish the comprehension of the items (semantic analysis) and their pertinence to the attribute that they propose to assess.
This paper aims to propose a nonparametric statistical approach based on the Cochran’s Q test to content analysis of items in furtherance of assessing its consistency and reliability. Therefore, our approach does not seek to establish the validity of the instrument, but rather assess the consistency of the content analysis process, so that its rule about the instrument may be trusted. Thus, this approach must be applied among other instrument validation methods, quantitative and qualitative, e.g., semantic analysis, pretrial and factorial analysis, in order to ensure the reliability, consistency, validity and trustworthiness of psychometric instruments.
2. Method
The researcher, supported by the theory of the construct that the instrument aims to assess, develops m items and for each item assigns a theoretical dimension according to the theory and/or his opinion about which dimension the item assesses. Although the items and their dimensions have theoretical foundations, it is necessary to test them in order to determine if every item is indeed assessing the dimension it is supposed to. In order to fulfil such a test, the items are sent to s specialists in the construct, so that they may judge the items according to the dimension they assess. The items may be sent to at least six specialists and should be presented to them in a random order and without their theoretical dimensions, so that their judgement is not biased.
A condition for an item to be excluded from the instrument is determined based on the judgement of the specialists. This condition must exclude the items that do not belong to the universe that the instrument aims to assess, so that the not excluded items are a sample of such a universe. A possible way to proceed is to determine a
Concordance Index (
CI) that states that all items in which less than
c% of the specialists agree on the dimension they assess must be excluded. One may also take the
Content Validity Ratio (
CVR), as proposed by [
7], as a condition to exclude items that do not belong to the universe that the instrument aims to assess.
The method to be developed in this paper aims to determine whether all specialists have the same capability to judge the items according to their dimensions, through the analysis of their judgement about the items that were not excluded by the established condition. However, the method does not rank the specialists according to their capabilities, but only determines if all specialists have the same capability, so it is not possible to determine the specialists with low capability.
On the one hand, if there is no evidence that the capabilities of the specialists are different, their judgement is accepted and the items that are not excluded by the established condition are used in the next steps of the instrument validation process. Indeed, if all specialists have the same capability, it may happen that they are all highly capable or almost incapable of judging the items, though the proposed method is not able to differentiate between the two cases. Nevertheless, the two scenarios may be differentiated by a qualitative analysis of the specialists’ judgement, by observing if they agree with the theoretical dimension of the items and, when they do not, if there is some theory that supports their choice. Therefore, if their collective judgement is consistent with some theory, then the specialists may be regarded as all being highly capable of judging the items, given that they all have the same capability to judge them.
On the other hand, if it is determined that the specialists do not all have the same capability to judge the items, then at least one specialist is less capable to judge them than the others, which may bias the validation of the instrument. Therefore, in such a scenario, we propose two approaches in order to avoid a biased validation process. First, we propose that the specialists’ judgement be disregarded and a new group of specialists be requested to judge the items. However, this approach may be impractical, as time and resources may be too limited to repeat the cycle of specialists’ judgement more than once. Nonetheless, we propose a much more practical approach that consists in applying the proposed method to all subgroups of specialists of size , , of the original group of specialists, and then choosing the judgement of the subgroup whose specialists all have the same capability to judge the items. This approach is presented in more detail in the application section.
3. Notation and Definitions
Let be a construct divided in n dimensions and U be the universe of all the items that assess the dimensions of C. A set of m items is developed based on the theory about C and then a subset of items, that we believe to be a subset of U, is determined, by the following process.
Denote
a set of
s specialists and let
be the dimension that the item
assesses. Let the random variables
, defined on
, be so that
if the specialist
judged the item
at the
kth dimension of
C (in the following, we have that
l goes from 1 to
m and that
j goes from 1 to
s).Note that if
and
, then the specialist
judged the item
correctly. The capability of the specialist
to judge the items is defined as
in which
and
. In the proposed approach, denoting
the length of
, we are interested in developing a hypothesis test to determine if
, i.e., if all specialists have the same capability to judge the items.
For this purpose, let a random sample of the judgement of the specialist
about the items of
I be given by
and let
be the space of all possible random samples
. Define the random sets
as
in which
is the indicator function of the set
A. Note that
is the set containing the number of the dimensions in which the majority of the specialists judged the item
. Given a random sample
and a subset
of items, the set
, determined from the sample values
, is a random sample of
.
The subset
may be defined by a condition function, a function of the sample
, given by
, in which
is the power set operator. The condition function must be such that if
is determined from
and
, then the length of
is one,
. The
CI for
and the
are condition functions. As the
CI may be obtained from other concordance indexes, as the
Content Validity Index (
CVI) that is used to measure concordance when the construct is unidimensional and the task of the specialists is to judge the item’s relevance [
8,
9,
10,
11], the method developed in this paper may also be applied in other scenarios. From now on, it is supposed that the condition function may be expressed as a
CI.
The condition function is based on the assumption that an item is in the universe of items that assess the construct of interest if the majority of specialists agree on the dimension it assesses. Of course, one may take a different criterion to exclude the items that do not assess the construct of interest, although our method may be applied only if the criterion can be expressed as a condition function, as it is based on the fact that is a univariate random variable.
Finally, define
as the random variable that indicates if the specialist
judged the item
at the same dimension as the majority of the specialists. Given a random sample
and a subset
of items, the set
, determined from the sample values
, is a random sample of
.
On the one hand, whilst we observe the values of the random variables , we do not know if the specialists judged the items correctly or not, as the dimension that an item really assesses (if any) is unknown. Therefore, it is not possible to differentiate the specialists by the number of items they judged correctly, for example. On the other hand, from the random variables , we know the concordance of the specialists on the judgement of the items, which gives us a relative measure of their capability to judge the items. Therefore, we are able to test if all the specialists have the same capability to judge the items by applying the Cochran’s Q test, although we cannot determine the capability of each one.
4. Assumptions
The development of the items and the judgement of the specialists must satisfy two assumptions so that the method presented below may be applied:
Each item assesses one, and only one, dimension .
The random variables are independent.
Assumption 1 establishes that the items that were not excluded by the condition function, i.e., the items in , are well constructed and assess only one dimension of C, while Assumption 2 imposes that the specialists judge the items independently of each other and that the judgement of a specialist about one item does not depend on his judgement about any other item. These assumptions are not strong, as it is expected that they are satisfied if the items were well constructed. Indeed, the better the condition function in determining what items are not in U, the better the quality of the items in . Therefore, the assumptions above are closely related to the condition function. If, in fact, , then the first assumption is immediately satisfied, as there is no intersection between two dimensions of a construct, and the second assumption may also hold, as the items are well defined.
5. Mathematical Deduction
Given a random sample
, it is not trivial to estimate the capabilities
, as the dimension that each item assesses is unknown. Examining such a random sample, it is known that the specialist
judged the item
at the dimension
, but it is not possible to determine, with probability 1, if he judged such an item correctly. Therefore, the problem is, given a random sample
, to determine random variables that allow us to test if the capability of all the specialists is the same. It will be shown that if the random variables
are not identically distributed
, then the specialists do not all have the same capability to judge the items. Indeed, in order to test if the capability of all specialists is the same, we consider the following null hypotheses:
Of course, we are only interested in testing the first part of , that refers to the capability of the specialists, i.e., that all specialists have the same capability to judge the items. However, the second part is needed to develop a test statistic for . It will be argued that for great values of , the hypothesis that is actually being tested is the first one.
The propositions below set the scenario for the nonparametric test, i.e., the Cochran’s Q test, that is used to test .
Proposition 1. The random variables are independent , but the random variables are dependent .
Proof. On the one hand, the random variables are each, by assumption 2, function of independent random variables, therefore they are independent. On the other hand, note that , for at least of the specialists must agree on the dimension that an item in assesses, which establishes a dependence. ☐
Proposition 2. Under , the random variables are identically distributed for all .
Proof. Now let
and
, be independent random variables, and let
, in which
c is the
CI. Then, under
,
and
Hence,
which does not depend on
and the result follows. ☐
It is important to note that if all are approximately 1, then and the hypothesis that is actually being tested is the first part of . Therefore, it is reasonable to test in order to determine if all the specialists have the same capability to judge the items, as, if it is indeed true, we expect that all are great and the second part of will hardly lead to the rejection of when the capability is the same.
This test may be used as a diagnostic for the content analysis of items. If is not rejected, then there is no evidence that the capabilities of the specialists are different. However, if is rejected, we do not know if it is the first or the second part (or both) of that is not being satisfied by the judgement of the specialists. Nevertheless, we may disregard their judgement in any case, as either their capability is not the same or they are the same, but some are small, which led to the rejection of by its second part.
6. Hypothesis Testing
The Cochran’s Q test may be applied to the random sample
determined from
as a way to test
[
12]. The assumptions of the Cochran’s Q test, using the notation of this paper, are as follows:
- (a)
The items of were randomly selected from the items that form the universe U that the instrument aims to assess.
- (b)
The random variables are dichotomous.
- (c)
The random variables are independent.
The Cochran’s Q test is used in applications in which treatments are applied independently to blocks (subjects) and the result of each treatment application is either a success or a failure (zero or one) [
13]. In our case, we have that the items may be seen as the
blocks and the specialists as the
treatments. What the Cochran’s Q test evaluates is if the treatments are all equally effective or, in our case, if the specialists are all equally capable of judging the items (which is equivalent to testing if the random variables
are identically distributed for all
). Therefore, if we reject the null hypothesis, we conclude that
are not identically distributed for all
and, by Proposition 2,
is also rejected. Thus, the hypothesis tested by the Cochran’s Q test is indeed
.
The statistic of the test is calculated from
Table 1, in which
, and may be expressed as
The exact distribution of the
Q statistic may be calculated by the method presented by [
14], although a large sample approximation may be used instead. If
is large, then the distribution of
Q is approximately
with
degrees of freedom [
13].
It is worth mentioning that the random variables being identically distributed for all does not imply that all the specialists have the same capability to judge the items, although there is no evidence that their capabilities are different. If there is no evidence that the capabilities of the specialists to judge the items are different, their judgement may be accepted.
If it is determined that the random variables are not identically distributed for all , then the judgement of the specialists is disregarded as is rejected. The items may be judged by different groups of specialists until they are judged by one in which all the specialists have the same capability to judge the items. These groups may be formed by new specialists or may be a subgroup of size , , of the specialists for which was rejected.
7. Simulation Study
As the Cochran’s Q test is not a powerful one, i.e., its Type I error may be too great, a simulation study is conducted to estimate its power in some specific cases. The power of a statistical test is defined as the probability of being rejected when it is false and depends on the real scenario, i.e., on the real values of the parameters considered on . Therefore, the power of the Cochran’s Q test in testing depends on the real capability of each specialist to judge the items, so the simulation study considers 10 distinct scenarios and is conducted as follows.
For each scenario, we simulate 50,000 judgements of the same items by the specialists and then determine the proportion of the simulations in which was rejected at a significance, i.e., Type II error, of 5%. This proportion is regarded as an estimate for the power of the test in the considered scenario. A CI of 50% is used to determine in each simulation. The results of all 10 scenarios provide a wide picture of the power of the test, so we will know for which scenarios it is more powerful.
We consider in all scenarios nine specialists judging 30 items into three dimensions; this is the framework of the application in the next section. We also consider that the capability of each specialist is the same for all items, i.e., that
for all
and
. Finally, we assume that
for all
,
and
. A pseudocode for the simulation of each scenario is presented in Algorithm 1. The scenarios and their estimated test power are displayed in
Table 2.
Algorithm 1 Pseudocode that estimates the power of the Cochran’s Q test under a given scenario from 50,000 simulated judgements. |
Ensure: |
- 1:
for simulation ∈ {1,...,50,000} do - 2:
for do - 3:
for do - 4:
- 5:
end for - 6:
end for - 7:
Determine I* as the items such that at least 5 specialists agree on the dimension they assess - 8:
- 9:
- 10:
- 11:
- 12:
if then - 13:
- 14:
end if - 15:
end for - 16:
return rejected/50,000
|
* In scenarios 1 to 8. In scenarios 9 and 10 the Multinomial has parameter in which is simulated from a uniform distribution with range . |
On the one hand, we see in
Table 2, that the power of the test is great when the majority of the specialists have the same high capability, while few specialists have a low capability, as is the case for scenarios 1, 2 and 3. This is also the case for scenario 8, when the specialists have different capability and there are specialists whose capability is very low. On the other hand, the power of the test is quite low when some of the specialists have the same high capability, and the specialists with lower capability are almost as capable as them, as is the case for scenarios 4, 5 and 6.
In scenario 7, we see that the power of the test is low when the majority of specialists have the same low capability (0.3 is this case). It happens because the specialists hardly agree on the dimension that each item assesses (as some of them are not capable) so many items are excluded by the CI and, for the items that remain, the not capable specialists agree with the highly capable ones, so it seems that they have high capability. Indeed, in scenario 7, the mean number of not excluded items is the lowest of all scenarios, so a low concordance among the specialists is evidence of the existence of specialists of low capability, given that the items were well constructed.
Finally, as pointed out in the Mathematical Deduction section, we see in scenarios 9 and 10 that the hypotheses that is actually being tested when all the specialists are highly and equally capable is the first part of , as the power of the test is close to the Type II error, which must be the case if the hypothesis is true.
The simulation study shed light on some interesting facts about the proposed method in the considered scenarios. On the one hand, if the majority of the specialists have a homogeneous high capability, and few specialists have a very low capability, or if the capability of the specialists is highly heterogeneous, then the power of the test is great. However, if the specialists all have high, but different, capability then the power of the test is low. On the other hand, if the majority of the specialists have a low capability, then a great number of items is excluded by the CI and, given that the items were well constructed, we may conclude that the specialists have low capability of judging the items, even though the power of the test is low. Finally, if only the first part of is being satisfied, and the capability of the specialists is high, then the power of the test is low and, therefore, the hypothesis that is really being tested is the first part of .
8. Application: Perception about the Evaluation of the Teaching-Learning
In this section, we apply the developed method to a real validation process, in order to analyse the content of items of an instrument that aims to assess the perception of teachers and students of higher education institutions about the teaching-learning process; this is a construct that may be divided into three dimensions: process (P), judgement (J) and teaching-learning (T).
The evaluation of teaching-learning is a process, as it must have a well defined beginning, middle and end and must have a continuous, cumulative and systematic character. Indeed, it is a systematic mechanism for gathering information over time, with well defined levels, which characterises it as a process. Also, the evaluation of teaching-learning has a judgement dimension because it must issue a judgement of value or assign a score through the analysis of educational results obtained from the information gathered over time. Finally, the evaluation of teaching-learning has a teaching-learning dimension because, as indicated by its own name, it must not only evaluate the learning, but also the teaching: it should not only evaluate what the student has learnt, but also what the teacher has taught. Therefore, the evaluation of teaching-learning is a process of data gathering, in which an individual judges or is judged according to the teaching-learning.
In order to develop an instrument to assess this construct, 30 items were developed and sent to nine specialists; they would judge the items according to the dimension that, in their opinion, each one assesses. The condition defined for excluding an item is the
CI with
. The judgements of the specialists are presented in
Table 3; the table for the Cochran’s Q test is displayed in
Table 4 and a translation of the items, that were originally constructed in Portuguese, is presented in the
Appendix A.
The statistic of the Cochran’s Q test for the data in
Table 4 is
and the test
p-value is
, so there is no evidence that
is not true, at a significance of 5%. Furthermore, as the majority of the specialists agreed on the dimension that 24 out of 30 (80%) items assess, we also do not have evidence that the capability of the specialists is low. Therefore, based on the proposed method, there is no reason to disregard the judgement of the specialists.
Nevertheless, in order to illustrate the proposed approach for the case in which
is rejected, we apply the test to every subgroup of size
of specialists, which amounts to 130 subgroups, and see for which subgroups the capability of the specialists is the same. From the 130 subgroups, for 29 of them
was rejected at a significance of 5%. The
Q statistic and the
p-value for the 10 groups with the greatest
p-values are displayed in
Table 5. If
had been rejected by a group of nine specialists, we could then look for a subgroup of these specialists for which
is not rejected and, with the help of a qualitative analysis, we could choose a subgroup of these specialists instead of disregarding their judgement as a whole and sending the items to other specialists to judge.
9. Final Remarks
The Cochran’s Q test is not a powerful one, so the method must be used with caution. The validation of a psychometric instrument is a process that comprises various procedures, therefore it must not be restricted to content analysis of items and the method developed in this paper. It is important to apply other validation techniques, both qualitative and quantitative, to the instrument to properly validate it.
The method may be improved in order to further decrease the subjectivity of the content analysis of items, especially by the development of more powerful tests and the definition of other random variables that enable the comparison between the judgement of the specialists. This paper does not exhaust the subject, but presents a nonparametric statistical approach that aims to decrease the subjectivity of a subjective process and that may applied not only to content analysis of items, but also to any statistical application that enables the definition of variables such as those of this paper.