Access the full text.
Sign up today, get DeepDyve free for 14 days.
applied sciences Article Cognitive Diagnostic Assessment in University Statistics Education: Valid and Reliable Skill Measurement for Actionable Feedback Using Learning Dashboards 1, 2 3 1 Lientje Maas * , Matthieu J. S. Brinkhuis , Liesbeth Kester and Leoniek Wijngaards-de Meij Department of Methodology & Statistics, Utrecht University, 3584 CH Utrecht, The Netherlands; firstname.lastname@example.org Department of Information and Computing Sciences, Utrecht University, 3584 CC Utrecht, The Netherlands; email@example.com Department of Education, Utrecht University, 3584 CS Utrecht, The Netherlands; firstname.lastname@example.org * Correspondence: email@example.com Abstract: E-learning is increasingly used to support student learning in higher education, facilitating administration of online formative assessments. Although providing diagnostic, actionable feedback is generally more effective, in current practice, feedback is often given in the form of a simple proportion of correctly solved items. This study shows the validation process of constructing detailed diagnostic information on a set of skills, abilities, and cognitive processes (so-called attributes) from students’ item response data with diagnostic classiﬁcation models. Attribute measurement in the domain of statistics education is validated based on both expert judgment and empirical student data from a think-aloud study and large-scale assessment administration. The constructed assessments provide a valid and reliable measurement of the attributes. Inferences that can be drawn from the results of these formative assessments are discussed and it is demonstrated how this information Citation: Maas, L.; Brinkhuis, M.J.S.; can be communicated to students via learning dashboards to allow them to make more effective Kester, L.; Wijngaards-de Meij, L. learning choices. Cognitive Diagnostic Assessment in University Statistics Education: Valid Keywords: online formative assessment; cognitive modeling; skill measurement; validation; reliabil- and Reliable Skill Measurement for ity; learning dashboards; university statistics education Actionable Feedback Using Learning Dashboards. Appl. Sci. 2022, 12, 4809. https://doi.org/10.3390/ app12104809 1. Introduction Academic Editor: Donato Cascio In higher education, increased autonomy is required from students because external Received: 14 April 2022 regulation is generally limited . Students need to rely on their own evaluation of Accepted: 6 May 2022 performance to regulate their learning processes. Unfortunately, not all students can Published: 10 May 2022 accurately evaluate their performance, which can result in poor learning choices . The use of e-learning environments in higher education has emerged over the last decades , Publisher’s Note: MDPI stays neutral facilitating administration of online formative assessments that can help students interpret with regard to jurisdictional claims in published maps and institutional afﬁl- the meaning of their past performance and determine what they should be studying and iations. practicing to increase their performance [4,5]. In online formative assessments, students answer assessment items and receive feed- back based on their responses via learning dashboards. In current practice, this feedback often consists of information regarding correctness of responses and percentage correct Copyright: © 2022 by the authors. scores. However, if students know the correct answer to an item and can compare it with Licensee MDPI, Basel, Switzerland. their given answer, this does not imply that they can accurately infer which knowledge or This article is an open access article skills are lacking. For example, Ref.  found that students ﬁnd it difﬁcult to determine distributed under the terms and which features of statistics items are relevant for learning. Furthermore, the interpreta- conditions of the Creative Commons tion of percentage correct scores can be problematic, since these scores cannot be directly Attribution (CC BY) license (https:// compared across different item sets and differences in percentage scores cannot be used creativecommons.org/licenses/by/ to express change . It would be beneﬁcial to provide students more detailed, diagnostic 4.0/). Appl. Sci. 2022, 12, 4809. https://doi.org/10.3390/app12104809 https://www.mdpi.com/journal/applsci Appl. Sci. 2022, 12, 4809 2 of 19 feedback regarding their knowledge and skills . Diagnostic feedback is information about how students’ current levels of performance relate to desired levels of performance, indicating whether or not they have mastered the skills that are required to solve certain tasks . This allows students to determine where they should focus their attention and effort; hence, this information is actionable . In order to provide diagnostic feedback, valid and reliable measurement of students’ skills is required, which is the focus of the current study. Diagnostic information can be extracted from students’ item responses with cognitive diagnostic assessment, which brings together cognitive science and psychometrics to measure knowledge structures and pro- cessing skills in order to provide information about cognitive strengths and weaknesses . 1.1. Cognitive Diagnostic Assessment In cognitive diagnostic assessment, a so-called cognitive model is speciﬁed that links understanding to performance . A set of skills, abilities, and cognitive processes required to solve certain items is deﬁned, which are referred to as attributes. The objective of cognitive diagnostic assessment is to classify students as master or nonmaster of each attribute. This classiﬁcation depends on both expert judgment and empirical evidence. Domain experts encode which attributes are required to solve each item in the Q-matrix, which is a binary matrix that describes the relations between the items and the attributes by indicating for each item whether it measures each attribute or not. The Q-matrix in combination with item response data from diagnostic assessments enables the application of statistical models to classify students. Well-suited models to this end are diagnostic classiﬁcation models (DCMs; ). Previous simulations have shown opportunities to apply these models to item response data from online formative assessments in higher education to obtain diagnostic information if sufﬁcient numbers of students participate (i.e., 100–300 students to assess 3–6 attributes; ). Thus, in courses with large groups of students, cognitive diagnostic assessment with DCMs can be a valuable tool to obtain diagnostic feedback at low cost. It enables educational practitioners to design formative assessments that align with the learning objectives of a course, resulting in actionable feedback that aligns with these objectives. 1.2. Cognitive Model Speciﬁcation To design educational assessments, one ﬁrst needs to deﬁne the desired results (known as ‘backward design’; ). One identiﬁes what will be measured, resulting in a cognitive model that consists of a set of attributes. As pointed out by Pellegrino et al. (, p. 45), the attributes should “reﬂect the most scientiﬁcally credible understanding of typical ways in which learners represent knowledge and develop expertise in a domain”. Attribute speciﬁcation is ideally based on both expert judgment and research in cognitive and educational science. 1.2.1. Attribute Granularity Diagnostic assessment can support the analysis of student behavior at various levels of detail by deﬁning attributes at different granularity levels. This granularity level should be valuable from an educational perspective, in the sense that feedback is meaningful and actionable (see e.g., ). In addition, the attributes must be psychometrically measurable. If the number of attributes increases, the complexity of the measurement model increases as well, putting higher demands on the data structure. Further, one must be able to construct items that produce observable student behaviors at an appropriate level of speciﬁcity. 1.2.2. Item Construction and Q-Matrix Speciﬁcation Once the attributes are deﬁned, items are constructed to measure them. It is crucial that the items truly measure the attributes that they are designed to measure, i.e., that students indeed apply (only) those attributes to solve the items. Further, it is essential that the Q-matrix is speciﬁed correctly in order to obtain accurate student classiﬁcations and, Appl. Sci. 2022, 12, 4809 3 of 19 thus, high-quality diagnostic feedback [17,18]. However, establishing the Q-matrix based solely on expert judgment may be subjective and susceptible to speciﬁcation errors. To address this issue, several qualitative and quantitative methods are available to validate the Q-matrix. Think-aloud studies can provide insight in the processing steps students use to solve the item . Students are asked to verbalize their thoughts while solving items, allowing to verify the relations between items and attributes (see e.g., ) and to modify items for better alignment between items and attributes. In addition, assessments can be administered at large scale to gather empirical evidence regarding the extent to which the items elicit the attributes using empirical Q-matrix validation methods to identify misspeciﬁed entries (e.g., ). 1.3. Diagnostic Assessment in Statistics Education The current study is focused on the domain statistics education to show how valid and reliable measurements of students’ skills can be obtained with cognitive diagnostic assessment. Introductory statistics courses are part of many educational programs at universities, yet students often struggle to understand the abstract concepts in this do- main . Diagnostic assessment can support students in developing such conceptual understanding . There has been some research related to attribute speciﬁcation for statistics in higher ed- ucation. These studies demonstrate that attribute speciﬁcation in statistics education should not solely be based on expert judgment, since this may result in invalid attributes [24,25]. To our knowledge, the constructs of interest in assessment in university statistics education have not been extensively studied from a cognitive diagnostic perspective. 1.4. Current Study In the current study, we construct and validate cognitive diagnostic assessments that can serve as a foundation for designing effective tools for diagnostic assessment of students’ conceptual understanding of statistics that can be used to obtain actionable feedback at low cost. Further, it is demonstrated how this information can be communicated to students via learning dashboards to allow them to make more effective learning choices. The following research question is addressed: How can valid and reliable diagnostic information be obtained from online formative assessment to provide actionable feedback via learning dashboards in university statis- tics education? We ﬁrst identify what attributes are necessary to master for students enrolled in introductory statistics courses in higher education (Section 2). Next, we examine how these attributes can be measured with formative assessment to draw valid and reliable inferences about students’ attribute mastery status. Items are constructed and a provisional Q-matrix is speciﬁed based on expert judgment (Section 3). Assessments are empirically validated using both qualitative and quantitative methods (Section 4). We discuss what inferences can be drawn from the results and how this can be communicated to students (Section 5). We end with an overall discussion (Section 6). An overview of the current study is presented in Figure 1. Section 2 Section 3 Section 4 Section 5 Qualitative Attribute Assessment Q-matrix Communicating Assessment identification construction specification feedback validation Quantitative Figure 1. Outline of the current study. Appl. Sci. 2022, 12, 4809 4 of 19 2. Attribute Identiﬁcation 2.1. Methods In the attribute identiﬁcation process, we ﬁrst deﬁned our domain of interest. We then speciﬁed learning objectives for this domain and subsequently deﬁned attributes. The results were reﬁned based on domain expert evaluation. 2.1.1. Domain Description We speciﬁed a cognitive model of conceptual understanding of inferential statistics (in a frequentist framework) relevant in introductory nonmathematical statistics courses in higher education. The attributes are focused on principles behind null hypothesis sig- niﬁcance testing (NHST), which is a widely used tool in statistical inference that requires students to understand and connect several complex concepts (e.g., variability, sampling distributions, p-values). We chose this domain because it is a central topic in many intro- ductory statistics courses, and students often experience difﬁculty with it due to the large number of involved abstract concepts and the (counter-intuitive) logic of experimental design . 2.1.2. Deﬁning Learning Objectives Learning objectives were deﬁned based on literature and course materials. The litera- ture study was based on four sources: a review of students’ misconceptions in statistical inference ; a book about the development of students’ statistical reasoning ; a paper about the development of instruments measuring conceptual understanding of statis- tics ; and the Guidelines for Assessment and Instruction in Statistics Education College Report , which lists goals that summarize what students should know and understand after a ﬁrst course in statistics based on collective beliefs reﬂected in statistics education lit- erature. After the literature review, we inspected the course materials of (nonmathematical) introductory statistics courses to complement the speciﬁed learning objectives. 2.1.3. Deﬁning Attributes Based on the learning objectives, we aimed to deﬁne attributes that are both psychome- trically measurable and pedagogically meaningful and actionable. In order to be measurable (i.e., to enable estimation of student attribute proﬁles), it is recommended to include at least ﬁve measurements of each attribute . Evaluating mastery of each learning objective would require administration of huge amounts of items and, moreover, feedback at such ﬁne-grained level may be less meaningful in a domain where students need to develop understanding of interrelations among concepts ; therefore, we grouped learning ob- jectives into attributes. We aimed to group learning objectives that address closely related concepts, presuming that students’ skills regarding these learning objectives are strongly related. Consequently, concepts that are generally addressed within the same lectures, e-learning modules, or book chapters are encompassed by the same attribute, allowing for actionable feedback (e.g., by recommending relevant learning materials for nonmastered attributes). 2.1.4. Expert Evaluation Four domain experts were consulted to evaluate the exhaustiveness, redundancy, and grouping of the learning objectives. The experts had 13 to 29 years of experience in statistics education. Based on their feedback, the results were reﬁned. 2.2. Results The attribute identiﬁcation process resulted in 9 attributes comprising 33 learning objectives. The attributes are summarized in Table 1 and more extensive results including the learning objectives are presented in Table A1 in Appendix A. In the Supplementary Material, a more detailed description of each attribute is provided including references Appl. Sci. 2022, 12, 4809 5 of 19 to empirical studies showing the relevance of addressed concepts and the occurrence of misconceptions. Table 1. The nine identiﬁed attributes. A. Understanding center & spread B. Interpreting univariate graphical representations C. Graphically comparing groups D. Understanding sampling variability E. Understanding sampling distributions F. Understanding the standard error G. Understanding principles of hypothesis testing H. Evaluating NHST results I. Understanding and using conﬁdence intervals 3. Assessment Construction and Q-Matrix Speciﬁcation 3.1. Item Collection Assessing conceptual understanding of statistics requires assessment techniques that reﬂect the nature of students’ thinking. Although this is best achieved through one-to-one communication or by examining in-depth student work , items with a selected-response format can also be useful to gather limited indicators of statistical reasoning if properly designed . This is demonstrated by instruments designed to measure conceptual under- standing of statistics, such as the Statistical Reasoning Assessment (SRA; ), the Statistics Concept Inventory (SCI; ), the Comprehensive Assessment of Outcomes in Statistics test (CAOS; ), and the Assessment Resource Tools for Improving Statistical Thinking (ARTIST; ). These instruments consist of multiple-choice items that require thinking and reasoning rather than recalling deﬁnitions, computing, or using formulas. In order to mea- sure our attributes, we exploited these sources of existing items. Relying on these sources ensures that the items have been evaluated by experts in earlier research and/or education. Items were selected (and if needed modiﬁed) based on validated guidelines for multiple choice item writing  and design principles for statistical assessments (, p. 139). The set of prototype items consisted of 59 items. 3.2. Q-Matrix Speciﬁcation The list of attributes and learning objectives was presented to three independent experts who had 4 to 6 years of experience in statistics education. The experts were not involved in the attribute identiﬁcation phase of the study. For each item, they were asked to select which learning objective(s) are required to solve it. The expert ratings were combined by means of a majority vote and aggregated to the attribute level, since this is the level at which we want to make inferences about students. Pairwise inter-rater agreement, as indicated by fuzzy kappa , ranged from 0.41 to 0.73 at the learning objective level and from 0.79 to 0.91 at the attribute level. Expectedly, agreement is substantially lower at the learning objective level, which stems from the more ﬁne-grained nature of the learning objectives. The lower agreement illustrates that Q-matrix speciﬁcation based on expert judgment can be susceptible to errors and provides a rationale for the validation. According to the combined ratings, all learning objectives are measured by the item set, indicating that all aspects of the attributes are measured (i.e., no construct under- representation). The aggregated ratings resulted in a provisional Q-matrix. In total, the item set consisted of 41 unidimensional items, 16 two-dimensional items, and 2 three- dimensional items. 4. Assessment Validation Attribute measurement was validated based on a qualitative evaluation with a student think-aloud study and a quantitative evaluation of students’ item response data. Appl. Sci. 2022, 12, 4809 6 of 19 4.1. Qualitative Evaluation 4.1.1. Methods We let students verbalize their thoughts when solving the items during interviews to verify whether the attributes are a good representation of the skills students rely on when solving the items. The procedures are brieﬂy described below and more extensive descriptions are included in the Supplementary Material. Participants and Procedures Participants were 8 university students who recently participated in an introductory statistics course. Students were asked to rate their own performance in these courses on a continuum from basic to excellent. Two students rated their performance as basic, three students as good, and three students as very good. During the interviews, items were presented one by one and students were asked to read them aloud and, subsequently, explain clearly how they would solve the item. We avoided follow-up questions, because these may evoke thinking patterns that differ from how students think about problems on their own . If necessary, clarifying questions were asked at the end of the interview to successfully ascertain the cognitive processes that the students demonstrated. Only a subset of the 59 items was presented to each student. On average, students solved 27.5 items (SD = 7.9, min = 16, max = 41). We ensured each item was answered by at least 3 students by presenting different subsets in different orders. Analysis The interviews were recorded and coded by two independent experts who subse- quently participated in conversations to reach consensus about students’ attribute use (intercoder agreement as indicated by fuzzy kappa was 0.67; ). For items with (partly) justiﬁed answers, it was coded for each learning objective whether or not students showed (complete or incomplete) evidence of using it. Results were aggregated to the attribute level to evaluate students’ attribute use for each item. We veriﬁed to what extent this corresponded to the provisional Q-matrix. In addition, we evaluated clarity of the items from a student perspective (unrelated to their knowledge of statistics), allowing alteration of unclear items. Minor adjustments were made to 17 items, such as reformulation or deletion/addition of context information. The improved items were used in the quantitative evaluation. 4.1.2. Results The percentages of students who used each attribute when solving each item are presented in Table 2. Colored cells indicate which attributes are measured by the item according to the provisional Q-matrix. It can be seen that the attribute usage to a great extent coincides with the provisional Q-matrix (96.2% agreement based on a majority rule for attribute use). These results were used in the quantitative evaluation of data-driven suggestions for alternative Q-matrix entries. For one item, students showed indication of engaging in cognitive processes that pointed to the use of domain-relevant attributes that are not included in the set of nine at- tributes. This was resolved by deleting one distractor from the response alternatives. Apart from this item, students did not show notable indication of using unspeciﬁed attributes. Therefore, we believe the nine attributes encompass the knowledge and understanding of statistics required to solve the items. Appl. Sci. 2022, 12, 4809 7 of 19 Table 2. Percentages of students using each attribute (Att.) in their reasoning per item (“-” indicates 0%). Colored cells indicate the attributes that are measured by each item according to the provisional Q-matrix. The last column indicates the number of students who answered each item (n). Item Att. A Att. B Att. C Att. D Att. E Att. F Att. G Att. H Att. I n ARTIST_sc_MS_05 67% - - - - - - - - 3 ARTIST_sc_MS_01 100% - - - - - - - - 4 ARTIST_sc_MC_05 100% - - - - - - - - 4 ARTIST_sc_MC_06 100% - - - - - - - - 4 ARTIST_db_MS_Q0490 100% - - - - - - - - 4 CAOS_14 100% 100% - - - - - - - 4 CAOS_15 100% 100% - - - - - - - 4 CAOS_08 67% 100% - - - - - - - 3 CAOS_09 67% 100% - - - - - - - 3 CAOS_10 100% 100% - - - - - - - 3 CAOS_11 - 67% 100% - - - - - - 3 CAOS_12 - 100% 100% - - - - - - 3 CAOS_13 - - - - - - - - - 3 SRA_015 - 75% 100% - - - - 25% - 4 ARTIST_db_CG_Q0840 50% 100% 100% - - - - - - 4 ARTIST_sc_SV_01 - 67% - 33% 33% - - - - 3 CAOS_17 - - - 100% - - - - - 4 ARTIST_sc_SV_03 - - - 100% - 100% - - - 4 ARTIST_sc_SV_14 67% - - 100% - 33% - - - 3 ARTIST_sc_SV_04 - - - 60% 100% - - - - 5 ARTIST_sc_SV_10 - - - - 100% - - - - 4 ARTIST_sc_SV_11 - - - 25% 100% 50% - - - 4 ARTIST_sc_SV_09 - 50% - 50% 100% 50% - - - 4 ARTIST_db_SS_Q0061A - - - 50% 100% - - - - 4 ARTIST_db_SS_Q0061B - - - - 100% 25% - - - 4 ARTIST_db_SS_Q0061C - - - - 100% 25% - - - 4 ARTIST_db_SS_Q0061D - - - - 100% 25% - - - 4 ARTIST_db_SS_Q0061E 50% - - - 75% 100% - - - 4 ARTIST_sc_SV_05 - - - 50% 100% - - - - 4 CAOS_16 - - - 100% - 100% - - - 4 CAOS_32 20% - - 40% - 100% - - - 5 SCI_2004_20 100% - - 75% - 100% - - - 4 GRASPLE_DP_SE_40985 - - - - - 100% - - - 3 ARTIST_db_SS_Q1437 - - - - - 67% - - - 3 ARTIST_db_SS_Q0614 - - - - - 100% - - - 3 CAOS_40 - - - - - - 100% 60% - 5 ARTIST_sc_TS_01 - - - - - - 100% - - 4 ARTIST_db_TSG_Q1182 - - - - - - 100% 25% - 4 CAOS_23 - - - - - 25% 100% 75% - 4 CAOS_24 - - - - - - 100% 25% - 4 ARTIST_db_TSG_Q1392 - - - - - - - 60% - 5 CAOS_25 - - - - - - - 75% - 4 CAOS_26 - - - - - - - 100% - 4 CAOS_27 - - - - - - - 100% - 4 ARTIST_sc_TS_04 - - - - - - 100% 50% - 4 ARTIST_sc_TS_10 - - - - - - 100% 100% - 4 ARTIST_sc_TS_07 33% - - 33% - - 33% 100% - 3 ARTIST_sc_TS_09 - - - - - - 75% 100% - 4 SCI_2004_22 - - - - - - - 100% - 3 ARTIST_db_TSG_Q1007 - - - - - 25% 25% 50% - 4 ARTIST_sc_CI_05 - - - 25% - - - - 75% 4 ARTIST_sc_CI_03 - - - - - - - - 67% 3 ARTIST_sc_CI_02 - - - - - - - - 75% 4 ARTIST_sc_CI_01 - - - - - - - - 100% 3 ARTIST_sc_CI_07 - - - - - - - - 100% 3 ARTIST_sc_CI_06 - - - - - - - - 67% 3 ARTIST_sc_CI_10 - - - - - - - - 100% 3 ARTIST_db_CIOSM_Q1394 - - - - - - 25% - 50% 4 ARTIST_db_CIOSM_Q1387 - - - - - - - - 100% 3 4.2. Quantitative Evaluation 4.2.1. Methods In addition to the qualitative evaluation, the items were administered on a large scale to evaluate students’ item responses quantitatively. Appl. Sci. 2022, 12, 4809 8 of 19 Participants and Procedures The items were translated into Dutch and implemented in introductory statistics courses taught during the fall of 2021 in ﬁve different bachelor programs at Utrecht University—namely, psychology, educational sciences, pedagogical sciences, sociology, and cultural anthropology. The item set was split into two assessments to reduce the burden on students. The ﬁrst assessment consisted of 35 items and concerned all topics related to Samples and spread, encompassing attributes A–F. The second assessment consisted of 24 items and concerned all topics related to NHST and conﬁdence intervals (CIs), encompass- ing attributes G–I. The two item sets exclusively measure the attributes that are assessed within each assessment according to the provisional Q-matrix and both Q-matrices meet the conditions for generic identiﬁability from . The assessments were made available after the relevant topics were addressed in class, but well before the ﬁnal exam. They were presented as optional, formative assessments. Students were given only one attempt for each assessment to motivate them to make a serious effort. In total, 849 students completed the ﬁrst assessment and 790 students completed the second assessment. Analysis The data were analyzed using the statistical software R . We focused on diagnostic classiﬁcation models (DCMs) under the log-linear cognitive diagnosis modeling (LCDM) framework because of its modeling ﬂexibility and straightforward interpretation . To validate attribute measurement with DCMs, we ﬁrst evaluated the assumption of local independence using the local dependence statistic c for each item pair . To control LD for multiple comparisons, p-value adjustments according to the Holm–Bonferroni method are conducted . Next, we empirically evaluated the Q-matrix with the stepwise Wald method . This method provides suggestions for modiﬁcations to Q-matrix entries based on the amount of explained variance in the success probabilities for different attribute proﬁles. This procedure is based on the data at hand, and whether the suggested modiﬁcations should be incorporated should be subject to the judgment of domain experts. Therefore, we carefully evaluated the suggested modiﬁcations in relation to theoretical considerations and the results from the think-aloud study to make decisions about the ﬁnal Q-matrix. We then examined attribute behavior at the item level. This refers to the compensatory nature of the attributes (i.e., whether nonmastery of an attribute can be compensated by mastery of another). The Wald test is used to evaluate whether, for items that require at least two attributes, the LCDM can be replaced by a reduced model without signiﬁcant loss of ﬁt, reﬂecting different attribute behaviors at the item level . This can result in higher classiﬁcation accuracy . We end with an evaluation of the ﬁnal model. Several ﬁt statistics are evaluated, namely, the M statistic , the RMSEA , and SRMSR , and the maximum transformed 2 2 correlation and log-odds ratio . Finally, the attribute-level classiﬁcation accuracy index from  and the reliability measurement from  are reported. 4.2.2. Results Assumption of Local Independence The results showed several item pairs with signiﬁcant c values, indicating local LD dependence. For these item pairs, removal of one item was considered based on the amount of dependence with other items, extreme proportion correct scores, or the presence of items with similar q-vectors (i.e, rows in the Q-matrix). In total, three items were removed from each assessment. The ﬁnal assessments consist of 32 and 21 items, respectively, which are used in the subsequent analyses and can be found in the Supplementary Material. Appl. Sci. 2022, 12, 4809 9 of 19 Q-Matrix Validation The stepwise Wald method suggested modiﬁcations for the q-vectors of 14 items of the ﬁrst assessment (21 suggested entry changes) and for the q-vectors of two items of the second assessment (3 suggested entry changes). These suggestions were evaluated based on three aspects. First, the items and attributes were re-evaluated to assess whether the suggested modiﬁcations were theoretically defensible. Second, we inspected the results from the think-aloud study to determine the percentage of students who used the attributes under consideration when answering the items, thereby including empirical evidence for whether or not the items measure the attributes. Third, we compared the proportion of variance accounted for (PVAF) of relevant q-vectors to evaluate to what extent the attributes under consideration contribute to explaining variance in success probabilities. Only if suggestions were sensible from a theoretical perspective were modiﬁcations considered. In total, ﬁve Q-matrix entries were adjusted for ﬁve different items. These adjustments along with extensive argumentation can be found in Table A2 in Appendix A. Table 3 presents the ﬁnal Q-matrices (along with reduced DCMs, which are discussed next). Both Q-matrices satisfy the conditions for generic identiﬁability from . Table 3. Final Q-matrices and selected reduced DCMs per item. Assessment 1 Attribute Reduced Assessment 2 Attribute Reduced Item A B C D E F Model Item G H I Model ARTIST_sc_MS_05 1 0 0 0 0 0 - CAOS_40 1 0 0 - ARTIST_sc_MS_01 1 0 0 0 0 0 - ARTIST_sc_TS_01 1 0 0 - ARTIST_sc_MC_05 1 0 0 0 0 0 - ARTIST_db_TSG_Q1182 1 0 0 - ARTIST_sc_MC_06 1 0 0 0 0 0 - CAOS_24 1 0 0 - ARTIST_db_MS_Q0490 1 0 0 0 0 0 - ARTIST_db_TSG_Q1392 0 1 0 - CAOS_14 1 1 0 0 0 0 DINA CAOS_25 0 1 0 - CAOS_15 1 1 0 0 0 0 DINA CAOS_26 0 1 0 - CAOS_08 1 1 0 0 0 0 DINA ARTIST_sc_TS_04 1 1 0 DINA CAOS_09 0 1 0 0 0 0 - ARTIST_sc_TS_10 0 1 0 - CAOS_10 1 1 0 0 0 0 LLM ARTIST_sc_TS_07 0 1 0 - CAOS_12 0 1 1 0 0 0 R-RUM ARTIST_sc_TS_09 0 1 0 - CAOS_13 0 0 1 0 0 0 - SCI_2004_22 0 1 0 - SRA_015 0 1 1 0 0 0 R-RUM ARTIST_db_TSG_Q1007 0 1 0 - ARTIST_db_CG_Q0840 0 1 1 0 0 0 R-RUM ARTIST_sc_CI_05 0 0 1 - ARTIST_sc_SV_01 0 1 0 0 1 0 DINA ARTIST_sc_CI_02 0 0 1 - CAOS_17 0 0 0 1 0 0 - ARTIST_sc_CI_01 0 0 1 - ARTIST_sc_SV_03 0 0 0 1 0 1 LLM ARTIST_sc_CI_07 0 0 1 - ARTIST_sc_SV_14 0 0 0 1 0 0 - ARTIST_sc_CI_06 0 0 1 - ARTIST_sc_SV_04 0 0 0 1 0 0 - ARTIST_sc_CI_10 0 0 1 - ARTIST_sc_SV_10 0 0 0 0 1 0 - ARTIST_db_CIOSM_Q1394 0 1 1 DINA ARTIST_sc_SV_11 0 0 0 0 1 1 A-CDM ARTIST_db_CIOSM_Q1387 0 0 1 - ARTIST_sc_SV_09 0 0 0 0 1 1 R-RUM ARTIST_db_SS_Q0061A 0 0 0 0 1 0 - ARTIST_db_SS_Q0061B 0 0 0 0 1 0 - ARTIST_db_SS_Q0061C 0 0 0 0 1 0 - ARTIST_db_SS_Q0061D 0 0 0 0 1 0 - ARTIST_sc_SV_05 0 0 0 0 1 0 - CAOS_16 0 0 0 1 0 1 LLM CAOS_32 0 0 0 0 0 1 - SCI_2004_20 1 0 0 0 0 1 LLM ARTIST_db_SS_Q1437 0 0 0 0 0 1 - ARTIST_db_SS_Q0614 0 0 0 0 0 1 - Attribute Behavior at the Item Level The Wald test was used to conduct item-level comparisons of the LCDM and reduced DCMs that it subsumes , namely, the deterministic inputs, noisy “and” gate (DINA) model ; the deterministic inputs, noisy “or ” gate (DINO) model ; the additive cognitive diagnosis model ( A-CDM; ); the linear logistic model (LLM; ); and the reduced reparameterized uniﬁed model (R-RUM; ). The Wald test suggested reduced models for all two-dimensional items: 13 and 2 items in assessments 1 and 2, respectively. The suggested models are presented in Table 3. Fitting these models did not signiﬁcantly Appl. Sci. 2022, 12, 4809 10 of 19 reduce model ﬁt compared with ﬁtting the saturated LCDM to all items, c (17) = 9.8, p = 0.91 (assessment 1) and c (4) = 5.1, p = 0.28 (assessment 2). The reduced models are used in the subsequent analyses. Final Model The ﬁt statistics to evaluate the absolute ﬁt of the ﬁnal model are shown in Table 4. The statistics indicate that the model ﬁts adequately (for the RMSEA , the cutoff values 0.030 and 0.045 indicate excellent and good ﬁt for the LCDM ; for the SR MSR, values below 0.05 indicate acceptable model ﬁt in DCMs, e.g., ). Further, classiﬁcation accuracy and reliability are mostly high, as shown in Table 5. Detailed model results are included in the Supplementary Material. Table 4. Absolute ﬁt statistics of the ﬁnal models. M RMSEA SRMSR Max Transformed Correlation * Max Log-Odds Ratio * 2 2 Assessment 1 591.4 ( p < 0.001) 0.025 0.042 0.128 ( p = 0.10) 0.982 ( p = 0.19) Assessment 2 212.1 ( p = 0.06) 0.014 0.038 0.104 ( p = 0.76) 0.850 ( p = 0.49) * To control for multiple comparisons, p-value adjustments according to the Holm–Bonferroni method are conducted . Table 5. Attribute-level classiﬁcation accuracy and reliability. Assessment 1 Assessment 2 Attribute A B C D E F G H I Accuracy 0.89 0.83 0.90 0.89 0.82 0.82 0.82 0.84 0.88 Reliability 0.87 0.69 0.89 0.87 0.68 0.67 0.67 0.76 0.85 5. Communicating Feedback With the ﬁnal model results, students’ mastery status can be determined based on their item responses resulting in diagnostic information. Diagnostic assessments should provide meaningful information to students that can be readily understood and used to improve learning. In this section, we pay attention to the psychometric interpretation of the results, i.e., to the inferences that can be made based on results from cognitive diagnostic assessment with DCMs. In addition, we discuss score reporting based on a mock-up dashboard to show how this information can be communicated to students. 5.1. Psychometric Interpretation Before students can be supported to interpret assessment results, assessment develop- ers must accurately interpret the results from a psychometric perspective. DCMs assume that the latent variables predicting the item responses (i.e., the attributes) are discrete variables. For each attribute, two groups are distinguished: those who master the attribute and those who do not. With DCMs, one can estimate the mastery probability for each attribute (expected a posteriori estimates). Classiﬁcation as master or nonmaster is based on these marginal mastery probabilities. For example, probabilities above 0.5 result in classiﬁcation as master and below 0.5 as nonmaster (although some contexts may call for a different cut-off value). The mastery probability is not an estimate of the amount of mastery, nor of a student’s progress from nonmastery to mastery. Rather, it indicates the (un)certainty of the classiﬁca- tion, which is determined by students’ consistency in providing either correct or incorrect answers to different items measuring an attribute . The closer the mastery probability is to 0 or 1, the more certainty there is about the classiﬁcation, whereas a probability of 0.5 reﬂects complete uncertainty. Although these models do not allow to break the probability scale into more than two groups to indicate more than two mastery levels, it is possible to break the scale to indicate uncertainty. One could for example not classify students on at- tributes with mastery probabilities between 0.4 and 0.6, but instead add an “undetermined” Appl. Sci. 2022, 12, 4809 11 of 19 category for these uncertain situations (as proposed by , and as will be demonstrated below), or one could ﬂag uncertain classiﬁcations when reporting results. 5.2. Reporting Results If the assessments and model results are implemented in online learning environ- ments, feedback can be reported directly after completion of the assessments. This is a great advantage, since timeliness is important to encourage stakeholders to use diagnostic infor- mation . Further, web-based reporting allows to manage and organize large amounts of information via interactive dashboards [59,60]. Ref.  present a framework for de- veloping score reports for cognitive diagnostic assessments. Following their guidelines, we created a mock-up student dashboard that is personalized, contains a description of the presented information, provides a visual summary of student performance that can be readily interpreted, and outlines how this information can guide study behavior. The mock-up dashboard is presented in Figure 2. Results formative assessment: Samples & spread Skill mastery Interpreting Skill: Interpreting univariate Learning materials Understanding univariate center & graphical representations This formative test assessed 6 skills that graphical spread - E-module .. representations are represented in the graph on the You can describe and - Lecture .. right. Some skills involve prerequisite interpret graphical - Book … knowledge and understanding for other representations of univariate - Exercises .. Understanding skills. The arrows represent these Graphically sample data in common sampling hierarchical relations. comparing variability graphical displays (histogram, groups boxplot, dotplot, barchart). You can see for each skill whether you You can reason about center master it (green), you do not master it and spread in sample data yet (red), or that it is uncertain whether Understanding based on graphical sampling you master it based on your test results representations. distributions (orange). Click on a circle for more information and to see which learning materials can Understanding help you increase your understanding of the standard the subject matter. error Raw scores Below you can see your performance on each item (green for correct, red for incorrect). Click on an item to review your response and to receive additional feedback. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Figure 2. Mock-up student dashboard to report cognitive diagnostic assessment results. The dashboard consists of two sections to report both attribute-level scores and raw scores. We focus on the ﬁrst section, titled “Skill mastery”, which presents attribute-level performance. On the left, it is explained what information is presented in this section. On the right, a graph is provided that visualizes the attributes (the presented hierarchy reﬂects attribute dependencies, indicating whether mastery of an attribute is a prerequisite to mastery of another attribute; this hierarchy was determined in consultation with domain experts). If a student clicks on an attribute, brief descriptions of the measured skills are shown, as well as recommended learning materials to improve these skills. Indicating skill mastery is substantively meaningful, because the inferences about student performance are made with reference to the cognitive skills measured by the assessment, allowing for criterion-referenced interpretations (cut-scores in DCMs are set to maximize the reliable separation of respondents, i.e., the criterion is set statistically; , Chapter 5). We have used the familiar terms “mastery” versus “nonmastery” to label the classiﬁcations throughout this paper. Alternatively, one can choose labels that express in-progress learning, such as “on track” versus “needs attention”. The labels can inﬂuence how the information is interpreted and used by students and it is important to select labels that are appropriate for a given context . Appl. Sci. 2022, 12, 4809 12 of 19 6. Discussion The current study showed the validation process of constructing detailed diagnostic information on a set of skills, abilities, and cognitive processes from students’ item response data. This integration of learning analytics and educational assessment promotes valid and reliable formative assessment. We constructed cognitive diagnostic assessments to measure attributes relevant for nonmathematical introductory statistics courses in higher education using diagnostic classiﬁcation models. To construct the assessments, we did not write new items to measure the identiﬁed attributes, but we relied on existing items, which may limit the scope of what can be measured . However, we did not deﬁne attributes based on the items but searched for items based on the prespeciﬁed attributes. We exploited multiple sources for item collection and evaluated whether all aspects of the attributes were measured. Moreover, we validated a new measurement scale for actionable feedback based on the existing items, which can be viewed as a contribution of our work. The ﬁnal models showed adequate model ﬁt, classiﬁcation accuracy, and reliability. The constructed assessments provide a valid and reliable measurement of the speciﬁed attributes and allow provision of actionable feedback to students via learning dashboards. 6.1. Implications for Educational Practice If the assessments are implemented in online learning environments of introductory statistics courses, the mastery status of new students can be determined directly after they complete an assessment based on their automatically scored item responses. Based on timely, actionable diagnostic feedback, students can make more effective learning choices . The value of feedback depends on how it is interpreted by the recipient and, in turn, inﬂuences learning choices. Although both interpretation and use are critical to the effectiveness of formative assessment , only few studies have examined actual uses of score reports (see  for a review). It is an interesting avenue for future research to explore how diagnostic feedback impacts students’ learning processes and whether there are individual differences in the effects of diagnostic feedback on learning choices (e.g., due to differences in expertise level or self-regulated learning skills; [65,66]). 6.2. Limitations Although the constructed assessments provide a valid and reliable measurement of the speciﬁed attributes, this study presents several limitations. One limitation lies in the small sample size in the think-aloud study due to limited resources, which may limit the generalizability of the qualitative evaluation. Furthermore, verbal reports may not completely reﬂect students’ cognitive processes . Especially if constructs are not well- understood, verbal reports can be subject to bias and error. The participants in the think- aloud mostly rated their performance as (very) good. Although it would be interesting to include students with low performance, it is difﬁcult to collect high-quality verbal report data from those students. Even for students with high performance, verbal reports may provide an incomplete record of their knowledge and cognitive processes as they solved the items. Therefore, the results from the think-aloud were not used as hard evidence, but rather as supportive information in the empirical Q-matrix validation. Further, in the quantitative evaluation, the assessments were offered as optional study material. This might have inﬂuenced students’ motivation, for example, resulting in rapid-guessing behavior that can lead to differential item functioning . Moreover, it could cause selection bias, for example, if the assessments were mostly made by students who did not grasp the materials yet, resulting in a low base rate of attribute mastery. However, Ref.  showed that DCM item parameters are theoretically invariant with respect to the calibration sample; thus, the estimates are not inﬂuenced by the base rate of attribute mastery. Since we found adequate model-data ﬁt, the calibrated assessments can be used to appropriately classify new students. Nevertheless, changes in parameters can be expected  and regular monitoring of parameter invariance is recommended. Appl. Sci. 2022, 12, 4809 13 of 19 6.3. Concluding Remarks To conclude, cognitive diagnostic assessment can be a valuable tool within e-learning environments to obtain timely, diagnostic feedback on cognitive attributes to support student learning. Since poor quality assessment leads to less-effective learning choices, it is important to validate whether assessments are adequate and appropriate for their intended interpretation and use . The current study was focused on validating the interpretation of assessments in university statistics education (i.e., the claims that can be made about students), but it is important to also verify whether appropriate actions follow if these interpretations are presented to students (e.g., following ). Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/app12104809/s1. Author Contributions: Conceptualization, L.M., M.J.S.B., L.K. and L.W.-d.M.; data curation, L.M.; formal analysis, L.M.; investigation, L.M.; methodology, L.M., M.J.S.B., L.K. and L.W.-d.M.; super- vision, L.W.-d.M.; validation, L.M.; writing—original draft preparation, L.M.; writing—review and editing, M.J.S.B., L.K. and L.W.-d.M. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: The study was approved by the Ethics Review Board of the Faculty of Social and Behavioural Sciences at Utrecht University (protocol number 21-0115, date of approval 1 April 2021). Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Data Availability Statement: The data that support the ﬁndings of this study are available on request from the corresponding author. The data are not publicly available due to privacy/ethical restrictions. Acknowledgments: The authors sincerely thank the experts who advised them and supported the study. Conﬂicts of Interest: The authors declare no conﬂict of interest. Abbreviations The following abbreviations are used in this manuscript: DCM Diagnostic classiﬁcation model LCDM Log-linear cognitive diagnosis model DINA Deterministic inputs, noisy “and” gate DINO Deterministic inputs, noisy “or ” gate A-CDM Additive cognitive diagnosis model LLM Linear logistic model R-RUM Reduced reparameterized uniﬁed model Appendix A This section presents more extensive results from (1) the attribute identiﬁcation process and (2) the empirical Q-matrix validation: • Table A1 presents an overview of the 9 attributes that resulted from the attribute iden- tiﬁcation process, which are comprised of 33 learning objectives. The table includes references to empirical studies that show the relevance of the addressed concepts and the occurrence of misconceptions for each learning objective. In the Supplemen- tary Material, a more detailed description of each attribute and the references to the empirical studies is provided. • Table A2 presents all items for which modiﬁcations to the Q-matrix were suggested in the empirical Q-matrix validation. Argumentation to retain or modify entries is provided along with the ﬁnal q-vectors. Appl. Sci. 2022, 12, 4809 14 of 19 Table A1. Identiﬁed attributes and learning objectives with references showing their relevance. A. Understanding center & spread 1. Understanding of the idea of variability in samples and populations. -  2. Ability to describe and interpret measures of center (mean, median, mode). - , Chapter 9 Ability to describe and interpret measures of spread (range, interquartile range, variance, standard 3. -  deviation). 4. Understanding of how center and spread are affected by data transformations. -  5. Understanding of how center and spread are affected by handling of outliers. - , Chapters 9 and 10 B. Interpreting univariate graphical representations Ability to describe and interpret graphical representations of sample data in common univariate graphical 6. displays (histogram, boxplot, dotplot, bar chart), including the ability to reason about center and spread in - [28,74] sample data based on graphical representations. C. Graphically comparing groups Ability to compare groups on a continuous outcome variable by focusing on data as an aggregate with , Chapters 9 and 7. - characteristics such as center, spread, and shape. 10; [28,75] Ability to make informal inferences about group differences based on graphical representations by 8. - , Chapter 11;  comparing variability between and within groups, i.e., by considering both center and spread. D. Understanding sampling variability Understanding that a sample provides incomplete information about the population from which it is 9. - , Chapter 12;  drawn and that random sampling forms the basis for statistical inference. Understanding how the set of all possible random samples (with the given size) from the population of , Chapters 12 and 10. interest is considered in inferential statistics, i.e., that the unit of analysis is the entire sample rather than a - 13;  single observation. Understanding that not all samples are identical, so not all of them will resemble the population in the 11. -  same way and to the same extent every time. Understanding of expected patterns in sampling variability, in which some values are more or less likely 12. - , Chapter 12;  than others to be drawn from a particular population. E. Understanding sampling distributions Ability to describe and distinguish between the sample distribution, sampling distribution, and population 13. - , Chapter 12; [79,80] distribution. Knowing that, given a sufﬁciently large sample size, the sampling distribution of the mean for a variable 14. - , Chapter 12; [81,82] will approximate a normal distribution. Knowing how the (unknown) population mean, the sample mean, the possible sample mean values for 15. different random samples of the given size, and the theoretical mean of these possible values relate to each - , Chapter 12;  other. F. Understanding the standard error 16. Ability to interpret the standard error as a measure of variability of sample means. - , Chapter 12 Understanding that statistics from small samples vary more than statistics from large samples and 17. - , Chapter 12; [28,79] knowing how the standard error decreases with sample size. 18. Understanding of how population variance inﬂuences sampling variability and, thus, the standard error. - , Chapter 12 19. Ability to make informal inferences about sample means based on measures of sampling variability. -  G. Understanding principles of hypothesis testing Understanding of the goals and the logic of signiﬁcance tests; understanding of the theoretical idea of 20. -  ﬁnding evidence against a null hypothesis. Understanding that statistical inferences are not deterministic and NHST results do not prove the null 21. - [84,85] hypothesis to be true or false. H. Evaluating NHST results 22. Ability to take a correct decision about the null hypothesis based on the signiﬁcance level and p-value. -  Ability to correctly interpret p-values as the probability of obtaining test results at least as extreme as the 23. - [28,87–90] results actually observed, under the assumption that the null hypothesis is true. 24. Ability to describe the concepts Type I error, Type II error, signiﬁcance level, and statistical power. - [86,91] 25. Understanding of how increasing the sample size increases power by reducing the standard error. - , Chapter 13;  26. Understanding of how increasing the signiﬁcance level increases power by enlarging the rejection region. -  Understanding of how larger effect sizes result in higher power by enlarging the difference between the 27. -  means of the null and alternative distribution. 28. Understanding that statistical signiﬁcance does not imply practical relevance and vice versa. - [28,92,93] I. Understanding and using conﬁdence intervals 29. Understanding that CIs are used to provide an indication of the precision of an estimate. - [28,94,95] Ability to correctly interpret CIs in terms of repeated sampling (e.g., if repeated samples were taken and the 30. 95% conﬁdence interval was computed for each sample mean, 95% of the intervals would contain the -  population mean). 31. Understanding of how increasing the sample size decreases CI width by reducing the standard error. -  Understanding of how increasing the conﬁdence level increases CI width by reducing the precision of the 32. -  estimate. Understanding of connections between CIs and hypothesis tests; ability to draw a conclusion about the null 33. hypothesis based on a CI (i.e., if the CI does not contain the null hypothesis value, the results are -  statistically signiﬁcant). Appl. Sci. 2022, 12, 4809 15 of 19 Table A2. Items for which the stepwise Wald method suggested q-vector modiﬁcations. The table shows the original and suggested q-vectors that represent rows in the Q-matrix. Further, the table shows (1) the item content re-evaluation indicating whether suggestions to include or exclude attributes could be theoretically defensible (yes/no), (2) the percentage of students in the think-aloud (TA) who used the attributes for which modiﬁcations are suggested, (3) the PVAF for relevant q-vectors, (4) the argumentation to retain or modify q-vector entries, and (5) the ﬁnal q-vectors (modiﬁcations in boldface). Content Re-Evaluation: Item q-Vector Attribute Use in TA PVAF Argumentation Final q-Vector Theoretical Defensibility CAOS_15 Orig. 110000 Exclude B: no B: 100% 100000: 0.328 Att. B should be included from a theoretical perspective and explains a 110000 Sugg. 100000 110000: 0.816 substantial proportion of additional variance. CAOS_08 Orig. 110000 Exclude A: no A: 67% 010000: 0.464 Att. A should be included from a theoretical perspective and explains a 110000 Sugg. 010000 110000: 0.835 substantial proportion of additional variance. CAOS_12 Orig. 111000 Exclude A: yes A: 0% 001000: 0.959 Although att. B explains little additional variance, it cannot be excluded from a 011000 Sugg. 001000 Exclude B: no B: 100% 011000: 0.964 theoretical perspective. Att. A can be excluded and explains little additional 111000: 0.985 variance; therefore, it is excluded. CAOS_13 Orig. 011000 Exclude B: yes B: 0% 001000: 0.999 Att. B can be excluded from a theoretical perspective and explains no additional 001000 Sugg. 001000 011000: 0.999 variance; therefore, it is excluded. SRA_15 Orig. 001000 Include B: yes B: 75% 001000: 0.002 Although att. C explains a very small proportion of variance, it cannot be 011000 Sugg. 010000 Exclude C: no C: 100% 010000: 0.437 excluded from a theoretical perspective. Att. B can be included from a theoretical 011000: 0.469 perspective and explains a substantial proportion of additional variance; therefore, it is included. ARTIST_db_CG_Q0840 Orig. 011000 Exclude B: no B: 100% 001000: 0.876 Att. B should be included from a theoretical perspective and explains some 011000 Sugg. 001000 011000: 0.969 additional variance. ARTIST_sc_SV_04 Orig. 000100 Include C: no C: 0% 000100: 0.056 Including att. C and excluding att. D is not theoretically defensible. Despite the 000100 Sugg. 001000 Exclude D: no D: 60% 001000: 0.183 higher PVAF for att. C, the original q-vector is retained. ARTIST_sc_SV_10 Orig. 000110 Exclude D: yes D: 0% 000010: 0.945 Att. D can be excluded from a theoretical perspective and explains only little 000010 Sugg. 000010 000110: 0.986 additional variance; therefore, it is excluded. ARTIST_sc_SV_11 Orig. 000011 Exclude F: no F: 50% 000010: 0.949 Although att. F explains only a small proportion of additional variance; it cannot 000011 Sugg. 000010 000011: 0.951 be excluded from a theoretical perspective. ARTIST_db_SS_Q0061A Orig. 000010 Include A: no A: 0% 000010: 0.016 Including att. A and excluding att. E is not theoretically defensible. Despite the 000010 Sugg. 100000 Exclude E: no E: 100% 100000: 0.597 higher PVAF for att. A, the original q-vector is retained. ARTIST_db_SS_Q0061C Orig. 000010 Include C: no C: 0% 000010: 0.232 Including att. C and excluding att. E is not theoretically defensible. Despite the 000010 Sugg. 001000 Exclude E: no E: 100% 001000: 0.288 higher PVAF for att. C, the original q-vector is retained. ARTIST_db_SS_Q0061D Orig. 000010 Include C: no C: 0% 000010: 0.121 Including att. C and excluding att. E is not theoretically defensible. Despite the 000010 Sugg. 001000 Exclude E: no E: 100% 001000: 0.182 higher PVAF for att. C, the original q-vector is retained. CAOS_32 Orig. 000001 Include E: no E: 0% 000001: 0.113 Including att. E and excluding att. F is not theoretically defensible. Despite the 000001 Sugg. 000010 Exclude F: no F: 100% 000010: 0.379 higher PVAF for att. E, the original q-vector is retained. SCI_2004_20 Orig. 100001 Exclude F: no F: 100% 100000: 0.908 Although att. F explains only a small proportion of additional variance; it cannot 100001 Sugg. 100000 100001: 0.958 be excluded from a theoretical perspective. ARTIST_sc_TS_04 Orig. 110 Exclude G: no G: 100% 010: 0.849 Att. G should be included from a theoretical perspective and explains some 110 Sugg. 010 110: 0.999 additional variance. ARTIST_db_CIOSM_Q1394 Orig. 001 Include H: yes H: 0% 001: 0.180 Att. H can be included from a theoretical perspective and explains a substantial 011 Sugg. 010 Exclude I: no I: 50% 010: 0.885 proportion of variance; therefore, it is included. Although att. I explains only 011: 0.897 little additional variance, it cannot be excluded from a theoretical perspective. Appl. Sci. 2022, 12, 4809 16 of 19 References 1. Sitzmann, T.; Ely, K. A meta-analysis of self-regulated learning in work-related training and educational attainment: What we know and where we need to go. Psychol. Bull. 2011, 137, 421–442. [CrossRef] [PubMed] 2. Dunlosky, J.; Rawson, K.A. Overconﬁdence produces underachievement: Inaccurate self evaluations undermine students’ learning and retention. Learn. Instr. 2012, 22, 271–280. [CrossRef] 3. Lee, K. Rethinking the accessibility of online higher education: A historical review. Internet High. Educ. 2017, 33, 15–23. [CrossRef] 4. Gikandi, J.W.; Morrow, D.; Davis, N.E. Online formative assessment in higher education: A review of the literature. Comput. Educ. 2011, 57, 2333–2351. [CrossRef] 5. Brinkhuis, M.J.S.; Cordes, W.; Hofman, A. Governing games: Adaptive game selection in the Math Garden. ITM Web of Conf. 2020, 33, 03003. [CrossRef] 6. Quilici, J.L.; Mayer, R.E. Teaching students to recognize structural similarities between statistics word problems. Appl. Cogn. Psychol. Off. J. Soc. Appl. Res. Mem. Cogn. 2002, 16, 325–342. [CrossRef] 7. Guskey, T.R. The case against percentage grades. Educ. Sch. Couns. Psychol. Fac. Publ. 2013, 71, 68–72. 8. Leighton, J.P.; Gierl, M.J. Why cognitive diagnostic assessment? In Cognitive Diagnostic Assessment for Education; Leighton, J.P., Gierl, M.J., Eds.; Cambridge University Press: New York, NY, USA, 2007; pp. 3–18. [CrossRef] 9. Huff, K.; Goodman, D.P. The demand for cognitive diagnostic assessment. In Cognitive Diagnostic Assessment for Education; Leighton, J.P., Gierl, M.J., Eds.; Cambridge University Press: New York, NY, USA, 2007; pp. 19–60. 10. Kanar, A.M.; Bell, B.S. Guiding learners through technology-based instruction: The effects of adaptive guidance design and individual differences on learning over time. J. Educ. Psychol. 2013, 105, 1067–1081. [CrossRef] 11. Norris, S.P.; Macnab, J.S.; Phillips, L.M. Cognitive modeling of performance on diagnostic achievement tests: A Philosophical Analysis and Justiﬁcation. In Cognitive Diagnostic Assessment for Education; Leighton, J.P., Gierl, M.J., Eds.; Cambridge University Press: New York, NY, USA, 2007; pp. 61–84. [CrossRef] 12. Rupp, A.A.; Templin, J.; Henson, R.A. Diagnostic Measurement: Theory, Methods, and Applications; The Guilford Press: New York, NY, USA, 2010. 13. Maas, L.; Brinkhuis, M.J.S.; Kester, L.; Wijngaards-de Meij, L. Diagnostic classiﬁcation models for actionable feedback in education: Effects of sample size and assessment length. Front. Educ. 2022, 7, 36. [CrossRef] 14. Wiggins, G.; McTighe, J. Understanding by Design; Association for Supervision and Curriculum Development: Alexandria, VA, USA, 2005. 15. Pellegrino, J.W.; Chudowsky, N.; Glaser, R. Knowing What Students Know: The Science and Design of Educational Assessment; National Academy Press: Washington, DC, USA, 2001. [CrossRef] 16. Thompson, K.; Yonekura, F. Practical guidelines for learning object granularity from one higher education setting. Interdiscip. J.-Learn. Learn. Objects 2005, 1, 163–179. [CrossRef] 17. Rupp, A.A.; Templin, J. The effects of Q-matrix misspeciﬁcation on parameter estimates and classiﬁcation accuracy in the DINA model. Educ. Psychol. Meas. 2008, 68, 78–96. [CrossRef] 18. Kunina-Habenicht, O.; Rupp, A.A.; Wilhelm, O. The impact of model misspeciﬁcation on estimation accuracy in diagnostic classiﬁcation models. J. Educ. Meas. 2012, 49, 59–81. [CrossRef] 19. Leighton, J.P.; Gierl, M.J. Verbal reports as data for cognitive diagnostic assessment. In Cognitive Diagnostic Assessment for Education; Leighton, J.P., Gierl, M.J., Eds.; Cambridge University Press: New York, NY, USA, 2007; pp. 146–172. [CrossRef] 20. Tjoe, H.; de la Torre, J. The identiﬁcation and validation process of proportional reasoning attributes: An application of a cognitive diagnosis modeling framework. Math. Educ. Res. J. 2014, 26, 237–255. [CrossRef] 21. Ma, W.; de la Torre, J. An empirical Q-matrix validation method for the sequential generalized DINA model. Br. J. Math. Stat. Psychol. 2020, 73, 142–163. [CrossRef] 22. Castro Sotos, A.E.; Vanhoof, S.; Van den Noortgate, W.; Onghena, P. Students’ misconceptions of statistical inference: A review of the empirical evidence from research on statistics education. Educ. Res. Rev. 2007, 2, 98–113. [CrossRef] 23. Garﬁeld, J.B.; Ben-Zvi, D.; Chance, B.; Medina, E.; Roseth, C.; Ziefﬂer, A. Assessment in statistics education. In Developing Students’ Statistical Reasoning; Springer: Berlin, Germany, 2008; pp. 82–114. [CrossRef] 24. Tacoma, S.; Sosnovsky, S.; Boon, P.; Jeuring, J.; Drijvers, P. The interplay between inspectable student models and didactics of statistics. Digit. Exp. Math. Educ. 2018, 4, 139–162. [CrossRef] 25. Cui, Y.; Roduta Roberts, M. Validating Student Score Inferences With Person-Fit Statistic and Verbal Reports: A Person-Fit Study for Cognitive Diagnostic Assessment. Educ. Meas. Issues Pract. 2013, 32, 34–42. [CrossRef] 26. delMas, R. A comparison of mathematical and statistical reasoning. In The Challenge of Developing Statistical Literacy, Reasoning and Thinking; Ben-Zvi, D., Garﬁeld, J.B., Eds.; Springer: Dordrecht, The Netherlands, 2004; pp. 79–95. [CrossRef] 27. Garﬁeld, J.B.; Ben-Zvi, D. Developing Students’ Statistical Reasoning: Connecting Research and Teaching Practice; Springer Science & Business Media: Berlin, Germany, 2008. [CrossRef] 28. delMas, R.; Garﬁeld, J.B.; Ooms, A.; Chance, B. Assessing students’ conceptual understanding after a ﬁrst course in statistics. Stat. Educ. Res. J. 2007, 6, 28–58. [CrossRef] 29. GAISE. Guidelines for Assessment and Instruction in Statistics Education College Report 2016; American Statistical Association: Alexandria, VA, USA, 2016. Appl. Sci. 2022, 12, 4809 17 of 19 30. Madison, M.J.; Bradshaw, L. The effects of Q-matrix design on classiﬁcation accuracy in the log-linear cognitive diagnosis model. Educ. Psychol. Meas. 2015, 75, 491–511. [CrossRef] 31. Garﬁeld, J.B.; Ben-Zvi, D. Research on statistical literacy, reasoning, and thinking: Issues, challenges, and implications. In The Challenge of Developing Statistical Literacy, Reasoning and Thinking; Ben-Zvi, D., Garﬁeld, J.B., Eds.; Springer: Dordrecht, The Netherlands, 2004; pp. 397–409. [CrossRef] 32. Garﬁeld, J.B.; Chance, B. Assessment in statistics education: Issues and challenges. Math. Think. Learn. 2000, 2, 99–125. [CrossRef] 33. Gal, I.; Garﬁeld, J.B. Curricular goals and assessment challenges in statistics education. In The Assessment Challenge in Statistics Education; Gal, I., Garﬁeld, J.B., Eds.; IOS Press: Amsterdam, The Netherlands, 1997; pp. 1–13. 34. Garﬁeld, J.B. Assessing statistical reasoning. Stat. Educ. Res. J. 2003, 2, 22–38. [CrossRef] 35. Allen, K. The Statistics Concept Inventory: Development and Analysis of a Cognitive Assessment Instrument in Statistics. Ph.D. Dissertation, University of Oklahoma, Norman, OK, USA, 2006. 36. Haladyna, T.M.; Downing, S.M.; Rodriguez, M.C. A review of multiple-choice item-writing guidelines for classroom assessment. Appl. Meas. Educ. 2002, 15, 309–333. [CrossRef] 37. Garﬁeld, J.B.; Franklin, C. Assessment of learning, for learning, and as learning in statistics education. In Teaching Statistics in School Mathematics-Challenges for Teaching and Teacher Education; Batanero, C., Burrill, G., Reading, C., Eds.; Springer: Dordrecht, The Netherlands, 2011; pp. 133–145. [CrossRef] 38. Kirilenko, A.P.; Stepchenkova, S. Inter-coder agreement in one-to-many classiﬁcation: Fuzzy kappa. PLoS ONE 2016, 11, e0149787. [CrossRef] 39. Ericsson, K.A.; Simon, H.A. How to study thinking in everyday life: Contrasting think-aloud protocols with descriptions and explanations of thinking. Mind Cult. Act. 1998, 5, 178–186. [CrossRef] 40. Gu, Y.; Xu, G. Sufﬁcient and Necessary Conditions for the Identiﬁability of the Q-matrix. Stat. Sin. 2021, 31, 449–472. [CrossRef] 41. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2019. 42. Chen, W.H.; Thissen, D. Local dependence indexes for item pairs using item response theory. J. Educ. Behav. Stat. 1997, 22, 265–289. [CrossRef] 43. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. 44. de la Torre, J.; Lee, Y.S. Evaluating the Wald test for item-level comparison of saturated and reduced models in cognitive diagnosis. J. Educ. Meas. 2013, 50, 355–373. [CrossRef] 45. Ma, W.; Iaconangelo, C.; de la Torre, J. Model similarity, model selection, and attribute classiﬁcation. Appl. Psychol. Meas. 2016, 40, 200–217. [CrossRef] 46. Maydeu-Olivares, A.; Joe, H. Limited information goodness-of-ﬁt testing in multidimensional contingency tables. Psychometrika 2006, 71, 713. [CrossRef] 47. Maydeu-Olivares, A.; Joe, H. Assessing approximate ﬁt in categorical data analysis. Multivar. Behav. Res. 2014, 49, 305–328. [CrossRef] 48. Chen, J.; de la Torre, J.; Zhang, Z. Relative and absolute ﬁt evaluation in cognitive diagnosis modeling. J. Educ. Meas. 2013, 50, 123–140. [CrossRef] 49. Wang, W.; Song, L.; Chen, P.; Meng, Y.; Ding, S. Attribute-level and pattern-level classiﬁcation consistency and accuracy indices for cognitive diagnostic assessment. J. Educ. Meas. 2015, 52, 457–476. [CrossRef] 50. Templin, J.; Bradshaw, L. Measuring the reliability of diagnostic classiﬁcation model examinee estimates. J. Classif. 2013, 30, 251–275. [CrossRef] 51. Haertel, E.H. Using restricted latent class models to map the skill structure of achievement items. J. Educ. Meas. 1989, 26, 301–321. [CrossRef] 52. Templin, J.; Henson, R.A. Measurement of psychological disorders using cognitive diagnosis models. Psychol. Methods 2006, 11, 287. [CrossRef] 53. de la Torre, J. The generalized DINA model framework. Psychometrika 2011, 76, 179–199. [CrossRef] 54. Maris, E. Estimating multiple classiﬁcation latent class models. Psychometrika 1999, 64, 187–212. [CrossRef] 55. DiBello, L.V.; Stout, W.F.; Roussos, L.A. Uniﬁed cognitive/psychometric diagnostic assessment likelihood-based classiﬁcation techniques. In Cognitively Diagnostic Assessment; Nichols, P.D., Chipman, S.F., Brennan, R.L., Eds.; Erlbaum: Hillsdale, NJ, USA, 1995; pp. 361–390. 56. Liu, Y.; Tian, W.; Xin, T. An application of M2 statistic to evaluate the ﬁt of cognitive diagnostic models. J. Educ. Behav. Stat. 2016, 41, 3–26. [CrossRef] 57. Liu, R.; Huggins-Manley, A.C.; Bulut, O. Retroﬁtting diagnostic classiﬁcation models to responses from IRT-based assessment forms. Educ. Psychol. Meas. 2018, 78, 357–383. [CrossRef] 58. Bradshaw, L.; Levy, R. Interpreting probabilistic classiﬁcations from diagnostic psychometric models. Educ. Meas. Issues Pract. 2019, 38, 79–88. [CrossRef] 59. Aljohani, N.R.; Davis, H.C. Learning analytics and formative assessment to provide immediate detailed feedback using a student centered mobile dashboard. In Proceedings of the 2013 Seventh International Conference on Next Generation Mobile Apps, Services and Technologies, Prague, Czech Republic, 25–27 September 2013; pp. 262–267. [CrossRef] Appl. Sci. 2022, 12, 4809 18 of 19 60. Verbert, K.; Govaerts, S.; Duval, E.; Santos, J.L.; Van Assche, F.; Parra, G.; Klerkx, J. Learning dashboards: An overview and future research opportunities. Pers. Ubiquitous Comput. 2014, 18, 1499–1514. [CrossRef] 61. Roduta Roberts, M.; Gierl, M.J. Developing score reports for cognitive diagnostic assessments. Educ. Meas. Issues Pract. 2010, 29, 25–38. [CrossRef] 62. de la Torre, J.; Minchen, N. Cognitively diagnostic assessments and the cognitive diagnosis model framework. Psicol. Educ. 2014, 20, 89–97. [CrossRef] 63. Bennett, R.E. Formative assessment: A critical review. Assess. Educ. Princ. Policy Pract. 2011, 18, 5–25. [CrossRef] 64. Gotch, C.M.; Roduta Roberts, M. A review of recent research on individual-level score reports. Educ. Meas. Issues Pract. 2018, 37, 46–54. [CrossRef] 65. Roelle, J.; Berthold, K.; Fries, S. Effects of feedback on learning strategies in learning journals: Learner-expertise matters. In Virtual Learning Environments: Concepts, Methodologies, Tools and Applications; Khosrow-Pour, M., Ed.; IGI Global: Hershey, PA, USA 2012; pp. 710–725. [CrossRef] 66. Clarebout, G.; Horz, H.; Schnotz, W.; Elen, J. The relation between self-regulation and the embedding of support in learning environments. Educ. Technol. Res. Dev. 2010, 58, 573–587. [CrossRef] 67. Leighton, J.P. Rethinking think-alouds: The often-problematic collection of response process data. Appl. Meas. Educ. 2021, 34, 61–74. [CrossRef] 68. DeMars, C.E.; Wise, S.L. Can differential rapid-guessing behavior lead to differential item functioning? Int. J. Test. 2010, 10, 207–229. [CrossRef] 69. Bradshaw, L.; Madison, M.J. Invariance properties for general diagnostic classiﬁcation models. Int. J. Test. 2016, 16, 99–118. [CrossRef] 70. Brinkhuis, M.J.S.; Maris, G. Tracking Ability: Deﬁning Trackers for Measuring Educational Progress. In Theoretical and Practical Advances in Computer-Based Educational Measurement; Veldkamp, B.P., Sluijter, C., Eds.; Methodology of Educational Measurement and Assessment; Springer International Publishing: Cham, Switzerland, 2019; chapter 8, pp. 161–173. 71. Kane, M.T. Validating the interpretations and uses of test scores. J. Educ. Meas. 2013, 50, 1–73. [CrossRef] 72. Hopster-den Otter, D.; Wools, S.; Eggen, T.J.; Veldkamp, B.P. A general framework for the validation of embedded formative assessment. J. Educ. Meas. 2019, 56, 715–732. [CrossRef] 73. delMas, R.; Liu, Y. Exploring students’ conceptions of the standard deviation. Stat. Educ. Res. J. 2005, 4, 55–82. [CrossRef] 74. Bakker, A.; Gravemeijer, K.P.E. Learning to reason about distribution. In The Challenge of Developing Statistical Literacy, Reasoning and Thinking; Ben-Zvi, D., Garﬁeld, J.B., Eds.; Springer: Dordrecht, The Netherlands, 2004; pp. 147–168. 75. Konold, C.; Pollatsek, A.; Well, A.; Gagnon, A. Students analyzing data: Research of critical barriers. In Research on the Role of Technology in Teaching and Learning Statistics; Springer: Dordrecht, The Netherlands, 1997; pp. 151–167. 76. Garﬁeld, J.B. The challenge of developing statistical reasoning. J. Stat. Educ. 2002, 10. [CrossRef] 77. Tversky, A.; Kahneman, D. Belief in the law of small numbers. Psychol. Bull. 1971, 76, 105. [CrossRef] 78. Schuyten, G. Statistical thinking in psychology and education. In Proceedings of the 3rd International Conference on Teaching Statistics: Vol. 2. Teaching Statistics Beyond School Level; Vere-Jones, D., Ed.; ISI Publications in Statistical Education: Dunedin, New Zealand, 1991; pp. 486–490. 79. Chance, B.; delMas, R.; Garﬁeld, J.B. Reasoning about sampling distribitions. In The Challenge of Developing Statistical Literacy, Reasoning and Thinking; Ben-Zvi, D., Garﬁeld, J.B., Eds.; Springer: Dordrecht, The Netherlands, 2004; pp. 295–323. 80. Lipson, K. The role of computer based technology in developing understanding of the concept of sampling distribution. In Proceedings of the 6th International Conference on Teaching Statistics, Cape Town, South Africa, 7–12 July 2002. 81. Batanero, C.; Tauber, L.M.; Sánchez, V. Students’ reasoning about the normal distribution. In The Challenge of Developing Statistical Literacy, Reasoning and Thinking; Ben-Zvi, D., Garﬁeld, J.B., Eds.; Springer: Dordrecht, The Netherlands, 2004; pp. 257–276. [CrossRef] 82. Bower, K.M. Some Misconceptions about the Normal Distribution; Six Sigma Forum; American Society for Quality: Milwaukee, WI, USA, 2003. 83. Batanero, C.; Godino, J.D.; Vallecillos, A.; Green, D.e.; Holmes, P. Errors and difﬁculties in understanding elementary statistical concepts. Int. J. Math. Educ. Sci. Technol. 1994, 25, 527–547. [CrossRef] 84. Vallecillos, A. Understanding of the logic of hypothesis testing amongst university students. J.-Math.-Didakt. 2000, 21, 101–123. [CrossRef] 85. Falk, R.; Greenbaum, C.W. Signiﬁcance tests die hard: The amazing persistence of a probabilistic misconception. Theory Psychol. 1995, 5, 75–98. [CrossRef] 86. Perezgonzalez, J.D. Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Front. Psychol. 2015, 6, 223. [CrossRef] 87. Haller, H.; Krauss, S. Misinterpretations of signiﬁcance: A problem students share with their teachers. Methods Psychol. Res. 2002, 7, 1–20. 88. Falk, R. Misconceptions of statistical signiﬁcance. J. Struct. Learn. 1986, 9, 83–96. 89. Vallecillos, A.; Batanero, C. Conceptos activados en el contraste de hipótesis estadísticas y su comprensión por estudiantes universitarios [Activated concepts in statistical hypothesis testing and their understanding by university students]. Rech. Didact. Math. 1997, 17, 29–48. Appl. Sci. 2022, 12, 4809 19 of 19 90. Williams, A.M. Students’ understanding of the signiﬁcance level concept. In Proceedings of the 5th International Conference on Teaching Statistics, Singapore, 21–26 June 1998; pp. 743–749. 91. Mittag, K.C.; Thompson, B. Research news and Comment: A National Survey of AERA Members’ Perceptions of Statistical Signiﬁcance Tests and Other Statistical Issues. Educ. Res. 2000, 29, 14–20. [CrossRef] 92. Gliner, J.A.; Leech, N.L.; Morgan, G.A. Problems with null hypothesis signiﬁcance testing (NHST): What do the textbooks say? J. Exp. Educ. 2002, 71, 83–92. [CrossRef] 93. Gagnier, J.J.; Morgenstern, H. Misconceptions, misuses, and misinterpretations of p values and signiﬁcance testing. J. Bone Jt. Surg. 2017, 99, 1598–1603. [CrossRef] 94. Cumming, G.; Williams, J.; Fidler, F. Replication and researchers’ understanding of conﬁdence intervals and standard error bars. Underst. Stat. 2004, 3, 299–311. [CrossRef] 95. Fidler, F. Should psychology abandon p-values and teach CIs instead? Evidence-based reforms in statistics education. In Proceedings of the 7th International Conference on Teaching Statistics, Salvador, Brazil, 2–7 July 2006. 96. Hoekstra, R.; Morey, R.D.; Rouder, J.N.; Wagenmakers, E.J. Robust misinterpretation of conﬁdence intervals. Psychon. Bull. Rev. 2014, 21, 1157–1164. [CrossRef] 97. Kalinowski, P. Identifying misconceptions about conﬁdence intervals. In Proceedings of the 8th International Conference on Teaching Statistics, Ljubljana, Slovenia, 11–16 July 2010. 98. Belia, S.; Fidler, F.; Williams, J.; Cumming, G. Researchers misunderstand conﬁdence intervals and standard error bars. Psychol. Methods 2005, 10, 389. [CrossRef]
Applied Sciences – Multidisciplinary Digital Publishing Institute
Published: May 10, 2022
Keywords: online formative assessment; cognitive modeling; skill measurement; validation; reliability; learning dashboards; university statistics education
Access the full text.
Sign up today, get DeepDyve free for 14 days.