Description

Download Two Methods for Measuring Question Difficulty and Discrimination in Incomplete Crowdsourced Data.pdf

Categories

Published

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Share

Transcript

Two Methods for Measuring Question Difﬁculty andDiscrimination in Incomplete Crowdsourced Data
Sarah K. K. Luger
Institute for Language, Cognition and ComputationThe University of EdinburghEdinburgh, UKs.k.k.luger@sms.ed.ac.uk
Jeff Bowles
The Department of Computer ScienceThe University of New MexicoAlbuquerque, New Mexico jbowles@cs.unm.edu
Abstract
Assistance in creating high-quality exams would bewelcomed by educators who do not have direct accessto the proprietary data and methods used by educationaltesting companies. The current approach for measur-ing question difﬁculty relies on models of how goodpupils will perform and contrasts that with their lower-performing peers. Inverting this process and allowingeducators to test their questions before students answerthem will speed up question development and utility.We cover two methods for automatically judging thedifﬁculty and discriminating power of MCQs and howbest to build sufﬁcient exams from good questions.
Introduction
We present two methodologies for creating quality examswhich ﬁlter out the least discriminating questions in anexam. The ﬁrst approach analyzes the best balance of stu-dents and questions based on creating a more dense matrixof those students and questions. The second approach ini-tially analyzes the questions’ difﬁculty to ﬁnd the best newexam set that includes the most discriminating questions.Creating viable human performance data or new, virtual ex-ams out of sets of questions allows downstream analysis of the difﬁculty of the questions that would not be otherwiseavailable.
Crowdsourcing MCQs
We have procured data for sets of MCQs from University-level introductory biology classes using the PeerWise ques-tion creation system (Denny 2009). PeerWise is a free,web-based, question repository that allows classes to set upshared environments where students create questions thatare subsequently used by their peers as a study aid. In-structors can review the questions or use some of the betterquestions for future exams. Because answering these ques-tions is not compulsory, the resulting data can be sparse,with sets of questions that have been answered by some stu-dents, but not all of the questions have been answered by thesame students. There are other similar datasets in the edu-cation domain (Pooja 2013) and many crowdsourcing tasks
Copyright c
2013, Association for the Advancement of ArtiﬁcialIntelligence (www.aaai.org). All rights reserved.
have multiple-choice elements (Munro, Erle, and Schnoebe-len 2013).
Item Analysis
To measure the usefulness of exam questions, researchershave devised methods for judging both the difﬁculty of thequestion and the differentiation power of the answer options(Patz and Junker 1999) and (B´eguin and Glas 2001). Onesuch approach is Item Analysis Theory (Gronlund 1981).Once a cohort (for this example, 100 students) has takena test containing suitable questions, the exams are gradedand ranked from highest score to lowest. The set of 100students is split into three groups that represent the top-scoring, middle-scoring, and lowest-scoring students. Thesethree groups are commonly split, lower 27%-middle 46%-upper 27%. The middle set of (46) exams is excluded be-cause they confer no meaningful information. Item Difﬁ-culty is measured by the percentage of students who an-swered a question correctly. Item Discriminating Power isthe difference between the number of high-scoring studentsversus the number of low-scoring students who chose thesame answer option.
Our Clique-based Methodology
Our approach for representing the individual student ques-tion answering relationship is with a graph: an ‘exam’,where every student answers every question would be acomplete bipartite graph (or biclique). We are seeking agood set that is similar to an exam.To sort and build the covariance matrices we collect thedata in triples of student ID, question ID, and answer choice.Then, the students are ordered by the number of questionsthey answered and the incidence matrix
M
is built with stu-dents listed as rows and the questions as columns. We com-pute
S
=
M
×
M
T
and compute
Q
=
M
T
×
M
. We canﬁnd the most correlated students by computing the vector
s
by summing over the rows of
S
. Thus
s
=
i
S
ij
. Wecan then sort the rows and columns of
S
based on the order-ing of
s
as
S
is symmetric. As above, we can ﬁnd the mostcorrelated questions by computing the vector
q
=
i
Q
ij
.We can then sort the rows and columns of
Q
based on theordering of
q
.This sorting process provides a sound heuristic for select-ing highly correlated students and questions. We then se-
48
Human Computation and Crowdsourcing: Works in Progress and Demonstration Abstracts AAAI Technical Report CR-13-01
Method: Clique Clique W WCourse: 1 2 1 2Total no. S: 1055 887 886 807Total no. Q: 148 132 148 132% top correlated S: 0.15 0.15 N/A N/A% top correlated Q: 0.25 0.25 N/A N/AOmissions: YES YES N/A N/ANew no. S: 158 133 886 807Initial exam size: 37 32 148 132New exam size: 26 20 26 20
Cohort movement:
Low to middle: 4 5 43 25Low to high: 0 0 29 24Middle to low: 4 5 125 121Middle to high: 5 8 70 72High to low: 0 0 22 27High to middle: 5 8 105 106Numerical total: 18 26 393 375% total: 0.11 0.2 0.44 0.46Figure 1: The weighted (W) method eliminated studentswho answered fewer than 4 questions.lected the top 15% most correlated students and the top 15%most correlated questions from the dense group of studentswho have answered the same questions based on the adja-cency methodology. This presents a realistic exam wherethere are a few omitted questions.
Our Weighting-based Methodology
A “weight” vector,
w
is created where each element of the vector is the weight for a question. The questions areweighted based on the number of times a question was an-swered correctly. Weights are normalized, or in the range
[0
,
1]
. A question with weight 0 is a question that was neveranswered correctly by any student, and a weight of 1 is givento a question that was always answered correctly. Calculat-ing components of the weight vector are:
w
(
x
) =
ni
=0
c
(
i,x
)
n
(
x
)
Where
x
is the position in the vector
w
,
n
(
x
)
is the num-ber of answers to question
x
,
c
(
i,x
)
is the correctness of student
i
’s answer to question
x
. Values for
c
(
i,x
)
are 1 if the answer is correct, 0 if wrong. Weights are in the range[0,1] where weights closer to 0 correspond to very difﬁcultquestions and weights closer to 1 correspond to very easyquestions. The goal is to ﬁnd the middle band of discrimi-nating questions. In general, the questions are of moderate toeasy difﬁculty. A few of the hard questions were answeredcorrectly by about 1 in 5 students, but the majority of ques-tions were answered correctly by more than 1 in 2 students.
Results
When questions with low and high weights were removedfrom the list to ﬁnd exam sizes that were the same as theclique-based methodology, we found that 44% and 46% of the students were scored so signiﬁcantly differently that theywould be moved into different cohorts. As a comparison, inthe clique-based method only 11% and 20% of the studentsmoved into a different cohort. This indicates that perform-ing analysis based on question weights is not an attractivemethod for ﬁnding the most discriminating questions. Ques-tion weighting was viewed as a simple alternate method forﬁnding the most discriminating questions, but it appears thatthis analysis does not take into account enough contextualdata to discover the most discriminating questions.
Conclusion and Future Research
We have addressed the difﬁcult and resource-expensive task of making MCQ exams by considering crowdsourced, in-complete data. We demonstrated two sets of algorithms thatidentiﬁed appropriate MCQs and analyzed them to deter-mine both their difﬁculty and discrimination. There are mul-tiple methods for analyzing data in sparse matrices. We look forward to using additional machine learning approaches todealing with sparse data and including supplementary datasets.Finding algorithms that allow meaningful analysis withincomplete data sets have great beneﬁt in crowdsourced datacollection. Many problems that do not srcinally appear tobe MCQs behave like them, such as recent efforts to crowd-source relief and recovery efforts after Hurricane Sandy(Munro, Erle, and Schnoebelen 2013). Ranking photos of the Sandy devastation could be viewed as a 3-distractorMCQ.
References
B´eguin, A. A., and Glas, C. 2001. Mcmc estimation andsome model-ﬁt analysis of multidimensional irt models. In
Psychometrika, Vol. 66, No. 4, pp. 541-562
.Denny, P. 2009. Peerwise.http://peerwise.cs.auckland.ac.nz/.Gronlund, N. E. 1981.
Measurement and Evaluation inTeaching
. Macmillan, 4 edition.Munro, R.; Erle, S. E.; and Schnoebelen, T. 2013. 10th in-ternational conference on information systems for crisis re-sponse and management. baden baden, germany. In
Analysis After Action Report for the Crowdsourced Aerial Imagery Assessment Following Hurricane Sandy
.Patz, R. J., and Junker, B. W. 1999. Applications and exten-sions of mcmc in irt: Multiple item types, missing data, andrated responses. In
Journal of Educational and BehavioralStatistics, Vol. 24, No. 4 (Winter, 1999), pp. 342-366
.Pooja, S. 2013. Piazza. http://www.piazza.com.
49

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...Sign Now!

We are very appreciated for your Prompt Action!

x