DSM: New assessment test: More results of experiment.

Arie Dirkzwager (aried@xs4all.nl)
Wed, 21 Oct 1998 11:59:56 +0000

Dear List members,
        After my first report I got an additional 46 responses from 24
respondents. I thought they could make a difference so here is my revised
report. At the end I summarise my conclusions in a few statements - I do
hope you will reply to me how far you agree with them and correct me when
you do not.

Here is the revised (final) report:
        Thanks to your major cooperation I got 244 qualified responses from
128 respondents. I think there is not yet sufficient consensus on my two
questions (partly caused by my poor formulation, partly by ignorance of the
respondents, partly because specialists in the field of psychometrics
disagreed storngly) so I think further scientific discussion and
experimentation is needed to evaluate Multiple Evaluation and compare it to
Multiple Choice.
Here follows the more detailed report on the experiment:

Title: Multiple Evaluation Evaluated Experimentally.
1. The experiment.
        By way of e-mail discussion lists several thousands (3000?) of
people all over the world interested or specialised in assessment methods
where asked to respond to two questions regarding "Multiple Evaluation" with
a short instruction on how to reply and pick one of seven possible responses
which would be "scored" according to a Multiple Evaluation scoring rule.
They were instructed to take it as a two question test on which they should
try to get as high a score as possible given their knowledge expressed as
the probability to answer correctly. Their responses implied an estimate of
this probability (that is the essential part of Multiple Evaluation.
        The questions asked were:
        1. "Multiple Evaluation should be preferred to Multiple Choice as an
assessment method.", is this statement true?
        2. "Persons tested are not able to self-assess their probability of
being right.", is this statement true?.
        The response requested was "YES" or "NO" with an indication of the
probability interval in which the subjectively estimated probability of
being "correct" felt. Also the scores to expect depending on the response
computed according to the Multiple Evaluation scoring formula were shown.

2. The results and conclusions.
        I got answers from (self-selected) sample of 128 respondents (it was
not exactly known how large the population was, but I assume it was a sample
representative for the population of people really interested in the topic).
Only 14 times one of the questions wasn't answered, I reckon the
non-respondents answered none of the questions and thus get a score of zero
on this "test".
        For 96 of the 240 answers the respondent estimated that her (his)
probability of being correct was larger than 95% - I assume they have good
arguments for their response (YES or NO) and consider them as "specialists"
on the question at hand.
        For only 47 of the 240 answers the respondent estimated that her
(his) probability of being correct was lower than .87% - I assume they were
in doubt about their answer and will consider them as "students" of the
topic the question was about.
        In case of 98 out of 240 answers the respondent estimated his
probability of being correct between 87% and 95% - in the sequel I will call
them "teachers" as I assume they are quite knowledgeable, however not
necessary "specialists".
        My first hypothesis is that the questions are closely related in the
sense that being at the "specialist", "teacher", or "student" on one
question correlates highly with falling in the same category on the other
question. The following table confirms this hypothesis, especially in the
case of specialists and teachers:

                Q2: specialist teacher student no answer
Q1: specialist: 31 5 8 3
        teacher: 10 29 10 7
        student: 5 9 9 0
        no answer: 3 1 0 0

        In education the prime purpose of testing is assessment of the
students' knowledge and performance. This is operationally defined as their
mean probability of answering an item correctly, estimated as their
percentage "correct". My second hypothesis is that the "students" do not
better than chance and show a normal distribution with mean p(correct)=.50
and variance .25. The experiment shows that 32 out of 47 student responses
were correct: 68%. The sample variance is 47*.25=11.75% and the standard
deviation under the hypothesis p=.50 is equal to .08. We have to reject the
hypothesis as the observed deviation of 18% is about two standard
deviations. Students have some correct information and thus do better than
        The third hypothesis is that "teachers" do better than "students".
This can be tested with the following table:
                           right wrong
"teacher" responses: 61 37 62% right
"student" responses: 32 15 68% right

        The hypothesis that "teachers" do better than "students" should be
rejected but a chi squared test shows we can't say "students" do better, the
difference is not significant. The conclusion should be that "teachers"
should consider themselves more often as "students" themselves, they also
have quite often (in about 40% of all cases) a wrong opinion.

        This makes our fourth hypothesis more interesting: "specialists" do
best. The following table shows the results:
                                                right wrong
"specialist" responses: 46 50 48% right
other responses ("teacher" or "student"): 93 52 64% right

        Our hypothesis should be rejected: "specialists" do significantly
*worse*, no better than chance!
        This raises serious doubts if the key used for scoring was correct,
on question 1 24 of the 28 "specialists" disagreed with the key "YES" and on
question 2 26 out of 49 "specialists" disagreed with the key "NO". This key
was set by the test author - he might be mistaken so let's see what happens
when we change the key according to "specialist" opinions. This revised key
reverses "right" and "wrong" in the table above but still a large percentage
(52%) of those "specialist" responses is wrong while they themselves
reported that their probability of being wrong would be smaller than 5%.
Those "specialists" pretend to be very probably right even when they are
wrong. In another formulation: many uninformed people are still very sure of
their own judgement and pose as "specialist" while they are *not*. I think
it a task of proper education to teach those people to be more realistic
regarding there informedness. Multiple Evaluation does this as a side effect
as overconfidence is punished and realism rewarded. Moreover Multiple
Evaluation measures also this realism of the self assessment provided a test
with more than twenty items is used and thus enables us to pinpoint the
*real* specialists who are 95% right when they report so.
        However, whatever the right answer might be to those test questions,
we still may assume those "specialists" have strong arguments, but they are
still arguing from opposite positions, with opposing arguments. I think they
(we?) should argue and discuss their (our?) arguments theoretically and with
empirical research in an effort to reach consensus on a scientific basis. In
a discussion the real specialists and those who are studying to become
specialists should speak up to the advancement of science and scientifically
founded decisions.
Please give your judgement on those conclusions by quoting the part
underneath this line, filling in your personal probabilities p(true) for
each of the following statements whereas your p(false)=1-p(true), and send
that reply to me, aried@xs4all.nl, (NOT to the whole list!).

(1) Many "specialists" think or pretend to be very probably right even when
they are wrong.
        p(true)= %

(2) In our culture and its education many uninformed people are still very
sure of their own judgement. They are too self-assured.
        p(true)= %

(3) When overconfidence is punished and realism rewarded in education people
learn to be more realistic in their judgements.
        p(true)= %

(4) Realistic "specialists" have strong arguments, but they may still be
arguing from opposite positions, with opposing arguments. In a COLTS
(Cooperative Learning and Teaching System) they should be invited to give
their arguments to be judged.
        p(true)= %

(5) In a COLTS consensus can be reached in not too much time.
        p(true)= %

Thanks in advance for your reply!
Kind regards,

Educational Instrumentation Technology,
Computers in Education.
Huizerweg 62,
1402 AE Bussum,
The Netherlands.
voice: x31-35-6981676
FAX: x31-35-6930762
E-mail: aried@xs4all.nl

When reading the works of an important thinker, look first for the
apparent absurdities in the text and ask yourself how a sensible person
could have written them." T. S. Kuhn, The Essential Tension (1977).
Accept that some days you are the statue, and some days you are the bird.

This archive was generated by hypermail 2.0b3 on Thu Dec 23 1999 - 09:01:53 EST