updated for fall, 2005
The purpose of formative assessment is to adjust teaching and learning. For example, a grade one teacher may test the decoding skills of a grade one student to determine whether specific phonics instruction is required.
Summative assessments are all forms of assessment that are not formative. For example, a medical student may take medical board exams to determine if she is competent to practice medicine. Another example: a teacher may give a final exam to determine the grade for a course.
From a self-regulated learning perspective, virtually any assessment has the potential to be formative if the student is informed of the results. More detailed feedback increases the formative function of an assessment.
In criterion-referenced assessment, outcomes are presented with respect to the objectives (or curriculum).
In norm-referenced assessment, outcomes are presented in relation to the performance of others.
However, normative information does (and should) drive the design of learning objectives, activities and assessments.
There are three major reference points in assessment: the objective or task (criteria), the abilities of others (norms), the abilities of self.
Criterion-referenced assessment leads one to compare one's current performance to past performance.
Objective items (e.g., multiple choice) are more difficult to create but easier to score than subjective items (e.g., essay)
Objective items tend to have higher reliability.
Essay items tend to be more valid assessments of writing ability.
Essay items tend to be more valid assessments of ability to produce complex products.
Reliability is the consistency or repeatability of an assessment.
Validity is how well an assessment measures what it is supposed to measure.
High reliability and high validity:
Low reliability and low validity:
High reliability but low validity:
A thermometer that always reads the same temperature gives highly reliable but invalid measurement.
Reliability is a necessary but not sufficient condition for validity.
Different kinds of reliability:
* test-retest reliability
* inter-rater reliability
* internal consistency
Authentic tests are usually more realistic, more "situated." They have the learner perform the task that is the true target of instruction.
For example, an authentic test of baking a cake would have the student actually bake a cake rather than write down the steps in baking a cake.
Authentic tests are more valid because they closely match the test to realistic conditions.
Portfolios have been found to promote learning, but they are difficult to reliably score.
* There is evidence that repeated failure is demotivating (recall learned helplessness).
* There is evidence that unremitting success does not prepare students to deal with failure (Clifford, 1990).
* A mixture of success and failure appears to be highly motivating (Clifford, 1990).
* Learning environments with a high level of inter-individual competition and comparison can be demotivating for lower ability students.
* Intra-individual comparison can be highly motivating.
Dempster (1991) found that testing is an effective way to promote learning.
Feedback that corrects errors and explains how to correctly perform is highly effective for learning.
In giving feedback, answer these questions:
What is the key error?
What is the probable reason the student made the error?
How can I guide the student to avoid the error in the furture?
Every assessment (exam, assignment, etc.) is assigned a percent weight.
Factors determining weight:
* proportion of objectives assessed
* importance of assessed objectives
* overlap with other assessments
* validity (and reliablity) of assessment
* discrimination index of assessment
* difficulty of assessment
How will you combine the assessment scores to obtain a total score?
The 100-point system may be the easiest for everyone to understand. In the 100-point system, each point is worth one percent. Each assessment is "out of" the percent weight that is allocated to it. You can simply add a student's scores on each assessment to obtain the student's total score out of 100.
For example, the think paper for this course is assigned a weight of 25%. Each component of the think paper is allocated some points out of 25, e.g., the writing mechanics component is allocated 10 points out of 25.
The main disadvantage of the 100-point system is that to maintain simplicity the scoring for an assessment must be consistent with the allocated marks, e.g. writing mechanics must be scored out of 10. This is less flexible, and may require scoring with half marks.
Use a standard system that is familiar to all stakeholders: students, administrators, other institutions.
Provide enough information so that students can calculate their lettergrade.
List the learning objectives and select some proportion for assessment. Be aware of which objectives are being assessed and which are not.
An alternative is to use a behavior-content matrix to plan assessments. Using such a matrix can ensure that you have a balanced distribution of test items (or other assessments) over topics and skills.
All assessment types (e.g., objective items, subjective items, problem sets, projects) have strengths and weaknesses.
To obtain valid summative evaluations, combine at least two assessment types. For example, assign a project and a test, or an essay and a performance.
Select a task that is highly realistic (authentic), yet practical.
When planning, analyse quality into components. Assign a weight (number of points) to each component.
Provide students with rubrics or specific descriptions for each component. Use these when scoring the students' products.
Be as clear as possible about how the assignment will be scored.
Clifford, M. M. (1990). Students need challenge not easy success. Educational Leadership, 48(1), 22-26.
Dempster, F. N. (1991). Synthesis of research on reviews and tests. Educational Leadership, 48 (7), 71-76.