Principles of Assessment
The following recommendations arise from the conclusions reached at a symposium on the postgraduate assessment process within the Colleges of Medicine, held at the College in Rondebosch, Cape Town, on 3rd-4th May, 2003. They have been modified to suit our undergraduate and postgraduate (Registrar) programmes within the Division of Medicine.
Comments pertinent to our current position at the start of 2009 have been appended in blue.
Sound educational/psychometric principles should drive the training and assessment process within our postgraduate programmes
Programmes of training and methods of assessment should be based on sound educational principles and should be audited by validated psychometric methods.
Our assessment methodologies largely meet these objectives.An area of probably unavodiable weakness remains the low number of cases in the clinical assessments: this is mitigated by the combination of the portfolio exam and clinical assessment scores, which effectively increases n from 3 to 5.
Assessment and Training need to be evidence-based
The scientific study of assessment methods within education is now a 30 year-old discipline. There is a large body of valid evidence proving the value of certain methods of assessment; conversely there is good evidence to suggest that other methods are unreliable and should be modified or abandoned. Some of our examinations still rely on these less desirable methods and, some of the methods for which scientific support exists are little used.
Our undergraduate educational courses are now fairly well aligned with evidence-based principles.
High quality programmes are organised, supportive and provide feedback to trainees and trainers.
Assessment is but one aspect of the educational process. Ideally assessment should form part of a structured postgraduate education programme for the trainee – both undergraduate and postgraduate. Such structure comprises not only what is learnt, but also the provision of a supportive environment and frequent feedback to guide the individual learning of the trainee.
True support, including formative assessment, is made difficult for our undergraduate students because of the short time they spend with us in their blocks. In 2011 we are introducing a system of structured formative assessment at mid-block. At registrar, mid- and end-block formative assessment built around, but not necessarily restricted to, the CMSA portfolio will be effectively implemented. Both registrars and students will provide feedback on their educational experience via structured evaluation questionnaires.
Every programme should define clearly articulated, accessible and measurable learning outcomes.
A set of learning outcomes (a phrase which has grown out from the less inclusive term learning objectives) should be defined for each programme we offer. Learning outcomes should encompass the attitudes, knowledge, skills and performance attributes required of the candidate. They should be explicitly stated, clearly articulated and comprehensive; they should be made freely available to trainees, trainers and assessors. They should be measurable: in other words, methods should be devised in the assessment process to determine whether these learning outcomes have or have not been achieved.
We have a revised, defensible core syllabus, which stresses the importance of effective clinical problem solving rather than the accumulation of factual knowledge.
At registrar level, training has improved now that there are appropriate syllabi and test blueprints for the College Part 1 and Part 2 examinations. We are still not clear as to how our formal tutorials can best complement this. This area will benefit from further exploration.
Learning outcomes should drive both the training and the assessment process
Once learning outcomes have been set and accepted, the training process must be modified to maximise the potential for these outcomes to be reached, and the assessment process designed such that it is able to determine reliably whether these outcomes have indeed been achieved.
Our undergraduate programme largely achieves this.For registrars, we will enforce attendance at a well-planned sequence of tutorials and formative assessments much more strictly. We need to improve the balance between service-driven and educational activities.
The multiple purposes of assessment must be recognised and accommodated.
The assessment process has several purposes, all of which are important in determining whether learning outcomes are reached. Two major purposes of assessment are :
- Formative assessment, which is designed to provide feedback on individual progress to the trainee, trainer and programme manager.
- Summative assessment, which is a judgment call as to whether the trainee should be permitted to proceed to the next level of training or, eventually, to graduate. Formative assessment currently plays little part in our undergraduate and postgraduate training programmes.
Formative assessment is difficult in our undergraduate courses because of the short time they spend with us in their blocks. In 2011 we are introducing a system of structured formative assessment at mid-block. At registrar, mid- and end-block formative assessment built around, but not necessarily restricted to, the CMSA portfolio will be effectively implemented.
The assessment process must be psychometrically sound, and therefore credible.
The assessment process represents a sampling of the candidate’s attributes and abilities. As with all sampling processes, the assessment process must be demonstrably reliable (that is, able to discriminate between weak and strong performance) and valid (that is, a true test of the attribute under study). Psychometric evaluation is a scientifically and statistically valid process of measuring the reliability and validity of different assessment methods, and should therefore guide the selection of the methods used for the assessment of trainees. Put differently, assessment methods should be evidence-based.
Our undergraduate assessments are largely acceptable.
Assessment must be sustainable and feasible.
This is self-evident, but critically important. Both economy and efficiency must be considered in determining the format of examinations: this will include the cost of examinations, in the broadest sense, to candidates, examiners and patients who participate in the process.
We have rationalised the number of written examinations as far as possible, and have committed ourselves to offering no more such examinations than are essential as dictated by educational need and class logistics. It has not proved possible to reduce the number of undergraduate clinical exams we offer, and the number has in fact grown with the introduction of the oral portfolio examination in the 3rd to 5th years. A major bottleneck is in the provision of clinical examiners, the effects of which is severely exaggerated by the need to provide pairs rather than single examiners, for historical reasons. In 2009 we need to interrogate the logistics of the clinical exams critically, and consider such innovations as the deployment of registrars and family physicians as co-examiners, the use of additional centres for assessment, and the timing and spacing of assessments.
Criteria are needed to define the passing threshold.
There is a need to define the passing threshold more rigorously by reference to either criterion- or normative-referenced methods.
Our marking guidelines are now criterion-referenced and make explicit recommendations as to the competencies required for passing. Work is needed to standardise the interpretation of these guidelines by different assessors.
Numerical scales may be more reliable than percentages.
There is evidence that global rating scales (such as the seven-point scale) are inherently more meaningful and reliable than markings by percentage: It is not difficult to convert the global rating scale to numeric values which can thereafter be used in calculation. Use of global rating scales are useful for clinical examinations, portfolio assessments and SEQ’s. They tend to be inherently more reliable than percentages, with lower intra- and inter-observer variation.
Our criterion-referenced marking schedules largely satisfy this principle.
Methods are less important than the amount of sampling.
This is one of the most important conclusions of the symposium. The assessment process represents a sampling of the candidates attributes and abilities. As with all sampling procedures, reliability improves dramatically as the number of observations increases. Increasingly the n value will compensate for variations in the ease or difficulty of individual tests and for variations in the behaviour and expectations of individual examiners.
Increasing the n value is much more efficient in improving the overall reliability of assessment than are attempts to do so by training examiners or standardising questions. Currently many aspects of the College examination system are based on the use of a small number of samples of the candidate’s attributes and abilities. “Long case” and written essay questions are obvious examples. The reliability of such “n=1” questions is unacceptably low.
This appears to be an intractable problem, since we are limited by both assessors and patients. We need to explore innovative responses, such as the targeted provision of extra cases to borderline students.
The confounding effects of language on the assessment of performance requires study.
There is concern that candidates whose home language is not that in which the assessments are conducted may be disadvantaged. The confounding effects of this on our examinations (oral, clinical and written) have not been studied. Analysis, discussion and if necessary corrective measures are appropriate.
Do we have a policy on this (e.g. extra time for students whose home language is not English?
Language equity is a further issue.
Furthermore, the use of one of many official languages within a multilingual society is an issue which requires debate and a clear and defensible policy.
Formative assessments have a value which is currently under-exploited.
The value of these has been discussed above. There are many opportunities to introduce a programme of formative assessment in a creative and imaginative way into the postgraduate education programmes offered to our trainees. Formative assessment is educationally sound. Furthermore, by allowing the trainee and the trainer to evaluate his or her progress against the learning outcomes set for the programme, fewer under-prepared candidates will present themselves for College examinations.
Currently there is no formal process of formative assessment or even of feedback in place for either undergraduates or our registrars. We need to introduce this in 2009.
Formative assessment should be encouraged, and may eventually become mandatory, within our specialist training programmes.
In order to maintain the high standards of the graduate of the Colleges, to improve the preparedness of candidates presenting themselves for examinations, and thus to reduce the current failure rate for such examinations, the Colleges should encourage the development of formative assessment programmes, assist with their implementation in all training institutions, and may eventually make compliance with a programme of formative assessment a necessary precondition for taking final examinations.
Our Division needs to formulate a policy on this. Our DVC has mandated this for our MMed programmes, and we will need to incorporate this into our practice during 2009.
Extend summative assessment into the in-course evaluation period.
Currently the College makes little use of summative assessment within the training programme (i.e., assessments which contribute directly to passing or failure). Such assessment has several advantages: it potentially increases the amount of sampling possible (it is for instance simpler to assess the candidate’s performance on a large number of clinical cases over a period of several months rather than all on one morning); it provides feedback to the trainee which may result in modification of learning and greater success on subsequent occasions, and it is potentially places less stress on the candidate.
There are several methods of summative assessment which might be introduced into the period of training, including the use of a reflective portfolio of clinical learning, clinical work sampling and multiple short case assessments. The current College examination structure of once-off examinations held twice yearly does not readily allow for such ongoing summative assessment, but creative solutions are possible and should be explored.
We should explore the introduction of such methodology into our own registrar training programme, initially for formative purposes, but eventually integrating it into a College process which is likely to eventuate one day.
There is no doubt that the single biggest improvement we could make in our registrar education programme would be the introduction of regular, solid assessment, both formative and summative. This is in line with the adage: Assessment drives learning.
The primary/final process should be reviewed, and fragmentation avoided in the light of the drive to integrated learning programmes
it is an axiom in medical education that assessment drives learning. Explicitly directing primary and final examinations to different areas within the learning outcomes drives learning in different directions. Thus directing a primary examination towards the basic sciences and a final examination to clinical practice sends a signal to the trainee that these different aspects can be divorced, one being more appropriate to students, the other to practitioners. The most important value would appear to be the potential they have for guiding the trainee’s development incrementally through the learning programme (in which case some of their value is formative). A subsidiary value would be in detecting those candidates who are not suited to further study and should be advised to seek a different career path. For both these purposes, the earlier examinations should assess the same set of learning outcomes as the exit examination, albeit at a lower level.
For undergraduates, we have now successfully aligned the learning outcomes of third, fourth and fifth years such that effectively they are all the same, share a core syllabus and are therefore assessed similarly.
We have only just begun to explore the systematic development of our registrars towards specific outcomes, and how we tie the assessment process into this, in any meaningful way. As an important first step however, our formal educational programmes for our registrars are being redesigned in order to allow for different programmes and modes of instruction at different stages of the registrars’ progression through the four years of training.
Collaboration between training institutions such as universities can simplify the logistic implications of introducing formative assessment to the training programmes.
Examples will include the dissemination of theory questions for formative assessment by one party to all other training institutions. Similarly, institutions geographically within the same region might exchange staff at intervals through the year to assist with the assessment of each other’s trainees.
There is much that we can achieve by sharing learning, teaching and assessment resources with our sister medical schools. We need to take on a leadership role in this, e.g. by creating the structure for a common SA-wide examination in Medicine, which would relieve 8 departments around the country of having to devise and set their own questions repeatedly. We made a start in this with the national seminar we offered in MCQ assessment, and our involvement in the development of a National MCQ bank for Medicine.
In 2009 we intend to host a national seminar on clinical learning at UKZN.
Methods of formative and summative assessment should be aligned.
It is a basic principle that candidates should be familiar and comfortable with the assessment methods employed in all summative assessments. Stated colloquially, the candidates should not meet an assessment format for the first time in a high-stakes examination. It is therefore necessary that assessment methods used in the formative and summative assessments are aligned.
Need for a large number of samples.
Using a statistically valid sample size in sampling the work of any individual candidate is fundamental to the reliability of the assessment process. All methods of assessment should therefore move to formats capable of sampling widely rather than narrowly. As a rule of thumb, sample size usually has an absolute minimum of 8-9, but may extend as high as 20. The formats currently used therefore require review and improvement: particularly the use of long and short clinical cases, the essay and SEQ’s. There is a strong case for believing that both would benefit from sampling a wider range of conditions or topics in less time.
In selecting assessment methods, it is necessary to move beyond the question: Does this method assess some aspect of the learning outcomes? to the question: Does this method assess that aspect reliably?
Need for a spread of examiners
A spread of examiners is as important as a spread of case material or, in theoretical examinations, topics; this will reduce the potential for small-sample bias. There is evidence that having more examiners examine singly is of greater benefit than having the same number examine fewer instances in pairs.
This is a major point of debate in our department, which we need to explore further. Little progress has been made thus far, though in 2009 we will no longer have two of the three cases in final year assessed by a single pair of examiners. One consideration is a policy of submission of individual marks by pairs of examiners, rather than a consensus mark. This increases n=3 to n=6. It will result in a reduction in observer bias (tough vs gentle examiners), though not in case bias (lucky vs unlucky cases, hard vs easy cases).
There is a moral and legal obligation to verify the technical skills of our graduates.
It is insufficient to assess candidates purely on factual content and performance with clinical cases in examinations. A major aspect of their subsequent work (including a strong potential for both harm and for adverse medico-legal consequences) will comprise technical skills such as operating ability, resuscitation skills, endoscopy skills and many others. An assessment of the ability to perform such skills should be drawn into the formal assessment process. The assessment process should be qualitative (the demonstration that skills are present) and not just quantitative (logbook evidence that the procedures have been carried out).
With regard to registrars, we need to engage with the Skills Lab on this issue.
Assessment should be contextualised and authentic.
As far as possible, the assessment should simulate the actual circumstances in which the candidate will practice. This suggests a potentially important role for clinical work sampling (in the trainee’s own work environment) and for in-course formative and summative assessment.
There is potential here for both our undergraduate and our postgraduate programmes. Note that when we talk about in-course assessment, we refer to reliable assessment: not a single tutor’s or supervisor’s mark, but the aggregate of multiple marks from multiple observers over a period of time.
Trapdoor techniques are invalid, unfair and should not be used.
These are restrictive rules mandating failure as a result of poor performance in single questions, cases or stations. Allowing trapdoor questions immediately reduces sample size to n = 1, thus making the whole examination unreliable. Furthermore, there is psychometric evidence to show that subsequent performance in practice is adequately and reliably predicted by average scores over all components of an examination, without the inclusion of trapdoor questions. This does not negate the possibility of introducing a system of weighting into the assessment process.
We need to interrogate the defensibility of our subminima in this light.
There is opportunity for greater use of clinical work sampling, mini-clinical examination (mini-cex), mcq, ultra-short written answers and clinical reasoning exercise.
There is strong evidence to show that a small number of assessment formats are inherently more reliable and predictive than other more traditional methods which may currently enjoy a greater profile currently. Among these are different formats of MCQ, ultra-short written answers, clinical reasoning exercises, mini-CEX and OSCE examinations.
We need to develop an interest in assessment as a field of study in its own right.