2010 Global: UNICEF Global Evaluation Report Oversight System (GEROS): Quality Review of Evaluation Reports 2009
Author: Joseph Barnes, Hatty Dinsmore, Sadie Watson [IOD PARC]
UNICEF Evaluation Office (EO) has put in place a Global Evaluation Report Oversight System to monitor the impact of efforts to strengthen the UNICEF evaluation function globally.
The main purpose of this quality review process is to provide decision makers in UNICEF with information about evaluation reports that better supports using and improving the knowledge generated by the evaluation function. It seeks to go beyond raising awareness of quality issues, and demonstrate the implications of trends in evaluation quality on the usability of knowledge in the pursuit of delivering results for children and women.
This quality review process covered all evaluation reports submitted to the UNICEF Global Evaluation Report Oversight System for 2009. Reviews, research and other types of reports were excluded. The quality review tool assesses the evaluation report as a standalone document. The standards against which evaluation reports are assessed are set by the UNICEF deployment of the United Nations Evaluation Group (UNEG) global evaluation report standards.
The overall objective was to assess and rate the quality of evaluation reports commissioned by UNICEF in 2009 using the UNEG/UNICEF Evaluation Report Standards. Specific objectives included:
Review and rate (with justifications) the quality of the main elements of evaluation reports, including structure, context, purpose, methodology, findings, conclusions, recommendation and lessons learned;
To provide constructive feedback for evaluation commissioners to improve future evaluations;
To provide a global analysis of key trends, strengths, weaknesses, and lessons of UNICEF evaluation reports; and
To provide actionable conclusions and recommendations to improve the quality oversight system and the systemic quality of the evaluation function.
Evaluation reports were initially classified according to the UNICEF evaluation typology. This allowed analysis according to various evaluation report characteristics. Each review was undertaken by an evaluation expert familiar with previous meta-evaluations of UNICEF evaluation report quality. All reviewers participated in a co-design workshop that enabled a common understanding of the standards to be reached. Three additional levels of quality assurance were applied.
The review tool primarily adopted a qualitative approach to rating evaluation reports against the overall standard of confidence. It pursued a systematic process of aggregating qualitative ratings of 58 guiding questions about different aspects of an evaluation report into six sections, and then into a final overall assessment. The six sections of the review tool were: 1/ Object of the evaluation; 2/ Purpose, objectives and scope; 3/ Evaluation methodology, gender, human rights and equity; 4/ Findings and conclusions; 5/ Recommendations and lessons learned; 6/ Report is well structured, logical and clear; 7/ (Plus additional information and an overall reaction).
This qualitative approach was designed to enable reviewers to provide useful analysis across the range of evaluation contexts encountered; and constructive feedback to improve future evaluation reports. Each question, each section and the overall report are given a rating of either "very confident‟, "confident‟, "almost confident‟ and "no confidence‟ (where relevant, a N/A option was also provided). Each rating was informed by three factors: a prompting question in the review tool; a "confidence-to-act‟ test, and any ratings in the level of analysis below.
In addition to ratings, commentary was provided against each rating, suggestions for future improvement provided for each section, and executive feedback provided for each section and the overall report. The complete review process generates three types of data: a report typology, a series of ratings, and a structured set of discussion text.
The review process generated an extensive dataset to inform the trend analysis, consisting of 1,152 individual pieces of evaluation typology data, 6,432 individual ratings, and 6,912 sections of qualitative text (approximately 140,000 words). In order to distil the key findings from this data, a multi-stage process was adopted using qualitative analysis tools such as inductive coding.
The review process itself was subject to the limitation of only having access to the written evaluation report. As a direct consequence of this, the findings and conclusions drawn can only be applied to an evaluation report, and not to the evaluation itself. Qualitative analysis (as with all analysis) requires for judgements to be made in identifying the important indicators and trends contained within the dataset generated by the review process.
The meta-evaluation found that 36% of reviewed evaluation reports met the UNICEF standards to a degree that could be considered satisfactory. Whilst the remaining 64% of reports were rated as unsatisfactory, the vast majority of these (exactly half of all reports) could have been improved to a satisfactory level with just a little more work.
Overall, four evaluation reports were flagged as outstanding best practice, although six more achieved a very confident rating in one or more of the review sections. The outstanding evaluations were Thailand‟s Evaluation of Children and the 2004 Indian Ocean Tsunami, Guinea Conakry/Guinea Bissau‟s Evaluation of WASH Activities, Timor Leste‟s Evaluation of the UNICEF Education Programme, and Uzbekistan‟s Evaluation of the Family Education Project.
Reviewers noted in particular that unclear objectives were a major contributor to poor report quality, as was having an unclear purpose, inadequate evaluation questions, or missing evaluation criteria. In nearly all of these cases it was found that either poor quality terms of reference contributed to the weaknesses evident in evaluation reports.
Output level reports (30% of all reports) displayed concerning weakness across all review sections, such that 90% were rated as unsatisfactory. Outcome and impact reports were consistent across both result levels, with around 50% being satisfactory overall. Only impact reports included those that were considered to be outstanding best practice. Up to three times more summative reports than formative reports were classified as satisfactory in various aspects of the review.
Cross-referencing ratings with MTSP-correspondence reveals a consistent story. Multi-sector and cross-cutting evaluations register strongly in all sections, with around half of these reports being rated as satisfactory. Conversely, organisational performance evaluations were continuously ranked as unsatisfactory across all review sections (with the notable exception of the Global Evaluation of DevInfo).
From among the MTSP focus areas, young child survival and development suffered from particularly weak evaluation reports, less than 20% of which were considered to be satisfactory (albeit two being rated as outstanding). Policy and advocacy, and child protection were fairly robust in terms of "description of the object‟ and "purpose‟ sections, but rated very poorly in all other sections. It was concerning to note that not a single policy evaluation report was rated satisfactory in relation to „recommendations and lessons learned‟.
HIV/AIDS evaluation reports were assessed to be poor in relation to all sections of the review, in particular with regard to methodological rigour. At the other end of the scale, basic education and gender evaluation reports were found to be consistently the strongest of all the MTSP focus areas with more than half of reports rated as satisfactory in most sections of the review.
Around two-thirds of evaluation reports fail to explicitly articulate the results chain of the evaluated object. Over half of reports were still able to present a satisfactory context through inclusion of other information, but the consequence of this issue is that a majority of reports do not appear to be guided by the logic of the programme or project being evaluated. An observation by reviewers accounting over half of the reports was that there is a tendency to provide general information about a country or the implementation context of the evaluated object, rather than analysis that can shape the evaluation purpose, objectives, and findings.
Reviews of the purpose, objectives and scope of evaluation reports revealed a number of underlying issues with the framing of evaluations. These appear to manifest themselves through weak justification of evaluation criteria, and lack of consistent use of these criteria within evaluations. Despite these challenges, the purpose sections of more than half of reports still manage to rate as satisfactory. Four outstanding reports were noted primarily for having very strong evaluation frameworks. These clearly referenced the OECD DAC evaluation criteria in addition to identifying and integrating relevant rights instruments, such as the Core Commitments to Children.
A third of reviewers found methodologies to be narrow and inadequately explained as a general rule. This is manifested in terms of weak control of bias: with a handful of reports doing no more than the evaluator providing personal reflections on a narrow set of interviews. 83% of reports do not include any discussion on ethics, although it is sometimes evident from the approaches adopted by evaluators that ethical considerations had been borne in mind at some point.
Around 10% of evaluation reports were praised for collecting diverse datasets and large potential bodies of evidence. The best of these were able to convert this data systematically into evidence, findings and conclusions using strong and transparent analysis. The majority of reports were unable to demonstrate this systematic use of evidence to construct robust findings and conclusions. Indeed, there appears to be a persistent problem in regard to data analysis, with reports not using data to its full potential.
Reviewers noted that it was often hard to see the link between recommendations and the preceding findings and conclusions. Lessons learned proved to be even more problematic than recommendations. When they were found in reports, lessons learned were more-often-than-not found to be project or programme-specific observations and not generally applicable to other contexts.
All of the seven policy evaluations included in the review were rated as unsatisfactory in relation to recommendations and lessons learned. This is an issue of some significance in relation to UNICEF‟s commitment to more upstream working.
The majority of evaluation teams do not appear to have had sight of the UNICEF/UNEG minimum standards. In the 34% of reports that did have TOR attached it was largely found that TORs did not draw attention to these standards.
Although 86% of evaluations did include an executive summary, these were largely found to be weak and not fit for purpose. Only 30% of reports included executive summaries that could confidently be used for decision making purposes.
The review tool proved to be challenging in terms of drawing out lessons about gender, equity and human rights. Just under half of evaluation reports (40) integrated gender considerations to some degree. Only seven reports dealt substantively with issues of equity and only 30% reports were found to have methodologies that were appropriate for analysing gender and human rights issues identified in their scope. One set of evaluations that did stand out as being both strong on rights and strong overall were the various Child Friendly School evaluations.
Whilst there were inevitable disparities across the quality of reports submitted by different regions, nearly all regions had at least one evaluation report that was rated outstanding in one or more of the assessed sections. The highest levels of satisfactory reports were concentrated in the Asian and European-based regional offices. It must also be recognised that the three global-level evaluations conducted by HQ Divisions were consistently rated as confident across all sections and overall.
During the review process, a number of notable practices were observed in each of the UNICEF regions. These were noted in the final section of the review tool and reported back to evaluation managers.
Conclusions were developed by analysing the findings for trends in underlying factors that contributed to the performance of evaluation reports
Evaluation reports benefit from having access to relevant and well-developed international frameworks. The MTSP and purpose analyses clearly reveal a tendency for evaluations of education and humanitarian objects to have reports of better quality than their contemporaries. Our investigation into this trend suggests that these two areas benefit from having well-known, mature and contextually-adaptable frameworks.
A disjuncture exists between successful evaluation and strong integration of rights. There would seem to be a complex and multi-faceted dynamic around an apparent disparity between reports that respond well to rights issues and reports that rate well overall. From the evidence available to this review, it would appear that fragmentation of rights skills from evaluation skills and unmet needs for strong mainstreaming frameworks are two central drivers to this whole dynamic.
The evaluation function is not delivering consistent contributions to upstream knowledge management. The purpose and MTSP analyses both found that policy evaluation reports and organisational performance evaluation reports are weak areas. From the perspective of UNICEF‟s upstream ambitions, the current performance of the evaluation function is likely to be of some concern.
Robust and transparent analysis of data is a problem. All the different ways of breaking down the rating data reveal one consistent trend: evaluation reports are stronger in the initial sections of the review, with performance gradually deteriorating over the span of the report. The central issue appears to be that evaluators are far clearer about the theory of evaluation (purpose, objectives, methodology, data collection) than the processing and analysis of data that is generated.
Weak terms of reference are contributing to poor report quality. Reports tended to „build‟ off of the TOR as a starting point in terms of the evaluation purpose and framework, so better TORs inevitably resulted in better reports. This reemphasises the value of conducting basic checks and quality assurance on TORs, ensuring that each TOR gives evaluators sight of the minimum standards, and using the TOR to articulate very clearly the purpose of the evaluation.
Fundamental misunderstandings of recommendations and lessons learned prevail. Recommendations were often found to be disconnected from the preceding sections, drawing on the personal knowledge or opinions of the evaluator(s). Lessons learned generally perform even more poorly overall. Indeed, the prevalence of misunderstanding of the lessons learned element of reports might suggest that it is a central candidate for explicit efforts to raise the awareness of both evaluation managers and evaluation teams about what lessons learned are.
The qualitative approach is a viable and useful way forward for the UNICEF Global Evaluation Report Oversight System. The experience of this meta-evaluation has found the qualitative approach to not only have been capable of generating analysis equivalent to or richer than the previous quantified methodology, but that it also enabled reviewers to provide more useful, accurate and constructive feedback to UNICEF managers in a range of contexts.
There is work to be done in supporting evaluation managers to mainstream human rights, gender and equity. There are some profound challenges in the interaction between HRBAP and robust evaluation that are unlikely to be addressed through simply revising weaknesses in the review tool.
Lessons Learned (optional):
These three lessons learned have been generated through the analysis of the core-evaluation team and adapted based on responses from UNICEF Evaluation Office and regional offices.
There is great value to be gained from blending the skills of evaluation teams. Complex rights-orientated evaluations have been delivered successfully where both evaluation skills and sector-knowledge has been present on the team. With an apparent shortfall in both evaluators with rights-knowledge and rights-specialists with evaluation skills, it would seem that value could be delivered from creating evaluation teams that mix these two skillsets as an alternative to the more traditional international/local knowledge blend.
Developing strong international frameworks provides a platform for stronger, more rights-orientated and more useful evaluation. Frameworks such as Child Friendly Schools and Core Commitments to Children empower evaluators to better manage rights issues within their evaluations. Stronger, more useful evaluations contribute to enhancing knowledge across these sectors and thus strengthening the frameworks themselves: thereby contributing to the creation of a „virtuous spiral‟.
Co-designing the methodology and investing in the development stage of evaluation-quality-reviews delivers a strong return in performance. Misunderstandings that could have become a problem at the analysis stage were eradicated early on through face-to-face working between UNICEF and IOD PARC teams. This also had the benefit of attenuating different interpretations of ratings by reviewers. Reviewers themselves had a chance to test and to help refine the review tool before it was finalised and deployed. This had the benefit of working through many possible scenarios and ultimately contributing to a more universally usable tool.
Full report in PDF
PDF files require Acrobat Reader.