2011 Global: UNICEF Global Evaluation Report Oversight System (GEROS): Quality Review of 2010 Evaluation Reports [Final Report]
Author: Joseph Barnes, Hatty Dinsmore, Sadie Watson [IOD PARC]
UNICEF holds a long-standing commitment to independent assessment of the quality of evaluation reports produced by its country and regional offices all over the world, as well as HQ divisions. This is the second report to use the current methodology; developed in response to a need to provide decision makers in UNICEF with information about evaluation reports that better supports using and improving the knowledge generated by the evaluation function. It is important to note, however, that this approach is limited in its scope for achieving this aim because it is only able to assess the quality of reports, and not of the evaluations behind them.
This quality review process covered all 2010 evaluation reports submitted to the UNICEF Global Evaluation Database by the cut-off date of end of March 2011. The standards against which evaluation reports are assessed are set by the UNICEF deployment of the United Nations Evaluation Group (UNEG) global evaluation report standards.
Specific objectives of the review are to:
- Review and rate (with justifications) the quality of the main elements of evaluation reports, including structure, context, purpose, methodology, findings, conclusions, recommendation and lessons learned;
- To provide constructive feedback for evaluation commissioners to improve future evaluations;
- To provide a global analysis of key trends, strengths, weaknesses, and lessons of UNICEF evaluation reports; and
- To provide actionable conclusions and recommendations to improve the quality oversight system and systemic quality of the evaluation function;
- To provide specific information on the extent to which equity issues are taken into consideration.
Evaluations were filtered out from other types of report. This process resulted in 89 full reviews. An evaluation expert who is familiar with the UNICEF evaluation function and who had completed a dedicated induction process undertook each review. Three levels of quality assurance were applied: basic completeness and grammar checks; sampled peer-reviewing; and a right-to-challenge option exercised by the UNICEF Evaluation Office.
The full review tool is presented in the Annexes. This was originally co-designed by UNICEF and IOD PARC in 2010 based upon the UNEG/UNICEF standards. Changes made for this review include having separate questions on human rights, gender and equity; collecting more data on the role of TORs in evaluation quality; and reorganising the 58 guiding questions around 22 sub-sections. These sub-sections are used to generate a performance „dashboard‟ presented in the annexes.
Each question, each section and the overall report are given a rating of either: „very confident‟, „confident’, „almost confident’, or „not confident’ (where relevant, a N/A option is also provided). In addition to ratings, commentary is provided against each section and sub-section, suggestions for future improvement provided for each section, and executive feedback provided for each section and the overall report.
The review process generated an extensive dataset to inform the trend analysis, consisting of 1,068 individual pieces of evaluation typology data, 5,963 individual ratings, and 3,738 sections of qualitative text (approximately 120,000 words). In order to distil the key findings from this data, a multi-stage process was adopted consistent with the previous meta-evaluation.
This review added two additional layers of analysis that were possible because of time saved in the data aggregation process (after lessons learned in the previous meta-evaluation). The first additional analysis was a manual affinity diagramming of all of the overall comments made by reviewers. This was able to reveal meta themes, or clusters, that were not word-specific. The second additional layer of analysis was the quantitative cross-referencing of ratings data, from example between „equity‟ and „MTSP correspondence‟ in order to visually identify possible explanations for levels of report quality.
The limitations of time on the level of data analysis were mitigated as far as possible through triangulation of quantitative and qualitative patterns in the data. As noted previously, however, this does not enable us to overcome the limitations of only assessing the evaluation report and not the evaluation itself. Furthermore, the approach adopted is limited to being able to identify only the „headline’ findings, with the possibility that more nuanced or infrequently occurring issues exist for individual readers to find within the reviews themselves.
Overall, the review found a slight improvement in performance year-on-year, with 40% of reports being rated overall as satisfactory according to UNICEF Evaluation Standards (36% in 2010). Of the remaining reports, 30% could be brought up to a satisfactory standard with a little more work. Of some concern, however, was the 30% of reports found to have substantive weaknesses: 14% were in the previous year. At the top end of the scale, three reports were rated as outstanding1 – although eight reports had at least one individual section rated as outstanding.
Our qualitative analysis suggests that there is a general trend of reports rated unsatisfactory as being highly descriptive in nature; as compared to better performing reports that used a much more analytical approach. This belies the fact that the differentiating factors between stronger and weaker reports are still fundamental evaluation issues: criteria, methodology, conclusions and so on.
There are flourishes of reports that deal innovatively with things such as stakeholder engagement or future policy. In general, however, well-structured reports do better simply because they develop a clear line of logic that results in evaluative conclusions based upon evidence, and there are a significant number of evaluation reports that still do not achieve this.
Thus the overriding theme coming out in this meta-evaluation is the influence of clear and sufficient evidence in determining the quality rating of evaluation reports. This requirement for clear presentation, robust analysis, and transparent interpretation of information cuts across all of the assessed sections: from the selection of evaluation criteria to the development of lessons learned.
Reports were found to be stronger in terms of overall coherence (45% confidence) than in terms of individual sections (35% confidence). This would suggest that in overall terms, individual sections are causing more problems than poor structuring.
Far more 2010 reports included Terms of Reference: 64% compared to 34% of 2009 reports. This is to be commended. Of those reports that did have a TOR attached, around 36% appeared to have a broadly positive effect on the final report quality.
The most consistently satisfactory reports were found to be global and regional evaluations. Corporate global evaluations, in particular, were found to be almost entirely satisfactory in the critical methodology and conclusions sections that let down other reports.
The consistently strongest performing reports were those managed jointly by UNICEF and another UN agency: although this was a fairly small number of reports (five), 80% were found to be satisfactory. UNICEF-managed reports were often more challenging impact evaluations, whereas Joint-UN reports tended to focus at the outcome level. Country-led evaluations were found to be the weakest, with only 25% rated as reaching UNICEF standards.
Policy evaluations were found to be very few in number (three) and generally of unsatisfactory quality according to UNICEF standards. Real time humanitarian evaluations – also a small sample (seven) – were found to be consistently strong.
Output evaluations were by far the least satisfactory (less than 20% rated satisfactory), outcome evaluations were stronger but suffered from weak methodology sections (32% satisfactory), and impact evaluations were strongest (66% rated satisfactory).
Multi-sector evaluations were found to be the strongest overall, with 56% rated satisfactory. Young Child Survival (MTSP1) and Basic Education (MTSP2) evaluation reports were most consistently rated as satisfactory among single-sector reports.
Independent Internal evaluations were rated higher (55%) than Independent External evaluations (41%). There were no self-evaluations.
Quantitative analysis of section ratings suggests that the greatest correlation (80%) between a section rating and the overall rating is the findings and conclusion section. Outstanding reports tended to be differentiated by the methodology section. Developing a clear chain of logic from data to evaluative conclusions is the major barrier to many reports reaching UNICEF standards.
The object of the evaluation was one of the highest rating sections, with 46% of reports being placed as confident or very confident to act. Purpose, objectives and scope was the second strongly rated section, with 47% of reports reaching UNICEF standards. Methodology was generally found to be an area of some concern across all evaluation reports, with 65% of reports rated as below standard.
At 63% below standards, the findings and conclusions section showed some considerable concern, and was the weakest section in reports rated as unsatisfactory overall. Findings and lessons learned was the weakest performer, with only 34% of evaluations being found to satisfactorily meet UNICEF standards: due mostly to misinterpreted lessons learned. Finally, 42% of reports rated as satisfactory in relation to the report structure, logic and clarity.
As part of this meta-evaluation, a special analysis of the performance of evaluation reports in relation to equity was requested. Overall, only 9% of reports included equity issues to a satisfactory level, according to UNICEF standards: one report was rated outstanding and seven were rated as confident. A very strong correlation was found in the level of coverage of equity issues by reports and their overall quality rating. Real Time Evaluations handled equity better than any other purpose.
Jointly managed evaluations with UN agencies and country-led evaluations were both small in number (five each), but produced proportionately higher rates of satisfactory reports (both 20%) and lower rates of not confident ratings (20% and 40% respectively).
Quantitative analysis would appear to suggest that performance in equity is largely dependent on individual combinations of evaluators and evaluation managers, rather than an institutional capacity located in some part of the organisation: a major challenge to systemic strengthening.
The differentiating factor for reports that satisfactorily addressed equity was their consistency in mainstreaming equity considerations throughout the evaluation: from design, framework and methodology through to data collection, analysis and conclusions. Most reports rated "almost confident‟ offered consideration of equity as an issue, some few actually collected disaggregated data, and very few managed to generate convincing analysis of this.
Whilst human rights was better covered than equity – with 19% of reports being rated satisfactory – no evaluation was found to be outstanding in this regard. Similarly for gender equality and women's empowerment issues: 20% of reports were rated as satisfactory, but none were rated outstanding.
The level of confidence in ethics sections has substantially declined since 2009, from 18% to only 10% of reports reaching the (liberally interpreted) UNICEF standards. The degree to which ethics is discussed now appears to be on life-support, and warrants renewed highlighting of this concern.
Regional analysis reveals a mixed performance compared to the previous year. There was slight overall improvement, but a concerning increase in the number and proportion of reports rated as not confident. Tracking the regional distribution of performance over the two years 2009-2010 shows that, with the exception of Corporate evaluations, the relative ordering of regions as most-to-least satisfactory reports performed an almost precise reversal. Nevertheless, all except two regions had at least one report where a section was rated outstanding.
There does appear to be a pattern emerging of French language reports rating less well than all of the regional averages from which they came. To be able to assert such a conjecture conclusively, however, is beyond the scope of this methodology.
1 Forward looking evaluation of the Community Schools Project UNICEF, Egypt; Cash Transfer Programme for Orphans and Vulnerable Children Operational and Impact Evaluation, Kenya; Evaluation of the Go to School Initiative in Southern Sudan, Sudan.
Conclusions were developed by analysing the findings for trends in underlying factors that contributed to the performance of evaluation reports. This analysis was grounded in the concept of „confidence’.
Inconsistency of ratings across time, regions and sectors suggests low institutionalisation of quality evaluation capabilities
It would appear that – with the exception of corporate evaluations – there are not really „stores‟ of capabilities concentrated anywhere in the system. Great evaluation reports seem to be just as likely to come out from one point in the evaluation system as any other: as do very poor quality reports.
UNICEF appears not to be using its comparative advantage in equity, gender and human rights
UNICEF has been at the forefront of international efforts to recognise human rights, gender equality, women and children‟s empowerment, and socio-economic equity within the evaluation function. This is not being translated into strong performance in these areas.
The performance of impact evaluations and joint UN evaluations could suggest higher resourcing of evaluations is leading to better reports – but might not
On the assumption that impact evaluations and joint evaluations tend to involve higher levels of management time and resourcing, the findings could imply that higher levels of investment are generally leading to better quality reports. Alternatively, jointly-managed evaluations are simply delivering well on easier outcome-level questions, whereas UNICEF is managing inherently more complex and challenging evaluations that others do not want to get involved in.
Evidence, logic and analysis really counts in delivering quality
Outstanding reports are differentiated by the methodology, whereas other reports are most influenced by whether they get the findings-conclusions nexus right. Focusing on the broad detachment between evaluation design and objects‟ theories of change could help to move from reports full of description to those containing more analysis and logic. Most reports that are sub-standard appear confused because they leave important issues about the evaluation design or implementation unexplained.
The purpose of Lessons Learned is not understood
As a knowledge-centred organisation, it has to be of concern to UNICEF that only 15% of reports adequately demonstrate an understanding or what the purpose and value of lessons learned is. Lesson learned are mostly misinterpreted as object-specific management changes.
Ethics is an issue on life-support
Whilst no conclusions can be drawn regarding how ethical evaluations actually are, the vast majority of evaluation reports simply fail to even mention ethics issues.
Lessons Learned (optional):
Lessons learned have been generated through the analysis of the core-evaluation team and revised based on comments from UNICEF.
Evaluation reports need evaluation skills, but they also need proactive inputs from specialists in rights, gender, ethics and equity.
Evaluations undertaken by technical experts in a research style result in weaker quality of reports than those undertaken by evaluators. It is also the case, however, that most evaluators appear to be unsuccessful in building an analytical framework for their reports around human rights and gender commitments. This suggests a strong need for rights experts to be closely involved with evaluation teams – particularly at the design stage.
Conclusions from a single year’s meta-evaluation cannot give the whole picture.
Taken in isolation, there are a number of trends apparent in the 2009 and 2010 reports. Several of these, however, do not hold across the two years‟ reviews. The lessons from this experience are to exercise caution in interpreting trends in a single meta-evaluation, and to ensure that meta-evaluation methodologies continue to be consistent across years.
Full report in PDF
PDF files require Acrobat Reader.