Technical brief: Data cleaning

Publication language
Date published
26 May 2017
Data approaches, Data cycle steps, Data quality

No matter how data are collected (in face-to-face interviews, telephone interviews, self-administered questionnaires, etc.), there will be some level of error. Messy data” refers to data that is riddled with inconsistencies. While some of the discrepancies are legitimate as they reflect variation in the context, others will likely reflect a measurement or entry error. These can range from mistakes due to human error, poorly designed recording systems, or simply because there is incomplete control over the format and type of data imported from external data sources. Such discrepancies wreak havoc when trying to perform analysis with the data. Before processing the data for analysis, care should be taken to ensure data is as accurate and consistent as possible.

Used mainly when dealing with data stored in a database, the terms data validation, data cleaning or data scrubbing refers to the process of detecting, correcting, replacing, modifying or removing messy data from a record set, table, or database.

This document provides guidance for data analysts to find the right data cleaning strategy when dealing with needs assessment data. The guidance is applicable to both primary and secondary data. It covers situations where:

  • Raw data is generated by assessment teams using a questionnaire.
  • Data is obtained from secondary sources(displacement monitoring systems, food security data, census data, etc.)
  • Secondary data is compared or merged with the data obtained from field assessments

This document complements the ACAPS technical note on How to approach a dataset which specifically details data cleaning operations for primary data entered into an Excel spreadsheet during rapid assessments.