AdSense

Tuesday, 14 January 2014

Data Quality in ETL and BI - Reasons Impacts Solutions and Operations

The amount of Data that an organization stores and processes in the current time has increased many folds. This has also exposed the associated problems of poor quality of data. If the quality of data is bad then the information created by that data is not useful. Lot of efforts and money is being put in by organizations to improve the quality of data. 00
The Quality of data refers to the following
  • Accuracy
  • Consistency
  • Integrity
  • Uniqueness
Reasons for data quality issues:
  • Inaccurate data entry
  • No process or rules in application to validate data entry
  • Lack of Master Data Management (MDM) strategy
Examples:
  1. Phone number having values as 1111111111, 0000010100
  2. Customer Name as "ABC", "ZZZZ"
  3. Two records in a table like below:
Timothy, Kenny. 10 East Avenue
Tim, Kenny. 10 East Avenue
Impact of poor data quality:
  • Incomplete and misleading analysis
  • Increase in spending on incorrect data
  • Financial impacts when data is related to accounts and finance
  • Targeted market campaigns impacted adversely
  • Purchasing of data quality tools like First Logic, Trillium, etc.
  • Complicated ETL
  • Additional Cleansing in ETL process results in longer time to complete ETL cycle
Solutions for data quality:
  • Stringent data validations by means of rules in applications at source
  • Avoiding duplicate master entries by use of MDM solutions
  • If above not done, then using ETL process to handle data quality issues
  • Send feedback for bad data quality and correct at source, then reload
  • Maintain audit for data quality issues emerging from source systems
Data Cleansing Operations:
  • Removing invalid characters
remove extra and special characters from addresses, phone numbers, etc
  • Correcting data formats
formats for phone numbers, email addresses, etc
  • Identiying and removing duplicates
  • Building data quality audit and feedback system
record data quality in audit tables
automate process to send information of data quality to source system