Bottom-up Approach to Data Quality with Conformed Dimensions

In a previous blog we discussed the Top-down versus Bottom-up DQ approaches. Then in the second post of this series we explored how we use the Conformed Dimensions with the Top-down approach. In today’s post we’ll explore the opposite, how to leverage the Conformed Dimensions for a Bottom-up approach.

 

The only way to ensure that data quality improvement efforts aren’t one-time fire fighting exercises, is to improve business processes that produce poor quality data and fix issues as far upstream as feasible. Unfortunately, most organizations aren’t ready to make significant  changes in processes and systems without understanding how it will help them. Oftentimes, in order to build a business case for process, system, or data changes we need concrete examples of poor data quality and how much it impacts people, timelines, budgets and customers. For this reason, oftentimes a Bottom-up approach with limited scope can help us document the issue, how much impact it has and how much it might cost to fix at scale.

 

Before we get too far into this I need to remind you that what we’re discussing here in terms of the Bottom-up is a project, not a program. That means that the activities described in this blog are done as a one-off for the purpose of making a business case for a broader Top-down initiative and eventually a program supported by executive management with concrete roles and responsibilities, funding, staff and a strategy with a roadmap. The project activities here are something often done by technical IT department members as an ad hoc project to make the case for broader cultural change. The project might be done by consultants sponsored by either a business or IT department with the goal of developing into a program eventually.

 

Limited Scope Bottom-up Project:

One example might be a time-boxed deep dive on the sales funnel. For example, let’s assume the following scenario. 

Scenario:

Multiple departments want to produce daily sales reports across countries, regions, and agents that tie to both incentive systems and accounting close periods.

This is the same example we used in the previous blog post discussing the Top-down approach. This time we’ll work backwards identifying the data working upwards to the associated processes impacted.

 

Activities Conducted During a Bottom-up Approach Using the Conformed Dimensions:

 

Activities Conducted

Examples and Explanation

1

Identify data associated with the issue at hand.

Using the scenario defined above, we can review the columns in existing sales reports. We’ll want to include any geography, sales regions and accounting attributes in addition to the system keys needed to join all of the information together.

2

Identify the systems that produce the required data and their respective definitions of these columns (they may differ).

We'll likely have at least one sales system per geography (maybe even country). Those systems may have one or more billing systems and maybe a system for financials. In order to simplify the scope of this effort we’ll likely want to limit the scope to two regions or countries so that we are able to identify representative data quality issues but not spend so much time that we aren’t effective. This is a balancing act.

3

Document the data columns, definitions, systems, and data flows understood at a high level in a diagram

Very quickly things can get complicated so a diagram helps us visualize the breadth, location of data transformation and likely data quality challenges. I recommend not making this too detailed, by conducting short one-on-one interviews with key subject matter experts. The goal is to create a rough draft that we’ll use in the following step.

4

Optional: Conduct an informal data process review session with key stakeholders of respective departments that ‘touch’ the data across the systems landscape documented in the prior step.

Depending on the level of executive support you have (aka Top-down approach) you may not be able to convene representative SMEs from all the departments needed for this step. Many customers have given us compliments on what they learn during this step when following the DQMatters whiteboarding session approach.

5

Assess the data quality of the data in scope.

This step can also be scaled down in order to save time. 

  • For instance, you may only conduct data profiling and Subject Matter Expert review of a subset time period of the data for only one location. Let’s call this a horizontal approach because you only check one system furthest downstream. 
  • Alternatively, you may assess only 2-3 columns of data for a few customers from the most downstream all the way up to the systems of origination which we call a vertical approach.

6

Combine the assessment findings and document the business impacts.

Again, without a level of executive support you may not be able to meet all the impacted audiences, but typically you’ll review the assessment findings with at least two stakeholders and collaboratively define negative impacts of poor data quality. The more specific and quantitative you can communicate the impacts the more authority you’ll enjoy and likelihood of executive support to expand your scope or initiate a Top-down approach that supports your respective concerns.

In the previous table we stepped through the activities of a typical Bottom-up approach, and now we’ll identify likely Conformed Dimensions used during each step. By no means is this an exhaustive list because these only apply to the scenario described in this blog and a limited set of columns.

 

  1. Identify data associated with the issue at hand.
    1. Metadata- One of the most powerful tools used by a DQ professional is the Metadata repository that catalogs business definitions, data columns in reports and in systems across the organization and hopefully the transformations as well. The following CDDQ Underlying Concepts are used in conjunction with this repository and ad hoc data queries.
      1. Completeness: Existence- Whether a data column is even collected is an important question that often is hard to answer without systematically collected metadata (like in a repository).
      2. Lineage: All types- The data flow diagram developed in step 3 above requires metadata about the Source and Target, as well as transformations for each Segment. If the system can’t create an End-to-End Document for this lineage you’ll create it by hand to support the rest of the steps in the Bottom-up approach.
    2. Using data profiling we can evaluate the following Underlying Concepts
      1. Validity: All types- for those columns that are reference or master data you can evaluate what validity rules should be applied. This may open a discussion of reference/master data so be careful not to take on too much extra work before meeting your project objectives.
      2. Integrity: Referential Integrity- One of the key objectives of our original scenario was to tie sales amounts across systems. To do this referential integrity, the ability to join based on some common identifier, is critically needed.
      3. Consistency: Format Consistency and Logical Consistency- Now that we have the lineage documenting where the data comes from and any transformations you can choose what consistency checks are needed. Again, don’t go overboard and choose just enough to sell the need for improvement until you have resources to address the issue at hand.
    3. Using cross system balancing (covered in detail in the DQMatters Jumpstart class) we can compare values between systems.
      1. Accuracy: Match to Agreed Source- Using the lineage documentation you can identify whether logically the downstream data source stakeholders are using is correct.
      2. Consistency: Equivalence of Data and Temporal- Now that we have the lineage documenting where the data comes from and any transformations you can choose what consistency checks are needed. Again, don’t go overboard and choose just enough to sell the need for improvement until you have resources to address the issue at hand.
  2. Identify the systems that produce the required data and their respective definitions of these columns (they may differ).
    1. Lineage: Source Documentation- It’s pretty hard to identify what systems source the data without the Source Documentation.
    2. Representation: Metadata Availability- This measures whether definitions exist, but once you have them you also have to make sure they are the same across systems in our scenario.
  3. Document the data columns, definitions, systems, and data flows understood at a high level in a diagram
    1. Lineage: End to End Documentation- As previously discussed you can use either system generated documentation from a metadata repository or manually created specific to the attributes in your scope. Remember, this task isn’t scalable if done manually for a larger set of columns. Take that into account when forecasting future DQ efforts that will need lineage for many attributes at a time.
  4. Optional: Conduct an informal data process review session with key stakeholders of respective departments that ‘touch’ the data across the systems landscape documented in the prior step. 
    1. Use of your End-to-End documentation is critical for this but if you don’t have it you can draw it out on the whiteboard during the session, but it will slow the process and you’ll need to masterfully choose what level of detail is needed.
  5. Assess the data quality of the data in scope.
    1. Typically this is done via Metadata collection, Data Profiling and Cross System Validation/Comparison as discussed above.
  6. Combine the assessment findings and document the business impacts.

     

Ironically, providing context to your results often requires input from business stakeholders and some of the activities in the Top-down approach. Therefore even though some people say that IT and systems teams can independently conduct Bottom-up data quality approaches, it may stall when attributing business impact because I’ve found business end-users can ascribe impact. They also are the ones that need to prioritize mitigation and correction efforts and allocate budgets. All activities fall within a Top-down approach.

 

Has this blog opened your eyes to the wide variety of Underlying Concepts that can be measured using the Conformed Dimensions? In our Annual Survey about the dimensions of data quality we ask respondents which of the Underlying Concepts they use in their organizations. Please consider taking the annual Dimensions of Data Quality Survey open now!