How to Eliminate Bad Data in Health Insurance Claims

by | Dec 5, 2022 | Healthcare, Master Data Management

Artistic representation of bad data in claims

According to recent studies, bad data in health insurance claims costs U.S. healthcare organizations approximately $3.1 trillion annually and consumes 40% of industry IT outlays. Data disparities across sprawling unintegrated systems of poorly configured cloud applications and siloed legacy tech have made developing data cleansing capabilities a critical priority for all claims processors. 

Making an industry-wide course correction for insurance claims data would be a formidable obstacle to overcome. Nevertheless, the U.S. healthcare system faces challenges to effective, affordable operations. The financial strain of the COVID-19 pandemic on healthcare organizations and the subsequent inflation of clinical labor costs due to shortages of 200,000 nurses and 50,000 physicians has added hundreds of billions to projected healthcare spending over the next five years. 

Healthcare organizations can only ride out the broader financial hardships of the post-COVID economy and labor market. However, recent developments in data cleansing and data management technologies have made eliminating or significantly reducing bad data costs an addressable opportunity in the healthcare industry. In this guide, you’ll learn what bad data is, how it commonly enters databases and IT systems, and how data cleansing tools mitigate the damage it causes.

Key Takeaways:
  • Bad data in insurance claims has become an albatross around the neck of the healthcare industry.
  • Eliminating bad data requires implementing the data management process called data cleansing.
  • Different applications and analytical tools allow organizations to standardize and correct many bad data anomalies in distributed IT networks.

What Is Bad Data?


Data quality is a top data management challenge.
Image Source:

Bad data refers to data in IT systems that is inaccurate, misleading, incomplete, unavailable, duplicated, or irrelevant. Across industries, studies show various types of bad data accounting for roughly 40% of enterprise data. This bad data costs the average enterprise $12.9 million annually in expenses related to mitigating or correcting its operational effects. Eliminating bad data from healthcare databases, such as insurance claims and achieving high data quality has become a top challenge for 60% of reporting organizations. 

What Causes Bad Data?

In most IT systems, bad data enters through a variety of channels. Some of the most common are:

  • Lack of Single-Point Entry: Unintegrated systems in different departments of an organization or networks of third-party services allow multiple entries for the same data point. Over time, this results in accumulating duplicate data points and near duplicates,  based on different data formats, that appear to be distinct records upstream.
  • Human Error in Data Entry: The human error rate for manual data entry is 4%, with 14% of errors representing potentially disastrous consequences if applied to major decision-making processes.
  • Data Siloes: Organizations often run separate, partitioned IT systems in different departments. When these systems collect overlapping information, such as customer data, divergences emerge as users alter or update records in one system but not others, leaving largely irreconcilable data sets with no hierarchy to determine which is most current or accurate.
  • Lack of Data Governance: Data governance is a subfield of data management that focuses on standardizing data as systems capture it and assigning areas of data management responsibility within an organization. Effective data governance is critical to measuring data quality in different domains and tracing gaps in data quality to the root cause.
  • Overestimation of Data Quality in Upstream Systems: Without established processes for determining data quality in contributing databases and applications, data integration systems can only assume that incoming data is accurate. In this scenario, bad data from one system propagates across entire networks, compounding the effects of the original faulty portion.

What Is Data Cleansing?


Data cleansing (cleaning) uses analytical tools to identify bad data and delete, reconcile, or complete bad data points. Data cleansing improves data quality, making extracted data insights more reliable. Data cleansing is a core process of data science and data management in any industry or context. 

In most organizations, data cleansing is a cyclical process involving eight steps.

Steps of data cleansing cycle.
Image Source:

1. Import Data

Data enters databases through any possible combination of manual and automated routes, such as data entry, end-user inputs, and connections to other databases and applications. Imported data is raw data and, like other raw resources, needs refinement and processing to generate value.

2. Merge Data Sets

Relational databases organize data according to a schema, a blueprint that determines the number of tables, columns, and rows and defines how individual data points relate to all others. Schemas do not contain data. Rather, they structure it.

In networked IT systems containing multiple data sources, creating a master data set, an authoritative and comprehensive reconstruction of data in contributing domains,  requires successfully merging multiple datasets. Data sets must share compatible schemas to merge successfully.

3. Rebuild Missing Data

Merging large data sets inevitably causes data loss for various reasons. Achieving high data quality requires rebuilding lost data. Data scientists use regression models to identify missing data and one of two kinds of imputation techniques to reconstruct it.

  • Average imputation: Fills in missing values with average values for the same fields
  • Common-Point Imputation: Fill in missing values with either middle point values or the most commonly chosen values for non–numerical entries for corresponding fields. 

4. Standardize

Data standardization applies standard, computer-readable formats to data such as DD-MM-YYYY for dates.

5. Normalize

Data normalization allows actions that merge, separate, or alter different database entries without causing redundancies.

6. Deduplicate

Statistical modeling tools, typically applying custom parameters, identify different kinds of data anomalies, such as duplicates, by probability measures. Custom configurations determine the statistical significance threshold above which the application deletes potential duplicates.

7. Verify and Enrich

Verifying and enriching data involves running internal match queries to identify records with a high probability of successful merging. This process differs from deduplication in that merged matches complete missing fields in one or both contributing records.

8. Export Data

Exporting data involves transmitting and storing cleansed data for analytical and operational use.


Achieve Effective Master Data Management with Coperor by Gaine

Coperor by Gaine is an industry-first scalable master data management platform built for the healthcare industry’s challenges. With powerful data-modeling capabilities, Coperor helps your organization achieve a single source of truth for data across your IT systems and contracted partners.

To learn more and schedule a live demo, contact Gaine today. 


Opt-in with Gaine for More Insight

Keep ahead of the rest with critical insight into Healthcare and Life Sciences MDM and interoperability technique, best practices, and the latest solutions.