Mastering Medical Claims Data: 3 Crucial Steps for Effective Analysis

Zain Jafri
Apr 18, 2023
5 min read

Medical Claims Data is an invaluable source of information in Healthcare Analytics, providing valuable insights into the healthcare services utilized, diagnoses, and costs within a population.

These data sets are widely used by Data Analysts, Data Engineers, and Data Scientists. However, navigating Medical Claims Data can be complex. Often supporting documentation designed for analytics purposes isn’t readily available.

For those new to Medical Claims Data or Healthcare Analytics, getting started can be challenging and prone to pitfalls. In this article, we will outline three crucial steps to effectively work with Medical Claims Data, empowering you with the knowledge and skills to effectively tap into this critical data source.

Don't Just Dive In

While it’s tempting to jump and generate metrics from Medical Claims Data, you must first understand the data structure and make necessary transformations for your analysis. This is a common pitfall and major risk, as it’s easy to extract data that might seem reasonable but fundamentally flawed. This happens when:

The data is not aggregated appropriately
The data is not normalized
The data is not translated into meaningful terms

Using Medical Claims Data requires solid understanding of the data in terms of how the data is organized, but also how you intend to use it from a business and analytical perspective.

The good news is that these risks are easily addressed, as long as you’re aware of them and account for them in your execution plan.

When analyzing Medical Claims Data, I perform these 3 crucial steps (as required by the analysis):

Step 1: Aggregate the Data

First you need to understand the “grain”, which is the granularity or level of detail of your data. You need to know this for how the data stored and how you plan to use it (i.e., your business or analytical use case).

It’s important to remember Medical Claims Data is generated through the billing process between providers and insurers. Therefore, the grain is usually an individual service rendered. These services are associated with specific claims and patients/members. These fields are usually denoted as Service Line IDs, Claim IDs, and Member IDs. A Service Line is a subcomponent of a Claim ID, and each Claim ID is associated with a Member ID.

There are many different formats for Medical Claims. In most cases, each row in a file will represent a Service Line. There may be additional data like provider information, cost, and diagnosis codes represented as columns.

However in some cases, especially with diagnosis codes, this information may be represented as additional rows (i.e., pivoted as rows). This is because a patient/member may have more than one diagnosis code and the diagnosis codes are listed with all of the Service Lines. In cases like this, you'll need to aggregate your data to a single Service Line before generating aggregate statistics (e.g., total cost). If you don’t, you’ll end up with inflated figures.

Tip: When generating an analysis with aggregate data, I will often check to see if the figures are in line with what I'd expect. If I see that they are way off (e.g., multiples of what I'd expect), the first thing I check to see is if the data was aggregated correctly.

Step 2: Translate Key Fields

Medical Claims data contains a lot of important information, yet much of it is captured in various code sets. Some common ones include:

Diagnosis Codes: ICD-10
Place of Service Codes
Revenue Codes (used for Hospital Services)
Service Codes: HCPC/CPT Codes

These code sets are usually vary granular (e.g., there are 68,000+ ICD-10 diagnosis codes) and the codes often aren’t meaningful by themselves. To generate meaningful insights, you need to translate them. In many cases, you also need to roll them up to higher level categories.

This step can be as simple as translating codes to their descriptions and as complex as applying business/clinical rules and machine learning methods to arrive at higher level categories. Most analytics will roll up the tens of thousands of diagnosis codes in a data set into a couple hundred medical conditions.

Tip: For statistical analysis and machine learning, you will definitely need to aggregate many of these code sets, otherwise you’ll end up with a data set that is too sparse (i.e., a large number of variables with no data).

Step 3: Normalize the Data

Now that we've aggregated the data and translated key fields, we're at the last step before you can dive into your analysis!

Healthcare Analytics often deals with population level analysis, trending, and segmentation. You must also account for the fact that health status, utilization, and outcomes vary significantly between populations. No two individuals or populations are exactly the same!

Therefore, to best perform your analysis, you may need to apply a few data normalization steps. (Note: what I’m referring to is different than the normalization techniques typically applied in statistical analysis or machine learning, which has a more specific definition).

Here are a few common normalization techniques used in Healthcare Analytics:

Convert Costs to Per-Member Per-Month (PMPM) Basis

One important step in normalizing Medical Claims data is converting costs to a per-member per-month (PMPM) basis for fair comparisons across different populations. By dividing the total cost by the number of members (or patients) and months in the analysis period, a standardized cost metric is obtained, accounting for population size and duration. This normalization method allows for accurate representation of cost trends and meaningful comparisons between populations of different sizes or durations.

Convert Utilization Statistics to Per 1,000 Basis

Another common normalization method is converting utilization statistics to a per 1,000 basis. This involves calculating the rate of utilization per 1,000 patients/members. This normalization method allows for meaningful comparisons of utilization rates across different populations, regardless of size.

Case-Mix / Risk Adjust Metrics

Normalizing data with case-mix or risk adjustment is crucial in healthcare analytics. It accounts for differences in patient/member characteristics and severity of illness across populations. Applying case-mix or risk adjustment methodologies enables more accurate and fair comparisons of health outcomes or utilization patterns. There are many different methods and models available to apply this normalization technique.

Account for Outliers

Outliers, or extreme values in data, can significantly impact the results of data analysis. Therefore, it’s important to consider the effect of outliers on analysis using Medical Claims Data. In some cases, it may make sense to exclude them or apply methods to account for them. Ultimately, it depends on your business objective and the analytical methods you’re applying.

Tip: Consider your business objectives, including how you’ll present your results and how they might be used, when evaluating whether you may need to normalize your data. There are many ways to approach this step.

In conclusion, when working with Medical Claims data, it's crucial to follow these three essential steps: aggregating the data appropriately, translating key fields for meaningful analysis, and normalizing the data to account for population differences. These steps ensure that your analysis is based on accurate and meaningful insights, and help you avoid common pitfalls that may arise from the complexity of Medical Claims data. By following these steps, you can confidently analyze Medical Claims data and derive valuable insights for informed decision-making in the healthcare field.