Understanding Key Data Types in Healthcare: A Guide for Data Professionals
Explore the four main types of healthcare data: EHRs, claims data, clinical registries, and external reporting. Learn their uses and significance.
File
4 Types of Healthcare Data Analysts Should Know
Added on 09/08/2024
Speakers
add Add new speaker

Speaker 1: Today, we're going to be looking at some of the main types of data that you might be working with as a data professional in healthcare. Stay tuned. Hey everyone, I'm Josh Matlock. For those of you who are new to my channel, I am a clinical data analyst with eight years of experience and in my channel, I teach current and aspiring data professionals how to succeed in the healthcare industry. Jumping right into things, there are four major types of data that you might expect to encounter in the healthcare industry. They are data coming from EHRs or electronic health records, clinical slash disease registries, claims data, and external reporting data. And that's things like data that have to be submitted to various vendors and organizations for various reasons. As you can imagine, healthcare data covers a very wide variety of subjects and it's very rich. So let's start with EHR data. EHRs are a type of software that allow healthcare professionals to document information about a patient, order tests, medications, and labs, schedule appointments and surgeries, assign resources to a patient, and much more. Believe it or not, only in the last 10 to 15 years have we actually seen EHRs be widely adopted in the United States, which could explain why healthcare has a lot of technical debt compared to other industries like banking or tech, which have had digital records for much longer. In general, when you're a patient in a hospital, there could be many different people interacting with the EHR to record the details of the encounter. For example, if a patient came into the hospital with appendicitis, there might be a record showing the patient arriving in the emergency room, along with the notes describing what the doctor actually observed when they evaluated the patient. There might be a section where the patient's vitals were assessed. There might have been lab values that were drawn, a physical examination. There could have been relevant imaging performed, as well as a surgery to remove the appendix. A medical coder is going to assign ICD-10-CM codes to provide the patient with a diagnosis, as well as possibly ICD-10-PCS codes, which are the inpatient procedural codes that identify the surgery. Many staff members might be entering information into the patient's chart, and they can see those updates live. However, if you wanted to pull data out of the EMR en masse for a group of patients and not just one, using something like SQL, you're probably not going to be able to see the data live like the clinicians that are documenting. Why is that? Well, many medical applications use a language called MUMPS, which stands for the Massachusetts General Hospital Utility Multiprogramming System, also known as M. MUMPS is a programming and database management language that is used widely in EHRs. However, the way that it stores data is complicated. The data is basically stored like a tree, where data exists at terminal nodes. This differs dramatically from relational database management systems, which store data in a tabular format. It has to be moved over to the tabular format using a data migration process called ETL, which stands for extract, transform, and load. Because there's so much data that needs to be transported, it would be very inefficient to transmit it every single second or minute. So instead, data usually gets moved from that MUMPS database into a relational database in one big giant batch every night. So by the time you see that data in SQL, it's already about a day old usually. There are many tables that you will see imported into the relational database from the EMR system. Some of the main ones include patient information, like their name and their demographics, encounter information, and that'll include things like outpatient encounters, where the patient will usually be in the hospital for less than a day, or inpatient encounters, where the patient needed to be hospitalized and spend several days in the hospital. You also have observations information, which are things like lab values, survey data, vitals, and those are recorded multiple times over a given period using something called a flowsheet. You also have a conditions table, which will store things like the diagnosis during the encounter or the admitting diagnosis or the discharge diagnosis. You might have a problem list, which tracks the active and resolved problems that a patient had throughout their lifetime. There's going to be a procedures table that's going to track all of the surgeries or medical interventions that were needed to be performed on the patient, and those could be major things like heart transplants, or they could be routine things like a colonoscopy. This doesn't even scratch the surface of the type of data that you might expect to see in an EMR relational database. The list goes on and on and on, but needless to say, one patient can generate an enormous amount of data within their lifetime for just one hospital. The two most common EHRs are Epic and Cerner, which possess most of the market share when it comes to EHRs. Now, before we move on, I should mention that I have a free 13-page PDF on my website that teaches you more about the most common EHR data that you'll encounter as a data analyst, as well as important clinical concepts that you should be familiar with. So I encourage you to check that out. This is actually an excerpt from my upcoming clinical analytics accelerator course, which I hope to launch by the end of the year. So if you want to receive a copy of that PDF, go to my website, datawizardry.academy, input your name and email, and I'll send you a copy of the PDF. You can opt out of future emails at any time. I also have an entire one-hour lesson dedicated to learning the basics of SQL with EMR data using a synthetic dataset called Cynthia. So also check that tutorial out if you haven't already. All right, so next we have claims data. Now, to be honest, claims data is the one that I'm least familiar with as I don't work with this data very often. So I invite other people watching this video to leave a comment if you have anything to add or revise. But for the most part, claims data is simply a request submitted by doctors, hospitals, and pharmacies to an insurance company to get reimbursed for the medical care that they're providing. The amount that a hospital gets paid depends on the complexity of the claim and the resources allocated to treat that patient. There's also a process called adjudication where the claim is evaluated by the insurance company to determine how legitimate the claim is before they actually pay the amount that is requested. For example, a provider might commit fraud by trying to submit a claim using a diagnosis code that they know will earn them more money than a more accurate code that would make them less. Insurance companies don't want to pay more than they have to, so they adjudicate the claims to ensure that the money that they are about to pay is fair and accurate. The claims data might be pre-adjudicated or post-adjudicated, and it might also contain information like the name of the provider submitting the claim, the amount paid out by the insurance company to the provider submitting the claim, procedure codes that were rendered during the encounter, diagnosis codes, information about the beneficiary or the patient whom the claim was about. But claims data doesn't capture as much detail as EHRs do. For example, we might be able to see that a test was given to a particular patient in the claims data on a certain date, but we won't be able to see what the test results were like we might see in the EHR. So what advantage does claims data confer over EHR records? Well, the data might capture information not just from one health care organization, but many, if it is held by an organization that is administering insurance, like the Center for Medicare and Medicaid Services. So it is not uncommon to see extremely large data sets emerge from the claims data with hundreds of millions of records. So what it lacks in quality, it makes up for in quantity. An insurance company might be interested in analyzing the claims data to study the possibility of fraud, cost drivers in different areas of health care, and to see if the quality of care aligns with what they're actually paying out to the health care providers. In other words, are the patients actually getting what the insurers are paying for? Hospitals might also be interested in the claims data. They might study referral patterns. They might try to improve population health, increase sales, among other things, using the claims data. Next, we have clinical and disease registries. And generally speaking, these are repositories of data that are dedicated to certain diseases. So you might have one for stroke patients, another one for cancer patients, another one for patients who had specific surgeries that were performed. The data is typically collected by a data abstractor who will gather the necessary details from the patient's medical record and plug them into a repository. The data is often then analyzed by the owners of that registry and compared to similar health care organizations across the country. Hospitals might participate in these registries because it helps them identify costly areas of the care provided. It helps identify weaknesses in the provision of care. The registry might grant the hospital access to resources that will educate them on the best practices of care. Some of the registries might offer accreditation, which will make the hospital appear more reputable. And they can also compare how they're doing to other hospitals. And this last part is really important. Once the data is analyzed, a report containing the results are shared with the health care organization and it will show them their strengths and weaknesses compared to similar hospitals and clinics. Sometimes individual physicians will also be compared to their peers, like how many surgeries and subsequent infection rates were at the hand of Surgeon A versus Surgeon B versus Surgeon C. There's lots of different registries a hospital or clinic might be a part of, and they're typically utilized by analysts, researchers, and people involved in quality improvement efforts to gain insights about their population of interest. So to give you an example, before I was a data analyst, I was a data abstractor for a bunch of different registries. The registry that I was tasked with using the most, however, was one called NSQIP, or the National Surgery Quality Improvement Program. In this repository, data will be tracked for patients receiving general surgeries like appendectomies, colectomies, hysterectomies. I was tasked with determining if the patients had certain medications, if their vital signs crossed a certain threshold at some point, like having an elevated heart rate, breathing rate, body temperature, which could indicate a fever, determining whether the patient had any mortalities or morbidities, also known as M&Ms. Mortalities simply meaning death from the surgery, morbidities being things like pneumonia or surgical site infections. And if the patient had surgical site infections, what was the severity? I would go through several patients each day and plug them into a repository where it would then be analyzed by the organization that administered the repository. Then on a quarterly basis, we would get the results back and we would see metrics like an observed to expected ratio that showed how many mortalities or morbidities did we witness relative to what was expected of us. The surgeons would also get to see how many overall SSIs their patients suffered from and would get to see how they compared to their peers. Data professionals might be called upon to help automate some of this process, import the results into a database, or analyze the results in a dashboard. The automation of data collection and entry into the registry is not always straightforward, however, because some of that data is really nuanced and requires a human to read through pages and pages of operative notes before they have a good understanding of the complications a patient might have suffered from during the surgery. Some of the registries I've seen throughout my career include surgery registries, cancer registries, infectious disease registries like the National Healthcare Safety Network or NHSN, organ transplants, cardiac registries, stroke registries, among many, many others. Our final category is external reporting data. Given that healthcare is such a complex and heavily regulated industry, there's quite a few external entities that either require hospitals to submit data or are completely optional yet provide numerous benefits to participation. We can split them out into required reporting and optional reporting. For required reporting, let's take hospitals in Washington State as an example. Hospitals are required to submit a list of their adverse events that happened within their hospital to the Washington State Department of Health. These include things like wrong site surgeries, retained foreign objects during surgery, falls in a hospital that resulted in major injury or death, among many other things. In addition to state government, there's also data that must be submitted to the federal government. I already mentioned NHSN earlier, but this is a disease registry owned by the Centers for Disease Control or CDC, and it tracks hospital acquired infections like catheter-associated urinary tract infections, central line-associated bloodstream infections, C. diff, antimicrobial stewardship, and lots of other things. One of the reasons this is required is because many hospitals get money from the government through Medicare. If a hospital slips and does a half-assed job of keeping infections in the hospital under control, they could lose millions of dollars. HAIs are just one type of data that hospitals have to submit to the federal government, though. There's other things like mortality rates for various conditions, survey data that measures patient satisfaction. Specific data has to be submitted to the Center for Medicare and Medicaid Services. They do this to reward the hospitals that exceed expectations and penalize those that don't meet expectations. Now for the optional reporting side of things, many hospitals choose to share a good portion of their data with other organizations. For example, hospitals might choose to share their data with entities called hospital associations that will help facilitate quality improvement efforts for the hospital and provide them with averages and peer hospital comparisons within the state. Hospitals that provide a certain type of care, for example, children's hospitals, have special exclusive memberships with organizations like the Children's Hospital Association. This gets into one of the main perks of optional reporting. Hospitals will often share data with many different organizations so that they can do benchmarking. Benchmarking is just a fancy way of saying, how well do we compare to hospitals similar to ours? By sharing their data with a central organization, that organization is able to pool the collective results from all of the hospitals it collects data from and come up with various statistics and then they can publish the results so that all of the participants can see if they are above the average or below the average. Some of these organizations provide a huge amount of visibility and impact the reputation of the hospital. For example, US News and World Report is a program where hospitals submit large volumes of data and these large volumes of data are curated into specific metrics like how many cancer surgeries did we do, how many patients were evaluated for epilepsy. Once all that data is collected from all the participating hospitals, US News World Report then awards the hospital with a ranking out of all the hospitals in its cohort. And by cohort I mean like Best Hospitals in the Nation or Best Children's Hospital in the Nation. The ranking will often be based off of the collective performance of the various service lines that the hospital has like cardiology, neurology, surgery, nephrology, cancer care. For programs like these, they require a lot of planning and data wrangling and it can be a very time consuming process to prepare for. The tradeoff though is that hospitals with a really good reputation can attract more patients due to their high score. They can attract more skilled physicians. It can also give a hospital more leverage when negotiating and dealing with health insurance companies which means more money. Now believe it or not, we've barely even scratched the surface of all of the data that exists within healthcare organizations but that should be about 80% of the data that you might expect to see. There's so much more but hopefully that gives you a general idea of the major types of data that you see in a healthcare system. If you want to learn more about what it's like to be a data analyst, check out this video next. Thank you so much for watching and I'll see you in another video.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript