Skip to Main Content

Understanding Data and Statistics: Home

Definition: Data and Statistics

"Data" and "Statistics" are terms that are often used interchangeably, however there is a fundamental difference.  Raw data is unprocessed data. Statistics are processed data that provide a summary of the raw data.   Statistics help us understand patterns and trends within data. The core distinction at a fundamental level: data is raw and unprocessed while statistics are processed and summarized.

Data are the individual, discrete facts, observations, or measurements collected in their rawest form. They are the 'ingredients' that must be organized to have meaning.

Example: A researcher collects the following numbers for the age of five participants: 35, 22, 48, 35, 51. Each number is a piece of raw data. They are the input for analysis. They are often unstructured or difficult to interpret in isolation.

Statistics: The term statistics has two related but distinct meanings:

The Field of Study (The Method)

Statistics is the scientific discipline that involves the collection, organization, analysis, interpretation, and presentation of data. It is the set of mathematical tools and techniques used to make sense of the raw numbers.

The Calculated Value (The Result)

A statistic is a numerical value calculated from a set of data (a sample) that serves to summarize, describe, or make inferences about the whole.

Example: Taking the age data (35, 22, 48, 35, 51), a statistician calculates the:

  • Mean (Average) Age: 38.2 years.

  • Mode (Most Frequent) Age: 35 years.

  • Range of Ages: 51−22=29 years.

  • The resulting values (38.2, 35, and 29) are the statistics because they are the processed output that summarizes the raw data.

You collect the Data and use Statistics (the methods) to produce Statistics (the summaries).

 

Heath Care example for Data and Statistics.

Imagine a hospital administrator is trying to understand why some patients are readmitted to the hospital shortly after discharge.

 

Data (Raw & Unprocessed)

Data is the individual piece of information recorded for every patient discharged over a month.

Patient ID Readmitted within 30 days? Length of Stay (Days) Age Primary Diagnosis
001 No 4 72 Pneumonia
002 Yes 6 55 Heart Failure
003 No 3 88 Hip Fracture
004 Yes 5 61 Heart Failure
... ... ... ... ...

 

The raw data are the hundreds of individual entries in the hospital's database. By itself, this raw data does not tell the administrator if there is a problem, only that some specific patients were readmitted. This is the input.


Statistics are the summarized numerical facts derived from analyzing all the raw data entries. They turn the input into actionable information.

From the raw data, the administrator calculates the following statistics:

  • Readmission Rate: "12.5% of all discharged patients were readmitted within 30 days."

  • Average (Mean) Length of Stay: "The average length of stay for patients with a 30-day readmission was 6.2 days."

  • Population-Based Statistic: "Patients with a primary diagnosis of Heart Failure account for 45% of all 30-day readmissions."

This is the calculated metrics and summary figures. These statistics tell the administrator what is happening at a system-wide level (the average, the total proportion, the biggest risk group). They provide a summary that can be used to set goals and improve care protocols.

Challenges of Collecting Data

Key Challenges in Finding Health Statistics

Challenge Description Implication
Data Lag The time between data collection, analysis, and official publication  is extensive. The "most recent" statistics you find are often 1–3 years old, meaning current trends are not fully reflected.
Fragmented Sources Data is collected by a diverse array of organizations (federal, state, and local governments; NGOs; academic institutions). You often need to check multiple agencies (e.g., CDC, state health departments, WHO) to get a complete picture.
Variable Quality &  Accessibility Collection methodologies, definitions, and reporting formats differ significantly between sources. Comparing data across different states or countries can be difficult because they might not be measuring the exact same metric in the same way.
Granularity Limitations Detailed data (e.g., incidence or prevalence) is often only available  at high geographic levels (national, state) but not at local levels (county, city). Public health efforts aimed at specific local communities may lack the precise, localized data needed for targeted interventions.
Privacy Concerns Data that could potentially identify individuals is restricted or aggregated to protect patient privacy (HIPAA in the U.S.). Rare disease statistics or detailed demographics on small populations are often withheld or presented in very broad categories.