What does data cleaning involve?

Data cleaning is the process of detecting and correcting errors, inconsistencies, and inaccuracies in data to ensure high-quality analysis.

Data Science Vocabulary: Analytics and ML Terms

Q: What is data science?

Data science is an interdisciplinary field that employs scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data for decision-making.

Q: What is a dataset in data science?

A dataset is a structured collection of data organized for analysis, typically arranged in rows (observations) and columns (variables or features) in a tabular format.

By the Dictionary Wiki Editorial Team · Published February 3, 2024 · Updated April 1, 2024 · 1,798 words · English Language

Walk into any modern office, lab, or startup and you'll run into the same vocabulary sooner or later: features, training data, overfitting, transformers, pipelines, lakes. Data science has quietly rewritten how industries make decisions, and with it came a shared technical dialect that engineers, analysts, product managers, and executives all draw from. This reference pulls together the most useful terms — grouped by area — so you can read a research paper, sit in on a sprint planning session, or interview for a role without getting lost in the jargon.

1. Core Concepts to Start With
2. The Statistics Toolkit
3. What Machine Learning Actually Is
4. The Main Flavors of ML
5. Neural Networks and Deep Learning
6. The Plumbing: Data Engineering
7. Making Data Visible
8. The Language of Big Data
9. AI Terms You'll Hear Everywhere
10. Building a Career in the Field

1. Core Concepts to Start With

Before anyone reaches for an algorithm, they need a shared understanding of what counts as data, what a variable is, and what cleaning actually means. The terms below form the common ground every data conversation starts from.

Data science — An interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data for decision-making.

Dataset — A structured collection of data organized for analysis, typically arranged in rows (observations) and columns (variables or features) in tabular format.

Feature — An individual measurable property or characteristic of the data being observed, used as input variables in machine learning models for making predictions.

Data cleaning — The process of detecting and correcting errors, inconsistencies, and missing values in a dataset to improve data quality before analysis.

Exploratory data analysis (EDA) — An approach to analyzing datasets to summarize their main characteristics, often using statistical graphics and visualization methods to discover patterns and anomalies.

Get these right and the rest of the field clicks into place faster. Most early-career mistakes trace back to fuzzy definitions of terms at this level.

2. The Statistics Toolkit

Every useful result in data science has to pass a statistical smell test. Averages lie, correlations mislead, and samples mislead further — so practitioners lean on a small set of measures and methods to stay honest with the numbers.

Mean (average) — The sum of all values in a dataset divided by the number of values, providing a measure of central tendency that represents the typical value.

Standard deviation — A measure of the amount of variation or dispersion in a set of values, indicating how spread out the data points are from the mean.

Correlation — A statistical measure that describes the strength and direction of the relationship between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).

Hypothesis testing — A statistical method used to determine whether there is enough evidence in a sample of data to conclude that a certain condition is true for the entire population.

Regression — A statistical technique that models the relationship between a dependent variable and one or more independent variables, used for prediction and understanding relationships.

Strong statistical intuition separates analysts who ship reliable findings from those who accidentally mistake noise for a signal.

3. What Machine Learning Actually Is

Machine learning is less magic than it sounds. Strip away the hype and you're left with algorithms that adjust their own parameters based on examples. A few terms describe almost everything that happens inside that loop.

Machine learning — A subset of artificial intelligence in which algorithms learn patterns from data and improve their performance on a task through experience without being explicitly programmed.

Training data — The dataset used to teach a machine learning model, providing examples from which the algorithm learns the patterns and relationships needed to make predictions.

Model — A mathematical representation of a real-world process created by a machine learning algorithm, used to make predictions or decisions based on new, unseen data.

Overfitting — A modeling error that occurs when a machine learning model learns the training data too closely, including noise and random fluctuations, resulting in poor performance on new data.

Cross-validation — A technique for evaluating machine learning models by dividing data into subsets, training on some and testing on others, to assess how well the model generalizes to new data.

Keep these five ideas in your back pocket and most ML conversations — papers, product reviews, code reviews — become much easier to follow.

4. The Main Flavors of ML

Machine learning isn't one recipe; it's a family. Each branch targets a different situation depending on whether you have labels, whether the problem involves sequential decisions, and what kind of answer you want out.

Supervised learning — A type of machine learning in which the algorithm is trained on labeled data, learning to map inputs to known outputs for tasks like classification and regression.

Unsupervised learning — A type of machine learning in which the algorithm discovers patterns in unlabeled data without predefined categories, used for clustering and dimensionality reduction.

Reinforcement learning — A type of machine learning in which an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties based on outcomes.

Classification — A supervised learning task that assigns data points to predefined categories or classes, such as identifying whether an email is spam or not spam.

Clustering — An unsupervised learning technique that groups similar data points together based on shared characteristics, revealing natural structures within the data.

Picking the right flavor is usually the biggest single decision in a project. Match the approach to the data you actually have, not the one you wish you had.

5. Neural Networks and Deep Learning

Deep learning is the branch that gets most of the headlines. It stacks layers of artificial neurons so the model can learn its own internal representations instead of relying on hand-crafted features, which is why it dominates image, audio, and language tasks.

Neural network — A computing system inspired by biological neural networks, consisting of interconnected nodes (neurons) organized in layers that process information and learn patterns from data.

Deep learning — A subset of machine learning that uses neural networks with many layers (deep networks) to learn hierarchical representations of data for complex tasks.

Convolutional neural network (CNN) — A type of deep learning architecture designed for processing structured grid data like images, using convolutional layers to detect features and patterns.

Natural language processing (NLP) — A field combining linguistics and machine learning to enable computers to understand, interpret, and generate human language.

Transfer learning — A technique in which a model trained on one task is repurposed as the starting point for a model on a different but related task, reducing training time and data requirements.

Most of the generative tools people talk about — chatbots, image generators, speech systems — sit on top of the ideas in this list.

6. The Plumbing: Data Engineering

Before a model is trained, data has to get where it needs to go, in the shape the analyst wants. Data engineering is the quiet discipline that makes everything upstream of a notebook work.

Storage and Processing Layers

A data warehouse is the tidy, structured repository tuned for fast analytical queries — think aggregated sales numbers at month-end. A data lake, by contrast, stores raw logs, images, and JSON blobs in their original form so engineers can decide later how to use them. ETL (Extract, Transform, Load) names the classical pattern for moving information between systems, while SQL remains the default language for asking relational databases for answers. APIs let one application pull or push data from another without a human in the loop.

Pipelines and Governance

A data pipeline is an automated relay race that passes records from source to destination on a schedule. Batch processing handles data in large scheduled chunks — overnight loads, for example — while stream processing reacts to events the moment they arrive, which matters for fraud detection or live dashboards. Layered over all of this, data governance sets the rules for quality, privacy, access, and retention that keep a company's data trustworthy and compliant.

7. Making Data Visible

Numbers in a table rarely change minds on their own. Visualization turns rows and columns into shapes the human eye can interpret quickly. Dashboards pack key indicators into a single screen. Bar charts compare categories side by side; line charts trace changes over time; scatter plots show how two variables move together (or don't); heatmaps use color intensity to expose patterns in large grids. Good chart design depends as much on knowing the audience as on choosing the right geometry — a chart that works in an engineering standup may flop in a board meeting, and vice versa.

8. The Language of Big Data

"Big data" gets thrown around loosely, but it has a real meaning: datasets so large, fast, or varied that traditional single-machine tools can't process them in a reasonable time. The classic shorthand is the three Vs — Volume, Velocity, and Variety. To tame datasets like these, engineers use distributed frameworks such as Hadoop and Spark to spread work across clusters. Cloud providers (AWS, Azure, Google Cloud) rent the underlying hardware on demand, and MapReduce remains the reference programming model for parallelizing a job across many machines. Fluency with this vocabulary is almost required once a project outgrows a laptop.

9. AI Terms You'll Hear Everywhere

Artificial intelligence is the umbrella term that holds machine learning, reasoning systems, and everything in between. A large language model (LLM) is a neural network trained on enormous volumes of text until it can write, summarize, and answer questions. Generative AI covers any system that produces new outputs — paragraphs, images, code, audio — rather than classifying existing ones. Computer vision teaches machines to read images and video the way humans read text. Explainable AI (XAI) tries to crack open the black box so users can see why a model decided what it did. And AI ethics wrestles with the harder questions: bias, consent, accountability, environmental cost.

10. Building a Career in the Field

Learning the words is step one; putting them to work is where a career starts. Pick up Python or R as your first language and keep SQL close at hand. Study statistics and linear algebra deeply enough to read a paper without panic. Treat Kaggle competitions, open-source repos, and your own side projects as your real portfolio — hiring managers notice what you've built, not just what you've memorized. Over time, the vocabulary in this guide becomes less a glossary and more a mental map of the terrain, showing you where you've already been and which direction to head next.