CutScore | Reinforcement learning

What it is

Reinforcement learning (RL) is a machine learning technique that trains software to make decisions to achieve the most optimal results. Rather than learning from labeled examples, an agent interacts with an environment, takes actions, and receives reward feedback — positive, negative, or zero — after each step. Over many trials the agent develops a policy: a set of if-then rules that maximizes cumulative reward over time.

Mental model

Think of a chess program learning entirely by playing games. Nobody tells it which move is correct; it only knows, at the end, whether it won or lost. It explores different moves (exploration), gradually favors the ones that tend to lead to wins (exploitation), and eventually develops a strategy — without ever being handed a labeled dataset of "good move / bad move" pairs.

This exploration-exploitation trade-off is the defining dynamic of RL: the agent must balance discovering new state-action rewards against leveraging actions it already knows are high-reward.

When to use it

The exam tests whether you can distinguish RL from supervised and unsupervised learning. The key axis is what signal drives learning.

	Supervised learning	Unsupervised learning	Reinforcement learning
Training signal	Labeled input-output pairs	No labels; find hidden patterns	Reward feedback from the environment
Human involvement	Requires a human supervisor to label data	No supervisor; no specified output	Defined goal, no pre-labeled data
Learns to…	Map inputs to known outputs	Discover structure in data	Take sequential actions to maximize cumulative reward
Representative use cases	Classification, regression	Clustering, dimensionality reduction	Optimization, sequential decision-making, personalization

RL is suited to problems where the right answer is not known in advance but success or failure can be measured — for example, cloud resource allocation optimization, marketing personalization through customized recommendations, or financial prediction by analyzing market dynamics.

Common misconception

The trap: candidates often assume that because RL uses feedback, it is a form of supervised learning — after all, supervised learning also uses correct/incorrect feedback. The distinction is structural, not superficial.

Supervised learning requires pre-labeled training data: every input already has a known correct output provided by a human supervisor before training begins. RL has no such dataset. The agent receives reward signals after it acts, the signals are often delayed (a short-term sacrifice can lead to a better long-term outcome), and the agent itself generates the training experience by exploring the environment. There is no "answer key."

A second misconception is that RL always requires real-world interaction. Because real-world testing can be risky or impractical, RL agents are commonly trained inside simulated environments — the agent learns from the simulation, not directly from the live system.

How it shows up on the exam

The cognitive target for this concept is distinction — candidates must identify which ML paradigm fits a described scenario. Signal phrases in scenario stems that point toward RL include:

"agent," "environment," "reward," or "policy"
"sequential decisions" or "takes actions"
"optimize over time" or "maximize cumulative"
"trial-and-error" or "no labeled data, but a clear goal"

Candidates often confuse RL with supervised learning when a scenario mentions feedback or scoring. The grounding question is: was the correct output known before training, or did the agent discover it through interaction? If the latter, the scenario is describing RL.

The exam may also probe the five-element framework: agent, environment, action, state, and reward. Being able to map a scenario description onto these elements — rather than memorizing the names in isolation — is the practical skill being assessed.

Reinforcement learning — AIF-C01

What it is

Mental model

When to use it

Common misconception

How it shows up on the exam

Related concepts

Sources