Let’s be honest: Rock Ridge is full of overachievers trying to make their mark in big tech at a young age. There is nothing wrong with that, of course, but there’s an okay way to do it and a great way to do it. If you’ve ever researched questions like “What is the best way to build an autopilot system?” for classes like Independent Science Research, then you’ve probably heard of supervised and unsupervised learning. However, numerous other methods can be better suited to train a neural network, depending on your product.
Take reinforcement learning (RL), for example. What do you think powers the recommendation system when you are binge-watching a new show on Netflix? Or how TikTok always seems to know what you want to watch, trapping you in an endless cycle of doomscrolling? Well, you can partially thank RL for that. RL plays a role in just about everything, from AI chess bots to Amazon’s Prime Air autonomous drone delivery software. And it might just be what you need for your next research project.
But what is reinforcement learning?
Let’s step into the Thought Bubble.
Suppose you’re given a week to teach a monkey to give you a fist bump. To do this, you create a reward, a banana. For the monkey to attain that reward, it has to give you a successful fist bump. You show it this reward and demonstrate how to do it. However, the monkey won’t understand it on the first try. It’s more likely that it will just make monkey noises for the first ten minutes, but then if it does give you a fistbump, you reward it with a banana, and you withhold it every time it doesn’t. Over time, the monkey figures out what action it has to take to receive a banana.
This system of trial and error represents the core principles of RL: an agent (the monkey) learns by interacting with an environment (you). It does this by trying different actions and receiving rewards (banana) or penalties (no banana) to gradually learn which choices/patterns lead to the attainment of the reward. The monkey’s only goal is to maximize the cumulative reward over time.
In RL, the reward is usually a number that the agent receives after it completes an action to signify the quality of each action. The reward can be positive (to encourage desirable actions) or negative (to discourage undesirable actions).
The history behind this whole process is pretty interesting.
The mathematical foundations originated in the 1950s from Richard Bellman and his equation, which forms the basis of his theory of dynamic programming. The Bellman equation essentially expresses the relationship between the value of a state and the value of its successive states to break down complex tasks into smaller, simpler tasks.
A major breakthrough came in the 1980s when Richard Sutton and Andrew Barto developed an algorithm called temporal difference (TD) learning. Basically, instead of waiting for a task to complete before an agent can learn from it, TD learning allows an agent to update its knowledge after each step leading to the task. In simpler terms, it’s like adjusting a recipe based on how the individual ingredients taste, rather than changing the ingredients after you’ve completed the dish.
Then, in 1989, Christopher Watkins introduced the Q-learning algorithm. It is still widely used today and is an introduction into RL. Q-learning enables an agent to create an optimal policy for making decisions in an environment. Fun fact: the “Q” stands for “quality.”
Moving closer to the next century, RL started making waves in real applications. For example, IBM researcher Gerald Tesauro showed just how far RL progress has gone with his TD-Gammon program, which was developed in the late 1900s. The program was able to teach itself to play at the expert level by combining neural networks with reinforcement learning.
Now, the RL industry is growing faster than ever. Last year, it was valued at $10.49 billion and $13.43 billion this year. Moreover, it’s projected to grow at a 28.8% compound annual rate. To put this in perspective, the artificial intelligence market is expected to grow at a 39.0% compound annual rate. This means that it’s only going to be worth more and more in the future. At Rock Ridge, if you are interested in machine learning, this is your chance to join a growing field and even potentially work on cutting-edge research and technology.





![Phoenix gets in position to initiate the beginning of an intense game. “It's coming to the end of the season here, so [our goal] is to just focus on working harder,” senior lineman Ryan Abbondanza said.](https://theblazerrhs.com/wp-content/uploads/2025/10/DSC_0042-1200x800.jpg)
































