What is the reward in Reinforcement Learning?

ngrunenberg · Oct 17, 2018

I know I'm not that bright and I realize that this is a silly question to anyone in the field, but I was curious what the reward is in reinforcement learning algorithms.

I understand the concept behind reinforcement learning, though I am unsure of how you could program a reward into a program. There is no limbic system that would respond positively because it has been rewarded with an influx of dopamine, and even if we could program this into an algorithm, how would it know to respond as a biological entity would; I imagine void of having a biological "purpose" to perpetuate ones genes, there would be no real reward that would bring the agent closer to said purpose.

Again, apologies for my ignorance and thanks in advance for taking the time to reply.

jedishrfu · Oct 17, 2018

Here’s a blog on reinforcement learning

https://vinodsblog.com/2018/04/16/reinforcement-learning-reward-for-learning/

My view is reinforcement learning is like course correction as you drive a car. You feel a sense of accomplishment if you stay within the lines. The reward is staying within the lines. But if your car veers left or right then you adjust to compensate and stay within the lines.

The algorithm does the same as it gets things right, the evaluation continues to apply rewards/adjustments to continue in that mode but if they evaluation decreases then a negative reward /adjustment is applied. The evaluation indicates if things are getting better, staying the same or getting worse and the algorithms adjusts weights accordingly. The reward algorithm evaluates how well things stayed within the bounds of the system. A learn as you go scheme aka continuous learning.

jedishrfu · Oct 17, 2018

One key point made in the blog referenced above is that there are three kinds of learning systems:

1) supervised learning where you train it and test it until it works well and then it goes into production and no changes are made until the next update.

It's good for identifying known patterns

2) unsupervised where it identifies new patterns using statistics to locate interesting clusters of data

It's good for finding hidden trends.

3) reinforced learning where the system is training itself continuously so it continually gets better and better doing the task at hand

It's good for learning a behavior. As an example, you might have an RL cruise control with various inputs and a driver. The driver sets the speed and turns on the system. The system tries to match the speed noting motor RPMs, uphill downhill positioning, LIDAR ... whatever cool gadget you can think of. However, every so often the driver does a correction that the system notes and it learns from it to match the driver's style of driving while at the same time maintaining the speed. So its reward is the driver leaves it alone and the punishment is when the driver brakes or hits the gas...

kind of like when you're driving on a long and lonely road while your spouse is sleeping peacefully in the passenger seat next to you and then you hit a bump... What happened? are you okay? Did you fall asleep? Can I drive? When will we get there? and then the kids wake up we're hungry... are we there yet?

The joys of driving!

anorlunda · Oct 17, 2018

Google used to provide a service. I think it was called Google 411. It was a yellow pages service that worked with dumb flip phones or land line phones, not the Internet. You might say, "auto repair in scotia new york" or "quick, get me a lawyer" and it would give you a phone number. Google's goal was to learn to recognize speech from any user, regardless of voice or accent. That technology is heavily used in today's smart speakers.

If the user accepts the first suggestion and asks to be connected to that number, that is positive reinforcement.
If the user hangs up, that is negative reinforcement.
If the user asks for more suggestions, that is intermediate reinforcement.

That doesn't sound mysterious at all.

ngrunenberg · Oct 18, 2018

Thank you for the explanation and the link to the blog, that definitely cleared a few things up.

anorlunda said:

If the user accepts the first suggestion and asks to be connected to that number, that is positive reinforcement.
If the user hangs up, that is negative reinforcement.
If the user asks for more suggestions, that is intermediate reinforcement.

That doesn't sound mysterious at all.

It's not the process itself that is mysterious, I understand the concept; what is "mysterious" to me is how you can reward something that has no subjective interpretation of what a reward is. It makes sense in children and animals; the cessation of pain is a reward when learning to not walk into fire for example, but I fail to see the analogue in a system that has no reason to avoid mistakes. I'm not sure if I've articulated my issue well enough so sorry in advance for that.

jedishrfu · Oct 18, 2018

This is just a naming convention. We can relate to rewards and punishments as positive and negative but with a more viceral feeling. There are similar notions in electrical systems defining plugs and sockets in terms of male and female connectors but clearly there’s no procreation involved.

anorlunda · Oct 18, 2018

ngrunenberg said:

It's not the process itself that is mysterious, I understand the concept; what is "mysterious" to me is how you can reward something that has no subjective interpretation of what a reward is. It makes sense in children and animals; the cessation of pain is a reward when learning to not walk into fire for example, but I fail to see the analogue in a system that has no reason to avoid mistakes. I'm not sure if I've articulated my issue well enough so sorry in advance for that.

You're trying to anthropomorphize it.

Many of these systems are neural nets with many adjustable parameters. A set of adjustments that work well, we keep. Those that fail, we discard. Then repeat with new test data. Continue until the machine works almost always. That is a way to apply of reward/punishment to machines. It is only an analogy to human reward/punishment.

atyy · Oct 19, 2018

Reward is something that the system receives at the end of the task that provides some information as to how well the task has been completed.

The subjective notion of reward still comes from the human designer of the system, since it is the human designer that specifies "how well the task has been completed".

In this respect, reinforcement learning is not any different from supervised learning, since it is the human designer that specifies "how well the task has been completed". What is different in reinforcement learning is that information about how well the task has been completed is provided with not so much detail (just various degrees of good or bad), and we do not know exactly which action performed some time before the reward was obtained was good or not.

atyy · Oct 19, 2018

You may like to look at how reward helps to drive learning in the Rescorla-Wagner model, a fairly successful mathematical model describing biological reinforcement learning.
https://en.wikipedia.org/wiki/Rescorla–Wagner_model

The "reward prediction error" or "surprise" of the Rescorla-Wagner model is a simple form of the "reward prediction error" or "temporal difference error" that is a better description of some biological reinforcement learning, and also used in machine reinforcement learning (eg. in Tesauro's backgammon player) .
https://en.wikipedia.org/wiki/Temporal_difference_learning
https://medium.com/jim-fleming/before-alphago-there-was-td-gammon-13deff866197

ngrunenberg · Oct 19, 2018

Thank you all for clearing up my confusion! I appreciate the help; especially the links.

What is the reward in Reinforcement Learning?

Related to What is the reward in Reinforcement Learning?

1. What is the definition of reward in Reinforcement Learning?

2. How is the reward calculated in Reinforcement Learning?

3. What is the role of reward in Reinforcement Learning?

4. Can the reward function be modified in Reinforcement Learning?

5. How does the reward impact the learning process in Reinforcement Learning?

Similar threads

Hot Threads

Recent Insights