Notes on Chapter 21: Reinforcement Learning¶
Reinforcement Learning¶
reinforcement learning is a very different approach to learning than supervised and unsupervised learning
a reinforcement learner is able to perform actions in an environment, and get rewards or penalties from their actions
the goal of a reinforcement learner is to maximize the rewards the get
in some complex domains, reinforcement learning is the only practical way to do learning because it can be hard to find labeled training examples
reinforcement learning is usually modeled as a markov decision process (mdp)
you can think of an mdp as a graph, where the nodes are states of the world, and the edges are actions that go between states
if the agent is in state s and does action a, then P(s,a,s’) is the probability that the agent ends of in state s’
- this allows for the possibility that an action might fail to do what the agent intended
Reward(s,a,s’) is the immediate reward the agent receives after going from s to s’ by action a
the agent chooses what action to do using a policy
- a learning agent updates its policy after getting reward feedback
- the exact structure of a policy depends upon the implementation of the learner
see Figure 22.1 (p. 832) for a good example of how to think about these ideas in terms of a maze-solving problem
reinforcement learners learn by interacting with their environment, and so this means they usually need to do some exploration, i.e. try out actions that are not necessarily the best ones in an effort to learn more about the environment (and those getting a higher overall reward from later actions)
- for example, suppose you are going to eat out at a restaurant, and you could
go to either a familiar restaurant with food you know you like, or visit a
just-opened restaurant you’ve never been to before
- going to the familiar restaurant will likely get you a high reward, because you like there food
- but discovering new food can be good, especially if you end up liking it, and so sometimes it is worthwhile taking a chance and visiting the new restaurant
- the balance between discovery and exploitation has been formalized into
multi-arm bandit problems
- imagine a slot machine with two different levers (A and B) you can pull
- each lever wins with some unknown probability
- if you have $100, and it costs $1 per pull, then what should be your strategy for pulling levers if you want to maximize the number of wins?
one application of reinforcement learning that has some big successes is game playing
- in 1992, TD-Gammon became a word class backgammon playing program using a reinforcement learning technique called temporal difference learning
- more recent, AlphaZero has used reinforcement learning to play chess, Shogi, and Go at world champion levels
Other Kinds of Learning¶
some AI researchers are interested in directly using knowledge in learning
one way to do that is to frame learning as a logic problem, and to consider how logical predicates might be learned from examples
one interesting technique for doing this is inductive logic programming (ILP), where the learning agent is given positive and negative examples of a logical predicate, and learns a logic program that describes it
- logic programs an be run as regular computer programs, and so ILP can be thought of as learning programs
see chapter 19 of the textbook if you are interested in more information on this approach
- it is still research-oriented, without having achieved the same success of example-based learning