Preface |
|
xv | |
1 Why Reinforcement Learning? |
|
1 | (24) |
|
|
2 | (1) |
|
|
3 | (1) |
|
|
4 | (4) |
|
|
5 | (2) |
|
|
7 | (1) |
|
Taxonomy of RL Approaches |
|
|
8 | (4) |
|
Model-Free or Model-Based |
|
|
8 | (1) |
|
How Agents Use and Update Their Strategy |
|
|
9 | (1) |
|
Discrete or Continuous Actions |
|
|
10 | (1) |
|
|
11 | (1) |
|
Policy Evaluation and Improvement |
|
|
11 | (1) |
|
Fundamental Concepts in Reinforcement Learning |
|
|
12 | (6) |
|
|
12 | (3) |
|
|
15 | (1) |
|
|
16 | (2) |
|
Reinforcement Learning as a Discipline |
|
|
18 | (2) |
|
|
20 | (1) |
|
|
20 | (5) |
2 Markov Decision Processes, Dynamic Programming, and Monte Carlo Methods |
|
25 | (34) |
|
|
25 | (10) |
|
|
26 | (1) |
|
Policy Evaluation: The Value Function |
|
|
26 | (3) |
|
Policy Improvement: Choosing the Best Action |
|
|
29 | (2) |
|
Simulating the Environment |
|
|
31 | (1) |
|
|
31 | (2) |
|
Improving the e-greedy Algorithm |
|
|
33 | (2) |
|
Markov Decision Processes |
|
|
35 | (7) |
|
|
36 | (4) |
|
Inventory Control Simulation |
|
|
40 | (2) |
|
Policies and Value Functions |
|
|
42 | (8) |
|
|
42 | (1) |
|
Predicting Rewards with the State-Value Function |
|
|
43 | (4) |
|
Predicting Rewards with the Action-Value Function |
|
|
47 | (1) |
|
|
48 | (2) |
|
Monte Carlo Policy Generation |
|
|
50 | (2) |
|
Value Iteration with Dynamic Programming |
|
|
52 | (5) |
|
Implementing Value Iteration |
|
|
54 | (2) |
|
Results of Value Iteration |
|
|
56 | (1) |
|
|
57 | (1) |
|
|
57 | (2) |
3 Temporal-Difference Learning, Q-Learning, and n-Step Algorithms |
|
59 | (28) |
|
Formulation of Temporal-Difference Learning |
|
|
60 | (10) |
|
|
62 | (2) |
|
|
64 | (1) |
|
|
65 | (3) |
|
Case Study: Automatically Scaling Application Containers to Reduce Cost |
|
|
68 | (2) |
|
Industrial Example: Real-Time Bidding in Advertising |
|
|
70 | (4) |
|
|
70 | (1) |
|
Results of the Real-Time Bidding Environments |
|
|
71 | (2) |
|
|
73 | (1) |
|
|
74 | (2) |
|
|
74 | (1) |
|
|
74 | (1) |
|
Comparing Standard, Double, and Delayed Q-learning |
|
|
75 | (1) |
|
|
75 | (1) |
|
|
76 | (4) |
|
n-Step Algorithms on Grid Environments |
|
|
79 | (1) |
|
|
80 | (3) |
|
Extensions to Eligibility Traces |
|
|
83 | (2) |
|
|
83 | (1) |
|
Fuzzy Wipes in Watkins's Q(A) |
|
|
84 | (1) |
|
|
84 | (1) |
|
Accumulating Versus Replacing Eligibility Traces |
|
|
84 | (1) |
|
|
85 | (1) |
|
|
85 | (2) |
4 Deep Q-Networks |
|
87 | (28) |
|
Deep Learning Architectures |
|
|
88 | (4) |
|
|
88 | (1) |
|
Common Neural Network Architectures |
|
|
89 | (1) |
|
|
90 | (1) |
|
Deep Reinforcement Learning |
|
|
91 | (1) |
|
|
92 | (7) |
|
|
92 | (1) |
|
|
92 | (1) |
|
Neural Network Architecture |
|
|
93 | (1) |
|
|
93 | (1) |
|
Example: DQN on the CartPole Environment |
|
|
94 | (4) |
|
Case Study: Reducing Energy Usage in Buildings |
|
|
98 | (1) |
|
|
99 | (4) |
|
|
100 | (2) |
|
Prioritized Experience Replay |
|
|
102 | (1) |
|
|
102 | (1) |
|
|
102 | (1) |
|
Example: Rainbow DQN on Atari Games |
|
|
103 | (4) |
|
|
104 | (2) |
|
|
106 | (1) |
|
|
107 | (4) |
|
|
108 | (1) |
|
|
109 | (1) |
|
Learning from Offline Data |
|
|
109 | (2) |
|
|
111 | (1) |
|
|
112 | (3) |
5 Policy Gradient Methods |
|
115 | (30) |
|
Benefits of Learning a Policy Directly |
|
|
115 | (1) |
|
How to Calculate the Gradient of a Policy |
|
|
116 | (1) |
|
|
117 | (2) |
|
|
119 | (3) |
|
|
120 | (2) |
|
|
122 | (1) |
|
|
122 | (14) |
|
|
122 | (2) |
|
|
124 | (3) |
|
Gradient Variance Reduction |
|
|
127 | (2) |
|
n-Step Actor-Critic and Advantage Actor-Critic (A2C) |
|
|
129 | (5) |
|
Eligibility Traces Actor-Critic |
|
|
134 | (1) |
|
A Comparison of Basic Policy Gradient Algorithms |
|
|
135 | (1) |
|
Industrial Example: Automatically Purchasing Products for Customers |
|
|
136 | (6) |
|
The Environment: Gym-Shopping-Cart |
|
|
137 | (1) |
|
|
137 | (1) |
|
Results from the Shopping Cart Environment |
|
|
138 | (4) |
|
|
142 | (1) |
|
|
143 | (2) |
6 Beyond Policy Gradients |
|
145 | (46) |
|
|
145 | (7) |
|
|
146 | (2) |
|
Behavior and Target Policies |
|
|
148 | (1) |
|
|
149 | (1) |
|
Gradient Temporal-Difference Learning |
|
|
149 | (1) |
|
|
150 | (1) |
|
|
151 | (1) |
|
Deterministic Policy Gradients |
|
|
152 | (11) |
|
Deterministic Policy Gradients |
|
|
152 | (2) |
|
Deep Deterministic Policy Gradients |
|
|
154 | (4) |
|
|
158 | (3) |
|
Case Study: Recommendations Using Reviews |
|
|
161 | (2) |
|
|
163 | (1) |
|
|
163 | (11) |
|
Kullback-Leibler Divergence |
|
|
165 | (2) |
|
Natural Policy Gradients and Trust Region Policy Optimization |
|
|
167 | (2) |
|
Proximal Policy Optimization |
|
|
169 | (5) |
|
Example: Using Servos for a Real-Life Reacher |
|
|
174 | (7) |
|
|
175 | (1) |
|
RL Algorithm Implementation |
|
|
175 | (2) |
|
Increasing the Complexity of the Algorithm |
|
|
177 | (1) |
|
Hyperparameter Tuning in a Simulation |
|
|
178 | (2) |
|
|
180 | (1) |
|
Other Policy Gradient Algorithms |
|
|
181 | (3) |
|
|
182 | (1) |
|
Actor-Critic with Experience Replay (ACER) |
|
|
182 | (1) |
|
Actor-Critic Using Kronecker-Factored Trust Regions (ACKTR) |
|
|
183 | (1) |
|
|
183 | (1) |
|
Extensions to Policy Gradient Algorithms |
|
|
184 | (1) |
|
Quantile Regression in Policy Gradient Algorithms |
|
|
184 | (1) |
|
|
184 | (2) |
|
Which Algorithm Should I Use? |
|
|
185 | (1) |
|
A Note on Asynchronous Methods |
|
|
185 | (1) |
|
|
186 | (5) |
7 Learning All Possible Policies with Entropy Methods |
|
191 | (24) |
|
|
191 | (1) |
|
Maximum Entropy Reinforcement Learning |
|
|
192 | (1) |
|
|
193 | (3) |
|
SAC Implementation Details and Discrete Action Spaces |
|
|
194 | (1) |
|
Automatically Adjusting Temperature |
|
|
194 | (1) |
|
Case Study: Automated Traffic Management to Reduce Queuing |
|
|
195 | (1) |
|
Extensions to Maximum Entropy Methods |
|
|
196 | (2) |
|
Other Measures of Entropy (and Ensembles) |
|
|
196 | (1) |
|
Optimistic Exploration Using the Upper Bound of Double Q-Learning |
|
|
196 | (1) |
|
Tinkering with Experience Replay |
|
|
197 | (1) |
|
|
197 | (1) |
|
Soft Q-Learning (and Derivatives) |
|
|
197 | (1) |
|
Path Consistency Learning |
|
|
198 | (1) |
|
Performance Comparison: SAC Versus PPO |
|
|
198 | (2) |
|
How Does Entropy Encourage Exploration? |
|
|
200 | (5) |
|
How Does the Temperature Parameter Alter Exploration? |
|
|
203 | (2) |
|
Industrial Example: Learning to Drive with a Remote Control Car |
|
|
205 | (6) |
|
Description of the Problem |
|
|
205 | (1) |
|
|
205 | (3) |
|
|
208 | (1) |
|
|
209 | (1) |
|
|
209 | (1) |
|
|
210 | (1) |
|
|
211 | (4) |
|
Equivalence Between Policy Gradients and Soft Q-Learning |
|
|
211 | (1) |
|
What Does This Mean For the Future? |
|
|
212 | (1) |
|
|
212 | (3) |
8 Improving How an Agent Learns |
|
215 | (36) |
|
|
216 | (4) |
|
Partially Observable Markov Decision Process |
|
|
216 | (2) |
|
Case Study: Using POMDPs in Autonomous Vehicles |
|
|
218 | (1) |
|
Contextual Markov Decision Processes |
|
|
219 | (1) |
|
MDPs with Changing Actions |
|
|
219 | (1) |
|
|
220 | (1) |
|
Hierarchical Reinforcement Learning |
|
|
220 | (5) |
|
|
221 | (1) |
|
High-Low Hierarchies with Intrinsic Rewards (HIRO) |
|
|
222 | (1) |
|
Learning Skills and Unsupervised RL |
|
|
223 | (1) |
|
|
224 | (1) |
|
|
225 | (1) |
|
Multi-Agent Reinforcement Learning |
|
|
225 | (10) |
|
|
226 | (2) |
|
Centralized or Decentralized |
|
|
228 | (1) |
|
|
229 | (1) |
|
Case Study: Using Single-Agent Decentralized Learning in UAVs |
|
|
230 | (1) |
|
Centralized Learning, Decentralized Execution |
|
|
231 | (1) |
|
|
232 | (1) |
|
|
233 | (1) |
|
|
234 | (1) |
|
|
235 | (1) |
|
|
235 | (5) |
|
|
236 | (1) |
|
|
236 | (1) |
|
|
237 | (1) |
|
|
238 | (2) |
|
|
240 | (1) |
|
|
240 | (1) |
|
|
240 | (1) |
|
|
241 | (1) |
|
|
242 | (9) |
9 Practical Reinforcement Learning |
|
251 | (46) |
|
The RL Project Life Cycle |
|
|
251 | (5) |
|
|
253 | (3) |
|
Problem Definition: What Is an RL Project? |
|
|
256 | (8) |
|
RL Problems Are Sequential |
|
|
256 | (1) |
|
RL Problems Are Strategic |
|
|
257 | (1) |
|
|
258 | (2) |
|
|
260 | (4) |
|
RL Engineering and Refinement |
|
|
264 | (25) |
|
|
264 | (1) |
|
|
265 | (3) |
|
State Engineering or State Representation Learning |
|
|
268 | (2) |
|
|
270 | (5) |
|
Mapping Policies to Action Spaces |
|
|
275 | (4) |
|
|
279 | (6) |
|
|
285 | (4) |
|
|
289 | (1) |
|
|
290 | (7) |
10 Operational Reinforcement Learning |
|
297 | (44) |
|
|
298 | (19) |
|
|
298 | (3) |
|
|
301 | (8) |
|
|
309 | (8) |
|
|
317 | (16) |
|
|
317 | (4) |
|
|
321 | (2) |
|
|
323 | (5) |
|
Safety, Security, and Ethics |
|
|
328 | (5) |
|
|
333 | (1) |
|
|
334 | (7) |
11 Conclusions and the Future |
|
341 | (18) |
|
|
341 | (4) |
|
|
341 | (1) |
|
|
342 | (1) |
|
|
343 | (1) |
|
|
344 | (1) |
|
|
345 | (1) |
|
|
345 | (3) |
|
${ALGORITHM_NAME} Can't Solve ${ENVIRONMENT}! |
|
|
347 | (1) |
|
|
348 | (1) |
|
The Future of Reinforcement Learning |
|
|
348 | (7) |
|
|
349 | (1) |
|
Future RL and Research Directions |
|
|
350 | (5) |
|
|
355 | (2) |
|
|
356 | (1) |
|
|
356 | (1) |
|
|
357 | (2) |
A The Gradient of a Logistic Policy for Two Actions |
|
359 | (4) |
B The Gradient of a Softmax Policy |
|
363 | (2) |
Glossary |
|
365 | (6) |
|
Acronyms and Common Terms |
|
|
365 | (3) |
|
|
368 | (3) |
Index |
|
371 | |