List of Listings – Deep Reinforcement Learning in Action

List of Listings

Chapter 2. Modeling reinforcement learning problems: Markov decision processes

Listing 2.1. Finding the best actions given the expected rewards in Python 3

Listing 2.2. Epsilon-greedy strategy for action selection

Listing 2.3. Defining the reward function

Listing 2.4. Updating the reward record

Listing 2.5. Computing the best action

Listing 2.6. Solving the n-armed bandit

Listing 2.7. The softmax function

Listing 2.8. Softmax action-selection for the n-armed bandit

Listing 2.9. Contextual bandit environment

Listing 2.10. The main training loop

Chapter 3. Predicting the best states and actions: Deep Q-networks

Listing 3.1. Creating a Gridworld game

Listing 3.2. Neural network Q function

Listing 3.3. Q-learning: Main training loop

Listing 3.4. Testing the Q-network

Listing 3.5. DQN with experience replay

Listing 3.6. Testing the performance with experience replay

Listing 3.7. Target network

Listing 3.8. DQN with experience replay and target network

Chapter 4. Learning to pick the best policy: Policy gradient methods

Listing 4.1. Listing the OpenAI Gym environments

Listing 4.2. Creating an environment in OpenAI Gym

Listing 4.3. Taking an action in CartPole

Listing 4.4. Setting up the policy network

Listing 4.5. Using the policy network to sample an action

Listing 4.6. Computing the discounted rewards

Listing 4.7. Defining the loss function

Listing 4.8. The REINFORCE training loop

Chapter 5. Tackling more complex problems with actor-critic methods

Listing 5.1. Introduction to multiprocessing

Listing 5.2. Manually starting individual processes

Listing 5.3. Pseudocode for online advantage actor-critic

Listing 5.4. CartPole actor-critic model

Listing 5.5. Distributing the training

Listing 5.6. The main training loop

Listing 5.7. Running an episode

Listing 5.8. Computing and minimizing the loss

Listing 5.9. N-step training with CartPole

Listing 5.10. Returns with and without bootstrapping

Chapter 6. Alternative optimization methods: Evolutionary algorithms

Listing 6.1. Evolving strings: set up random strings

Listing 6.2. Evolving strings: recombine and mutate

Listing 6.3. Evolving strings: evaluate individuals and create new generation

Listing 6.4. Evolving strings: putting it all together

Listing 6.5. Defining an agent

Listing 6.6. Unpacking a parameter vector

Listing 6.7. Spawning a population

Listing 6.8. Genetic recombination

Listing 6.9. Mutating the parameter vectors

Listing 6.10. Testing each agent in the environment

Listing 6.11. Evaluate all the agents in the population

Listing 6.12. Creating the next generation

Listing 6.13. Training the models

Listing 6.14. Setting the random seed

Chapter 7. Distributional DQN: Getting the full story

Listing 7.1. Setting up a discrete probability distribution in numpy

Listing 7.2. Updating a probability distribution

Listing 7.3. Redistributing probability mass after a single observation

Listing 7.4. Redistributing probability mass with a sequence of observations

Listing 7.5. Decreased variance with sequence of same reward

Listing 7.6. The Dist-DQN

Listing 7.7. Computing the target distribution

Listing 7.8. The cross-entropy loss function

Listing 7.9. Testing with simulated data

Listing 7.10. Dist-DQN training on synthetic data

Listing 7.11. Visualizing the learned action-value distributions

Listing 7.12. Preprocessing states and selecting actions

Listing 7.13. Dist-DQN plays Freeway, preliminaries

Listing 7.14. The main training loop

Chapter 8. Curiosity-driven exploration

Listing 8.1. Setting up the Super Mario Bros. environment

Listing 8.2. Downsample state and convert to grayscale

Listing 8.3. Preparing the states

Listing 8.4. The policy function

Listing 8.5. Experience replay

Listing 8.6. ICM components

Listing 8.7. Deep Q-network

Listing 8.8. Hyperparameters and model instantiation

Listing 8.9. The loss function and reset environment

Listing 8.10. The ICM prediction error calculation

Listing 8.11. Mini-batch training using experience replay

Listing 8.12. The training loop

Listing 8.13. Testing the trained agent

Chapter 9. Multi-agent reinforcement learning

Listing 9.1. Pseudocode for neighborhood Q-learning, part 1

Listing 9.2. Pseudocode for neighborhood Q-learning, part 2

Listing 9.3. 1D Ising model: Create the grid and produce rewards

Listing 9.4. The 1D Ising model: Generate neural network parameters

Listing 9.5. The 1D Ising model: Defining the Q function

Listing 9.6. The 1D Ising model: Get the state of the environment

Listing 9.7. The 1D Ising model: Initialize the grid

Listing 9.8. The 1D Ising model: The training loop

Listing 9.9. Mean field Q-learning: The policy function

Listing 9.10. Mean field Q-learning: Coordinate and reward functions

Listing 9.11. Mean field Q-learning: Calculate the mean action vector

Listing 9.12. Mean field Q-learning: The main training loop

Listing 9.13. Creating the MAgent environment

Listing 9.14. Adding the agents

Listing 9.15. Finding the neighbors

Listing 9.16. Calculating the mean field action

Listing 9.17. Choosing actions

Listing 9.18. The training function

Listing 9.19. Initializing the actions

Listing 9.20. Taking a team step and adding to the replay

Listing 9.21. Training loop

Listing 9.22. Adding to the replay (still in while loop from listing 9.21)

Chapter 10. Interpretable reinforcement learning: Attention and relational models

Listing 10.1. Preprocessing functions

Listing 10.2. Relational module

Listing 10.3. The forward pass (continued from listing 10.2)

Listing 10.4. MNIST training loop

Listing 10.5. MNIST test accuracy

Listing 10.6. Convolutional neural network baseline for MNIST

Listing 10.7. Multi-head relational module

Listing 10.8. Preprocessing functions

Listing 10.9. Loss function and updating the replay

Listing 10.10. The main training loop

Appendix Appendix. Mathematics, deep learning, PyTorch

Listing A.1. Gradient descent

Listing A.2. A simple neural network

Listing A.3. PyTorch neural network

Listing A.4. Classifying MNIST using a neural network

Listing A.5. Using the Adam optimizer