Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
TensorFlow 2 Reinforcement Learning Cookbook

You're reading from  TensorFlow 2 Reinforcement Learning Cookbook

Product type Book
Published in Jan 2021
Publisher Packt
ISBN-13 9781838982546
Pages 472 pages
Edition 1st Edition
Languages
Author (1):
Palanisamy P Palanisamy P
Profile icon Palanisamy P
Toc

Table of Contents (11) Chapters close

Preface 1. Chapter 1: Developing Building Blocks for Deep Reinforcement Learning Using Tensorflow 2.x 2. Chapter 2: Implementing Value-Based, Policy-Based, and Actor-Critic Deep RL Algorithms 3. Chapter 3: Implementing Advanced RL Algorithms 4. Chapter 4: Reinforcement Learning in the Real World – Building Cryptocurrency Trading Agents 5. Chapter 5: Reinforcement Learning in the Real World – Building Stock/Share Trading Agents 6. Chapter 6: Reinforcement Learning in the Real World – Building Intelligent Agents to Complete Your To-Dos 7. Chapter 7: Deploying Deep RL Agents to the Cloud 8. Chapter 8: Distributed Training for Accelerated Development of Deep RL Agents 9. Chapter 9: Deploying Deep RL Agents on Multiple Platforms 10. Other Books You May Enjoy

Building an environment and reward mechanism for training RL agents

This recipe will walk you through the steps to build a Gridworld learning environment to train RL agents. Gridworld is a simple environment where the world is represented as a grid. Each location on the grid can be referred to as a cell. The goal of an agent in this environment is to find its way to the goal state in a grid like the one shown here:

Figure 1.1 – A screenshot of the Gridworld environment

Figure 1.1 – A screenshot of the Gridworld environment

The agent's location is represented by the blue cell in the grid, while the goal and a mine/bomb/obstacle's location is represented in the grid using green and red cells, respectively. The agent (blue cell) needs to find its way through the grid to reach the goal (green cell) without running over the mine/bomb (red cell).

Getting ready

To complete this recipe, you will first need to activate the tf2rl-cookbook Python/Conda virtual environment and pip install numpy gym. If the following import statements run without issues, you are ready to get started!

import copy
import sys
import gym
import numpy as np

Now we can begin.

How to do it…

To train RL agents, we need a learning environment that is akin to the datasets used in supervised learning. The learning environment is a simulator that provides the observation for the RL agent, supports a set of actions that the RL agent can perform by executing the actions, and returns the resultant/new observation as a result of the agent taking the action.

Perform the following steps to implement a Gridworld learning environment that represents a simple 2D map with colored cells representing the location of the agent, goal, mine/bomb/obstacle, wall, and empty space on a grid:

  1. We'll start by first defining the mapping between different cell states and their color codes to be used in the Gridworld environment:
    EMPTY = BLACK = 0
    WALL = GRAY = 1
    AGENT = BLUE = 2
    MINE = RED = 3
    GOAL = GREEN = 4
    SUCCESS = PINK = 5
  2. Next, generate a color map using RGB intensity values:
    COLOR_MAP = {
        BLACK: [0.0, 0.0, 0.0],
        GRAY: [0.5, 0.5, 0.5],
        BLUE: [0.0, 0.0, 1.0],
        RED: [1.0, 0.0, 0.0],
        GREEN: [0.0, 1.0, 0.0],
        PINK: [1.0, 0.0, 1.0],
    }
  3. Let's now define the action mapping:
    NOOP = 0
    DOWN = 1
    UP = 2
    LEFT = 3
    RIGHT = 4
  4. Let's then create a GridworldEnv class with an __init__ function to define necessary class variables, including the observation and action space:
    class GridworldEnv():
    	def __init__(self):

    We will implement __init__() in the following steps.

  5. In this step, let's define the layout of the Gridworld environment using the grid cell state mapping:
    	self.grid_layout = """
            1 1 1 1 1 1 1 1
            1 2 0 0 0 0 0 1
            1 0 1 1 1 0 0 1
            1 0 1 0 1 0 0 1
            1 0 1 4 1 0 0 1
            1 0 3 0 0 0 0 1
            1 0 0 0 0 0 0 1
            1 1 1 1 1 1 1 1
            """

    In the preceding layout, 0 corresponds to the empty cells, 1 corresponds to walls, 2 corresponds to the agent's starting location, 3 corresponds to the location of the mine/bomb/obstacle, and 4 corresponds to the goal location based on the mapping we defined in step 1.

  6. Now, we are ready to define the observation space for the Gridworld RL environment:
    	self.initial_grid_state = np.fromstring(
                        self.grid_layout, dtype=int, sep=" ")
    	self.initial_grid_state = \
                        self.initial_grid_state.reshape(8, 8)
    	self.grid_state = copy.deepcopy(
                                     self.initial_grid_state)
    	self.observation_space = gym.spaces.Box(
    		low=0, high=6, shape=self.grid_state.shape
    	)
    	self.img_shape = [256, 256, 3]
    	self.metadata = {"render.modes": ["human"]}
  7. Let's define the action space and the mapping between the actions and the movement of the agent in the grid:
    	   self.action_space = gym.spaces.Discrete(5)
            self.actions = [NOOP, UP, DOWN, LEFT, RIGHT]
            self.action_pos_dict = {
                NOOP: [0, 0],
                UP: [-1, 0],
                DOWN: [1, 0],
                LEFT: [0, -1],
                RIGHT: [0, 1],
            }
  8. Let's now wrap up the __init__ function by initializing the agent's start and goal states using the get_state() method (which we will implement in the next step):
    (self.agent_start_state, self.agent_goal_state,) = \
                                             self.get_state()
  9. Now we need to implement the get_state() method that returns the start and goal state for the Gridworld environment:
    def get_state(self):
            start_state = np.where(self.grid_state == AGENT)
            goal_state = np.where(self.grid_state == GOAL)
            start_or_goal_not_found = not (start_state[0] \
                                           and goal_state[0])
            if start_or_goal_not_found:
                sys.exit(
                    "Start and/or Goal state not present in 
                     the Gridworld. "
                    "Check the Grid layout"
                )
            start_state = (start_state[0][0], 
                           start_state[1][0])
            goal_state = (goal_state[0][0], goal_state[1][0])
            return start_state, goal_state
  10. In this step, we will be implementing the step(action) method to execute the action and retrieve the next state/observation, the associated reward, and whether the episode ended:
    def step(self, action):
            """return next observation, reward, done, info"""
            action = int(action)
            info = {"success": True}
            done = False
            reward = 0.0
            next_obs = (
                self.agent_state[0] + \
                    self.action_pos_dict[action][0],
                self.agent_state[1] + \
                    self.action_pos_dict[action][1],
            )
  11. Next, let's specify the rewards and finally, return grid_state, reward, done, and info:
     # Determine the reward
            if action == NOOP:
                return self.grid_state, reward, False, info
            next_state_valid = (
                next_obs[0] < 0 or next_obs[0] >= \
                                    self.grid_state.shape[0]
            ) or (next_obs[1] < 0 or next_obs[1] >= \
                                    self.grid_state.shape[1])
            if next_state_valid:
                info["success"] = False
                return self.grid_state, reward, False, info
            next_state = self.grid_state[next_obs[0], 
                                         next_obs[1]]
            if next_state == EMPTY:
                self.grid_state[next_obs[0], 
                                next_obs[1]] = AGENT
            elif next_state == WALL:
                info["success"] = False
                reward = -0.1
                return self.grid_state, reward, False, info
            elif next_state == GOAL:
                done = True
                reward = 1
            elif next_state == MINE:
                done = True
                reward = -1        # self._render("human")
            self.grid_state[self.agent_state[0], 
                            self.agent_state[1]] = EMPTY
            self.agent_state = copy.deepcopy(next_obs)
            return self.grid_state, reward, done, info
  12. Up next is the reset() method, which resets the Gridworld environment when an episode completes (or if a request to reset the environment is made):
    def reset(self):
            self.grid_state = copy.deepcopy(
                                     self.initial_grid_state)
            (self.agent_state, self.agent_goal_state,) = \
                                             self.get_state()
            return self.grid_state
  13. To visualize the state of the Gridworld environment in a human-friendly manner, let's implement a render function that will convert the grid_layout that we defined in step 5 to an image and display it. With that, the Gridworld environment implementation will be complete!
    def gridarray_to_image(self, img_shape=None):
            if img_shape is None:
                img_shape = self.img_shape
            observation = np.random.randn(*img_shape) * 0.0
            scale_x = int(observation.shape[0] / self.grid_\
                                             state.shape[0])
            scale_y = int(observation.shape[1] / self.grid_\
                                             state.shape[1])
            for i in range(self.grid_state.shape[0]):
                for j in range(self.grid_state.shape[1]):
                    for k in range(3):  # 3-channel RGB image
                        pixel_value = \
                          COLOR_MAP[self.grid_state[i, j]][k]
                        observation[
                            i * scale_x : (i + 1) * scale_x,
                            j * scale_y : (j + 1) * scale_y,
                            k,
                        ] = pixel_value
            return (255 * observation).astype(np.uint8)
        def render(self, mode="human", close=False):
            if close:
                if self.viewer is not None:
                    self.viewer.close()
                    self.viewer = None
                return
            img = self.gridarray_to_image()
            if mode == "rgb_array":
                return img
            elif mode == "human":
                from gym.envs.classic_control import \
                   rendering
                if self.viewer is None:
                    self.viewer = \
                            rendering.SimpleImageViewer()
                self.viewer.imshow(img)
  14. To test whether the environment is working as expected, let's add a __main__ function that gets executed if the environment script is run directly:
    if __name__ == "__main__":
    	env = GridworldEnv()
    	obs = env.reset()
    	# Sample a random action from the action space
    	action = env.action_space.sample()
    	next_obs, reward, done, info = env.step(action)
    	print(f"reward:{reward} done:{done} info:{info}")
    	env.render()
    	env.close()
  15. All set! The Gridworld environment is ready and we can quickly test it by running the script (python envs/gridworld.py). An output such as the following will be displayed:
    reward:0.0 done:False info:{'success': True}

    The following rendering of the Gridworld environment will also be displayed:

Figure 1.2 – The Gridworld

Figure 1.2 – The Gridworld

Let's now see how it works!

How it works…

The grid_layout defined in step 5 in the How to do it… section represents the state of the learning environment. The Gridworld environment defines the observation space, action spaces, and the rewarding mechanism to implement a Markov Decision Process (MDP). We sample a valid action from the action space of the environment and step the environment with the chosen action, which results in the new observation, reward, and a done status Boolean (representing if the episode has finished) as the response from the Gridworld environment. The env.render() method converts the environment's internal grid representation to an image and displays it for visual understanding.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}