We could, of course, implement TD (λ) using the tabular online method, which we haven't covered yet, or with Q-learning. However, since this is a chapter on SARSA, it only makes sense that we continue with that theme throughout. Open Chapter_5_4.py and follow the exercise:
- The code is quite similar to our previous examples, but let's review the full source code, as follows:
import gym
import math
from copy import deepcopy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
env = gym.make('MountainCar-v0')
Q_table = np.zeros((65,65,3))
alpha=0.3
buckets=[65, 65]
gamma=0.99
rewards=[]
episodes=2000
lambdaa=0.8
def to_discrete_states(observation):
interval=[0 for i in range(len(observation))]
max_range=[1.2,0.07]
for i in range(len(observation)):
data = observation[i]
inter = int(math.floor((data + max_range[i])...