We proceed with the policy as follows:
- Let us implement a naive neural network-based policy. Define a new policy to use the neural network based predictions to return the actions:
def policy_naive_nn(nn,obs):
return np.argmax(nn.predict(np.array([obs])))
- Define nn as a simple one layer MLP network that takes the observations having four dimensions as input, and produces the probabilities of the two actions:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(8,input_dim=4, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam')
model.summary()
This is what the model looks like:
Layer (type) Output Shape Param # ===================...