The reinforcement program we studied contains no trace of a specific field, as in traditional software. The program contains Bellman's equation with stochastic (random) choices based on the reward matrix. The goal is to find a route to C (line 3, column 3), which has an attractive reward (100):
# Markov Decision Process (MDP) - Bellman's equations adapted to
# Reinforcement Learning with the Q action-value(reward) matrix
# R is The Reward Matrix for each state
R = ql.matrix([ [0,0,0,0,1,0],
[0,0,0,1,0,1],
[0,0,100,1,0,0],
[0,1,1,0,1,0],
[1,0,0,1,0,0],
[0,1,0,0,0,0] ])
That reward matrix goes through Bellman's equation and produces a result in Python:
Q :
[[ 0. 0. 0. 0. 258.44 0. ]
[ 0. 0. 0. 321.8 0. 207.752]
[ 0. 0. 500. 321.8 0. 0. ]
[ 0. 258.44 401. 0. 258.44 0. ]
[ 207.752 0. 0. 321.8 0. 0. ]
[ 0. 258.44 0. 0. 0. 0. ]]
Normed Q :
[[ 0. 0. 0. 0. 51.688 0. ]
[ 0. 0. 0. 64.36 0. 41.5504]
[ 0. 0. 100. 64.36 0. 0. ]
[ 0. 51.688 80.2 0. 51.688 0. ]
[ 41.5504 0. 0. 64.36 0. 0. ]
[ 0. 51.688 0. 0. 0. 0. ]]
The result contains the values of each state produced by the reinforced learning process, and also a normed Q (highest value divided by other values).
As Python geeks, we are overjoyed. We made something rather difficult to work, namely reinforcement learning. As mathematical amateurs, we are elated. We know what MDP and Bellman's equation mean.
However, as natural language thinkers, we have made little progress. No customer or user can read that data and make sense of it. Furthermore, we cannot explain how we implemented an intelligent version of his/her job in the machine. We didn't.
We hardly dare say that reinforcement learning can beat anybody in the company making random choices 50,000 times until the right answer came up.
Furthermore, we got the program to work but hardly know what to do with the result ourselves. The consultant on the project cannot help because of the matrix format of the solution.
Being an adaptive thinker means knowing how to be good in all the dimensions of a subject. To solve this new problem, let's go back to step 1 with the result.
By formatting the result in Python, a graphics tool, or a spreadsheet, the result that is displayed as follows:
|
A |
B |
C |
D |
E |
F |
A |
- |
- |
- |
- |
258.44 |
- |
B |
- |
- |
- |
321.8 |
- |
207.752 |
C |
- |
- |
500 |
321.8 |
- |
- |
D |
- |
258.44 |
401. |
- |
258.44 |
- |
E |
207.752 |
- |
- |
321.8 |
- |
- |
F |
- |
258.44 |
- |
- |
- |
- |
Now, we can start reading the solution:
- Choose a starting state. Take F for example.
- The F line represents the state. Since the maximum value is 258.44 in the B column, we go to state B, the second line.
- The maximum value in state B in the second line leads us to the D state in the fourth column.
- The highest maximum of the D state (fourth line) leads us to the C state.
Note that if you start at the C state and decide not to stay at C, the D state becomes the maximum value, which will lead you to back to C. However, the MDP will never do this naturally. You will have to force the system to do it.
You have now obtained a sequence: F->B->D->C. By choosing other points of departure, you can obtain other sequences by simply sorting the table.
The most useful way of putting it remains the normalized version in percentages. This reflects the stochastic (random) property of the solution, which produces probabilities and not certainties, as shown in the following matrix:
|
A |
B |
C |
D |
E |
F |
A |
- |
- |
- |
- |
51.68% |
- |
B |
- |
- |
- |
64.36% |
- |
41.55% |
C |
- |
- |
100% |
64.36% |
- |
- |
D |
- |
51.68% |
80.2% |
- |
51.68% |
- |
E |
41.55% |
- |
- |
64.36% |
- |
- |
F |
- |
51.68% |
- |
- |
- |
- |
Now comes the very tricky part. We started the chapter with a trip on a road. But I made no mention of it in the result analysis.
An important property of reinforcement learning comes from the fact that we are working with a mathematical model that can be applied to anything. No human rules are needed. This means we can use this program for many other subjects without writing thousands of lines of code.
Case 1: Optimizing a delivery for a driver, human or not
This model was described in this chapter.
Case 2: Optimizing warehouse flows
The same reward matrix can apply to going from point F to C in a warehouse, as shown in the following diagram:
In this warehouse, the F->B->D->C sequence makes visual sense. If somebody goes from point F to C, then this physical path makes sense without going through walls.
It can be used for a video game, a factory, or any form of layout.
Case 3: Automated planning and scheduling (APS)
By converting the system into a scheduling vector, the whole scenery changes. We have left the more comfortable world of physical processing of letters, faces, and trips. Though fantastic, those applications are social media's tip of the iceberg. The real challenge of artificial intelligence begins in the abstract universe of human thinking.
Every single company, person, or system requires automatic planning and scheduling (see Chapter 12, Automated Planning and Scheduling). The six A to F steps in the example of this chapter could well be six tasks to perform in a given unknown order represented by the following vector x:
The reward matrix then reflects the weights of constraints of the tasks of vector x to perform. For example, in a factory, you cannot assemble the parts of a product before manufacturing them.
In this case, the sequence obtained represents the schedule of the manufacturing process.
Case 4 and more: Your imagination
By using physical layouts or abstract decision-making vectors, matrices, and tensors, you can build a world of solutions in a mathematical reinforcement learning model. Naturally, the following chapters will enhance your toolbox with many other concepts.