The lessons of reinforcement learning
Unsupervised reinforcement machine learning, such as the MDP-driven Bellman equation, is toppling traditional decision-making software location by location. Memoryless reinforcement learning requires few to no business rules and, thus, doesn't require human knowledge to run.
Being an adaptive next-generation AI thinker involves three prerequisites: the effort to be an SME, working on mathematical models to think like a machine, and understanding your source code's potential and limits.
Machine power and reinforcement learning teach us two important lessons:
- Lesson 1: Machine learning through reinforcement learning can beat human intelligence in many cases. No use fighting! The technology and solutions are already here in strategic domains.
- Lesson 2: A machine has no emotions, but you do. And so do the people around you. Human emotions and teamwork are an essential asset. Become an SME for your team. Learn how to understand what they're trying to say intuitively and make a mathematical representation of it for them. Your job will never go away, even if you're setting up solutions that don't require much development, such as AutoML. AutoML, or automated machine learning, automates many tasks. AutoML automates functions such as the dataset pipeline, hyperparameters, and more. Development is partially or totally suppressed. But you still have to make sure the whole system is well designed.
Reinforcement learning shows that no human can solve a problem the way a machine does. 50,000 iterations with random searching is not an option for a human. The number of empirical episodes can be reduced dramatically with a numerical convergence form of gradient descent (see Chapter 3, Machine Intelligence – Evaluation Functions and Numerical Convergence).
Humans need to be more intuitive, make a few decisions, and see what happens, because humans cannot try thousands of ways of doing something. Reinforcement learning marks a new era for human thinking by surpassing human reasoning power in strategic fields.
On the other hand, reinforcement learning requires mathematical models to function. Humans excel in mathematical abstraction, providing powerful intellectual fuel to those powerful machines.
The boundaries between humans and machines have changed. Humans' ability to build mathematical models and ever-growing cloud platforms will serve online machine learning services.
Finding out how to use the outputs of the reinforcement learning program we just studied shows how a human will always remain at the center of AI.
How to use the outputs
The reinforcement program we studied contains no trace of a specific field, as in traditional software. The program contains the Bellman equation with stochastic (random) choices based on the reward matrix. The goal is to find a route to C (line 3, column 3) that has an attractive reward (100
):
# Markov Decision Process (MDP) – The Bellman equations adapted to
# Reinforcement Learning with the Q action-value(reward) matrix
import numpy as ql
# R is The Reward Matrix for each state
R = ql.matrix([ [0,0,0,0,1,0],
[0,0,0,1,0,1],
[0,0,100,1,0,0],
[0,1,1,0,1,0],
[1,0,0,1,0,0],
[0,1,0,0,0,0] ])
That reward matrix goes through the Bellman equation and produces a result in Python:
Q :
[[ 0. 0. 0. 0. 258.44 0. ]
[ 0. 0. 0. 321.8 0. 207.752]
[ 0. 0. 500. 321.8 0. 0. ]
[ 0. 258.44 401. 0. 258.44 0. ]
[ 207.752 0. 0. 321.8 0. 0. ]
[ 0. 258.44 0. 0. 0. 0. ]]
Normed Q :
[[ 0. 0. 0. 0. 51.688 0. ]
[ 0. 0. 0. 64.36 0. 41.5504]
[ 0. 0. 100. 64.36 0. 0. ]
[ 0. 51.688 80.2 0. 51.688 0. ]
[ 41.5504 0. 0. 64.36 0. 0. ]
[ 0. 51.688 0. 0. 0. 0. ]]
The result contains the values of each state produced by the reinforced learning process, and also a normed Q
(the highest value divided by other values).
As Python geeks, we are overjoyed! We made something that is rather difficult work, namely, reinforcement learning. As mathematical amateurs, we are elated. We know what MDP and the Bellman equation mean.
However, as natural language thinkers, we have made little progress. No customer or user can read that data and make sense of it. Furthermore, we cannot explain how we implemented an intelligent version of their job in the machine. We didn't.
We hardly dare say that reinforcement learning can beat anybody in the company, making random choices 50,000 times until the right answer came up.
Furthermore, we got the program to work, but hardly know what to do with the result ourselves. The consultant on the project cannot help because of the matrix format of the solution.
Being an adaptive thinker means knowing how to be good in all steps of a project. To solve this new problem, let's go back to step 1 with the result. Going back to step 1 means that if you have problems either with the results themselves or understanding them, it is necessary to go back to the SME level, the real-life situation, and see what is going wrong.
By formatting the result in Python, a graphics tool, or a spreadsheet, the result can be  displayed as follows:
A | B | C | D | E | F | |
A |
- |
- |
- |
- |
258.44 |
- |
B |
- |
- |
- |
321.8 |
- |
207.752 |
C |
- |
- |
500 |
321.8 |
- |
- |
D |
- |
258.44 |
401. |
- |
258.44 |
- |
E |
207.752 |
- |
- |
321.8 |
- |
- |
F |
- |
258.44 |
- |
- |
- |
- |
Now, we can start reading the solution:
- Choose a starting state. Take F, for example.
- The F line represents the state. Since the maximum value is 258.44 in the B column, we go to state B, the second line.
- The maximum value in state B in the second line leads us to the D state in the  fourth column.
- The highest maximum of the D state (fourth line) leads us to the C state.
Note that if you start at the C state and decide not to stay at C, the D state becomes the maximum value, which will lead you back to C. However, the MDP will never do this naturally. You will have to force the system to do it.
You have now obtained a sequence: F->B->D->C. By choosing other points of departure, you can obtain other sequences by simply sorting the table.
A useful way of putting it remains the normalized version in percentages, as shown in the following table:
A | B | C | D | E | F | |
A |
- |
- |
- |
- |
51.68% |
- |
B |
- |
- |
- |
64.36% |
- |
41.55% |
C |
- |
- |
100% |
64.36% |
- |
- |
D |
- |
51.68% |
80.2% |
- |
51.68% |
- |
E |
41.55% |
- |
- |
64.36% |
- |
- |
F |
- |
51.68% |
- |
- |
- |
- |
Now comes the very tricky part. We started the chapter with a trip on the road. But I made no mention of it in the results analysis.
An important property of reinforcement learning comes from the fact that we are working with a mathematical model that can be applied to anything. No human rules are needed. We can use this program for many other subjects without writing thousands of lines of code.
Possible use cases
There are many cases to which we could adapt our reinforcement learning model without having to change any of its details.
Case 1: optimizing a delivery for a driver, human or not
This model was described in this chapter.
Case 2: optimizing warehouse flows
The same reward matrix can apply to go from point F to C in a warehouse, as shown in the following diagram:
Figure 1.3: A diagram illustrating a warehouse flow problem
In this warehouse, the F->B->D->C sequence makes visual sense. If somebody goes from point F to C, then this physical path makes sense without going through walls.
It can be used for a video game, a factory, or any form of layout.
Case 3: automated planning and scheduling (APS)
By converting the system into a scheduling vector, the whole scenery changes. We have left the more comfortable world of physical processing of letters, faces, and trips. Though fantastic, those applications are social media's tip of the iceberg. The real challenge of AI begins in the abstract universe of human thinking.
Every single company, person, or system requires automatic planning and scheduling (see Chapter 12, AI and the Internet of Things (IoT)). The six A to F steps in the example of this chapter could well be six tasks to perform in a given unknown order represented by the following vector x:
The reward matrix then reflects the weights of constraints of the tasks of vector x to perform. For example, in a factory, you cannot assemble the parts of a product before manufacturing them.
In this case, the sequence obtained represents the schedule of the manufacturing process.
Cases 4 and more: your imagination
By using physical layouts or abstract decision-making vectors, matrices, and tensors, you can build a world of solutions in a mathematical reinforcement learning model. Naturally, the following chapters will enhance your toolbox with many other concepts.
Before moving on, you might want to imagine some situations in which you could use the A to F letters to express some kind of path.
To help you with these mind experiment simulations, open mdp02.py
and go to line 97, which starts with the following code that enables a simulation tool. nextc
and nextci
are simply variables to remember where the path begins and will end. They are set to -1
so as to avoid 0, which is a location.
The primary goal is to focus on the expression "concept code." The locations have become any concept you wish. A could be your bedroom, and C your kitchen. The path would go from where you wake up to where you have breakfast. A could be an idea you have, and F the end of a thinking process. The path would go from A (How can I hang this picture on the wall?) to E (I need to drill a hole) and, after a few phases, to F (I hung the picture on the wall). You can imagine thousands of paths like this as long as you define the reward matrix, the "concept code," and a starting point:
"""# Improving the program by introducing a decision-making process"""
nextc=-1
nextci=-1
conceptcode=["A","B","C","D","E","F"]
This code takes the result of the calculation, labels the result matrix, and accepts an input as shown in the following code snippet:
origin=int(input(
"index number origin(A=0,B=1,C=2,D=3,E=4,F=5): "))
The input only accepts the label numerical code: A=0
, B=1
… F=5
. The function then runs a classical calculation on the results to find the best path. Let's takes an example.
When you are prompted to enter a starting point, enter 5
, for example, as follows:
index number origin(A=0,B=1,C=2,D=3,E=4,F=5): 5
The program will then produce the optimal path based on the output of the MDP process, as shown in the following output:
Concept Path
-> F
-> B
-> D
-> C
Try multiple scenarios and possibilities. Imagine what you could apply this to:
- An e-commerce website flow (visit, cart, checkout, purchase) imagining that a user visits the site and then resumes a session at a later time. You can use the same reward matrix and "concept code" explored in this chapter. For example, a visitor visits a web page at 10 a.m., starting at point A of your website. Satisfied with a product, the visitor puts the product in a cart, which is point E of your website. Then, the visitor leaves the site before going to the purchase page, which is C. D is the critical point. Why didn't the visitor purchase the product? What's going on?
You can decide to have an automatic email sent after 24 hours saying: "There is a 10% discount on all purchases during the next 48 hours." This way, you will target all the visitors stuck at D and push them toward C.
- A sequence of possible words in a sentence (subject, verb, object). Predicting letters and words was one of Andrey Markov's first applications 100+ years ago! You can imagine that B is the letter "a" of the alphabet. If D is "t," it is much more probable than F if F is "o," which is less probable in the English language. If an MDP reward matrix is built such as B leads to D or F, B can thus either go to D or to F. There are thus two possibilities, D or F. Andrey Markov would suppose, for example, that B is a variable that represents the letter "a," D is a variable that represents the letter "t" and F is a variable that represents the letter "o." After studying the structure of a language closely, he would find that the letter "a" would more likely be followed by "t" than by "o" in the English language. If one observes the English language, it is more likely to find an "a-t" sequence than an "a-o" sequence. In a Markov decision process, a higher probability will be awarded to the "a-t" sequence and a lower one to "a-o." If one goes back to the variables, the B-D sequence will come out as more probable than the B-F sequence.
- And anything you can find that fits the model that works is great!
Machine learning versus traditional applications
Reinforcement learning based on stochastic (random) processes will evolve beyond traditional approaches. In the past, we would sit down and listen to future users to understand their way of thinking.
We would then go back to our keyboard and try to imitate the human way of thinking. Those days are over. We need proper datasets and ML/DL equations to move forward. Applied mathematics has taken reinforcement learning to the next level. In my opinion, traditional software will soon be in the museum of computer science. The complexity of the huge volumes of data we are facing will require AI at some point.
An artificial adaptive thinker sees the world through applied mathematics translated into machine representations.
Use the Python source code example provided in this chapter in different ways. Run it and try to change some parameters to see what happens. Play around with the number of iterations as well. Lower the number from 50,000 down to where you find it fits best. Change the reward matrix a little to see what happens. Design your reward matrix trajectory. This can be an itinerary or decision-making process.