The Iris dataset is a small set of just 150 samples. When we randomly split it into training and test sets, we ended up with 45 instances in the test set. With such a small number, we may have variations in the distribution of our targets. For example, when I randomly split the data, I got 13 samples from class 0 and 16 samples from each one of the two other classesin my test set. Knowing that predicting class 0 is easier than the other two classes in this particular dataset, we can tell that if I was luckier and had more samples of class 0 in the test set, I'd have had a higher score. Furthermore, decision trees are very sensitive to data changes, and you may get a very different tree with every slight change in your training data.
What to do now to get a more reliable score
A statistician would say let's run the whole process of data splitting, training, and predicting, more than once, and get the distribution of the different...