Imagine that you need to design a spam-filtering algorithm, starting from this initial (over-simplistic) classification based on two parameters:
Parameter | Spam emails (X1) | Regular emails (X2) |
p1 - Contains > 5 blacklisted words | 80 | 20 |
p2 - Message length < 20 characters | 75 | 25 |
We have collected 200 email messages (X) (for simplicity, we consider p1 and p2 as mutually exclusive) and we need to find a couple of probabilistic hypotheses (expressed in terms of p1 and p2), to determine the following:
We also assume the conditional independence of both terms (it means that hp1 and hp2 contribute in conjunction to spam in the same way as they would alone).
For example, we could think about rules (hypotheses) like so: "If there are more than five blacklisted words" or "If the message is less than 20 characters...