Imagine that you need to design a spam-filtering algorithm starting from this initial (over-simplistic) classification based on two parameters:
Parameter | Spam emails (X1) | Regular emails (X2) |
p1 - Contains > 5 blacklisted words | 80 | 20 |
p2 - Message length < 20 characters | 75 | 25 |
We have collected 200 email messages (X) (for simplicity, we consider p1 and p2 mutually exclusive) and we need to find a couple of probabilistic hypotheses (expressed in terms of p1 and p2), to determine:
We also assume the conditional independence of both terms (it means that hp1 and hp2 contribute conjunctly to spam in the same way as they were alone).
For example, we could think about rules (hypotheses) like: "If there are more than five blacklisted words" or "If the message is less than 20 characters in...