Statistical machine translation systems select a target text by maximizing its conditional probability, given the source text. For example, let's say we have a source text s and we want to derive the best equivalent text t in the target language. This can be derived as follows:

The formulation of P(t/s) in (1) can be expanded using Bayes' theorem as follows:

For a given source sentence, P(s) would be fixed, and, hence, finding the optimal target translation turns out to be as follows:

You may wonder why maximizing P(s/t)P(t) in place of P(t/s) directly would give an advantage. Generally, ill-formed sentences that are highly likely under P(t/s) are avoided by breaking the problem into two components, that is, P(s/t) and P(t), as shown in the previous formula:

As we can see...