Beam Search
greedy search
generate (or “decode”) the target sentence by taking argmax on each step of the decoder
problem with greedy search :
- Greedy decoding has no way to undo decisions!
- Input: il a m’entarté (he hit me with a pie)
- → he ____
- → he hit ____
- → he hit a ____ (whoops! no going back now…)
Exhaustive search decoding
Ideally we want to find a (length T) translation y that maximizes :
[egin{aligned} P(y | x) &=Pleft(y_{1} | x
ight) Pleft(y_{2} | y_{1}, x
ight) Pleft(y_{3} | y_{1}, y_{2}, x
ight) ldots, Pleft(y_{T} | y_{1}, ldots, y_{T-1}, x
ight) \ &=prod_{t=1}^{T} Pleft(y_{t} | y_{1}, ldots, y_{t-1}, x
ight) end{aligned}
]
We could try computing all possible sequences y:
- This means that on each step t of the decoder, we’re tracking (V^t) possible partial translations, where (V) is vocab size
- This (O(V^t)) complexity is far too expensive!
beam search
-
Core idea : On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses), where (k) is the beam size (in practice around 5 to 10)
-
A hypothesis (y_1,cdots,y_t) has a score which is its log probability:
[operatorname{score}left(y_{1}, ldots, y_{t} ight)=log P_{mathrm{LM}}left(y_{1}, ldots, y_{t} | x ight)=sum_{i=1}^{t} log P_{mathrm{LM}}left(y_{i} | y_{1}, ldots, y_{i-1}, x ight) ]- Scores are all negative, and higher score is better
- We search for high-scoring hypotheses, tracking top (k) on each step
-
Beam search is not guaranteed to find optimal solution
-
But much more efficient than exhaustive search!
Beam search decoding: stopping criterion
- In greedy decoding, usually we decode until the model produces a
token - In beam search decoding, different hypotheses may produce
tokens on different timesteps - When a hypothesis produces
, that hypothesis is complete. - Place it aside and continue exploring other hypotheses via beam search.
- When a hypothesis produces
- Usually we continue beam search until:
- We reach timestep T (where T is some pre-defined cutoff), or
- We have at least n completed hypotheses (where n is pre-defined cutoff)
Beam search decoding: finishing up
- We have our list of completed hypotheses.
- How to select top one with highest score?
- Each hypothesis (y_1,cdots,y_t) on our list has a score[operatorname{score}left(y_{1}, ldots, y_{t} ight)=log P_{mathrm{LM}}left(y_{1}, ldots, y_{t} | x ight)=sum_{i=1}^{t} log P_{mathrm{LM}}left(y_{i} | y_{1}, ldots, y_{i-1}, x ight) ]
Problem with this: longer hypotheses have lower scores
Fix : Normalize by length. Use this to select top one instead:
[frac{1}{t} sum_{i=1}^{t} log P_{mathrm{LM}}left(y_{i} | y_{1}, ldots, y_{i-1}, x
ight)
]