Its performance loss ranges from either being just as fast as individual loads
(Skylake), a few percent slower (Alderlake), 8% slower (Zen 3), to completely
disasterous (older/other CPUs).
Sadly, gathers never panned out fast on x86, even with the benefit of time and
implementation experience.
This also saves a register, as there's no need to fill out an additional
register mask.
Zen 3 (16384-point transform):
Before: 1561050 decicycles in av_tx (fft), 131072 runs, 0 skips
After: 1449621 decicycles in av_tx (fft), 131072 runs, 0 skips
Alderlake:
2% slower on big transforms (65536), to 1% (131072), to a few percent for smaller
sizes.