Lynne 82a68a8771
x86/tx_float: remove vgatherdpd usage
Its performance loss ranges from either being just as fast as individual loads
(Skylake), a few percent slower (Alderlake), 8% slower (Zen 3), to completely
disasterous (older/other CPUs).

Sadly, gathers never panned out fast on x86, even with the benefit of time and
implementation experience.

This also saves a register, as there's no need to fill out an additional
register mask.

Zen 3 (16384-point transform):
Before: 1561050 decicycles in           av_tx (fft),  131072 runs,      0 skips
After:  1449621 decicycles in           av_tx (fft),  131072 runs,      0 skips

Alderlake:
2% slower on big transforms (65536), to 1% (131072), to a few percent for smaller
sizes.
2022-05-20 10:12:34 +02:00
..
2022-03-10 16:45:48 -03:00
2022-03-10 16:45:48 -03:00
2022-03-10 16:45:48 -03:00