Here are some rough notes on the “Hyena Hierarchy” architecture described in the paper 1.

This is a new way of getting subquadratic scaling for attention

it uses convolution filter
 typical convolution filters are in the form of an array of values which are learned and applied like an FiniteImpulseResponse discrete filter (FIR)2
 this doesn’t scale well
 instead the filter parameters are represented as a function of “t” where “t” represents the index or “timestep” in the filter. This means you can get a filter of any length from a limited number of parameters
 furthermore, this function is chosen to be the output of a statespace model of the type from control theory (Ax+Bu, Cx+Du etc.)
 If x0 = 0, then you can get an expression for the output “y” (aka the filter), in terms of matrices A, B, C and D (which can be learned during training)
 dimensions of the statespace model and structore of the matrices represent the degrees of freedom available

FFT can be used to implement convolutions

Typical attention involves three linear projections passed through a softmax function  called query, key and value

“Hyena” uses N+1 linear projections (not necessarily equal to three). One of these projections take the role of the “value”.

So
y = H(u)v
 H(u) is defined by “interleaving implicit long convolutions and elementwise multiplication” with one projection at a time
 It somehow retains the sublinear scaling by not “materializing” H(u)
 The elementwise product in time domain corresponds to convolution in frequency domain

more details to come as I understand the paper further
References
[1] “Hyena Hierarchy: Towards Larger Convolutional Language Models”, Poli et.al