Here are some rough notes on the “Hyena Hierarchy” architecture described in the paper 1.
-
This is a new way of getting sub-quadratic scaling for attention
-
it uses convolution filter
- typical convolution filters are in the form of an array of values which are learned and applied like an Finite-Impulse-Response discrete filter (FIR)2
- this doesn’t scale well
- instead the filter parameters are represented as a function of “t” where “t” represents the index or “time-step” in the filter. This means you can get a filter of any length from a limited number of parameters
- furthermore, this function is chosen to be the output of a state-space model of the type from control theory (Ax+Bu, Cx+Du etc.)
- If x0 = 0, then you can get an expression for the output “y” (aka the filter), in terms of matrices A, B, C and D (which can be learned during training)
- dimensions of the state-space model and structore of the matrices represent the degrees of freedom available
-
FFT can be used to implement convolutions
-
Typical attention involves three linear projections passed through a softmax function - called query, key and value
-
“Hyena” uses N+1 linear projections (not necessarily equal to three). One of these projections take the role of the “value”.
-
So
y = H(u)v
- H(u) is defined by “interleaving implicit long convolutions and element-wise multiplication” with one projection at a time
- It somehow retains the sublinear scaling by not “materializing” H(u)
- The element-wise product in time domain corresponds to convolution in frequency domain
-
more details to come as I understand the paper further
References
[1] “Hyena Hierarchy: Towards Larger Convolutional Language Models”, Poli et.al