Hyena Hierarchy: Towards Larger Convolutional Language Models

Here are some rough notes on the “Hyena Hierarchy” architecture described in the paper 1.

  • This is a new way of getting sub-quadratic scaling for attention

  • it uses convolution filter

    • typical convolution filters are in the form of an array of values which are learned and applied like an Finite-Impulse-Response discrete filter (FIR)2
    • this doesn’t scale well
    • instead the filter parameters are represented as a function of “t” where “t” represents the index or “time-step” in the filter. This means you can get a filter of any length from a limited number of parameters
    • furthermore, this function is chosen to be the output of a state-space model of the type from control theory (Ax+Bu, Cx+Du etc.)
    • If x0 = 0, then you can get an expression for the output “y” (aka the filter), in terms of matrices A, B, C and D (which can be learned during training)
    • dimensions of the state-space model and structore of the matrices represent the degrees of freedom available
  • FFT can be used to implement convolutions

  • Typical attention involves three linear projections passed through a softmax function - called query, key and value

  • “Hyena” uses N+1 linear projections (not necessarily equal to three). One of these projections take the role of the “value”.

  • So y = H(u)v

    • H(u) is defined by “interleaving implicit long convolutions and element-wise multiplication” with one projection at a time
    • It somehow retains the sublinear scaling by not “materializing” H(u)
    • The element-wise product in time domain corresponds to convolution in frequency domain
  • more details to come as I understand the paper further

References

[1] “Hyena Hierarchy: Towards Larger Convolutional Language Models”, Poli et.al

[2] Finite-Impulse Response

Backlinks

  • AI/ML
  • AI Research Papers