Cure the headache of Transformers via Collinear Constrained Attention

Shiyi Zhu, Jing Ye, Wei Jiang, Qi Zhang, Yifan Wu, & Jianguo Li
Ant Group
{zhushiyi.zsy,qianye.yj,shouzhi.jw,lijg.zero}@antgroup.com

Abstract

As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We’ve coined this discovery the ”headache of Transformers”. To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA’s computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we’ve made our code available in the appendix for reappearing experiments.

1 Introduction

In the seminal work of Transformer models (Vaswani et al., 2017), “extrapolate to sequence lengths longer than the ones encountered during training” was a beautiful but ideal hypothesis that we want.

There was a lot of works tried to realize it. From the perspective of implementation methods, it can be divided into two categories, pre-training and fine-tuning, pre-training ones such as ALiBi (Press et al., 2021) and LeX (Sun et al., 2022) by introducing a new position encoding method, while fine-tuning ones such as PI (Chen et al., 2023) and NTK-aware Scaled RoPE (bloc97, 2023) by improving the extrapolation or interpolation methods.

These works made a lot of contributions in this research domain. However, they still haven’t solve this problem perfectly. Pre-training methods like ALiBi (Press et al., 2021) and LeX (Sun et al., 2022) have weaken the aesthetics of the attention structure to some extent since the assumption in their models may be too strong. The structural changes of these models also make them not suitable for some useful optimization methods such as linear attention (Katharopoulos et al., 2020). Additionally, they conflict with available fine-tuning methods either.

For fine-tuning methods which is based on RoPE (Su et al., 2021), situation may be slightly better, since they introduce no destructive features. However, they still put their eyes on position encoding like Pre-training methods did. Furthermore, we found that all these methods haven’t discover one technical deficiency in Transformer models itself, which may be the heart of the matter of extrapolating to longer sequence.

In a nutshell, not only the tail of positional encoding will make attention scores oscillate as we all known, but the head either. The initial angle between queries and keys will destroy the monotonicity of RoPE significantly on those closest positions with the most important information in common cognition, which is incurable with whatever extrapolation or interpolation methods, but studying Transformer models itself.

Can we get a Transformer model with ideal usage of input information? This question, not exactly the same but highly relevant, was studied by (Del’etang et al., 2022) before. They found that Transformer models behave abnormally, left an open question.

This work We drilled down on the anomalous behavior of Transformer models, especially the interactions between RoPE (Su et al., 2021) and attention matrices. Finally, we make it realized, by rescuing the headache of Transformer models with our Collinear Constrained Attention method, see Figure 1 for understanding. “Extrapolate to sequence lengths longer than the ones encountered during training”, naturally, without additional fine-tuning, without destruction of an ideal attention structure. Even more, perfectly work with available extrapolation or interpolation methods and any other useful optimization methods proposed for traditional Transformer models.

Refer to caption — Figure 1: Collinear Constrained Attention. Q, T and V are generated by projection matrices the same as before, we generate K with Q and T to applying collinear Constraint.

Since our model is non-destructive to the current attention structure and computational efficient, everyone could get a promoted vision of Transformer models with ideal characteristic of extrapolating by the method we named Collinear Constraint here, and use it just as before, with no difference.

This is not only the historic moment for our work but for Transformer models, Large Language Models either. We believe that the characteristic of ideal usage of input information will make large language models such as GPTs (Brown et al., 2020) and LLaMA (Touvron et al., 2023a) much more powerful than before.

Our implementation will be open-source soon. Here we temporarily provide our code in Appendix B for reappearing experiments.

2 Method

2.1 Background: Rotary Position Embedding (RoPE)

Positional encoding plays a important role in Transformer models since it represents the order of inputs. We consider Rotary Position Embedding(RoPE) (Su et al., 2021) here, which is a positional encoding method used by LLaMA model (Touvron et al., 2023a), GPT-NeoX (Black et al., 2022), etc. Suppose the position index is an interger $n\in[0,c)$ and the corresponding input vector ${\mathbf{x}}:=(x_{0},x_{1},...,x_{d-1})$ , where $d$ represents the dimension of the attention head and always even. RoPE defines a vector-valued complex function $f({\mathbf{x}},n)$ as follows:

f({\mathbf{x}},n)=((x_{0}+ix_{d/2})e^{im\theta_{0}},(x_{1}+ix_{1+d/2})e^{im\theta_{1}},...,(x_{d/2-1}+ix_{d-1})e^{im\theta_{d/2-1}})

(1)

where $i:=\sqrt{-1}$ is the imaginary unit and $\theta_{j}=10000^{-2j/d}$ . Attention score after applying RoPE is:

$\displaystyle a(m,n)$	$\displaystyle=\text{Re}(\langle f({\mathbf{q}},m),f({\mathbf{k}},n)\rangle)$	(2)
	$\displaystyle=\text{Re}\left[\sum_{j=0}^{d/2-1}(q_{j}+iq_{j+d/2})(k_{j}-ik_{j+d/2})e^{i(m-n)\theta_{j}}\right]$
	$\displaystyle=\sum_{j=0}^{d/2-1}((q_{j}k_{j}+q_{j+d/2}k_{j+d/2})\text{cos}((m-n)\theta_{j})$
	$\displaystyle\ \ \ \ +(q_{j}k_{j+d/2}-q_{j+d/2}k_{j})\text{sin}((m-n)\theta_{j}))$
	$\displaystyle:=a(m-n)$

Here, ${\mathbf{q}}$ and ${\mathbf{k}}$ denote the query and key vectors for a particular attention head. The attention score $a(m-n)$ is only dependent on relative position $(m-n)$ . It is a beautiful design that works with the Attention mechanism to achieve absolute positional encoding in the way of relative positional encoding.This feature renders RoPE more efficient than other position encoding techniques and is inherently compatible with linear attentions.

2.2 Long-term decay of RoPE

As studied by (Su et al., 2021) before, RoPE has the characteristic of long-term decay:

	$\displaystyle\|a(s)\|$	$\displaystyle=\Bigg{\|}\text{Re}\left[\sum_{j=0}^{d/2-1}h_{j}e^{is\theta_{j}}\right]\Bigg{\|}$		(3)
		$\displaystyle\leq(\max_{i}\|h_{i+1}-h_{i}\|)\sum_{j=0}^{d/2-1}\|S_{j+1}\|$		(3)

where $h_{j}:=(q_{j}+iq_{j+d/2})(k_{j}-ik_{j+d/2})$ and $S_{j}:=\sum_{k=0}^{j-1}e^{is\theta_{k}}$ , Since the value of $\sum_{j=0}^{d/2-1}|S_{j+1}|$ decays with the relative distance $s$ , the attention score decays either.

This is consistent with human understanding of language modeling.

We claim that we could get a much more stronger one with collinear constraint later.

2.3 Anomalous behavior between RoPE and attention matrices

In Equation 2, we show the attention score after applying RoPE as $a(m-n)$ , mathematically, it can be visualized as the inner-product of two complex number after a rotation for any individual $j\in[0,d/2]$ , just like Figure 2.

It intuitively make sense, since position distance can be modeling as one kind of order and the inner-product of two complex number changes with the rotation angle $(m-n)\theta$ .

However, we will show that it is not a qualified order with a technical deficiency.

For simplicity, we consider bidirectional models first, such as Bert (Devlin et al., 2019) and GLM (Du et al., 2021), etc. As shown in Figure 2, for any pair of ${\mathbf{q}}_{j}$ and ${\mathbf{k}}_{j}$ , without loss of generality, we suppose that there is an angle $\theta_{0}$ which is smaller than $\pi$ to rotate counterclockwise from ${\mathbf{k}}_{j}$ to ${\mathbf{q}}_{j}$ in the complex plane, then we have two possible conditions of their position indices(while $=$ is ordinary).

$\bullet$ When $m>n$ , shown as the right part of Figure 2, it’s the order preserving one what we want. Since the attention score decreases when the position distance increase(until they rotate out of $\pi$ , we will discuss this part in Appendix A).

$\bullet$ However, when $m<n$ , shown as the left part of Figure 2, the anomalous behavior which breaks the order at closest tokens with the number of $\theta_{0}/\theta_{j}$ . More terribly, it always accompanies the model whether applying PI (Chen et al., 2023) or NTK-aware Scaled RoPE (bloc97, 2023). Since we could only survive by cutting tail but not head.

For causal models, it also doomed although $m$ is always larger than $n$ . As shown in Figure 3, just for some $j$ when there is an angle $\theta_{0}$ which is smaller than $\pi$ to rotate counterclockwise from ${\mathbf{q}}_{j}$ to ${\mathbf{k}}_{j}$ , instead of ${\mathbf{k}}_{j}$ to ${\mathbf{q}}_{j}$ .

2.4 Collinear Constrained Attention(CoCA)

Follow the analysis in Section 2.3, we can naturally deduce the following method: applying a collinear constraint on any pair of ${\mathbf{q}}_{j}$ and ${\mathbf{k}}_{j}$ .

Formally, let ${\mathbb{S}}_{N}=\{w_{i}\}_{i=1}^{N}$ be a sequence of $N$ input tokens. The corresponding word embedding of ${\mathbb{S}}_{N}$ is denoted as $\mathbb{E}_{N}=\{{\mathbf{x}}_{i}\}_{i=1}^{N}$ , we first get the queries as same as before:

\displaystyle{\mathbf{q}}_{m}={\bm{W}}_{q}{\mathbf{x}}_{m},\forall m\in[1,N]

(4)

Notice that the subscript $m$ here is quite different with $j$ we used in last section, while $m$ here represents the dimension of sequence length and $j$ represents the dimension of hidden size. We abbreviate it here by omitting the dimension of hidden size.

Next, we get the keys in a different way since we have to apply the collinear constraint on it, we get the constraint coefficient first:

		$\displaystyle{\mathbf{t}}_{n}={\bm{W}}_{t}{\mathbf{x}}_{n},\forall n\in[1,N]$		(5)
		$\displaystyle({\mathbf{t}}_{n;[0:d/2-1]},{\mathbf{t}}_{n;[d/2-1:d-1]})={\mathbf{t}}_{n}$
		$\displaystyle{\mathbf{t}}_{n}=(\frac{{\mathbf{t}}_{n;[0:d/2-1]}+{\mathbf{t}}_{n;[d/2-1:d-1]}}{2},\frac{{\mathbf{t}}_{n;[0:d/2-1]}+{\mathbf{t}}_{n;[d/2-1:d-1]}}{2})$
		$\displaystyle{\mathbf{t}}_{n}=\text{Relu}({\mathbf{t}}_{n})$

it could be regard as folding ${\mathbf{t}}_{n}$ in half along the dimension of hidden size and making a copy.

Secondly, we get the keys as follows:

\displaystyle{\mathbf{k}}_{n}={\mathbf{Q}}\circ{\mathbf{t}}_{n}=({\mathbf{q}}_{1}\circ{\mathbf{t}}_{n},...,{\mathbf{q}}_{N}\circ{\mathbf{t}}_{n})

(6)

where $\circ$ represents Hadamard product. We have to claim that ${\mathbf{k}}_{n}$ here has one more additional dimension than before, since it might bring unimaginable memory pressure(exactly $d/h$ times large, where $h$ represents the number of heads).

Fortunately, we can perfectly handle this with tensor contraction, leading to zero increase in memory consumption(see computational and spatial complexity in Section 3.2).

After this, we can apply RoPE almostly the same as before:

		$\displaystyle f({\mathbf{q}}_{m})=f({\mathbf{q}},m)$		(7)
		$\displaystyle f({\mathbf{k}}_{n})={\mathbf{Q}}\circ f({\mathbf{t}},n)$		(7)

where $f$ is defined as Equation 1.

Finally, we get the attention score as follows:

\displaystyle a(m,n)

\displaystyle=\text{Re}(\langle f({\mathbf{q}},m),{\mathbf{q}}_{m}\circ f({\mathbf{t}},n)\rangle)

(8)

Thus we have built the collinear constrained attention(CoCA) here. Review that the initial angle $\theta_{0}$ between ${\mathbf{q}}$ and ${\mathbf{k}}$ we defined in Section 2.3, it always zero now. No more headaches.

3 Theoretical explanation

3.1 Strong form of Long-term decay

As shown in Section 2.2, RoPE has the characteristic of long-term decay:

\displaystyle|a(s)|

\displaystyle\leq(\max_{i}|h_{i+1}-h_{i}|)\sum_{j=0}^{d/2-1}|S_{j+1}|

(9)

For CoCA, we could deduce a much more stronger one as follows:

\displaystyle|a(s)|

\displaystyle\leq(\max_{i}|l_{i+1}-l_{i}|)\sum_{j=0}^{d/2-1}|C_{j+1}|

(10)

where $l_{j}:=|q_{j}+iq_{j+d/2}||k_{j}+ik_{j+d/2}|$ , and $C_{j}:=\sum_{k=0}^{j-1}\text{cos}(s\theta_{j})$ . And we always have:

\displaystyle|l_{i+1}-l_{i}|

\displaystyle\leq|h_{i+1}-h_{i}|

(11)

Proof Just notice that, when the initial angle $\theta_{0}$ between ${\mathbf{q}}$ and ${\mathbf{k}}$ is $0$ , the attention score can be simplified as:

	$\displaystyle a(s)$	$\displaystyle=\text{Re}\left[\sum_{j=0}^{d/2-1}h_{j}e^{is\theta_{j}}\right]$		(12)
		$\displaystyle=\sum_{j=0}^{d/2-1}l_{j}\text{cos}(s\theta_{j})$		(12)

Then follow the study of (Su et al., 2021), will easily get the estimation in 10.

For Inequality 11, with triangle inequality, we have:

\displaystyle|h_{i+1}-h_{i}|\geq||h_{i+1}|-|h_{i}||

(13)

review the definition of $h_{i}=(q_{j}+iq_{j+d/2})(k_{j}-ik_{j+d/2})$ , we have:

$\displaystyle\|h_{i+1}-h_{i}\|$	$\displaystyle\geq\|\|h_{i+1}\|-\|h_{i}\|\|$	(14)
	$\displaystyle=\|\|{\mathbf{q}}_{i+1}{\mathbf{k}}^{}_{i+1}\|-\|{\mathbf{q}}_{i}{\mathbf{k}}^{}_{i}\|\|$
	$\displaystyle=\|\|{\mathbf{q}}_{i+1}{\mathbf{k}}_{i+1}\|-\|{\mathbf{q}}_{i}{\mathbf{k}}_{i}\|\|$
	$\displaystyle=\|l_{i+1}-l_{i}\|$

3.2 Computational and spatial complexity

We assign some notations before analysing, see Table 1.

Table 1: Some Notations

Variable	Notation
embedding-size	$V$
sequence length	$n$
number of layers	$L$
number of heads per layer	$h$
hidden dimension	$d$

Table 2: Computational complexity

COMPONENT	Complexity Of Origin Model	Complexity Of CoCA
QK(T)V projection	$3Vnd+3(L-1)nd^{2}$	$3Vnd+3(L-1)nd^{2}$
T half	-	$Lnd$
T Relu	-	$Lnd$
QK(T) rotary	$2Lnd$	$2Lnd$
$\text{K}=\text{Q}\circ\text{T}$	-	$Ln^{2}d$
$\text{QK}^{\text{T}}$	$Ln^{2}d$	$Ln^{2}d$
Mask	$Ln^{2}$	$Ln^{2}$
Softmax	$Ln^{2}$	$Ln^{2}$

Since $n\gg d$ in commonly used large language models, we can assert that computational complexity of CoCA is nearly 2 times of origin models with component $\text{K}=\text{Q}\circ\text{T}$ from Table 2, which is worthy by comparing such small cost with its excellent performance.

Apart from computational complexity, another important factor which affects the practicality of one model is spatial complexity. As we pointed out after Equation 6, there will be an unimaginable memory pressure without optimization, see Table 3.

The spatial complexity of component $\text{K}=\text{Q}\circ\text{T}$ will become $d/h$ times larger than origin model if fully expanded. It’s about $64\sim 256$ times for commonly used models which is unacceptable for practical use.

Table 3: Spatial complexity

COMPONENT	Complexity Of Origin Model	Complexity Of CoCA
QK(T)V projection	$Lnd/h$	$Lnd/h$
T half	-	$Lnd/h$
T Relu	-	$Lnd/h$
QK(T) rotary	$Lnd/h$	$Lnd/h$
$\text{K}=\text{Q}\circ\text{T}$	-	$Ln^{2}d/h$
$\text{QK}^{\text{T}}$	$Ln^{2}$	$Ln^{2}$
Mask	$Ln^{2}$	$Ln^{2}$
Softmax	$Ln^{2}$	$Ln^{2}$

Review the computational procedure of $\text{QK}^{\text{T}}$ , it could be seen as two steps:

$\bullet$ Element-wise product between Q and K.

$\bullet$ Sum calculation along hidden dimension.

Its spatial complexity will also become $Ln^{2}d/h$ if fully expanded, only if it contracts along hidden dimension before expanding along sequence length, avoiding full expansion. It also works for $\text{K}=\text{Q}\circ\text{T}$ , by combining those two components as follows:

\displaystyle\text{QK}^{\text{T}}=\text{Q}(\text{Q}\circ\text{T})^{\text{T}}

(15)

Thanks to the work of opt_einsum (a. Smith & Gray, 2018), the optimization of Equation 15 can be easily accomplished for commonly used backends, such as torch and tensorflow.

The memory consumption of CoCA gets zero increase with the optimization of Equation 15.

4 Experiments

Owing to GPU constraints, our experiments are not yet fully concluded. Nevertheless, we present some preliminary results that are indicative of the final outcome. We compare our model with LLaMA 7B (Touvron et al., 2023a) and LLaMA2 7B-chat (Touvron et al., 2023b).

4.1 Experimental setting

Model Architecture. We modified GPT-NeoX (Black et al., 2022) by incorporating our proposed CoCA method, as detailed in Section 2.4. For a comprehensive understanding of the implementation, please refer to the code provided in Appendix B. We trained a more compact model consisting of 350M parameters. This configuration includes 24 layers with a hidden dimension of 1024 and 16 attention heads. We’ve set the maximum sequence length to 512 to further conserve GPUs.

Training Data. Our model is trained on a combination of datasets, including the Pile training dataset (Gao et al., 2020), BookCorpus (Zhu et al., 2015), and the Wikipedia Corpus (Foundation, 2021). Additionally, we incorporated open-source code from GitHub with 1+ stars, which we personally collected. From these datasets, we derived a sample of approximately 50B tokens, maintaining a composition of 75% text and 25% code.

Training Procedure. Our training leverages the next-token prediction objective. The optimization is carried out using AdamW (Loshchilov & Hutter, 2017), set with hyperparameters $\beta_{1}=0.9$ and $\beta_{2}=0.95$ . The learning rate adopts a linear warm-up of 1% of total steps, starting from 1e-7. Subsequently, we adjust the learning rate to 1e-4 and linearly decay it to 1e-5. The training harnesses the computational capabilities of 8 A100 GPUs, with a global batch size of 256 and an accumulation of 2 gradient steps. For the implementation, we deploy PyTorch (Paszke et al., 2019) in tandem with Fully Sharded Data Parallel (Zhao et al., 2023). Our models underwent 2 epochs of training, completing within a span of 72 hours.

4.2 Long sequence language modeling

We evaluated the long-sequence language modeling prowess of both our model and the LLaMA 7B model. This evaluation was conducted on 100 documents, each possessing at least 16,384 tokens, randomly sourced from the PG-19 dataset (Rae et al., 2019). This methodology follows the approach taken by (Chen et al., 2023). For each test document, we truncated the content to the initial 16,384 tokens. To evaluate perplexity across varied context window sizes, we utilized a sliding window method, in line with (Press et al., 2021), employing a stride S = 256.

Figure 4 illustrates a noteworthy trend: the perplexity of the LLaMA 7B model rapidly diverges beyond its training length. Conversely, our CoCA model sustains its perplexity at a consistently low plateau, showing only a minuscule uptick even at 24 times its training length. It’s important to highlight that, in its nascent state, CoCA’s perplexity is marginally higher than that of LLaMA 7B. This can be attributed to CoCA’s smaller parameter size, coupled with the fact that LLaMA 7B’s training length is 2,048—quadruple that of CoCA.

4.3 Long-range dependence retrieval

Perplexity is a measure that captures a language model’s proficiency in predicting the next token. However, it doesn’t entirely encompass what we expect from an ideal model. While local attention excels at this task, it often falls short in capturing long-range dependencies.

To further evaluate this, we assessed the CoCA and LLaMA2 7B-chat models using a synthetic evaluation task of passkey retrieval, as proposed by (Mohtashami & Jaggi, 2023). In this task, there is a random passkey hidden in a long document to be identified and retrieved. The prompt format can be seen in Figure 5.

   There is an important info hidden inside a lot of irrelevant
   text. Find it and memorize them. I will quiz you about the
   important information there.
   The grass is green. The sky is blue. The sun is yellow. Here
   we go. There and back again.(filler)
   The pass key is xxxxx. Remember it. xxxxx is the pass key.
   (passkey)
   The grass is green. The sky is blue. The sun is yellow. Here
   we go. There and back again.(filler)
   What is the pass key?

Figure 5: Prompt format for passkey retrieval. The pass key ’xxxxx’ is randomly generated from 10,000 to 99,999

We first repeat the filler to make the prompt longer than the individual sequence length at the first time from 256 to 8,192, then insert the passkey into a random position between the fillers. For any individual sequence length, we make 100 test samples. We check first 64 tokens of model outputs for calculating accuracy.

As depicted in Figure 6, the LLaMA2-7B-chat model, which was trained with a maximum length of 4,096 tokens, demonstrated failures when tested on sequences that were 25% longer than its training length. In contrast, CoCA consistently exhibited a high degree of accuracy, even when the test sequence length was expanded to 16 times its original training length. We will delve deeper into specific instances where the model fell short in Section 4.5. It’s pertinent to note that we employed the NTK-aware RoPE (no fine-tuning) approach during inference for both CoCA and LLaMA2 7B-chat models. Further specifics can be found in Appendix A.

4.4 Behaviour of attention score in extrapolation

Experiments are not complete yet.

4.5 Case study

As shown in Table4, the failure cases of CoCA were not completely wrong, it might be further improved, see Appendix A for more details.

Table 4: Failure cases of CoCA

CoCA result	Ground Truth
228.The grass is green. The sky is green. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The grass is green. The	22841
57.The grass is green. The sky is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow. The sun is yellow.
The grass is green.	57680

5 Conclusions

In this work, we observed an anomalous behavior between RoPE and attention matrices, which severely disrupts the monotonicity of RoPE, especially at closest positions containing critical information. To address this at its core, we introduced a new self-attention framework called Collinear Constrained Attention (CoCA). We provided mathematical evidence showcasing the superior characteristics of our method, such as a strong form of long-term decay, as well as computational and spatial efficiency for practical applications. Experimental findings confirm that CoCA delivers outstanding performance in both long-sequence language modeling and long-range dependence capturing. Additionally, CoCA seamlessly integrates with existing extrapolation, interpolation techniques, and other optimization methods designed for conventional Transformer models. This adaptability suggests that CoCA has the potential to evolve into an enhanced version of Transformer models.

References

a. Smith & Gray (2018) Daniel G. a. Smith and Johnnie Gray. opt_einsum - a python package for optimizing contraction order for einsum-like expressions. Journal of Open Source Software, 3(26):753, 2018. doi: 10.21105/joss.00753. URL https://doi.org/10.21105/joss.00753.
Black et al. (2022) Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 95–136, virtual+Dublin, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.bigscience-1.9. URL https://aclanthology.org/2022.bigscience-1.9.
bloc97 (2023) bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_modes_to_have/.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. URL https://api.semanticscholar.org/CorpusID:218971783.
Chen et al. (2023) Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. ArXiv, abs/2306.15595, 2023. URL https://api.semanticscholar.org/CorpusID:259262376.
Del’etang et al. (2022) Gr’egoire Del’etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. ArXiv, abs/2207.02098, 2022. URL https://api.semanticscholar.org/CorpusID:250280065.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
Du et al. (2021) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:247519241.
Foundation (2021) Wikimedia Foundation. Wikimedia downloads, 2021. URL https://dumps.wikimedia.org.
Gao et al. (2020) Leo Gao, Stella Rose Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027, 2020. URL https://api.semanticscholar.org/CorpusID:230435736.
Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franccois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:220250819.
Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017. URL https://api.semanticscholar.org/CorpusID:3312944.
Mohtashami & Jaggi (2023) Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. ArXiv, abs/2305.16300, 2023. URL https://api.semanticscholar.org/CorpusID:258887482.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
Press et al. (2021) Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021. URL https://api.semanticscholar.org/CorpusID:237347130.
Rae et al. (2019) Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. ArXiv, abs/1911.05507, 2019. URL https://api.semanticscholar.org/CorpusID:207930593.
Su et al. (2021) Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021. URL https://api.semanticscholar.org/CorpusID:233307138.
Sun et al. (2022) Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. ArXiv, abs/2212.10554, 2022. URL https://api.semanticscholar.org/CorpusID:254877252.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a. URL https://api.semanticscholar.org/CorpusID:257219404.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b. URL https://api.semanticscholar.org/CorpusID:259950998.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Zhao et al. (2023) Yanli Zhao, Andrew Gu, Rohan Varma, Liangchen Luo, Chien chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel. ArXiv, abs/2304.11277, 2023. URL https://api.semanticscholar.org/CorpusID:258297871.
Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27, 2015. URL https://api.semanticscholar.org/CorpusID:6866988.

Appendix A Rotary borders

For simplicity, let’s take the example of Figure 3, there are three borders during rotating. As shown in Figure 7. We use the relative coordinate system by regarding ${\mathbf{q}}_{j}$ as $x$ -axis:

$\bullet$ The first border is ${\mathbf{k}}_{j}$ .

$\bullet$ The second border is $-{\mathbf{q}}_{j}$ .

$\bullet$ The last border is ${\mathbf{q}}_{j}$ .

Every time when the relative angle of ${\mathbf{q}}_{j}$ and ${\mathbf{k}}_{j}$ crossover these borders, the monotonicity of $a(m,n)$ will perform a reversal. Leading the model confusing when extrapolating.

CoCA fundamentally solved the problem of the border of ${\mathbf{k}}_{j}$ , and we applied NTK(no fine-tuning) here in CoCA during inference to reduce the confusion of $-{\mathbf{q}}_{j}$ and ${\mathbf{q}}_{j}$ .

Apart from applying NTK to reduce the confusion of $-{\mathbf{q}}_{j}$ and ${\mathbf{q}}_{j}$ , it might be more effective by limiting the rotary boundary at the beginning of training. We left this for future works.

Appendix B Code

We give an example of our code based on GPT-NeoX (Black et al., 2022). You have to rewrite the code as follows to use CoCA. Due to page width restrictions, if there is a problem with line breaks in the code, please contact me.

The file is transformer.py, you have to modify the definition of attention in ParallelSelfAttention:


    from opt_einsum import contract

    def attention(
        self, query_layer, t_layer, query_rot, value_layer,
        layer_past, attention_mask
    ):
        # ===================================
        # Raw attention scores. [b, np, s, s]
        # ===================================

        # change from baddmm to opt_einsum
        # notice to pip install opt_einsum
        attention_scores = \
                contract(’nbpd,sbpd,nbpd->bpns’,
                          query_layer, # [sq, b, np, hn]
                          t_layer, # [sk, b, np, hn]
                          query_rot, # [sq, b, np, hn]
                          backend=’torch’
                        ) / self.norm_factor

        # ==================================================
        # Update attention mask for inference. [b, np, sq, sk]
        # ==================================================

        if self.use_cache:
            with torch.no_grad():
                attention_mask = attention_mask[
                    ..., : attention_scores.size(3),
                    : attention_scores.size(3)
                ]

        # ===========================
        # Attention probs and dropout
        # ===========================

        if exists(self.rpe):
            rpe = self.rpe(query_layer.size(0), t_layer.size(0))
            attention_scores += rpe  # [1, np, sq, sk]

        if self.pos_emb == "alibi":
            attention_scores = self.alibi_embed(attention_scores)

        # attention scores and attention mask [b, np, sq, sk]
        attention_probs = \
                    self.scale_mask_softmax(attention_scores,
                                            attention_mask)

        # This is actually dropping out entire tokens to attend
        # to, which might seem a bit unusual, but is taken from
        # the original Transformer paper.
        with mpu.get_cuda_rng_tracker().fork():
            attention_probs = \
                    self.attention_dropout(attention_probs)

        # =========================
        # Context layer. [sq, b, hp]
        # =========================

        # value_layer -> context layer.
        # [sk, b, np, hn] --> [b, np, sq, hn]

        # context layer shape: [b, np, sq, hn]
        output_size = (
            value_layer.size(1),
            value_layer.size(2),
            query_layer.size(0),
            value_layer.size(3),
        )

        # change view [sk, b * np, hn]
        value_layer = value_layer.view(
            value_layer.size(0),
            output_size[0] * output_size[1], -1
        )

        # change view [b * np, sq, sk]
        attention_probs = attention_probs.view(
            output_size[0] * output_size[1], output_size[2], -1
        )

        # matmul: [b * np, sq, hn]
        context_layer = torch.bmm(attention_probs,
                                  value_layer.transpose(0, 1))

        # change view [b, np, sq, hn]
        context_layer = context_layer.view(*output_size)
        return context_layer

The other one is the definition of forward in ParallelSelfAttention:


    def forward(self, hidden_states, attention_mask,
                layer_past=None
        ):

        # hidden_states: [sq, b, h]

        # =====================
        # Query, Key, and Value
        # =====================

        # Attention heads [sq, b, h] --> [sq, b, (np * 3 * hn)]
        mixed_x_layer, _ = self.query_key_value(hidden_states)

        # [sq, b, (np * 3 * hn)] --> [sq, b, np, 3 * hn]
        new_tensor_shape = mixed_x_layer.size()[:-1] + (
            self.num_attention_heads_per_partition,
            3 * self.hidden_size_per_attention_head,
        )
        mixed_x_layer = mixed_x_layer.view(*new_tensor_shape)

        # replace key_layer with t_layer
        (query_layer, t_layer, value_layer) = \
            mpu.split_tensor_along_last_dim(mixed_x_layer, 3)

        t_layer_1 = t_layer[..., : t_layer.shape[-1] // 2]
        t_layer_2 = t_layer[..., t_layer.shape[-1] // 2 :]
        t_layer = (t_layer_1+t_layer_2)/2

        t_layer = F.relu(t_layer)

        t_layer = torch.cat((t_layer, t_layer), dim=-1)

        if exists(self.rotary_emb):
            if exists(self.rotary_ndims):
                # partial rotary
                query_rot, query_pass = (
                    query_layer[..., : self.rotary_ndims],
                    query_layer[..., self.rotary_ndims :],
                )
                t_rot, t_pass = (
                    t_layer[..., : self.rotary_ndims],
                    t_layer[..., self.rotary_ndims :],
                )
            else:
                # full rotary
                query_rot, t_rot = query_layer, t_layer

            apply_rotary_fn = (
                apply_rotary_pos_emb_torch if self.bf16 \
                    else apply_rotary_pos_emb
            )

            seq_len = query_layer.shape[0]
            offset = 0
            if exists(layer_past) and layer_past.numel() > 0:
                offset = layer_past[0].shape[0]
                seq_len += offset
            cos, sin = self.rotary_emb(value_layer,
                                       seq_len=seq_len)
            query_rot, t_layer = apply_rotary_fn(
                query_rot, t_rot, cos, sin, offset=offset
            )

            if exists(self.rotary_ndims):
                query_rot = torch.cat((query_rot, query_pass),
                                      dim=-1)
                t_layer = torch.cat((t_layer, t_pass), dim=-1)


        # ==================================
        # Cache key and value for inference
        # ==================================

        if exists(layer_past) and layer_past.numel() > 0:
            past_t, past_value = layer_past
            t_layer = \
                torch.cat((past_t.type_as(t_layer), t_layer),
                          dim=0)
            value_layer = torch.cat(
                (past_value.type_as(value_layer), value_layer),
                dim=0
            )

        if self.use_cache:
            present = torch.stack((t_layer, value_layer))

        if self.use_flash_attention:
            context_layer = \
                self.flash_attention(query_layer, t_layer,
                                     value_layer)
        elif not self.sparse:
            context_layer = self.attention(
                query_layer, t_layer, query_rot, value_layer,
                layer_past, attention_mask
            )
        else:
            context_layer = self.sparse_attention(
                query_layer, t_layer, value_layer, attention_mask
            )

        # [b, np, sq, hn] --> [sq, b, np, hn]
        context_layer = \
            context_layer.permute(2, 0, 1, 3).contiguous()

        # [sq, b, np, hn] --> [sq, b, hp]
        new_context_layer_shape = context_layer.size()[:-2] + (
            self.hidden_size_per_partition,
        )
        context_layer = \
            context_layer.view(*new_context_layer_shape)

        # =================
        # Output. [sq, b, h]
        # =================

        output, bias = self.dense(context_layer)

        if self.use_cache:
            output = [output, present]

        return output, bias

Appendix C Parity check retrieval

Experiments are not completed yet.

$\displaystyle\|h_{i+1}-h_{i}\|$	$\displaystyle\geq\|\|h_{i+1}\|-\|h_{i}\|\|$	(14)
	$\displaystyle=\|\|{\mathbf{q}}_{i+1}{\mathbf{k}}^{}_{i+1}\|-\|{\mathbf{q}}_{i}{\mathbf{k}}^{}_{i}\|\|$
	$\displaystyle=\|\|{\mathbf{q}}_{i+1}{\mathbf{k}}_{i+1}\|-\|{\mathbf{q}}_{i}{\mathbf{k}}_{i}\|\|$
	$\displaystyle=\|l_{i+1}-l_{i}\|$