Bibliographic entries cited from the LightningLM Cookbook. Each entry shows where in the cookbook it is cited.
-
Aghajanyan
Aghajanyan (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. arXiv:2012.13255. [link →]
Cited in: §2, §9
-
Brahmic
Shravan (2026). BrahmicTokenizer-131K: A 131,072-token tokenizer for English and the major Brahmic scripts. arXiv:2605.29379. [link →]
Cited in: §1, §2, §3
-
DeltaNet
Schlag (2021). Linear Transformers Are Secretly Fast Weight Programmers. arXiv:2102.11174. [link →]
Cited in: §2, §3
-
DepthDelusion
Fahim (2026). The Depth Delusion: Why Transformers Should Be Wider, Not Deeper. arXiv:2601.20994. [link →]
Cited in: §2, §3
-
DroPE
Gelberg (2025). Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings. arXiv:2512.12167. [link →]
Cited in: §5
-
DropUpcycling
Nakamura (2025). Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization. arXiv:2502.19261. [link →]
Cited in: §2, §5
-
Galanti
Galanti (2022). SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network. arXiv:2206.05794. [link →]
Cited in: §2
-
GatedDeltaNet
Yang (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length. arXiv:2406.06484. [link →]
Cited in: §2, §3
-
GeLoRA
Ed-dib (2024). GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning. arXiv:2412.09250. [link →]
Cited in: §2
-
Gemma2
Gemma Team (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118. [link →]
Cited in: §2
-
GPT4
OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774. [link →]
Cited in: §2
-
GroveMoE
Wu (2025). Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts. arXiv:2508.07785. [link →]
Cited in: §2
-
HyperConnections
Zhu (2024). Hyper-Connections. arXiv:2409.19606. [link →]
Cited in: §2, §3
-
Innovator
Liao (2025). Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling. arXiv:2507.18671. [link →]
Cited in: §2
-
Kobayashi
Kobayashi (2024). Weight decay induces low-rank attention layers. arXiv:2410.23819. [link →]
Cited in: §2
-
Kronecker
Shravan (2026). Kronecker Embeddings: Compressing Token Embedding Tables by Two Orders of Magnitude. arXiv:2605.29459. [link →]
Cited in: §1, §2, §3
-
LLaMAPro
Wu (2024). LLaMA Pro: Progressive LLaMA with Block Expansion. arXiv:2401.02415. [link →]
Cited in: §2
-
LoRA
Hu (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. [link →]
Cited in: §2
-
mHC
Xie (2026). Manifold-Constrained Hyper-Connections. arXiv:2512.24880. [link →]
-
Nemotron
He (2024). Upcycling Large Language Models into Mixture of Experts. arXiv:2410.07524. [link →]
Cited in: §2
-
OPUS
Wang (2026). OPUS: Towards Efficient and Principled Data Selection in LLM Pre-training in Every Iteration. arXiv:2602.05400. [link →]
Cited in: §2, §8
-
PeriodicLoRA
Meng (2024). PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization. arXiv:2402.16141. [link →]
Cited in: §2, §9
-
QDyLoRA
Rajabzadeh (2024). QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning. arXiv:2402.10462. [link →]
Cited in: §2
-
QwenMoE
Team (2024). Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. [link →]
Cited in: §2
-
ReLoRA
Lialin (2023). ReLoRA: High-Rank Training Through Low-Rank Updates. arXiv:2307.05695. [link →]
Cited in: §2, §9
-
ReLoRASLM
Weiss (2025). Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models. arXiv:2509.12960. [link →]
Cited in: §2, §9
-
SinkhornKnopp
Sinkhorn (1967). Concerning nonnegative matrices and doubly stochastic matrices.
Cited in: §2, §3
-
SOLAR
Kim (2023). SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling. arXiv:2312.15166. [link →]
Cited in: §2
-
SparseAttn
Yuan (2025). Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv:2502.11089. [link →]
Cited in: §2
-
SparseUpcycling
Komatsuzaki (2022). Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. arXiv:2212.05055. [link →]
Cited in: §2
-
TurboQuant
Zandieh (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874. [link →]
Cited in: §2, §9
-
Yunis
Yunis (2024). Approaching Deep Learning through the Spectral Dynamics of Weights. arXiv:2408.11804. [link →]
Cited in: §2