Contents ▾

References

Bibliographic entries cited from the LightningLM Cookbook. Each entry shows where in the cookbook it is cited.

  • Aghajanyan

    Aghajanyan (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. arXiv:2012.13255. [link →]

    Cited in: §2, §9

  • Brahmic

    Shravan (2026). BrahmicTokenizer-131K: A 131,072-token tokenizer for English and the major Brahmic scripts. arXiv:2605.29379. [link →]

    Cited in: §1, §2, §3

  • DeltaNet

    Schlag (2021). Linear Transformers Are Secretly Fast Weight Programmers. arXiv:2102.11174. [link →]

    Cited in: §2, §3

  • DepthDelusion

    Fahim (2026). The Depth Delusion: Why Transformers Should Be Wider, Not Deeper. arXiv:2601.20994. [link →]

    Cited in: §2, §3

  • DroPE

    Gelberg (2025). Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings. arXiv:2512.12167. [link →]

    Cited in: §5

  • DropUpcycling

    Nakamura (2025). Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization. arXiv:2502.19261. [link →]

    Cited in: §2, §5

  • Galanti

    Galanti (2022). SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network. arXiv:2206.05794. [link →]

    Cited in: §2

  • GatedDeltaNet

    Yang (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length. arXiv:2406.06484. [link →]

    Cited in: §2, §3

  • GeLoRA

    Ed-dib (2024). GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning. arXiv:2412.09250. [link →]

    Cited in: §2

  • Gemma2

    Gemma Team (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118. [link →]

    Cited in: §2

  • GPT4

    OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774. [link →]

    Cited in: §2

  • GroveMoE

    Wu (2025). Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts. arXiv:2508.07785. [link →]

    Cited in: §2

  • HyperConnections

    Zhu (2024). Hyper-Connections. arXiv:2409.19606. [link →]

    Cited in: §2, §3

  • Innovator

    Liao (2025). Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling. arXiv:2507.18671. [link →]

    Cited in: §2

  • Kobayashi

    Kobayashi (2024). Weight decay induces low-rank attention layers. arXiv:2410.23819. [link →]

    Cited in: §2

  • Kronecker

    Shravan (2026). Kronecker Embeddings: Compressing Token Embedding Tables by Two Orders of Magnitude. arXiv:2605.29459. [link →]

    Cited in: §1, §2, §3

  • LLaMAPro

    Wu (2024). LLaMA Pro: Progressive LLaMA with Block Expansion. arXiv:2401.02415. [link →]

    Cited in: §2

  • LoRA

    Hu (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. [link →]

    Cited in: §2

  • mHC

    Xie (2026). Manifold-Constrained Hyper-Connections. arXiv:2512.24880. [link →]

  • Nemotron

    He (2024). Upcycling Large Language Models into Mixture of Experts. arXiv:2410.07524. [link →]

    Cited in: §2

  • OPUS

    Wang (2026). OPUS: Towards Efficient and Principled Data Selection in LLM Pre-training in Every Iteration. arXiv:2602.05400. [link →]

    Cited in: §2, §8

  • PeriodicLoRA

    Meng (2024). PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization. arXiv:2402.16141. [link →]

    Cited in: §2, §9

  • QDyLoRA

    Rajabzadeh (2024). QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning. arXiv:2402.10462. [link →]

    Cited in: §2

  • QwenMoE

    Team (2024). Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. [link →]

    Cited in: §2

  • ReLoRA

    Lialin (2023). ReLoRA: High-Rank Training Through Low-Rank Updates. arXiv:2307.05695. [link →]

    Cited in: §2, §9

  • ReLoRASLM

    Weiss (2025). Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models. arXiv:2509.12960. [link →]

    Cited in: §2, §9

  • SinkhornKnopp

    Sinkhorn (1967). Concerning nonnegative matrices and doubly stochastic matrices.

    Cited in: §2, §3

  • SOLAR

    Kim (2023). SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling. arXiv:2312.15166. [link →]

    Cited in: §2

  • SparseAttn

    Yuan (2025). Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv:2502.11089. [link →]

    Cited in: §2

  • SparseUpcycling

    Komatsuzaki (2022). Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. arXiv:2212.05055. [link →]

    Cited in: §2

  • TurboQuant

    Zandieh (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874. [link →]

    Cited in: §2, §9

  • Yunis

    Yunis (2024). Approaching Deep Learning through the Spectral Dynamics of Weights. arXiv:2408.11804. [link →]

    Cited in: §2