References

Bibliographic entries cited from the LightningLM Cookbook. Each entry shows where in the cookbook it is cited.

Aghajanyan

Aghajanyan (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. arXiv:2012.13255. [link →]

Cited in: §2, §9
Brahmic

Shravan (2026). BrahmicTokenizer-131K: A 131,072-token tokenizer for English and the major Brahmic scripts. arXiv:2605.29379. [link →]

Cited in: §1, §2, §3
DeltaNet

Schlag (2021). Linear Transformers Are Secretly Fast Weight Programmers. arXiv:2102.11174. [link →]

Cited in: §2, §3
DepthDelusion

Fahim (2026). The Depth Delusion: Why Transformers Should Be Wider, Not Deeper. arXiv:2601.20994. [link →]

Cited in: §2, §3
DroPE

Gelberg (2025). Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings. arXiv:2512.12167. [link →]

Cited in: §5
DropUpcycling

Nakamura (2025). Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization. arXiv:2502.19261. [link →]

Cited in: §2, §5
Galanti

Galanti (2022). SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network. arXiv:2206.05794. [link →]

Cited in: §2
GatedDeltaNet

Yang (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length. arXiv:2406.06484. [link →]

Cited in: §2, §3
GeLoRA

Ed-dib (2024). GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning. arXiv:2412.09250. [link →]

Cited in: §2
Gemma2

Gemma Team (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118. [link →]

Cited in: §2
GPT4

OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774. [link →]

Cited in: §2
GroveMoE

Wu (2025). Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts. arXiv:2508.07785. [link →]

Cited in: §2
HyperConnections

Zhu (2024). Hyper-Connections. arXiv:2409.19606. [link →]

Cited in: §2, §3
Innovator

Liao (2025). Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling. arXiv:2507.18671. [link →]

Cited in: §2
Kobayashi

Kobayashi (2024). Weight decay induces low-rank attention layers. arXiv:2410.23819. [link →]

Cited in: §2
Kronecker

Shravan (2026). Kronecker Embeddings: Compressing Token Embedding Tables by Two Orders of Magnitude. arXiv:2605.29459. [link →]

Cited in: §1, §2, §3
LLaMAPro

Wu (2024). LLaMA Pro: Progressive LLaMA with Block Expansion. arXiv:2401.02415. [link →]

Cited in: §2
LoRA

Hu (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. [link →]

Cited in: §2
mHC

Xie (2026). Manifold-Constrained Hyper-Connections. arXiv:2512.24880. [link →]
Nemotron

He (2024). Upcycling Large Language Models into Mixture of Experts. arXiv:2410.07524. [link →]

Cited in: §2
OPUS

Wang (2026). OPUS: Towards Efficient and Principled Data Selection in LLM Pre-training in Every Iteration. arXiv:2602.05400. [link →]

Cited in: §2, §8
PeriodicLoRA

Meng (2024). PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization. arXiv:2402.16141. [link →]

Cited in: §2, §9
QDyLoRA

Rajabzadeh (2024). QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning. arXiv:2402.10462. [link →]

Cited in: §2
QwenMoE

Team (2024). Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. [link →]

Cited in: §2
ReLoRA

Lialin (2023). ReLoRA: High-Rank Training Through Low-Rank Updates. arXiv:2307.05695. [link →]

Cited in: §2, §9
ReLoRASLM

Weiss (2025). Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models. arXiv:2509.12960. [link →]

Cited in: §2, §9
SinkhornKnopp

Sinkhorn (1967). Concerning nonnegative matrices and doubly stochastic matrices.

Cited in: §2, §3
SOLAR

Kim (2023). SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling. arXiv:2312.15166. [link →]

Cited in: §2
SparseAttn

Yuan (2025). Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv:2502.11089. [link →]

Cited in: §2
SparseUpcycling

Komatsuzaki (2022). Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. arXiv:2212.05055. [link →]

Cited in: §2
TurboQuant

Zandieh (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874. [link →]

Cited in: §2, §9
Yunis

Yunis (2024). Approaching Deep Learning through the Spectral Dynamics of Weights. arXiv:2408.11804. [link →]

Cited in: §2

References

Aghajanyan

Brahmic

DeltaNet

DepthDelusion

DroPE

DropUpcycling

Galanti

GatedDeltaNet

GeLoRA

Gemma2

GPT4

GroveMoE

HyperConnections

Innovator

Kobayashi

Kronecker

LLaMAPro

LoRA

mHC

Nemotron

OPUS

PeriodicLoRA

QDyLoRA

QwenMoE

ReLoRA

ReLoRASLM

SinkhornKnopp

SOLAR

SparseAttn

SparseUpcycling

TurboQuant

Yunis