PDF to MP3

Listen to Your Documents

Transform your PDFs into easy-listening MP3s. Ideal for learning while driving, exercising, or just relaxing.

A picture showing a book with headphones.

New version available: ListenDock

Use listendock.com and the iPhone app.

Open iPhone app

Scaling Laws for Neural Language Models

March 24, 2025

This paper studies empirical scaling laws for language model performance on the cross-entropy loss, showing that the loss scales as a power-law with model size, dataset size, and the amount of compute used for training. It finds that larger models are significantly more sample-efficient, such that...

AI Explanation

Optimal for listening to technical papers.

Audio is incomplete.

Document Chapters

1. Abstract

2. Table of Contents

3. Introduction

4. Summary

5. Key Findings

6. Figures of Training Runs

7. Summary of Scaling Laws

8. Equations and Relations

9. Further Analysis

10. Notation

11. Background and Methods

12. Parameter and Compute Scaling

13. Parameter Counts and Compute Estimates

14. Training Procedures

15. Datasets

16. Empirical Results

17. Shape and Hyperparameter

18. Shape Independence

19. Performance

20. Performance Trends

21. LSTM and Transformer Comparison

22. Generalization Among Data Distributions

23. Dataset Size and Compute

24. Simultaneous Variation of N and D

25. Proposed Equation

26. Data Size and Overfitting

27. L(N, D) Fits and Overfitting

28. Critical Batch Size and Performance

29. Scaling Laws with Model Size and Training Time

30. Adjustment for Training at Berit (L)

31. Critical Batch Size and Training Steps

32. Results for L(N, Smin)

33. Performance vs Compute Budget

34. Lower Bound on Early Stopping Step

35. Optimal Allocation of the Compute Budget

36. Excess Compute and Steps

37. Optimal Performance and Allocations

38. Predictions from L(N, Smin)

39. L(Cmin) and L(D)

40. Contradictions and a Conjecture

41. Noise and Performance

42. Related Work

43. Discussion

44. Further Research

45. Acknowledgements

46. Acknowledgements

47. Compute Budget and Optimal Training

48. Efficient Training Implications

49. Comparison to Inefficient Training

50. Suboptimal Model Sizes

51. Caveats

52. Supplemental Figures: Early Stopping and Test vs Train

53. Supplemental Figures: Universal Transformers and Batch Size

54. Supplemental Figures: Sample Efficiency vs Model Size

55. Supplemental Figures: Performance and Context Dependence

56. Supplemental Figures: Learning Rate Schedules and Error Analysis

57. Supplemental Figures: Fit Details and Power Law Quality

58. Supplemental Figures: Generalization vs Depth

59. List of Figures

60. List of Tables and References

Share Document

Share this link with others to let them view the document.