This paper studies empirical scaling laws for language model performance on the cross-entropy loss, showing that the loss scales as a power-law with model size, dataset size, and the amount of compute used for training. It finds that larger models are significantly more sample-efficient, such that...
Document Chapters
Share Document
Share this link with others to let them view the document.