Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger mo...