[Submitted on 3 Mar 2025]
Abstract:Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural network
10 Comments
cgdl
Agreed, but PAC-Bayes or other descendants of VC theory is probably not the best explanation. The notion of algorithmic stability provides a (much) more compelling explanation. See [1] (particularly Sections 11 and 12)
[1] https://arxiv.org/abs/2203.10036
uoaei
[flagged]
TechDebtDevin
Anyone who wants to demystify ML should read: The StatQuest Illustrated Guide to Machine Learning [0] By Josh Starmer.
To this day I haven't found a teacher who could express complex ideas as clearly and concisely as Starmer does. It's written in an almost children's book like format that is very easy to read and understand. He also just published a book on NN that is just as good. Highly recommend even if you are already an expert as it will give you great ways to teach and communicate complex ideas in ML.
[0]: https://www.goodreads.com/book/show/75622146-the-statquest-i…
getnormality
> rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.
How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.
inciampati
An example, which is interesting, in which "deep" networks are necessary, is discussed in this fascinating and popular recent paper on RNNs [1]. Despite the fact that the minGRU and minLSTM models they propose don't explicitly model ordered state dependency, they can learn them as long as they are deep enough (deep >= 3):
> Instead of explicitly modelling dependencies on previous states to capture long-range dependencies, these kinds of recurrent models can learn them by stacking multiple layers.
[1] https://arxiv.org/abs/2410.01201
YesBox
I wish I had the time to try this:
1.) Grab many GBs of text (books, etc).
2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.
3.) For each word, store most frequent word for each $N distance. [a]
4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)
How close would this be to GPT 2?
[a] You could go one step further and store multiple words for each distance, ordered by frequency
EncomLab
The implication that any software is "mysterious" is problematic – there is no "woo" here – the exact state of the machine running the software may be determined at every cycle. The exact instruction and the data it executed with may be precisely determined, as can the next instruction. The entire mythos of any software being a "black box" is just so much advertising jargon, perpetuated by tech bros who want to believe they are part of some Mr. Robot self-styled priestly class.
rottc0dd
If anyone wants to delve into machine learning, one of the superb resources I have found is, Stanfords "Probability for computer scientists"(https://www.youtube.com/watch?v=2MuDZIAzBMY&list=PLoROMvodv4…).
It delves into theoretical underpinnings of probability theory and ML, IMO better than any other course I have seen. (Yeah, Andrew Ng is legendary, but his course demands some mathematical familarity with linear algebra topics)
And of course, for deep learning, 3b1b is great for getting some visual introduction (https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ…).
buffalobuffalo
When I was first getting into Deep Learning, learning the proof of the universal approximation theorem helped a lot. Once you understand why neural networks are able to approximate functions, it makes everything built on top of them much easier to understand.
talles
Correct me if I'm wrong, but an artificial neuron is just good old linear regression followed by an activation function to make it non linear. Make a network out of it and cool stuff happens.