Some of my answers while TAing CS.462/482 Deep Learning Fall 2020, that were not homework specific and might help a general audience.



Are Adagrad/RMSProp momentum methods? Are they first or second-order methods? Adagrad and RMSProp are still first order methods, they never require calculating the second-order derivative (Hessian matrix). Also they are adaptive (to each parameter) learning rate methods, and are not strictly considered momentum methods.


Is there a difference between having something like [2,10], [10,2] [2,2] and taking the softmax with 2 output classes, and one with [2,10], [10,2] [2,1] with the last layer having a sigmoid activation? I think of softmax as the multi-class sigmoid so if we only inspect that portion I believe they’ll give the same result

What you write is correct. Theoretically they are the same - if we could solve for this analytically, we should get the exact same thing. But with a neural net implementation, there is stochasticity in the initialisation of the weights, and it would be following slightly different trajectories for gradient descent. Therefore having 2 output nodes \(W \in \mathbb{R}^{2\times2}\) instead of one \(W \in \mathbb{R}^{2\times1}\) should give us different results in practice, though I expect this to be very minimal for modeling linear functions. For modeling more complex functions, I am not sure how significant the difference is, or which is better - one view is that the redundant weights have a regularising effect (because of overparameterisation), whereas another view is that the redundant weights are just redundant and lead to overfitting and slower convergence.


CNN always expects a 2d grid, but my image is circular (e.g a retinal image). How should I deal with “missing” pixels?

Although 0s doesn’t actually mean “missing” for image pixels, typically people still put 0s and the network should learn to ignore these “missing” pixels.


I was wondering if someone could explain why it has to be BFS (assuming this means breadth first search?) for the backward function (for backpropagation)? Going by the process we do on paper, it seems like a dfs would work as well?

Technically to do backprop, neither BFS or DFS are correct. We might get lucky and compute the gradients correctly, if the computational graph just so happens to be ordered such that both BFS and DFS will give you the same correct solution - you’ll have to check if this is the case manually. But in practice there is no guarantee that they will find the correct order.

What we want is to visit the nodes in the reverse order that we did forward computation on. The technical term for this is reverse topological sort on a directed acyclic graph. (It’s important that the graph is directed and acyclic otherwise there is no order). One way to get the topological sort, is via a modified DFS algorithm.


I found this gif on Twitter demonstrating KL divergence asymmetry: and thought the class might find it interesting. The replies all marvel at how this demonstration is simple and easy to understand, but I’m afraid I don’t really get it. Can someone explain what the curves are and what this is referring to?

This animation tells us that 1) \(KL(p\|q)\) and \(KL(q\|p)\) are both needed to get a picture of what is happening, a high \(KL(p\|q)\) is very different from a high \(KL(q\|p)\), and if we were only looking at one of the numbers, we only get part of the picture. 2) minimising \(KL(p\|q)\) and minimising \(KL(q\|p)\) will give us different curves.

By notation convention in ML, we think of \(p\) as the true distribution that we want to model, \(q\) is our best estimate of this true distribution. \(q\) happens to be a Gaussian for modeling and visualisation simplicity. We could have a more complex curve than just the Gaussian, but for visualisation simplicity let’s say we are only trying to fit a Gaussian. The true distribution \(p\) is a bimodal distribution (also for visualisation simplicity). Stare at the animation when \(KL(p||q)\) is high, but \(KL(q||p)\) is low, our green curve \(q\) looks like it fits pretty well with respect to the blue curve \(p\) but only for one mode. This is called “mode-seeking” or exclusive divergence, and if we minimise the \(KL(q||p)\), we will end up with a \(q\) that tries to fit to a mode of \(p\).

Now stare at the animation when \(KL(p||q)\) is low, but \(KL(q||p)\) is high, our green curve \(q\) tries to stay in the middle of the two modes of \(p\). This is called “mean-seeking” and if we minimise \(KL(p||q)\), we will end up with \(q\) that will place most of its probability density in the “mean” of \(p\). There is one case when both \(KL(p||q)\) and \(KL(q||p)\) are at their lowest, this is when our modeling assumptions of \(q\) actually fit \(p\) very well because \(p\) is too simple in this case.


Should we assume a certain base (e.g. e, 2, 10) when working with logarithms?

For optimisation, using different log base doesn’t matter because they change the loss by a constant multiplicative factor, and the parameters of the model that minimize the loss will be the same. But sometimes natural logarithms make the math cleaner, for example the first derivative: \(\frac{\delta}{\delta(x)}(ln(x))= \frac{1}{x}\). Also when they “de-exponentiate” the \(e^x\) found frequently in probability formulas. People might also use different log bases for reporting convention. In Information theory log base 2 is often used, when reporting in terms of bits.


Are we supposed to take the softmax function and its derivative or the negative log likelihood loss function and its derivative?

The softmax function only gives us a way to normalise scores to something that looks like a probability. We then need to take the derivative of the loss function \(\mathcal{L}(x, W, b, y)\), which for logistic regression, is the negative log likelihood loss. Where do we get “likelihood values” from? From the softmax function :) Some jargon: Softmax Function together with Negative Log likelihood loss (or cross entropy loss) is sometimes called Softmax Loss


Do we maximize only the log-likelihood of the true class as mentioned in the lecture or can we use the Softmax Loss (with cross-entropy loss)?

Mathematically they are equivalent, and this is not limited to the one-hot encoding case - we could have the target probability spread across multiple classes. Also, they are equivalent for discrete and continuous distributions as long as the underlying probability distribution we are assuming is the same, e.g, bernouli, categorical, gaussian. You may be wondering why two names exist for the same thing… They arose from different fields: Negative log likelihood comes from statistics and Maximum Likelihood Estimation, whereas Cross Entropy comes from Information Theory.


And finally, not written by me, but my favorite piece of wisdom from the Professor

Here’s some advice (not only for you but everyone): you should stop thinking only in terms of accuracy but try to understand why you observe a certain behavior. In this specific instance, how does the problem relate to concepts introduced in the lectures? Can you explain why this may happen? This is about demonstrating your understanding, not randomly perturbing the system as often as possible to try and get high accuracy.



Acknowledgements.

Professor Mathias Unbernath for taking me on as a TA for the class despite it being Computer Vision heavy, and allowing me to teach tranformer and self-attention. Shoutout to the Teaching Team for a great semester. My Co-TA Hao Ding, Gao Cong, and CAs Matt Figdore, Nan Huo, Shreya chakraborty Shreyash Kumar. And not forgetting the students who asked questions and engaged with the material.