Gradient descent, hessian(E(W^T X)) = cov(X),Why mean=/=0?

  • I
  • Thread starter NotASmurf
  • Start date
  • Tags
    Gradient
In summary, the paper discusses the backpropagation algorithm and its use of the covariance matrix of inputs to develop a learning weight update rule. The author also points out that the mean of zero for the inputs, commonly used in neural networks, results in a multivariate Gaussian distribution with a covariance matrix equal to the identity. The author uses a technique called "whitening" to normalize the data and achieve a mean of zero with an identity covariance matrix. They also mention the use of a "standard fresnel representation" for the determinate of a symmetric matrix, which has a relationship with multivariate Gaussian distributions.
  • #1
NotASmurf
150
2
Backpropagation algorithm E(X,W) = 0.5(target - W^T X)^2 as error, the paper I'm reading notes that covarience matrix of inputs is equal to the hessian, it uses that to develop its learning weight update rule V(k+1) = V(k) + D*V(k), slightly modified (not relevant for my question) version of normal backpropagation feedforward gradient descent, but he uses a mean of zero for the inputs, just like in all other neural networks, but doesn't shifting the mean not affect the covariance matrix, so the eigenvectors ,entropy H=|cov(x)|, second derivative is shouldn't change. It's not even a conjugate prior, its not like I am encoding some terrible prior beliefs if we view it as a distribution mapper. Why does the mean being zero matter here and ostensibly in all ANN's, any help appreciated.
 
Physics news on Phys.org
  • #2
Sounds like the author is performing "whitening" Given some data set, take the eigenbasis divide by the eigenvalues and you get a normalized. If the data is multivariate Gaussian, then the data now as a mean of zero with a covariance matrix equal to the identity. It's a pretty common practice for image processing, unless there's a lot of white noise.

edit: Actually misread the post, ignore me :)
 
  • #3
First, thanks for responding.
"If the data is multivariate Gaussian" it is a multivariate gaussian normal yes.

"data now as a mean of zero with a covariance matrix equal to the identity." but he also performs a gaussian integral on det(lambda*I - cov(X)), if cov(X) = I then it would have no eigenvalues, which need to be unique for this algorithm to work, and B) lambda*I - cov(X) is used where cov(X) inverse should be used, i know that A-lambda*I has no inverse, I don't know the logic behind him using lambda*I-A,

Also he says he's using something called "standard fresnel representation for the determinate of a symmetric matrix R" but "fresnel representation determinant" doesn't turn up anything that looks "standard" at all, the paper is over 25 years old to be fair. You have any idea what that is? Cos he seems to just be doing some gaussian multivariate optimization.
 
  • #4
Without reading the paper, it's hard to make comments. However, there is a relationship between Fresnel Integral and Multivariate Gaussian, but I'm not well versed enough to say anything meaningful about it. I simply recall in grad school, my friend research random matrix and thesis was over such relationship.
 
  • #5
"Fresnel integral" Thank you! that's just what i was looking for, the distrubutions derived for the eigenvalues from random matrices looks less arbitrary now. Last thing,

"data now as a mean of zero with a covariance matrix equal to the identity." are you saying that if mean is zero cov(X) is a diagonal with all dimensions having same magnitude?
 

Related to Gradient descent, hessian(E(W^T X)) = cov(X),Why mean=/=0?

1. What is gradient descent and how does it work?

Gradient descent is an optimization algorithm used to minimize a cost function by iteratively adjusting the parameters of a model. It works by calculating the gradient of the cost function with respect to each parameter and then updating the parameters in the opposite direction of the gradient in order to reach the minimum of the cost function.

2. What is the Hessian matrix and why is it important in gradient descent?

The Hessian matrix is the matrix of second-order partial derivatives of a multivariate function. In gradient descent, the Hessian matrix is used to determine the curvature of the cost function and the direction of the gradient, which helps to improve the efficiency and accuracy of the optimization process.

3. How is the Hessian related to the covariance matrix?

The Hessian matrix is equal to the covariance matrix when the cost function is a Gaussian distribution. This means that the Hessian can be used to calculate the covariance matrix, which is a measure of the linear relationship between variables in a dataset.

4. Why is the mean not equal to 0 in gradient descent?

The mean is not equal to 0 in gradient descent because the algorithm is designed to update the parameters in the direction of the gradient, which is affected by the mean of the data. If the mean were set to 0, the algorithm would not be able to accurately optimize the parameters.

5. How does gradient descent handle non-convex cost functions?

Gradient descent can still be used to optimize non-convex cost functions, but it may get stuck in local minima. To address this issue, techniques such as momentum, adaptive learning rates, and random restarts can be used to improve the chances of finding the global minimum.

Similar threads

  • Calculus and Beyond Homework Help
Replies
1
Views
1K
  • Calculus and Beyond Homework Help
Replies
9
Views
2K
  • General Math
Replies
1
Views
4K
  • Introductory Physics Homework Help
Replies
6
Views
6K
  • Advanced Physics Homework Help
Replies
4
Views
5K
Replies
7
Views
3K
  • Special and General Relativity
Replies
3
Views
4K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
4
Views
3K
  • Biology and Medical
Replies
2
Views
11K
Back
Top