θ is the parameter vector (e.g. of a model) that characterizes the distribution of x (a data sample)
L(θ,x) is the likelihood function: how likely is the data if the distribution is described by θ
the log-likelihood function is the natural log of the likelihood function: l(θ,x)=ln(L(θ,x))
the score vector is the first derivatives of the log-likelihood function with respect to θ:
∇θ=∇θl(θ,x)
this is like the gradient of the loss function in training NNs
for convenience, let call this ∇θ
the Fisher Information matrix (or Information Matrix) is the second cross-moments of the score:
I(θ)=Eθ[∇θ∇θT]
the Information matrix is the covariance matrix of the score
under mild regularity conditions, if θ is the true parameter, the expected value of the score is 0:
Eθ[∇θ]=0
hence, the information matrix is the covariance matrix of the score:
also under certain regularity conditions, if θ is the true parameter and the score function is twice differentiable, it can be proved that
I(θ)=−Eθ[∇θθ2]
where ∇θθ2 is the matrix of second-order cross-partial derivatives (Hessian Matrix) of the log-likelihood.
This equality is called the information equality.
the Fisher information matrix and the covariance matrix
the Hessian Matrix is an asymptotic estimator of the covariance matrix of θ, more specifically, the negative inverse of the Hessian Matrix is an estimation of the covariance matrix 1Cov[θ]=−H−1
hence we can use the covariance matrix of θ to estimate the Fisher Information matrix
considering approximating the likelihood function using the Taylor’s expansion at the true parameter θ0 up to the 2nd derivative:
L(θ,x)=L(θ0,x)+∂θ∂L(θ0,x)+21∂θ2∂2L(θ0,x)
at the true parameter θ0, the probability of the likelihood is at its peak, hence the 1st derivative is 0
thus the approximation becomes a parabola, and this is not very good because the likelihood function can take any form
but we can make the assumption that the likelihood function is a gaussian, which is a reasonable assumption for a probability distribution
now the log of a guassian can be approximated using a parabola, hence the motivation of why the log likelihood is used in this case
An intuitive explanation for the motivation behind the Fisher Information in parameters search
Calculating the first derivative of the loss function and move in that direction is treating the parameter space like a Euclidean manifold, and it is naive to do so
the Fisher Information matrix tell the curvature of the manifold and hence how to move in each direction