-
The Fisher information matrix and the Hessian
One matrix wearing three hats: the precision of the best estimate in
statistics, the metric of information geometry, and the curvature
behind second-order optimizers. Built up from the Hessian of a
log-likelihood — the score's two moments, the osculating circle at
the MLE, the Cramér–Rao bound, the KL-divergence Hessian, the
reparameterization tensor law, and the Gauss–Newton decomposition.
-
The information theory of matrix completion
Suh's completion capacity recasts matrix completion as a Shannon
problem: how many entries does one observation resolve? A walk
through the theory, an implementation that verifies every closed
form, a GF(2) decoder that exhibits the threshold, and several
verified extensions.
-
The many sides of PCA
One method, five derivations: best linear approximation, minimal
reconstruction error, metric MDS, maximum variance with
decorrelation, and the SVD. All reduce to the eigenvectors of the
covariance matrix.
-
Maximum likelihood and maximum a posteriori
The two standard methods for estimating model parameters:
definitions, worked examples, how they relate as the sample grows,
and the zero-count problem that motivates smoothing.
-
Several views on cross-entropy
From the likelihood of a Bernoulli model, cross-entropy follows as
the negative log-likelihood, and the same quantity reappears as KL
divergence, code length, a proper scoring rule, and mode-covering
KL. Concludes with the comparison to squared error for
classification.