Some thoughts about why we use log-likelihood as the loss function when we approximate the true data distribution.