In maximum likelihood factor analysis (FA), a p-dimensional real-valued data vector \(\mathbf{X}\) is modeled using a k-dimensional vector of real-valued factors, \(\mathbf{F}\), where \(k\) is generally much than \(p\) \[ \mathbf{X}=\mathbf{L} \mathbf{F} +\mathbf{\varepsilon}\] \(\mathbf{L}\) is known as the factor loading matrix. The factors \(\mathbf{F}\) are assume to be \(\mathcal{N}(0,\mathbf{I})\) distributed(zero-mean indepedenent normal with unit variance). The p- dimensional random variable \(\mathbf{\varepsilon}\) is distributed \(\mathcal{N}(0, \Psi)\), where \(\Psi\) is a diagonal matrix. So \[ \left(\begin{array}{c} \mathbf{X} \\ \mathbf{F}\end{array}\right) \sim \mathbf{N}\left(\left[\begin{array}{c} 0 \\ 0 \end{array} \right], \left[\begin{array}{cc} \mathbf{L}\mathbf{L}^T+\Psi & \mathbf{L} \\ \mathbf{L}^T & \mathbf{I} \end{array} \right] \right) \] with \[ \mathrm{E}(\mathbf{F}|\mathbf{X})= \mathbf{L}^T(\mathbf{L}\mathbf{L}^T+\Psi)^{-1} \mathbf{X} \stackrel{\wedge}{=}\beta \mathbf{X}\] and \[ (\mathbf{L}\mathbf{L}^T+\Psi)^{-1}=\Psi^{-1}-\Psi^{-1}\mathbf{L}(\mathbf{I}+\mathbf{L}^T\Psi^{-1}\mathbf{L})^{-1}\mathbf{L}^T\mathbf{\Psi}^{-1}\] Furthermore, it is possible (and in fact necessary for EM algorithm) to compute the second moment of the factors, \[\mathrm{E}(\mathbf{F}\mathbf{F}^T|\mathbf{X})=\mbox{Var}(\mathbf{F}|\mathbf{X})+\mathrm{E}(\mathbf{F}|\mathbf{X})\mathrm{E}(\mathbf{F}|\mathbf{X})^T=\mathbf{I}-\beta \mathbf{L} +\mathbf{\beta}\mathbf{X}\mathbf{X}^T\beta^T\]
\[\begin{eqnarray*} Q &=& \mathrm{E}\left[\log \prod\limits_{i}^n (2\pi)^{p/2}|\Psi|^{-1/2} \exp\left\{-\frac{1}{2}[\mathbf{X}_i-\mathbf{L}\mathbf{F}_i]^T \Psi^{-1}[\mathbf{X}_i-\mathbf{L}\mathbf{F}_i] \right\}\right] \\ &=& c-\frac{n}{2}\log|\Psi|-\sum\limits_{i=1}^n \mathrm{E} \left[ \frac12 \mathbf{X}_i^T \Psi^{-1} \mathbf{X}_i-\mathbf{X}_i^T \mathbf{L}\mathbf{F}_i+\frac12 \mathbf{F}_i^T \mathbf{L}^T \Psi^{-1}\mathbf{L}\mathbf{F}_i \right] \\ &=& c-\frac{n}{2}\log|\Psi|-\sum\limits_{i=1}^n \left( \frac12 \mathbf{X}_i^T \Psi^{-1} \mathbf{X}_i-\mathbf{X}_i^T \mathbf{L}\mathrm{E}[\mathbf{F}|\mathbf{X}_i]+\frac12 \mathrm{tr}\left[ \mathbf{L}^T \Psi^{-1}\mathbf{L}\mathrm{E}\left[ \mathbf{F}\mathrm{F}^T|\mathbf{X}_i \right]\right]\right) \end{eqnarray*}\] where \(c\) is a constant, independent of the parameters, and \(\mathrm{tr}\) is the trace operator.
* Re-estimate \(\Psi\) through its inverse \[ \frac{\partial Q}{\partial \Psi^{-1}} =\frac{n}{2} \Psi^{\mathrm{new}}-\sum\limits_{i=1}^n \left( \frac12 \mathbf{X}_i\mathbf{X}_i^T-\mathbf{L}^{\mathrm{new}} \mathrm{E}[\mathbf{F}|\mathbf{X}_i]\mathbf{X}_i^T+\frac12 \mathbf{L}^{\mathrm{new}} \mathrm{E}[ \mathbf{F}\mathbf{F}^T|\mathbf{X}_i](\mathbf{L}^{\mathrm{new}})^T \right)=0 \] Then substituting equation \[ \mathbf{L}^{\mathrm{new}}= \left(\sum\limits_{i=1}^n \mathbf{X}_i \mathrm{E}(\mathbf{F}|\mathbf{X}_i)^T \right) \left(\sum\limits_{l=1}^n \mathrm{E}(\mathbf{F}\mathbf{F}^T|\mathbf{X}_l) \right)^{-1} \], we have \[\frac{n}{2}\Psi^{\mathrm{new}}=\sum\limits_{i=1}^n \left( \frac12 \mathbf{X}_i\mathbf{X}_i^T-\frac12 \mathbf{L}^{\mathrm{new}} \mathrm{E}[\mathbf{F}|\mathbf{X}_i]\mathbf{X}_i^T \right) \] and using the diagonal constraint, \[ \Psi^{\mathrm{new}}=\frac{1}{n}\mathrm{diag}\left\{\sum\limits_{i=1}^n\left[\mathbf{X}\mathbf{X}_i^T-\mathbf{L}^{\mathrm{new}}\mathrm{E}[\mathbf{F}|\mathbf{X}_i]\mathbf{X}_i^T\right] \right\} \]
E-step Compute \(\mathrm{E}(\mathbf{F}|\mathbf{X}_i)\) and \(\mathrm{E}(\mathbf{FF}^T|\mathbf{X}_i)\) for each data point \(\mathbf{X}_i\), given\(\mathbf{L}_{\mathrm{int}}\) and \(\Psi_{\mathrm{int}}\).
M-step \[ \mathbf{L}^{\mathrm{new}}= \left(\sum\limits_{i=1}^n \mathbf{X}_i \mathrm{E}(\mathbf{F}|\mathbf{X}_i)^T \right) \left(\sum\limits_{l=1}^n \mathrm{E}(\mathbf{F}\mathbf{F}^T|\mathbf{X}_l) \right)^{-1} \] \[ \Psi^{\mathrm{new}}=\frac{1}{n}\mathrm{diag}\left\{\sum\limits_{i=1}^n\left[\mathbf{X}\mathbf{X}_i^T-\mathbf{L}^{\mathrm{new}}\mathrm{E}[\mathbf{F}|\mathbf{X}_i]\mathbf{X}_i^T\right] \right\} \] where the diag operator sets all the off-diagonal elements of a matrix to zero.
Using \(\mathbf{L}_{\mathrm{new}}\) and \(\Psi_{\mathrm{new}}\) to replace \(\mathbf{L}_{\mathrm{int}}\) and \(\Psi_{\mathrm{int}}\). Repeating E-step and M-step until the estimates \(\mathbf{L}_{\mathrm{new}}\) and \(\Psi_{\mathrm{new}}\) are converged.