Information Splitting for Big Data Analytics

12 Jul 2016  ·  Zhu Shengxin, Gu Tongxiang, Xu Xiaowen, Mo Zeyao ·

Many statistical models require an estimation of unknown (co)-variance parameter(s) in a model. The estimation usually obtained by maximizing a log-likelihood which involves log determinant terms. In principle, one requires the \emph{observed information}--the negative Hessian matrix or the second derivative of the log-likelihood---to obtain an accurate maximum likelihood estimator according to the Newton method. When one uses the \emph{Fisher information}, the expect value of the observed information, a simpler algorithm than the Newton method is obtained as the Fisher scoring algorithm. With the advance in high-throughput technologies in the biological sciences, recommendation systems and social networks, the sizes of data sets---and the corresponding statistical models---have suddenly increased by several orders of magnitude. Neither the observed information nor the Fisher information is easy to obtained for these big data sets. This paper introduces an information splitting technique to simplify the computation. After splitting the mean of the observed information and the Fisher information, an simpler approximate Hessian matrix for the log-likelihood can be obtained. This approximated Hessian matrix can significantly reduce computations, and makes the linear mixed model applicable for big data sets. Such a spitting and simpler formulas heavily depends on matrix algebra transforms, and applicable to large scale breeding model, genetics wide association analysis.

PDF Abstract
No code implementations yet. Submit your code now

Categories


Computation Numerical Analysis

Datasets


  Add Datasets introduced or used in this paper