Data Sharing and Resampled LASSO: A word based sentiment Analysis for IMDb data

Research paper by Ashutosh K. Maurya

Indexed on: 16 May '17Published on: 16 May '17Published in: arXiv - Statistics - Applications


In this article we study variable selection problem using LASSO with new improvisations. LASSO uses $\ell_{1}$ penalty, it shrinks most of the coefficients to zero when number of explanatory variables $(p)$ are much larger the number of observations $(N)$. Novelty of the approach developed in this article blends basic ideas behind resampling and LASSO together which provides a significant variable reduction and improved prediction accuracy in terms of mean squared error in the test sample. Different weighting schemes have been explored using \textit{Bootstrapped LASSO}, the basic methodology developed in here. Weighting schemes determine to what extent of data blending in case of grouped data. Data sharing (DSL) technique developed by \cite{gross} lies at the root of the present methodology. We apply the technique to analyze the IMDb dataset as discussed in \cite{gross} and compare our result with \cite{gross}.