I'm a researcher at Graz University of Technology working in applied information theory (with a data science flavor).
I love math, all things entropy, and working in small teams. Networking is the best part of academic conferences - I'm a people person!
Deep learning is fascinating, successful, and little understood. Information theory may help!
To understand neural networks better, researchers have recently made extensive use of information theory.
One popular approach is to train a neural network conventionally, and then evaluate neuron outputs using information-theoretic cost functions: For example, it has been found out that neurons in deeper layers "learn" about the class label during training (by increasing mutual information between layer and class) and that, in some cases, layers "forget" irrelevant aspects of the input (by decreasing mutual information between layer and input). Similarly, it was observed that, during training, the information contained in a layer passes through a period in which it is highly redundant, and ends up in a stage in which individual neurons a informative about different classes. Even more generally, neurons in deeper layers tend to become "cat neurons", i.e., they help in distinguishing exactly one class from the rest. Neurons in early layers, in contrast, mainly collect general features, thus appear to matter for classification not alone, but mainly together with other neurons. Surprisingly, the existence of these "cat neurons" was linked to bad generalization performance, i.e., to the performance on data that was not used during training.
The information bottleneck principle encapsulates exactly what one would wish for a neural network, or any classification system in general: That the network preserves all information in the input relevant for classification, but forgets everything that is irrelevant. Researchers have thus tried to train a neural network using the information bottleneck principle. Since it can be shown that -- taken as it is -- this principle is inadequate as a cost function for training, researchers have successfully replaced it by cost functions that are similar in spirit - with remarkable success: The trained networks were more successful in compressing the input information to what is relevant for classification and were more robust to adversarial examples.
However, it seems as if even qualitative results depends a lot on how the network is constructed: Is the activation function sigmoidal or a ReLU? How are the information-theoretic quantities estimated? What is the influence of the number of layers? Nevertheless, even though even general trends cannot be claimed with certainty, information theory holds some promise to "open the black box of deep learning" -- what are you waiting for?
Abstract: Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network's reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyperparameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.
Pub.: 19 Mar '18, Pinned: 06 Apr '18
Abstract: Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle. We first show that any DNN can be quantified by the mutual information between the layers and the input and output variables. Using this representation we can calculate the optimal information theoretic limits of the DNN and obtain finite sample generalization bounds. The advantage of getting closer to the theoretical limit is quantifiable both by the generalization bound and by the network's simplicity. We argue that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer. The hierarchical representations at the layered network naturally correspond to the structural phase transitions along the information curve. We believe that this new insight can lead to new optimality bounds and deep learning algorithms.
Pub.: 09 Mar '15, Pinned: 06 Apr '18
Abstract: Despite their great success, there is still no com- prehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their in- ner organization. Previous work [Tishby & Zaslavsky (2015)] proposed to analyze DNNs in the Information Plane; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the In- formation Bottleneck (IB) tradeoff between com- pression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information- Plane visualization of DNNs. We first show that the stochastic gradient descent (SGD) epochs have two distinct phases: fast empirical error minimization followed by slow representation compression, for each layer. We then argue that the DNN layers end up very close to the IB theo- retical bound, and present a new theoretical argu- ment for the computational benefit of the hidden layers.
Pub.: 02 Mar '17, Pinned: 06 Apr '18
Abstract: We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method "Deep Variational Information Bottleneck", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.
Pub.: 01 Dec '16, Pinned: 06 Apr '18
Abstract: Information bottleneck [IB] is a technique for extracting information in some 'input' random variable that is relevant for predicting some different 'output' random variable. IB works by encoding the input in a compressed 'bottleneck variable' from which the output can then be accurately decoded. IB can be difficult to compute in practice, and has been mainly developed for two limited cases: (1) discrete random variables with small state spaces, and (2) continuous random variables that are jointly Gaussian distributed (in which case the encoding and decoding maps are linear). We propose a method to perform IB in more general domains. Our approach can be applied to discrete or continuous inputs and outputs, and allows for nonlinear encoding and decoding maps. The method uses a novel upper bound on the IB objective, derived using a non-parametric estimator of mutual information and a variational approximation. We show how to implement the method using neural networks and gradient-based optimization, and demonstrate its performance on the MNIST dataset.
Pub.: 05 May '17, Pinned: 06 Apr '18
Abstract: In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that, even if the joint distribution between continuous feature variables and the discrete class variable is known, the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, the IB functional is infinite for almost all weight matrices, making the optimization problem ill-posed. Second, the invariance of the IB functional under bijections prevents it from capturing desirable properties for classification, such as robustness, architectural simplicity, and simplicity of the learned representation. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results imply limitations of the IB framework for the analysis of DNNs.
Pub.: 27 Feb '18, Pinned: 06 Apr '18