ETH Z?rich Identifies Priors That Boost Bayesian Deep Learning Models

     AI Technology & Industry Review A research team from ETH Z?rich presents an overview of priors for (deep) Gaussian processes, variational autoencoders and Bayesian neural networks. The researchers propose that well-chosen priors can achieve theoretical and empirical properties such as uncertainty estimation, model selection and optimal decision support; and provide guidance on how to choose them. It’s well known across the machine learning community that choosing the right prior — an initial belief re an event expressed in terms of a probability distribution — is crucial for Bayesian inference. Many recent Bayesian deep learning models however resort to established but uninformative or weak informative priors that may have detrimental consequences on their models’ inference abilities. In the paper , a research team from ETH Z?rich presents an overview of different priors for (deep) Gaussian processes, variational autoencoders, and Bayesian neural networks. The team proposes that well-chosen priors can actually achieve theoretical and empirical properties such as uncertainty estimation, model selection and optimal decision support; and provides guidance on how to choose them. The main idea of Bayesian models is to infer a posterior distribution over the parameters of a model based on a prior probability for some observed data. This approach can be used to update the probability for a hypothesis as more evidence or information becomes available. Although choosing the good prior is crucial for Bayesian models, in practice it is often non-trivial to map the prior subjective beliefs onto tractable probability distributions. In addition, the asymptotic consistency guarantees of the Bernstein-von-Mises theorem have made some researchers believe that priors can have a nuisance influence on the posterior. Therefore, there is an increasing trend in contemporary Bayesian deep learning research to choose seemingly “uninformative” priors such as standard Gaussians. This theorem however does not hold in many applications, as its regularity conditions are often not satisfied. Moreover, in the non-asymptotic regime of practical inferences, priors have a strong influence on the posterior. Worse yet, bad priors can undermine the very properties that motivate researchers to use Bayesian inference in the first place. Motivated by these insights, the team argues that it is time to look at prior choices other than the usual, uninformative ones. The paper first reviews existing prior designs for (deep) Gaussian processes. Gaussian processes (GPs) are nonparametric models that, instead of inferring a distribution over the parameters of a parametric function, can be used to infer a distribution over functions directly. This approach is not only ideal for problems with few observations; it also has the potential to exploit the available information in increasingly large datasets. The team specifies how to combine GPs with deep neural networks via parameterized functions and neural network limits; and how to use them to construct deep models in their own right. The team also examines priors in variational autoencoders (VAEs), Bayesian latent variable models with architectures comprising both an encoder and a decoder trained to minimize the reconstruction error between the encoded-decoded data and the initial data. In such models, observations are generated from unobserved latent variables through a likelihood function. The team examines a number of proper distributional VAE priors that can directly replace the standard Gaussian, some structural VAE priors, and a particularly interesting VAE model: the neural process. Regarding priors in Bayesian neural networks, the team proves that standard Gaussian priors over the parameters are insufficient, and that inductive biases should instead be represented through the choice of architectures. They also review priors defined in the weight and function spaces, and explore methods for extending these ideas to Bayesian ensembles of neural networks. The above approaches all assume that prior knowledge is available to encode into Bayesian deep learning models, but what if there is no useful prior knowledge to encode? In this case, the team suggests that researchers can alternatively rely on a framework of learning to learn or meta-learning, which leverages previously solved tasks that are related to the current task to learn hyperparameters for most of the priors discussed above (Gaussian processes, variational autoencoders, and Bayesian neural networks). The team reviews many alternative prior choices for popular Bayesian deep learning models and demonstrates that useful priors for these models can even be learned from data alone. They hope their study can encourage researchers to choose their priors more carefully and motivate the research community to develop better-suited priors for Bayesian deep learning models.