## Accepted Submissions

 Paper Title Author Names Abstract Revisiting Spatial Invariance with Low-Rank Local Connectivity Elsayed, Gamaleldin F*; Ramachandran, Prajit; Shlens, Jonathon; Kornblith, Simon Convolutional neural networks are among the most successful architectures in deep learning. This success is at least partially attributable to the efficacy of spatial invariance as an inductive bias. Locally connected layers, which differ from convolutional layers only in their lack of spatial invariance, usually perform poorly in practice. However, these observations still leave open the possibility that some degree of relaxation of spatial invariance may yield a better inductive bias than either convolution or local connectivity. To test this hypothesis, we design a method to relax the spatial invariance of a network layer in a controlled manner. In particular, we create a ow-rank locally connected (LRLC) layer, where the kernel applied at each position is constructed as a linear combination of basis kernels with spatially varying combining weights . By varying the number of basis kernels, we can control the degree of relaxation of spatial invariance. In experiments with small convolutional networks, we find that relaxing spatial invariance improves classification accuracy over both convolution and locally connected layers across MNIST, CIFAR-10, and CelebA datasets. These results suggest that spatial invariance may be an overly restrictive inductive bias. Neural Additive Models: Interpretable Machine Learning with Neural Nets Agarwal, Rishabh*; Frosst, Nicholas; Zhang, Xuezhou; Caruana, Rich; Hinton, Geoffrey The accuracy of deep neural networks (DNNs) comes at the cost of intelligibility: it is usually unclear how they make their decisions. This hinders their applicability to high stakes decision-making domains such as healthcare. We propose Neural Additive Models (NAMs) which combine some of the expressivity of DNNs with the inherent intelligibility of generalized additive models. NAMs learn a linear combination of neural networks that each attend to a single input feature. These networks are trained jointly and can learn arbitrarily complex relationships between their input feature and the output. Our experiments on regression and classification datasets show that NAMs are more accurate than widely used intelligible models such as logistic regression and shallow decision trees. They perform similarly to existing state-of-the-art generalized additive models in accuracy, but can be more easily applied to real-world problems. Bandit-based Monte Carlo Optimization for Nearest Neighbors Bagaria, Vivek; Baharav, Tavor Z*; Kamath, Govinda; Tse, David The celebrated Monte Carlo method estimates an expensive-to-compute quantity by random sampling. Bandit-based Monte Carlo optimization (BMO) is a general technique for computing the minimum of many such expensive-to-compute quantities by adaptive random sampling. The technique converts an optimization problem into a statistical estimation problem which is then solved via multi-armed bandits. We apply this technique to solve the important problem of high-dimensional k-nearest neighbors. We show that this technique allows us to develop an algorithm which can confer significant gains on real datasets over both exact computation (up to 100x in number of operations and 30x in wall-clock time) and state-of-the-art algorithms such as K-graph, NGT, and LSH. We provide theoretical guarantees and show that under regularity assumptions the complexity of this algorithm scales logarithmically with the dimension of the data rather than linearly as in exact computation. LassoNet: A Neural Network with Feature Sparsity Lemhadri, Ismael*; Tibshirani, Rob; Ruan, Feng Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or L1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network. Protecting Against Image Translation Deepfakes by Leaking Universal Perturbations from Black-Box Neural Networks Ruiz, Nataniel*; Bargal, Sarah; Sclaroff, Stan In this work, we develop efficient disruptions of black-box image translation deepfake generation systems. We are the first to demonstrate black-box deepfake generation disruption by presenting image translation formulations of attacks initially proposed for classification models. Nevertheless, a naive adaptation of classification black-box attacks results in a prohibitive number of queries for image translation systems in the real-world. We present a frustratingly simple yet highly effective algorithm Leaking Universal Perturbations (LUP), that significantly reduces the number of queries needed to attack an image. LUP consists of two phases: (1) a short leaking phase where we attack the network using traditional black-box attacks and gather information on successful attacks on a small dataset and (2) and an exploitation phase where we leverage said information to subsequently attack the network with improved efficiency. Our attack reduces the total number of queries necessary to attack GANimation and StarGAN by 30%. What is being transferred in transfer learning? Neyshabur, Behnam; Sedghi, Hanie*; Zhang, Chiyuan One desired capability for machines is the ability to transfer their understanding of one domain to another domain where data is (usually) scarce. Despite ample adaptation of transfer learning in many deep learning applications, we yet do not understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we provide new tools and analysis to address these fundamental questions. Through a series of analysis on transferring to block-shuffled images, we separate the effect of feature reuse from learning high-level statistics of data and show that some benefit of transfer learning comes from the latter. We present that when training from pre-trained weights, the model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space. VP-FO: A Variable Projection Method for Training Neural Networks Sahiner, Arda*; Pauly, John; Pilanci, Mert We propose a new optimization method for training neural networks for regression problems, built upon the success of the Variable Projection method for separable non-linear least squares problems. This Variable Projection approach eliminates the final-layer weights of a network by observing that the optimal values of these weights can be solved for in closed-form when the weights of the remaining layers are considered fixed. We propose minimizing the Variable Projection loss with first-order optimization methods, which allows for scalability at any network depth, and can easily be incorporated into existing neural network training pipelines. We extensively demonstrate the effectiveness of implementing our approach for training neural networks, in both training time and performance for applications such as image auto-encoders. Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization Cao, Kaidi*; Chen, Yining; Lu, Junwei; Arechiga, Nikos; Gaidon, Adrien; Ma, Tengyu Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning. siVAE: interpreting latent dimensions withinvariational autoencoders Choi, Yongin*; Quon, Gerald Interpretation of variational autoencoders (VAE) to measure contributions of input features to latent dimensions remains challenging because feature contributions are implicit in the trained parameters and choice of architecture of the VAE. Here we propose a scalable, interpretable variational autoencoder (siVAE), a Bayesian extension of VAEs that is interpretable by design: it learns feature embeddings that guide the interpretation of the sample embeddings, in a manner analogous to factor loadings of factor analysis. siVAE is as powerful and nearly as fast to train as the standard VAE, but achieves full interpretability of the latent dimensions, as well as all hidden layers of the decoder. We also introduce a new interpretability measure, feature awareness, that captures which features each layer of the siVAE model focuses on reconstructing well for each input sample. Training siVAE on a dataset exceeding 1 million samples and 28,000 samples is between 12 and 2,292 times faster than applying existing feature attribution methods to a trained VAE. Learning to grow: control of materials self-assembly using evolutionary reinforcement learning Whitelam, Stephen* We show that neural networks trained by evolutionary reinforcement learning can enact efficient molecular self-assembly protocols. Presented with molecular simulation trajectories, networks learn to change temperature and chemical potential in order to promote the assembly of desired structures or choose between competing polymorphs. In the first case, networks reproduce in a qualitative sense the results of previously-known protocols, but faster and with higher fidelity; in the second case they identify strategies previously unknown, from which we can extract physical insight. Networks that take as input the elapsed time of the simulation or microscopic information from the system are both effective, the latter more so. The evolutionary scheme we have used is simple to implement and can be applied to a broad range of examples of experimental self-assembly, whether or not one can monitor the experiment as it proceeds. Our results have been achieved with no human input beyond the specification of which order parameter to promote, pointing the way to the design of synthesis protocols by artificial intelligence. Neural Anisotropy Directions Ortiz-Jimenez, Guillermo*; Modas, Apostolos; Moosavi-Dezfooli, Seyed-Mohsen; Frossard, Pascal In this work, we analyze the role of the network architecture in shaping the inductive bias of deep classifiers. To that end, we start by focusing on a very simple problem, i.e., classifying a class of linearly separable distributions, and show that, depending on the direction of the discriminative feature of the distribution, many state-of-the-art deep convolutional neural networks (CNNs) have a surprisingly hard time solving this simple task. We then define as neural anisotropy directions (NADs) the vectors that encapsulate the directional inductive bias of an architecture. These vectors, which are specific for each architecture and hence act as a signature, encode the preference of a network to separate the input data based on some particular features. We provide an efficient method to identify NADs for several CNN architectures and thus reveal their directional inductive biases. Furthermore, we show that, for the CIFAR-10 dataset, NADs characterize features used by CNNs to discriminate between different classes. Self-supervised Learning for Deep Models in Recommendations Yao, Tiansheng*; Yi, Xinyang; Cheng, Zhiyuan; Hong, Lichan; Chi, Ed H.; Yu, Felix Large scale neural recommender models play a critical role in modern search and recommendation systems. With millions to billions of items to choose from, the quality of learned query and item representations is crucial to recommendation quality. Inspired by the recent success in self-supervised representation learning research in both computer vision and natural language understanding, we propose a multi-task self-supervised learning (SSL) framework for sparse neural models in recommendations. Furthermore, we propose two highly generalizable SSL tasks: (i) Feature Masking (FM) and (ii) Feature Dropout (FD) within the proposed framework. We evaluate our framework using two large-scale datasets with ~500M and 1B training examples respectively. Our results demonstrate that the proposed framework outperforms baseline models and state-of-the-art spread-out regularization techniques in the context of retrieval. The SSL framework shows larger improvement with less supervision compared to the counterparts. Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization Matsuhima, Tatsuya*; Furuta, Hiroki; Matsuo, Yutaka; Nachum, Ofir; Gu, Shixiang Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We then propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), which is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Learning Multi-granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems Kang, Wangcheng*; Cheng, Zhiyuan; Chen, Ting; Yi, Xinyang; Lin, Dong; Hong, Lichan; Chi, Ed H. Recommender system models often represent various sparse features like users, items, and categorical features via one-hot embeddings. However, a large vocabulary inevitably leads to a gigantic embedding table, creating two severe problems: (i) making model serving intractable in resource-constrained environments; (ii) causing overfitting problems. We seek to learn highly compact embeddings for large-vocab sparse features in recommender systems (recsys). First, we show that the novel Differentiable Product Quantization (DPQ) approach can generalize to recsys problems. In addition, to better handle the power-law data distribution commonly seen in recsys, we propose a Multi-Granular Quantized Embeddings (MGQE) technique which learns more compact embeddings for infrequent items. Preliminary experiments on three recommendation tasks and two datasets show that we can achieve on par or better performance, with only ∼20% of the original model size Temperature check: theory and practice for training models with softmax-cross-entropy losses Agarwala, Atish*; Schoenholz, Samuel S; Pennington, Jeffrey; Dauphin, Yann The softmax-cross-entropy loss function is a principled way of modeling probability distributions that has become ubiquitous in deep learning. While its lone hyperparameter, the temperature, is commonly set to one or regarded as a way to tune the model's confidence after training, less is known about how the temperature impacts training dynamics or generalization performance. In this work, we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the initial training set logit magnitude $||\beta\z||_{F}$. Empirically, we find that generalization performance depends strongly on the temperature even though the model’s final confidence does not. It follows that the addition of $\beta$ as a tunable hyperparameter is key to maximizing model performance which we demonstrate by showing that optimizing $\beta$ increases performance of ResNet-50 trained on ImageNet. Together these results underscore the importance of tuning the softmax temperature and provide qualitative guidance in performing this tuning. Autofocused oracles for design Fannjiang, Clara*; Listgarten, Jennifer Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a material that exhibits superconductivity at higher temperatures than previously ever observed. To that end, costly experimental measurements are being replaced with calls to a high-capacity regression model trained on labeled data, which can be leveraged in an in silico search for promising design candidates. However, the design goal necessitates moving into regions of the input space beyond where such models were trained. Therefore, one can ask: should the regression model be altered as the design algorithm explores the input space, in the absence of new data acquisition? Herein, we answer this question in the affirmative. In particular, we (i) formalize the data-driven design problem as a non-zero-sum game, (ii) leverage this formalism to develop a strategy for retraining the regression model as the design algorithm proceeds---what we refer to as autofocusing the model, and (iii) demonstrate the promise of autofocusing empirically. A full paper detailing our work can be found at: https://arxiv.org/abs/2006.08052. Provably Efficient Policy Optimization via Thompson Sampling Ishfaq, Haque*; Yang, Zhuoran; Lupu, Andrei; Islam, Riashat; Liu, Lewis; Precup, Doina; Wang, Zhaoran Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. Despite their popularity, it largely remains a challenge to design provably efficient policy optimization algorithms. In particular, it still remains elusive how to design provably efficient policy optimization algorithm using Thompson sampling ~\citep{thompson1933likelihood, ThompsonTutorial} based exploration strategy. This paper presents a provably efficient policy optimization algorithm that incorporates exploration using Thompson sampling. We prove that, in an episodic linear MDP setting, our algorithm, Thompson Sampling for Policy Optimization (TSPO) achieves $\Tilde{\mathcal{O}}(d^{3/2} H^{2} \sqrt{T})$ worst-case (frequentist) regret, where $H$ is the length of each episode, $T$ is the total number of steps and $d$ is the number of features. Finally, we empirically evaluate TSPO and show that it is competitive with state-of-the-art baselines. Robustness Analysis of Deep Learning via Implicit Models Tsai, Alicia Y.*; El Ghaoui, Laurent Despite the success of deep neural networks (DNNs), it is well-known that they can fail significantly in the presence of adversarial perturbations. Starting with Szegedy et al., a large number of works have demonstrated that state-of-the-art DNNs are vulnerable to adversarial samples. The vulnerability of DNNs has motivated the study of building models that are robust to such perturbations. However, many defense strategies are later shown to be ineffective. Although a large number of research works have been devoted to improving the robustness of DNNs and to our understanding of their behaviors, many fundamental questions about their vulnerabilities remain unclear. In this work, we introduce the implicit model and formalize its well-posedness properties theoretically. We analyze the robustness of DNNs via the lens of the implicit model and define its sensitivity matrix, which relates perturbations in inputs to those in outputs. Empirically, we show how the sensitivity matrix can be used to generate adversarial attacks effectively on MNIST and CIFAR-10 dataset. Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration Dai, Hanjun*; Singh, Rishabh; Dai, Bo; Sutton, Charles; Schuurmans, Dale Discrete structures play an important role in applications like program language modeling and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice in flexibility. Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions, but require partition function estimation. In this paper we propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data, where parameter gradients are estimated using a learned sampler that mimics local search. We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration, achieving a better trade-off between flexibility and tractability. Experimentally, we show that learning local search leads to significant improvements in challenging application domains. Most notably, we present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer. Distributed Sketching Methods for Privacy Preserving Regression Bartan, Burak*; Pilanci, Mert In this work, we study distributed sketching methods for large scale regression problems. We leverage multiple randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and analyze the accuracy of parameter averaging for distributed sketches. We consider random matrices including Gaussian, randomized Hadamard, uniform sampling and leverage score sampling in the distributed setting. Moreover, we propose a hybrid approach combining sampling and fast random projections for better computational efficiency. We illustrate the performance of distributed sketches in a serverless computing platform with large scale experiments. Exact posteriors of wide Bayesian neural networks Hron, Jiri*; Bahri, Yasaman; Novak, Roman; Pennington, Jeffrey; Sohl-Dickstein, Jascha Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) if the width of all layers is large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations. We provide the missing proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior. For empirical validation, we generate samples from the exact finite BNN posterior on a small dataset via rejection sampling. Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs Leavitt, Matthew L*; Morcos, Ari S The properties of individual neurons are often analyzed in order to understand the biological and artificial neural networks in which they're embedded. Class selectivity—typically defined as how different a neuron's responses are across different classes of stimuli or data samples—is commonly used for this purpose. However, it remains an open question whether it is necessary and/or sufficient for deep neural networks (DNNs) to learn class selectivity in individual units. We investigated the causal impact of class selectivity on network function by directly regularizing for or against class selectivity. Using this regularizer to reduce class selectivity across units in convolutional neural networks increased test accuracy by over 2% in ResNet18 trained on Tiny ImageNet. In ResNet20 trained on CIFAR10 we could reduce class selectivity by a factor of 2.5 with no impact on test accuracy, and reduce it nearly to zero with only a small (~2%) drop in test accuracy. In contrast, regularizing to increase class selectivity had rapid and disastrous effects on test accuracy across all models and datasets. These results indicate that class selectivity in individual units is neither sufficient nor strictly necessary, and can even impair DNN performance. They also encourage caution when focusing on the properties of single units as representative of the mechanisms by which DNNs function. GANs for Continuous Path Keyboard Input Modeling Mehra, Akash*; Bellegarda, Jerome; Bapat, Ojas; Lal, Partha; Wang, Xin Continuous path keyboard input has higher inherent ambiguity than standard tapping, because the path trace may exhibit not only local overshoots/undershoots (as in tapping) but also, depending on the user, substantial mid-path excursions. Deploying a robust solution thus requires a large amount of high-quality training data, which is difficult to collect/annotate. In this work, we address this challenge by using GANs to augment our training corpus with user-realistic synthetic data. Experiments show that, even though GAN-generated data does not capture all the characteristics of real user data, it still provides a substantial boost in accuracy at a 5:1 GAN-to-real ratio. GANs therefore inject more robustness in the model through greatly increased word coverage and path diversity. Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning Khadka, Shauharda; Guez Aflalo, Estelle; Marder, Mattias; Ben-David, Avrech; Miret, Santiago; Tang, Hanlin; Mannor, Shie; Hazan, Tamir; Majumdar, Somdeb* As modern neural networks have grown to billions of parameters, meeting tight latency budgets has become increasingly challenging. Solutions like compression and pruning modify the underlying network. We present Evolutionary Graph RL (EGRL) - a complimentary approach of optimizing how tensors are mapped to on-chip memory keeping the network untouched. Since different memory components trade off capacity for bandwidth differently, a sub-optimal mapping can result in high latency. We train and validate EGRL directly on the Intel NNP-I chip for inference. EGRL outperforms policy-gradient, evolutionary search and dynamic programming baselines on ResNet-50, ResNet-101 and BERT achieving 28-78% speed-up over the native compiler. Interpretable Planning-Aware Representations for Multi-Agent Trajectory Forecasting Ivanovic, Boris*; Elhafsi, Amine; Rosman, Guy; Gaidon, Adrien; Pavone, Marco Reasoning about human motion is an important prerequisite to safe and socially-aware robotic navigation. As a result, multi-agent behavior prediction has become a core component of modern human-robot interactive systems, such as self-driving cars. In particular, one of the main uses of behavior prediction in autonomous systems is to inform ego-robot motion planning and control. A unifying theme among most human motion prediction approaches is that they produce trajectories (or distributions thereof) for each agent in a scene; an intuitive output representation that matches common evaluation metrics. However, a majority of planning and control algorithms reason about system dynamics rather than future agent tracklets, which can hinder their integration. Towards this end, we investigate Mixtures of Linear Time-Varying Systems as an output representation for trajectory forecasting that is more amenable to downstream planning and control use. Our approach leverages successful ideas from prior probabilistic trajectory forecasting works to learn dynamical system representations that are well-studied in the planning and control literature. We consider an intuitive two-agent interaction scenario to illustrate how our method works and motivate further evaluation on large-scale autonomous driving datasets as well as real-world hardware. Architecture Compression Ashok, Anubhav* In this paper we propose a novel approach to model compression termed Architecture Compression. Instead of operating on the weight or filter space of the network like classical model compression methods, our approach operates on the architecture space. A 1-D CNN encoder-decoder is trained to learn a mapping from discrete architecture space to a continuous embedding and back. Additionally, this embedding is jointly trained to regress accuracy and parameter count in order to incorporate information about the architecture's effectiveness on the dataset. During the compression phase, we first encode the network and then perform gradient descent in continuous space to optimize a compression objective function that maximizes accuracy and minimizes parameter count. The final continuous feature is then mapped to a discrete architecture using the decoder. We demonstrate the merits of this approach on visual recognition tasks such as CIFAR-10, CIFAR-100, Fashion-MNIST and SVHN and achieve a greater than 20x compression on CIFAR-10. Simultaneous Learning of the Inputs and Parameters in Neural Collaborative Filtering Raziperchikolaei, Ramin*; Li, Tianyu; Chung, Young Joo User and item representations have a significant impact on the prediction performance of neural network-based collaborative filtering systems. Previous works fix the input to the user/item interaction vectors and/or IDs and train neural networks to learn the representations. We argue that this strategy adversely affects the quality of the representations since the similarities in the users’ tastes might not be reflected in the input space. We show that there is an implicit embedding matrix in the first fully connected layer which takes the user/item interaction vectors as the input. The role of the non-zero elements of the input vectors is to choose and combine a subset of the embedding vectors. To learn better representations, instead of fixing the input and only relying on neural network structure, we propose to learn the value of the non-zero elements of the input jointly with the neural network parameters. Our experiments on two movielens datasets and two real-world datasets show that our method outperforms the state-of-the-art methods. Neural Representations in Hybrid Recommender Systems: Prediction versus Regularization Raziperchikolaei, Ramin*; Li, Tianyu; Chung, Young Joo Autoencoder-based hybrid recommender systems have become popular recently because of their ability to learn user and item representations by reconstructing various information sources, including users' feedback on items (e.g., ratings) and side information of users and items (e.g., users' occupation and items' title). However, existing systems still use representations learned by matrix factorization (MF) to predict the rating, while using representations learned by neural networks as the regularizer. In our work, we define the neural representation for prediction (NRP) framework and apply it to the autoencoder-based recommendation systems. We theoretically analyze how our objective function is related to the previous MF and autoencoder-based methods and explain what it means to use neural representations as the regularizer. We also apply the NRP framework to a direct neural network structure which predicts the ratings without reconstructing the user and item information. Our experimental results confirm that neural representations are better for prediction than regularization and show that the NRP framework, combined with the direct neural network structure, outperforms the state-of-the-art methods in the prediction task. Anatomy of Catastrophic Forgetting: HiddenRepresentations and Task Semantics Ramasesh, Vinay V*; Dyer, Ethan; Raghu, Maithra Catastrophic forgetting is a central obstacle to continual learning. Many methods have been proposed to overcome this problem, but fully mitigating forgetting is likely hindered by a lack of understanding of the phenomenon’s fundamental properties. For example, how does catastrophic forgetting affect the hidden representations of neural networks? Are there underlying principles common to methods that mitigate forgetting? How is catastrophic forgetting affected by (semantic) similarities between sequential tasks? And what are good benchmark tasks that capture the essence of how catastrophic forgetting naturally arises in practice? This paper begins to provide answers to these and other questions. Meta-Learning Requires Meta-Augmentation Rajendran, Janarthanan*; Irpan, Alex; Jang, Eric In several areas of machine learning, data augmentation is critical to achieving state-of-the-art generalization performance. Examples include computer vision, speech recognition, and natural language processing. It is natural to suspect that data augmentation can play an equally important role in helping meta-learners. In this work, we present a unified framework for meta-data augmentation and an information theoretic view on how it prevents overfitting. Under this framework, we interpret existing augmentation strategies and propose modifications to handle overfitting. We show the importance of meta-augmentation on current benchmarks and meta-learning algorithms and demonstrate that meta-augmentation produces large complementary benefits to recently proposed meta-regularization techniques. Automated Utterance Generation Parikh, Soham*; Tiwari, Mitul; Vohra, Quaizar Conversational AI assistants are becoming popular and question-answering is an important part of any conversational assistant. Using relevant utterances as features in question-answering has shown to improve both the precision and recall for retrieving the right answer by a conversational assistant. Hence, utterance generation has become an important problem with the goal of generating relevant utterances (sentences or phrases) from a knowledge base article that consists of a title and a description. However, generating good utterances usually requires a lot of manual effort, creating the need for an automated utterance generation. In this paper, we propose an utterance generation system which 1) uses extractive summarization to extract important sentences from the description, 2) uses multiple paraphrasing techniques to generate a diverse set of paraphrases of the title and summary sentences, and 3) selects good candidate paraphrases with the help of a novel candidate selection algorithm. Uncovering Task Clusters in Multi-Task Reinforcement Learning Kumar, Varun; Rakelly, Kate; Majumdar, Somdeb* Multi-task learning refers to the approach of learning several distinct tasks using a shared representation. Such a strategy can be beneficial if the tasks share common structure: in this case, training a model on each individual tasks would be unnecessarily inefficient, as it would involve learning the common structure repeatedly. By contrast, a shared representation only needs to learn the structure a single time, following which it can be transferred to other tasks. This approach, when paired with deep neural networks, has proven effective in domains such as computer vision and natural language processing. Results in reinforcement learning, however, have been mixed, with multi-task reinforcement learning sometimes proving to be less sample efficient than independent single-task models. We investigate multi-task reinforcement learning in a recently published benchmark, Meta-World MT10. We suggest a method to reduce conflicts in multi-task reinforcement learning by dividing the task space into clusters of related tasks, and show that this method results in improved performance compared to prior work. ECLIPSE: An Extreme-Scale Linear Program Solver for Web-Applications Basu, Kinjal; Ghoting, Amol; Pan, Yao*; Keerthi, S. Sathiya; Mazumder, Rahul Web applications (involving many millions of users and items) based on machine learning often involve global constraints (e.g., budget limits of advertisers) that need to be satisfied during deployment (inference). This problem can usually be formulated as a Linear Program (LP) involving billions to trillions of decision variables and constraints. Despite the appeal of an LP formulation, solving problems at such scales is well beyond the capabilities of existing LP solvers. Often, ad-hoc decomposition rules are used to approximately solve these LPs, which have limited optimality guarantees and lead to sub-optimal performance in practice. In this work, we propose a distributed solver that solves the LP problems at scale. We propose a gradient-based algorithm on the smoothened dual of the LP with computational guarantees. The main workhorses of our algorithm are distributed matrix-vector multiplications (with load balancing) and efficient projection operations on distributed machines. Experiments on real-world data show that our proposed LP solver, ECLIPSE, can even solve problems with 10^12 decision variables within a few hours -- well beyond the capabilities of current generic LP solvers. Deep Ensembles: a loss landscape perspective Hu, Huiyi*; Fort, Stanislav; Lakshminarayanan, Balaji Deep ensembles have been empirically shown to be a promising approach for improving accuracy, uncertainty and out-of-distribution robustness of deep learning models. While deep ensembles were theoretically motivated by the bootstrap, non-bootstrap ensembles trained with just random initialization also perform well in practice, which suggests that there could be other explanations for why deep ensembles work well. Bayesian neural networks, which learn distributions over the parameters of the network, are theoretically well-motivated by Bayesian principles, but do not perform as well as deep ensembles in practice, particularly under dataset shift. One possible explanation for this gap between theory and practice is that popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space. We investigate this hypothesis by building on recent work on understanding the loss landscape of neural networks and adding our own exploration to measure the similarity of functions in the space of predictions. Our results show that random initializations explore entirely different modes, while functions along an optimization trajectory or sampled from the subspace thereof cluster within a single mode predictions-wise, while often deviating significantly in the weight space. Developing the concept of the diversity--accuracy plane, we show that the decorrelation power of random initializations is unmatched by popular subspace sampling methods. Finally, we evaluate the relative effects of ensembling, subspace based methods and ensembles of subspace based methods, and the experimental results validate our hypothesis. CoCon: Cooperative-Contrastive Learning Rai, Nishant*; Adeli, Ehsan; Lee, Kuan-Hui; Gaidon, Adrien; Niebles, Juan Carlos Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggest contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We experimentally evaluate our representations on the downstream task of action recognition. Our method sets a new state of the art on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higher-order class relationships. We will release code and models. Curriculum and Decentralized Learning in Google Research Football Domitrz, Witalis; Mikula, Maciej; Opała, Zuzanna*; Pacek, Mikołaj; Rychlicki, Mateusz; Sieniawski, Mateusz; Staniszewski, Konrad; Michalewski, Henryk; Miłoś, Piotr; Osiński, Błażej B We make a study of various curricula in the game of football (aka soccer). We aim to understand various methodological and architectonical choices. Concentrating on football has an advantage of interpretability, but we believe that observations with regard to decentralized learning are applicable more generally. Learning Mixed-Integer Convex Optimization Strategies for Robot Planning and Control Cauligi, Abhishek*; Culbertson, Preston; Stellato, Bartolomeo; Schwager, Mac; Pavone, Marco Mixed-integer convex programming (MICP) has seen significant algorithmic and hardware improvements with several orders of magnitude solve time speedups compared to 25 years ago. Despite these advances, MICP has been rarely applied to real-world robotic control because the solution times are still too slow for online applications. In this work, we extend the machine learning optimizer (MLOPT) framework to solve MICPs arising in robotics at very high speed. MLOPT encodes the combinatorial part of the optimal solution into a strategy. Using data collected from offline problem solutions, we train a multiclass classifier to predict the optimal strategy given problem-specific parameters such as states or obstacles. Compared to existing approaches, we use task-specific strategies and prune redundant ones to significantly reduce the number of classes the predictor has to select from, thereby greatly improving scalability. Given the predicted strategy, the control task becomes a small convex optimization problem that we can solve in milliseconds. Numerical experiments on a free-flying space robot and task-oriented grasps show that our method provides not only 1 to 2 orders of magnitude speedups compared to state-of-the-art solvers but also performance close to the globally optimal MICP solution. Entity Skeletons as Intermediate Representations for Visual Storytelling Chandu, Khyathi Raghavi* We are enveloped by stories of visual interpretations in our everyday lives. Story narration often comprises of two stages: forming a central mind map of entities and weaving a story around them. In this paper, we address these two stages of introducing the right entities at seemingly reasonable junctures and also referring them coherently in the context of visual storytelling. The building blocks of this, also known as entity skeleton are entity chains including nominal and coreference expressions. We establish a strong baseline for skeleton informed generation and propose a glocal hierarchical attention model that attends to the skeleton both at the sentence (local) and the story (global) levels. We observe that our proposed models outperform the baseline in terms of automatic evaluation metric, METEOR. We also conduct human evaluation from which concludes that visual stories generated by our model are preferred 82% of the times. Exact Polynomial-time Convex Optimization Formulations for Two-Layer ReLU Networks Pilanci, Mert; Ergen, Tolga* We develop exact representations of two-layer neural networks with rectified linear units in terms of a single convex program with number of variables polynomial in the number of training samples and number of hidden neurons. Active Online Domain Adaptation Chen, Yining*; Luo, Haipeng; Ma, Tengyu; Zhang, Chicheng Online machine learning systems need to adapt to domain shifts. Meanwhile, acquiring label at every timestep is expensive. We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains. For online linear regression with oblivious adversaries, we provide a tight tradeoff that depends on the durations and dimensionalities of the hidden domains. Our algorithm can adaptively deal with interleaving spans of inputs from different domains. We also generalize our results to non-linear regression for hypothesis classes with bounded eluder dimension and adaptive adversaries. Experiments on synthetic and realistic datasets demonstrate that our algorithm achieves lower regret than uniform queries and greedy queries with equal labeling budget. DisARM: An Antithetic Gradient Estimator for Binary Latent Variables Dong, Zhe*; Mnih, Andriy; Tucker, George Training models with discrete latent variables is challenging due to the difficulty of estimating the gradients accurately. Much of the recent progress has been achieved by taking advantage of continuous relaxations of the system, which are not always available or even possible. The Augment-REINFORCE-Merge (ARM) estimator (Yin and Zhou, 2019) provides an alternative that, instead of relaxation, uses continuous augmentation. Applying antithetic sampling over the augmenting variables yields a relatively low-variance and unbiased estimator applicable to any model with binary latent variables. However, while antithetic sampling reduces variance, the augmentation process increases variance. We show that ARM can be improved by analytically integrating out the randomness introduced by the augmentation process, guaranteeing substantial variance reduction. Our estimator, \emph{DisARM}, is simple to implement and has the same computational cost as ARM. We evaluate DisARM on several generative modeling benchmarks and show that it consistently outperforms ARM and a strong independent sample baseline in terms of both variance and log-likelihood. Boosted Sparse Oblique Decision Trees Gabidolla, Magzhan*; Zharmagambetov, Arman S; Carreira-Perpinan, Miguel A Boosted decision trees are widely used machine learning algorithms, achieving state-of-the-art performance in many domains with little effort on hyperparameter tuning. Though much work on boosting has focused on the theoretical properties and empirical variations, there has been little progress on the tree learning procedure itself. To this day, boosting algorithms employ regular axis-aligned trees as base learners optimized by CART-style greedy top-down induction. These trees are known to be highly suboptimal due to their greedy nature, and they are not well-suited to model the correlation of features due to their axis-aligned partition. In fact, these suboptimality characteristics are commonly believed to be beneficial because of the weak learning criterion in boosting. In this work we consider boosting better optimized sparse oblique decision trees trained with the recently proposed Tree Alternating Optimization (TAO). TAO generally finds much better approximate optima than CART-type algorithms due to the ability to monotonically decrease a desired objective function over a decision tree. Our extensive experimental results demonstrate that boosted sparse oblique TAO trees improve upon CART trees by a large margin, and achieve better test error than other popular tree ensembles such as gradient boosting (XGBoost) and random forests. Moreover, the resulting TAO ensembles require far smaller number of trees. A flexible, extensible software framework for model compression based on the LC algorithm Idelbayev, Yerlan*; Carreira-Perpinan, Miguel A We propose a software framework based on the ideas of the Learning-Compression (LC) algorithm, that allows a user to compress a neural network or other machine learning model using different compression schemes with minimal effort. Currently, the supported compressions include pruning, quantization, low-rank methods (including automatically learning the layer ranks), and combinations of those, and the user can choose different compression types for different parts of a neural network. The library is written in Python and PyTorch and available online at https://github.com/UCMerced-ML/LC-model-compression Safety Aware Reinforcement Learning (SARL) Miret, Santiago*; Wainwright, Carroll; Majumdar, Somdeb As reinforcement learning agents become more and more integrated into complex, real-world environments, designing for safety becomes more and more important. We specifically focus on scenarios where the agent can cause undesired side effects that may be linked with performing the primary task. The interdependence of side effects with the primary task makes it difficult to define hard constraints for the agent without sacrificing task performance. In order to address this challenge, we propose a novel virtual agent embedded co-training framework (SARL). SARL includes a primary reward based actor and a virtual agent that assesses side effect impacts and influences the behavior of the reward based actor via loss regularization. The actor loss is regularized with a proper distance metric measuring the difference in action probabilities of both agents. As such, in addition to optimizing for the task objective, the actor also aims to minimize the disagreement between itself and the safety agent. We apply SARL to tasks and environment in the SafeLife suite, which can generate complex tasks in dynamic environments, and construct performance vs side-effect Pareto fronts. Preliminary results indicate that SARL is competitive with a reward-based penalty method, which punishes side effects directly on the reward function, while also providing zero-shot generalization of the safety agent across different environments. This zero-shot generalization suggest that through SARL we can obtain a more flexible notion of side effects that is useful for a variety of settings. Hamming Space Locality Preserving Neural Hashing for Similarity Search Idelson, Daphna* We propose a novel method for learning to map a large-scale dataset in the feature representation space to binary hash codes in the hamming space, for fast and efficient approximate nearest-neighbor similarity search. Our method is composed of a simple neural network and a novel training scheme, that aims to preserve the locality relations between the original data points. We achieve distance preservation of the original cosine space upon the new hamming space, by introducing a loss function that translates the relational similarities in both spaces to probability distributions - and optimizes the KL divergence between them. We also introduce a simple data sampling method by representing the database with randomly generated proxies, used as reference points to query points from the training set. Experimenting with three publicly available standard ANN benchmarks we demonstrate significant improvement over other binary hashing methods, achieving an improvement of between 7% to 17%. As opposed to other methods, we show high performance in both low (64 bits) and high (768 bits) dimensional bit representation, offering increased accuracy when resources are available and flexibility in choice of ANN strategy. What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation Feldman, Vitaly*; Zhang, Chiyuan Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2020) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given. In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closely-related subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in (Feldman 2020). Can Neural Networks Learn Non-Verbal Reasoning? Zhang, Chiyuan*; Raghu, Maithra; Bengio, Samy Neural networks have demonstrated excellent capabilities in learning generalizable pattern-matching --- the ability to identify simple properties of the training data and utilize these properties to correctly process unseen (test) instances. These results raise fundamental questions on the reasoning capabilities of neural networks and how they generalize. Can neural networks learn more sophisticated reasoning? Are there insights on how they generalize in pattern matching and sophisticated reasoning settings? In this paper, we introduce a visual reasoning task to help investigate these questions. Learning to reason by learning on rationales Piękos, Piotr*; Michalewski, Henryk; Malinowski, Mateusz For centuries humans have been codifying observed natural or social phenomena in some abstract language. Such a language, which we call mathematics, is at the core of not only modern science but also everyday activity. In this work, we look into the basic algebraic formulations used to solve some real concrete problems. Like how much money I need to spend to buy 2 apples knowing each costs 2 pounds''. We teach to solve such math word problems very early and universally in our education system. The teacher asks a question about some real problem and expects not only answers but to understand the rationale behind them - consecutive precise steps that lead to the answer. In this work we are motivated by the same learning process and incorporate rationales during training of a language model. We also show that through learning to understand the order of steps in rationales, we can improve the overall performance of our model. Modality-Agnostic Attention Fusion for visual search with text feedback Dodds, Eric M*; Culpepper, Jack; Herdade, Simao; Zhang, Yang; Boakye, Kofi A Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS. We also introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding attending'' to the image region they refer to. MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records Xu, Zhen*; So, David; Dai, Andrew M Deep learning models trained on electronic health records (EHR) have demonstrated their potential to improve healthcare quality in a variety of areas, such as predicting diagnoses, reducing healthcare costs and personalizing medicine. However, most model architectures that are commonly employed were originally developed for academic unimodal machine learning datasets, such as ImageNet or WMT. In contrast, EHR data is multimodal, containing sparse and irregular longitudinal features with a mix of structured and unstructured data. Such complex data often requires specific modeling for each modality and a good strategy to fuse different representations to reach peak performance. To address this, we propose MUltimodal Fusion Architecture SeArch (MUFASA), the first multimodal Neural Architecture Search (NAS) method for EHR data. Specifically, we reformulate the NAS objective to simultaneously search for several architectures, jointly optimizing multimodal fusion strategies and per-modality model architectures together. We demonstrate empirically that our MUFASA method outperforms established unimodal evolutionary NAS on Medical Information Mart for Intensive Care (MIMIC-III) EHR data with comparable computation costs. What’s more, our experimental results show that MUFASA produces models that outperform the Transformer, and its NAS variant, the Evolved Transformer, on public EHR data. Compared with these baselines on MIMIC CCS diagnosis code prediction, our discovered models improve top-5 recall from 0.88 to 0.91, and demonstrate the ability to generalize to other EHR tasks. Studying our top architecture in depth, we provide empirical evidence that MUFASA’s improvements are derived from its ability to optimize custom modeling for varying input modalities and find effective fusion strategies. VirAAL: Virtual Adversarial Active Learning Senay, Gregory*; Youbi Idrissi, Badr; Marine Haziza This paper presents VirAAL, an Active Learning framework based on Virtual Adversarial Training (VAT), a semi-supervised approach that regularizes the model through Local Distributional Smoothness (LDS). VirAAL aims to reduce the effort of annotation in Natural Language Understanding (NLU). Adversarial perturbations are added to the inputs making the posterior distribution more consistent. Therefore, entropy-based Active Learning (AL) becomes robust by querying more informative samples without requiring additional components. VirAAL is an inexpensive method in terms of AL computation with a positive impact on data sampling. Furthermore, VirAAL decreases annotations in AL up to 80%. Beyond Supervision for Monocular Depth Estimation Guizilini, Vitor*; Ambruș, Rareș A; Li, Jie; Pillai, Sudeep; Gaidon, Adrien Self-supervised learning enables training predictive models on arbitrarily large amounts of unlabeled data. One of the most successful examples of self-supervised learning is monocular depth estimation, which relies on strong geometric priors to learn from raw monocular image sequences in a structure-from-motion setting. In this work, we present recent breakthroughs in self-supervised monocular depth estimation that establish a new state of the art on standard benchmarks, reaching parity with fully supervised methods. Our contributions center on a novel neural network architecture, PackNet, that is specifically designed for large-scale self-supervised learning on high resolution videos. We also discuss semi-supervised training extensions that can effectively combine the self-supervised objective with partial supervision, whether from very sparse Lidar scans, velocity information, or pretrained segmentation models, while keeping inference monocular. Finally, we introduce a new, diverse, and challenging benchmark: Dense Depth for Automated Driving (DDAD). DDAD contains diverse scenes collected using a fleet of autonomous vehicles across the US and Japan. Thanks to long-range Lidar sensors, we expand standard metrics to include (a) evaluation at longer ranges of up to 200m, to properly measure how performance degrades with distance; and (b) provide fine-grained labels in the validation and test frames to enable per-category and per-instance metrics, thus overcoming the current limitation of uniform per-pixel depth evaluations. Synthetic Health Data for Fostering Reproducibility of Private Research Studies Bhanot, Karan*; Dash, Saloni; Yale, Andrew; Guyon, Isabelle; Erickson, John; Bennett, Kristin The inability to share private health data can severely stifle research and has led to the reproducibility crisis in biomedical research. Recent synthetic data generation methods provide an attractive alternative for making data available for research and education purposes without violating privacy. In this paper, we discuss our novel HealthGAN model that produces high quality synthetic health data and demonstrate its effectiveness by reproducing research studies. To preserve privacy, HealthGAN synthetic data can be released when research papers are published. Approaches can be developed on synthetic data and then evaluated on real data inside secure environments enabling novel method generation. A Synthetic Data Petri Dish for Studying Mode Collapse in GANs Mangalam, Karttikeya*; Garg, Rohin In this extended abstract, we present a simple yet powerful data generation procedure for studying mode collapse in GANs. We describe a computationally efficient way to obtain visualizable high dimensional data using normalizing flows. We also train GANs (Table 1) on different proposed dataset Levels and find mode collapse to occur even in the most robust GAN formulations. We also use the inversion quality of our proposed transformation to visualize both the high dimensional generated samples in a 2D space and the learnt discriminator's distribution as a heatmap. Such 2D visualizations are ill-defined with other dimensionality reduction methods such as PCA or t-SNE when applied to natural images since they suffer from approximations, strong dependence on visualization hyperparameters and are computationally expensive. We believe our proposed procedure will serve as a petri dish for studying mode collapse in controlled settings and a better understanding of failure modes of proposed robust formulations thereby propelling research further in generative algorithms. Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible Wadia, Neha*; Duckworth, Daniel; Schoenholz, Samuel S; Dyer, Ethan; Sohl-Dickstein, Jascha We argue that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training can harness information contained in the sample-sample second moment matrix of the dataset. We show that for models with a fully connected first layer, the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information; in the high dimensional regime, the training procedure has no access to this information, producing models that either generalize poorly or not at all. We experimentally verify the predicted harmful effects of data whitening and second order optimization on generalization. We further show experimentally that generalization continues to be harmed even when theoretical requirements are relaxed. A Deep Learning Pipeline for Patient Diagnosis Prediction Using Electronic Health Records Paudel, Bibek*; Shrestha, Yash Raj; Franz, Leopold H Augmentation of disease diagnosis and decision-making in health care with machine learning algorithms is gaining much impetus in recent years. In particular, in the current epidemiological situation caused by COVID-19 pandemic, swift and accurate prediction of disease diagnosis with machine learning algorithms could facilitate identification and care of vulnerable clusters of population, such as those having multi-morbidity conditions. In order to build a useful disease diagnosis prediction system, advancement in both data representation and development of machine learning architectures are imperative. First, with respect to data collection and representation, we face severe problems due to multitude of formats and lack of coherency prevalent in Electronic Health Records (EHRs). This causes hindrance in extraction of valuable information contained in EHRs. Currently, no universal global data standard has been established. As a useful solution, we develop and publish a Python package to transform public health dataset into an easy to access universal format. This data transformation to an international health data format facilitates researchers to easily combine EHR datasets with clinical datasets of diverse formats. Second, machine learning algorithms that predict multiple disease diagnosis categories simultaneously remain underdeveloped. We propose two novel model architectures in this regard. First, DeepObserver, which uses structured numerical data to predict the diagnosis categories and second, ClinicalBERT\_Multi, that incorporates rich information available in clinical notes via natural language processing methods and also provides interpretable visualizations to medical practitioners. We show that both models can predict multiple diagnoses simultaneously with high accuracy. Ads Clickthrough Rate Prediction Models For Multi-Datasource Tasks Wang, Erzhuo* Clickthrough rate prediction in online advertisement is a challenging machine learning problem that involves multiple objectives and multiple data sources. For example, at Pinterest we serve both shopping and standard Ads products where each product has its own unique characteristics of creatives and user behavior patterns. In this work, we address this problem by adopting a multi-task deep neural network model that jointly learns distinct distributions of the data from various sources simultaneously. To tackle the multi-data-source problem, we proposed a shared-bottom, multi-tower model architecture. The multi-tower structure can effectively isolate the interference from the distinct data distributions of different sources, while the shared-bottom layers enable us to learn lower level common signals. Furthermore, we make use of the contextual signals on top of the neural networks to calibrate the predictions, such that a good confidence in the inferenced likelihood is established. In addition, an automatic machine learning framework is leveraged, which handles feature extraction and feature transforms with algorithms. Hence the cost of human feature engineering is saved. The multi-tower model results in improved offline evaluation results on both data sources than the single tower structure, as well as learning them with separate models. The integrated solution realizes significant CTR gain compared to vanilla multilayer perceptron neural network models on online A/B testing. Adversarial Learning for Debiasing Knowledge Base Embeddings Paudel, Bibek*; Arduini, Mario; Shrestha, Yash Raj; Zhang, Ce; Pirovano, Federico; Noci, Lorenzo Knowledge Graphs (KG) are gaining increasing attention in both academia and industry. Despite their diverse benefits, recent research have identified social and cultural biases embedded in the representations learned from KGs. Such biases can have detrimental consequences on different population and minority groups as applications of KG begin to intersect and interact with social spheres. This paper describes our \textbf{work-in-progress} which aims at identifying and mitigating such biases in Knowledge Graph (KG) embeddings. We explore gender bias in KGE, and a careful examination of popular KGE algorithms suggest that sensitive attribute like the gender of a person can be predicted from the embedding. This implies that such biases in popular KGs is captured by the structural properties of the embedding. As a preliminary solution to debiasing KGs, we introduce a novel framework to filter out the sensitive attribute information from the KG embeddings, which we call FAN (Filtering Adversarial Network). We also suggest the applicability of FAN for debiasing other network embeddings which could be explored in future work. Meta Attention Networks: Meta Learning Attention to Modulate Information Between Sparsely Interacting Recurrent Modules Madan, Kanika*; Ke, Nan Rosemary; Goyal, Anirudh; Bengio, Yoshua Decomposing knowledge into interchangeable pieces promises a generalization advantage when, at some level of representation, the learner is likely to be faced with situations requiring novel combinations of existing pieces of knowledge or computation. We hypothesize that such a decomposition of knowledge is particularly relevant for higher levels of representation as we see this at work in human cognition and natural language in the form of systematicity or systematic generalization. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs, as well as its reward function are stationary and can be re-used across tasks and changes in distribution. As the learner is confronted with variations in experiences, the attention selects which modules should be adapted and the parameters of those selected modules are adapted fast, while the parameters of attention mechanisms are updated slowly as meta-parameters. We find that both the meta-learning and the modular aspects of the proposed system greatly help achieve faster learning in experiments with reinforcement learning setup involving navigation in a partially observed gridworld. Batch Reinforcement Learning Through Continuation Method Guo, Yijie*; Chen, Minmin; Lee, Honglak; Chi, Ed H. Many real-world application of reinforcement learning (RL) requires the agent to learn from a fixed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging due to the distribution shift. In this work, we propose a simple yet effective policy-based approach to batch RL using global optimization methods known as continuation, i.e., by constraining the Kullback-Leibler(KL) divergence between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint. We theoretically show that policy gradient with KL divergence regularization converges significantly faster than vanilla policy gradient under the tabular setting even with the exact gradient. We empirically verify that our method benefits not only from the faster convergence, but also reduced noise in the gradient estimate under the batch RL setting with function approximation. We present results on continuous control tasks and the tasks with discrete action to demonstrate the efficacy of our proposed method. Rotation-Invariant Gait Identification with Quaternion Convolutional Neural Networks Jing, Bowen*; Prabhu, Vinay Uday; Gu, Angela; Whaley, John CNN-based accelerometric gait identification systems suffer a catastrophic drop in test accuracy when they encounter new device orientations unobserved during enrollment. In this paper we target this problem by introducing an SO(3)-equivariant quaternion convolutional kernel inside the CNN and disseminate some initial promising results. Attention-Sampling Graph Convolutional Networks Lippoldt, Franziska; Lavin, Alexander* A principle advantage of Graph Convolutional Networks (GCN) lies in the ability to cope with irregular data, which we evaluate in the image domain by inspecting both graph downsampling methods and network accuracy with respect to edge connections. We specifically investigate the effects of distance-based vs feature-attention downsampling, and suggest a method of generalizing pixel-wise attention to the graphs setting to better represent distributions and irregularity. Our analysis is especially important for pathological images for carcinoma prediction: due to image size and over-represented cell-graphs, downsampling is naturally required, and simplifying graph assumptions may misrepresent the cellular structures. With principled downsampling within GCN, we find that graph analysis of cells reveals possible stages of carcinoma development. Energy-based View of Retrosynthesis Sun, Ruoxi*; Dai, Hanjun; Li, Li; Kearnes, Steven; Dai, Bo Retrosynthesis---the process of identifying a set of reactants to synthesize a target molecule---is of vital importance to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. In this paper, we propose a framework that unifies sequence- and graph-based methods as energy-based models (EBMs) with different energy functions. This unified perspective provides critical insights about EBM variants through a comprehensive assessment of performance. Additionally, we present a novel dual'' variant within the framework that performs consistent training over Bayesian forward- and backward-prediction by constraining the agreement between the two directions. This model improves state-of-the-art performance by 9.6\% for template-free approaches where the reaction type is unknown. Neural Interventional GRU-ODEs Zhou, Helen*; Xue, Yuan; Dai, Andrew M Data is often generated as a continually accumulating byproduct of existing systems. This data can be observational, interventional, or a mixture of the two. In hospitals, for example, diagnostic measurements may be taken as needed, and treatments may be administered and recorded over a finite period of time. Leveraging recent advances in continuous-time modeling, we propose Neural Interventional GRU-ODEs (NIGO) to model passive observations alongside active interventions which happen at irregular timepoints. In this model, observations provide information about the underlying state of the system, whereas interventions drive changes in the underlying state. Our model seeks to capture the influence of interventions on the latent state, while also learning the dynamics of the system. Experiments are done on a simulated pendulum dataset with gravity interventions. See, Hear, Explore: Curiosity via Audio-Visual Association Dean, Victoria*; Tulsiani, Shubham; Gupta, Abhinav Exploration is one of the core challenges in reinforcement learning. A common formulation of curiosity-driven exploration uses the difference between the real future and the future predicted by a learned model. However, predicting the future is an inherently difficult task which can be ill-posed in the face of stochasticity. In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses. Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration. Our method is inspired by the fact that, for humans, both sight and sound play a critical role in exploration. We present results on Habitat (a photorealistic navigation simulator), showing the benefits of using an audio-visual association model for intrinsically guiding learning agents in the absence of external rewards. TSGLR: an Adaptive Thompson Sampling for the Switching Multi-Armed Bandit Problem ALAMI, Reda*; Azizi, Oussama The stochastic multi-armed bandit problem has been widely studied under the stationary assumption. However in real world problems and industrial applications, this assumption is often unrealistic because the distributions of rewards may change over time. In this paper, we consider the piece-wise iid non-stationary stochastic multi-armed bandit problem with unknown change-points and we focus on the change of mean setup. To solve the latter, we propose a Thompson Sampling strategy equipped with a change point detector based on a well tuned non-parametric Generalized Likelihood Ratio test (GLR). We call the resulting strategy Thompson Sampling-GLR (\TSGLR). Analytically, in the context of regret minimization for the global switching setting, our proposal achieves a $\mathcal{O}\left( K_T\log T\right)$ regret upper-bound where $K_T$ is the overall number of change-points up to the horizon $T$. This contradicts the lower bound in $\Omega(\sqrt{T})$. This result mainly comes from the order optimal detection delay of the GLR test for sub-Gaussian distributions and its well controlled false alarm rate. Experimentally, we demonstrate that the $\TSGLR$ outperforms the state-of-art non stationary stochastic bandits over synthetic Bernoulli rewards as well as on the Yahoo! User Click Log Dataset. Learning Invariant Representations for Reinforcement Learning without Reconstruction Zhang, Amy; McAllister, Rowan*; Calandra, Roberto; Gal, Yarin; Levine, Sergey We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that both provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance. ChemBERTa: Utilizing Transformer-Based Attention for Understanding Chemistry Chithrananda, Seyone*; Ramsundar, Bharath Despite the success of pre-training methods in NLP and computer vision, machine-learning-based pre-training methods remain incredibly scarce and ineffective for applications to chemistry. Many previous graph-based molecular property prediction models have yet to have seen a strong boost in generalizability or prediction accuracy through pre-training techniques, by mapping molecules to a sparse discrete space, known as a molecular fingerprint or numerical representation of molecules. To solve this, we present ChemBERTa, a RoBERTa-like transformer model that learns molecular fingerprints through semi-supervised pre-training of the sequence-to-sequence language model, using masked-language modelling of a large corpus of 250,000 SMILES strings, a well-known text representation of molecules. We train the model over 15 epochs, obtaining a mean masked LM likelihood loss of 0.285. After pre-training, we fine-tune ChemBERTa by benchmarking its performance on Tox21, a multi-task dataset for predicting the toxicities of molecules through various biochemical pathways. We also present the promise of visualizing the attention mechanism in ChemBERTa for the interpretability of chemical features in a molecule and evaluating the performance of our model. Our models have been made openly available through Huggingface's model hub with over 12,000 downloads, and we provide a tutorial for running masked language modelling, attention visualization, and binary classification experiments in the DeepChem library with ChemBERTa. Gradient Descent on Unstable Dynamical Systems Nar, Kamil*; Xue, Yuan; Dai, Andrew M When training the parameters of a linear dynamical model, the gradient descent algorithm is likely to fail to converge if the squared-error loss is used as the training loss function. Restricting the parameter space to a smaller subset and running the gradient descent algorithm within this subset can allow learning stable dynamical systems, but this strategy does not work for unstable systems. In this work, we show that observations taken at different times from the system to be learned influence the dynamics of the gradient descent algorithm in substantially different degrees. We introduce a time-weighted logarithmic loss function to fix this imbalance and demonstrate its effectiveness in learning unstable systems. Towards Learning Robots Which Adapt On The Fly Julian, Ryan C*; Swanson, Benjamin; Sukhatme, Gaurav; Levine, Sergey; Finn, Chelsea; Hausman, Karol One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today are deployed as a fixed policy and they are not being adapted after their deployment. Can we efficiently adapt previously learned behaviors to new environments, objects and percepts in the real world? We present a method and empirical evidence towards a robot learning framework that facilitates continuous adaption. In particular, we demonstrate how to adapt vision-based robotic manipulation policies to new variations by fine-tuning via off-policy reinforcement learning, including changes in background, object shape and appearance, lighting conditions, and robot morphology. Further, this adaptation uses less than 0.2\% of the data necessary to learn the task from scratch. We find that the simple approach of fine-tuning pre-trained policies leads to substantial performance gains over the course of fine-tuning, and that pre-training via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. We also find that these positive results hold in a limited continual learning setting, in which we repeatedly fine-tune a single lineage of policies using data from a succession of new tasks. Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 52 unique fine-tuning experiments on a real robotic grasping system pre-trained on 580,000 grasps.

## Call-for-Submissions

Please submit your proposals via CMT in the form of an abstract as a 2-page pdf in the NeurIPS Style by 11:59:59PM PDT, June 25th, 2020.  References can be included in a third page.
Note: Submissions are not blind-reviewed, thus please include authors' names and affiliations in the submissions.

Acceptable material includes work which has already been submitted or published, preliminary results and controversial findings.
We do not intend to publish paper proceedings, only abstracts will be shared through an online repository. Our primary goal is to foster discussion!

For examples of previously accepted talks, please watch the paper presentations from BayLearn 2019 or review the complete list of accepted submissionsFor examples of abstracts that have been selected in the past, please see the schedule of talks from BayLearn 2018. This page has videos of the talks and links to PDFs of the abstracts are provided for each of the selected talks.