Accepted SubmissionsPaper Title  Author Names  Abstract  Revisiting Spatial Invariance with LowRank Local Connectivity  Elsayed, Gamaleldin F*; Ramachandran, Prajit; Shlens, Jonathon; Kornblith, Simon  Convolutional neural networks are among the most successful architectures in deep learning. This success is at least partially attributable to the efficacy of spatial invariance as an inductive bias. Locally connected layers, which differ from convolutional layers only in their lack of spatial invariance, usually perform poorly in practice. However, these observations still leave open the possibility that some degree of relaxation of spatial invariance may yield a better inductive bias than either convolution or local connectivity. To test this hypothesis, we design a method to relax the spatial invariance of a network layer in a controlled manner. In particular, we create a owrank locally connected (LRLC) layer, where the kernel applied at each position is constructed as a linear combination of basis kernels with spatially varying combining weights . By varying the number of basis kernels, we can control the degree of relaxation of spatial invariance. In experiments with small convolutional networks, we find that relaxing spatial invariance improves classification accuracy over both convolution and locally connected layers across MNIST, CIFAR10, and CelebA datasets. These results suggest that spatial invariance may be an overly restrictive inductive bias.  Neural Additive Models: Interpretable Machine Learning with Neural Nets  Agarwal, Rishabh*; Frosst, Nicholas; Zhang, Xuezhou; Caruana, Rich; Hinton, Geoffrey  The accuracy of deep neural networks (DNNs) comes at the cost of intelligibility: it is usually unclear how they make their decisions. This hinders their applicability to high stakes decisionmaking domains such as healthcare. We propose Neural Additive Models (NAMs) which combine some of the expressivity of DNNs with the inherent intelligibility of generalized additive models. NAMs learn a linear combination of neural networks that each attend to a single input feature. These networks are trained jointly and can learn arbitrarily complex relationships between their input feature and the output. Our experiments on regression and classification datasets show that NAMs are more accurate than widely used intelligible models such as logistic regression and shallow decision trees. They perform similarly to existing stateoftheart generalized additive models in accuracy, but can be more easily applied to realworld problems.  Banditbased Monte Carlo Optimization for Nearest Neighbors  Bagaria, Vivek; Baharav, Tavor Z*; Kamath, Govinda; Tse, David  The
celebrated Monte Carlo method estimates an expensivetocompute
quantity by random sampling. Banditbased Monte Carlo optimization (BMO)
is a general technique for computing the minimum of many such
expensivetocompute quantities by adaptive random sampling. The
technique converts an optimization problem into a statistical estimation
problem which is then solved via multiarmed bandits. We apply this
technique to solve the important problem of highdimensional knearest
neighbors. We show that this technique allows us to develop an algorithm
which can confer significant gains on real datasets over both exact
computation (up to 100x in number of operations and 30x in wallclock
time) and stateoftheart algorithms such as Kgraph, NGT, and LSH. We
provide theoretical guarantees and show that under regularity
assumptions the complexity of this algorithm scales logarithmically with
the dimension of the data rather than linearly as in exact computation.  LassoNet: A Neural Network with Feature Sparsity  Lemhadri, Ismael*; Tibshirani, Rob; Ruan, Feng  Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or L1regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms stateoftheart methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.  Protecting Against Image Translation Deepfakes by Leaking Universal Perturbations from BlackBox Neural Networks  Ruiz, Nataniel*; Bargal, Sarah; Sclaroff, Stan  In this work, we develop efficient disruptions of blackbox image translation deepfake generation systems. We are the first to demonstrate blackbox deepfake generation disruption by presenting image translation formulations of attacks initially proposed for classification models. Nevertheless, a naive adaptation of classification blackbox attacks results in a prohibitive number of queries for image translation systems in the realworld. We present a frustratingly simple yet highly effective algorithm Leaking Universal Perturbations (LUP), that significantly reduces the number of queries needed to attack an image. LUP consists of two phases: (1) a short leaking phase where we attack the network using traditional blackbox attacks and gather information on successful attacks on a small dataset and (2) and an exploitation phase where we leverage said information to subsequently attack the network with improved efficiency. Our attack reduces the total number of queries necessary to attack GANimation and StarGAN by 30%.  What is being transferred in transfer learning?  Neyshabur, Behnam; Sedghi, Hanie*; Zhang, Chiyuan  One desired capability for machines is the ability to transfer their understanding of one domain to another domain where data is (usually) scarce. Despite ample adaptation of transfer learning in many deep learning applications, we yet do not understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we provide new tools and analysis to address these fundamental questions. Through a series of analysis on transferring to blockshuffled images, we separate the effect of feature reuse from learning highlevel statistics of data and show that some benefit of transfer learning comes from the latter. We present that when training from pretrained weights, the model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space.  VPFO: A Variable Projection Method for Training Neural Networks  Sahiner, Arda*; Pauly, John; Pilanci, Mert  We propose a new optimization method for training neural networks for regression problems, built upon the success of the Variable Projection method for separable nonlinear least squares problems. This Variable Projection approach eliminates the finallayer weights of a network by observing that the optimal values of these weights can be solved for in closedform when the weights of the remaining layers are considered fixed. We propose minimizing the Variable Projection loss with firstorder optimization methods, which allows for scalability at any network depth, and can easily be incorporated into existing neural network training pipelines. We extensively demonstrate the effectiveness of implementing our approach for training neural networks, in both training time and performance for applications such as image autoencoders.  Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization  Cao, Kaidi*; Chen, Yining; Lu, Junwei; Arechiga, Nikos; Gaidon, Adrien; Ma, Tengyu  Realworld largescale datasets are heteroskedastic and imbalanced  labels have varying levels of uncertainty and label distributions are longtailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is underexplored. We propose a datadependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a onedimensional nonparametric classification setting, our approach adaptively regularizes the data points in higheruncertainty, lowerdensity regions more heavily. We test our method on several benchmark tasks, including a realworld heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noiserobust deep learning.  siVAE: interpreting latent dimensions withinvariational autoencoders  Choi, Yongin*; Quon, Gerald  Interpretation of variational autoencoders (VAE) to measure contributions of input features to latent dimensions remains challenging because feature contributions are implicit in the trained parameters and choice of architecture of the VAE. Here we propose a scalable, interpretable variational autoencoder (siVAE), a Bayesian extension of VAEs that is interpretable by design: it learns feature embeddings that guide the interpretation of the sample embeddings, in a manner analogous to factor loadings of factor analysis. siVAE is as powerful and nearly as fast to train as the standard VAE, but achieves full interpretability of the latent dimensions, as well as all hidden layers of the decoder. We also introduce a new interpretability measure, feature awareness, that captures which features each layer of the siVAE model focuses on reconstructing well for each input sample. Training siVAE on a dataset exceeding 1 million samples and 28,000 samples is between 12 and 2,292 times faster than applying existing feature attribution methods to a trained VAE.  Learning to grow: control of materials selfassembly using evolutionary reinforcement learning  Whitelam, Stephen*  We show that neural networks trained by evolutionary reinforcement learning can enact efficient molecular selfassembly protocols. Presented with molecular simulation trajectories, networks learn to change temperature and chemical potential in order to promote the assembly of desired structures or choose between competing polymorphs. In the first case, networks reproduce in a qualitative sense the results of previouslyknown protocols, but faster and with higher fidelity; in the second case they identify strategies previously unknown, from which we can extract physical insight. Networks that take as input the elapsed time of the simulation or microscopic information from the system are both effective, the latter more so. The evolutionary scheme we have used is simple to implement and can be applied to a broad range of examples of experimental selfassembly, whether or not one can monitor the experiment as it proceeds. Our results have been achieved with no human input beyond the specification of which order parameter to promote, pointing the way to the design of synthesis protocols by artificial intelligence.  Neural Anisotropy Directions  OrtizJimenez, Guillermo*; Modas, Apostolos; MoosaviDezfooli, SeyedMohsen; Frossard, Pascal  In this work, we analyze the role of the network architecture in shaping the inductive bias of deep classifiers. To that end, we start by focusing on a very simple problem, i.e., classifying a class of linearly separable distributions, and show that, depending on the direction of the discriminative feature of the distribution, many stateoftheart deep convolutional neural networks (CNNs) have a surprisingly hard time solving this simple task. We then define as neural anisotropy directions (NADs) the vectors that encapsulate the directional inductive bias of an architecture. These vectors, which are specific for each architecture and hence act as a signature, encode the preference of a network to separate the input data based on some particular features. We provide an efficient method to identify NADs for several CNN architectures and thus reveal their directional inductive biases. Furthermore, we show that, for the CIFAR10 dataset, NADs characterize features used by CNNs to discriminate between different classes.  Selfsupervised Learning for Deep Models in Recommendations  Yao, Tiansheng*; Yi, Xinyang; Cheng, Zhiyuan; Hong, Lichan; Chi, Ed H.; Yu, Felix  Large scale neural recommender models play a critical role in modern search and recommendation systems. With millions to billions of items to choose from, the quality of learned query and item representations is crucial to recommendation quality. Inspired by the recent success in selfsupervised representation learning research in both computer vision and natural language understanding, we propose a multitask selfsupervised learning (SSL) framework for sparse neural models in recommendations. Furthermore, we propose two highly generalizable SSL tasks: (i) Feature Masking (FM) and (ii) Feature Dropout (FD) within the proposed framework. We evaluate our framework using two largescale datasets with ~500M and 1B training examples respectively. Our results demonstrate that the proposed framework outperforms baseline models and stateoftheart spreadout regularization techniques in the context of retrieval. The SSL framework shows larger improvement with less supervision compared to the counterparts.  DeploymentEfficient Reinforcement Learning via ModelBased Offline Optimization  Matsuhima, Tatsuya*; Furuta, Hiroki; Matsuo, Yutaka; Nachum, Ofir; Gu, Shixiang  Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many realworld applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new datacollection policy is high, to the point that it can become prohibitive to update the datacollection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct datacollection policies that are used during policy learning. We observe that naïvely applying existing modelfree offline RL algorithms recursively does not lead to a practical deploymentefficient and sampleefficient algorithm. We then propose a novel modelbased algorithm, BehaviorRegularized ModelENsemble (BREMEN), which is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 510 deployments, compared to typical values of hundreds to millions in standard RL baselines.  Learning Multigranular Quantized Embeddings for LargeVocab Categorical Features in Recommender Systems  Kang, Wangcheng*; Cheng, Zhiyuan; Chen, Ting; Yi, Xinyang; Lin, Dong; Hong, Lichan; Chi, Ed H.  Recommender system models often represent various sparse features like users, items, and categorical features via onehot embeddings. However, a large vocabulary inevitably leads to a gigantic embedding table, creating two severe problems: (i) making model serving intractable in resourceconstrained environments; (ii) causing overfitting problems. We seek to learn highly compact embeddings for largevocab sparse features in recommender systems (recsys). First, we show that the novel Differentiable Product Quantization (DPQ) approach can generalize to recsys problems. In addition, to better handle the powerlaw data distribution commonly seen in recsys, we propose a MultiGranular Quantized Embeddings (MGQE) technique which learns more compact embeddings for infrequent items. Preliminary experiments on three recommendation tasks and two datasets show that we can achieve on par or better performance, with only ∼20% of the original model size  Temperature check: theory and practice for training models with softmaxcrossentropy losses  Agarwala, Atish*; Schoenholz, Samuel S; Pennington, Jeffrey; Dauphin, Yann  The softmaxcrossentropy loss function is a principled way of modeling probability distributions that has become ubiquitous in deep learning. While its lone hyperparameter, the temperature, is commonly set to one or regarded as a way to tune the model's confidence after training, less is known about how the temperature impacts training dynamics or generalization performance. In this work, we develop a theory of early learning for models trained with softmaxcrossentropy loss and show that the learning dynamics depend crucially on the inversetemperature $\beta$ as well as the initial training set logit magnitude $\beta\z_{F}$. Empirically, we find that generalization performance depends strongly on the temperature even though the model’s final confidence does not. It follows that the addition of $\beta$ as a tunable hyperparameter is key to maximizing model performance which we demonstrate by showing that optimizing $\beta$ increases performance of ResNet50 trained on ImageNet. Together these results underscore the importance of tuning the softmax temperature and provide qualitative guidance in performing this tuning.  Autofocused oracles for design  Fannjiang, Clara*; Listgarten, Jennifer  Datadriven design is making headway into a number of application areas, including protein, smallmolecule, and materials engineering. The design goal is to construct an object with desired properties, such as a material that exhibits superconductivity at higher temperatures than previously ever observed. To that end, costly experimental measurements are being replaced with calls to a highcapacity regression model trained on labeled data, which can be leveraged in an in silico search for promising design candidates. However, the design goal necessitates moving into regions of the input space beyond where such models were trained. Therefore, one can ask: should the regression model be altered as the design algorithm explores the input space, in the absence of new data acquisition? Herein, we answer this question in the affirmative. In particular, we (i) formalize the datadriven design problem as a nonzerosum game, (ii) leverage this formalism to develop a strategy for retraining the regression model as the design algorithm proceedswhat we refer to as autofocusing the model, and (iii) demonstrate the promise of autofocusing empirically. A full paper detailing our work can be found at: https://arxiv.org/abs/2006.08052.  Provably Efficient Policy Optimization via Thompson Sampling  Ishfaq, Haque*; Yang, Zhuoran; Lupu, Andrei; Islam, Riashat; Liu, Lewis; Precup, Doina; Wang, Zhaoran  Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. Despite their popularity, it largely remains a challenge to design provably efficient policy optimization algorithms. In particular, it still remains elusive how to design provably efficient policy optimization algorithm using Thompson sampling ~\citep{thompson1933likelihood, ThompsonTutorial} based exploration strategy. This paper presents a provably efficient policy optimization algorithm that incorporates exploration using Thompson sampling. We prove that, in an episodic linear MDP setting, our algorithm, Thompson Sampling for Policy Optimization (TSPO) achieves $\Tilde{\mathcal{O}}(d^{3/2} H^{2} \sqrt{T})$ worstcase (frequentist) regret, where $H$ is the length of each episode, $T$ is the total number of steps and $d$ is the number of features. Finally, we empirically evaluate TSPO and show that it is competitive with stateoftheart baselines.  Robustness Analysis of Deep Learning via Implicit Models  Tsai, Alicia Y.*; El Ghaoui, Laurent  Despite the success of deep neural networks (DNNs), it is wellknown that they can fail significantly in the presence of adversarial perturbations. Starting with Szegedy et al., a large number of works have demonstrated that stateoftheart DNNs are vulnerable to adversarial samples. The vulnerability of DNNs has motivated the study of building models that are robust to such perturbations. However, many defense strategies are later shown to be ineffective. Although a large number of research works have been devoted to improving the robustness of DNNs and to our understanding of their behaviors, many fundamental questions about their vulnerabilities remain unclear. In this work, we introduce the implicit model and formalize its wellposedness properties theoretically. We analyze the robustness of DNNs via the lens of the implicit model and define its sensitivity matrix, which relates perturbations in inputs to those in outputs. Empirically, we show how the sensitivity matrix can be used to generate adversarial attacks effectively on MNIST and CIFAR10 dataset.  Learning Discrete Energybased Models via Auxiliaryvariable Local Exploration  Dai, Hanjun*; Singh, Rishabh; Dai, Bo; Sutton, Charles; Schuurmans, Dale  Discrete structures play an important role in applications like program language modeling and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice in flexibility. Energybased models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions, but require partition function estimation. In this paper we propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data, where parameter gradients are estimated using a learned sampler that mimics local search. We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration, achieving a better tradeoff between flexibility and tractability. Experimentally, we show that learning local search leads to significant improvements in challenging application domains. Most notably, we present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer.  Distributed Sketching Methods for Privacy Preserving Regression  Bartan, Burak*; Pilanci, Mert  In this work, we study distributed sketching methods for large scale regression problems. We leverage multiple randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and analyze the accuracy of parameter averaging for distributed sketches. We consider random matrices including Gaussian, randomized Hadamard, uniform sampling and leverage score sampling in the distributed setting. Moreover, we propose a hybrid approach combining sampling and fast random projections for better computational efficiency. We illustrate the performance of distributed sketches in a serverless computing platform with large scale experiments.  Exact posteriors of wide Bayesian neural networks  Hron, Jiri*; Bahri, Yasaman; Novak, Roman; Pennington, Jeffrey; SohlDickstein, Jascha  Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) if the width of all layers is large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations. We provide the missing proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior. For empirical validation, we generate samples from the exact finite BNN posterior on a small dataset via rejection sampling.  Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs  Leavitt, Matthew L*; Morcos, Ari S  The properties of individual neurons are often analyzed in order to understand the biological and artificial neural networks in which they're embedded. Class selectivity—typically defined as how different a neuron's responses are across different classes of stimuli or data samples—is commonly used for this purpose. However, it remains an open question whether it is necessary and/or sufficient for deep neural networks (DNNs) to learn class selectivity in individual units. We investigated the causal impact of class selectivity on network function by directly regularizing for or against class selectivity. Using this regularizer to reduce class selectivity across units in convolutional neural networks increased test accuracy by over 2% in ResNet18 trained on Tiny ImageNet. In ResNet20 trained on CIFAR10 we could reduce class selectivity by a factor of 2.5 with no impact on test accuracy, and reduce it nearly to zero with only a small (~2%) drop in test accuracy. In contrast, regularizing to increase class selectivity had rapid and disastrous effects on test accuracy across all models and datasets. These results indicate that class selectivity in individual units is neither sufficient nor strictly necessary, and can even impair DNN performance. They also encourage caution when focusing on the properties of single units as representative of the mechanisms by which DNNs function.  GANs for Continuous Path Keyboard Input Modeling  Mehra, Akash*; Bellegarda, Jerome; Bapat, Ojas; Lal, Partha; Wang, Xin  Continuous path keyboard input has higher inherent ambiguity than standard tapping, because the path trace may exhibit not only local overshoots/undershoots (as in tapping) but also, depending on the user, substantial midpath excursions. Deploying a robust solution thus requires a large amount of highquality training data, which is difficult to collect/annotate. In this work, we address this challenge by using GANs to augment our training corpus with userrealistic synthetic data. Experiments show that, even though GANgenerated data does not capture all the characteristics of real user data, it still provides a substantial boost in accuracy at a 5:1 GANtoreal ratio. GANs therefore inject more robustness in the model through greatly increased word coverage and path diversity.  Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning  Khadka, Shauharda; Guez Aflalo, Estelle; Marder, Mattias; BenDavid, Avrech; Miret, Santiago; Tang, Hanlin; Mannor, Shie; Hazan, Tamir; Majumdar, Somdeb*  As modern neural networks have grown to billions of parameters, meeting tight latency budgets has become increasingly challenging. Solutions like compression and pruning modify the underlying network. We present Evolutionary Graph RL (EGRL)  a complimentary approach of optimizing how tensors are mapped to onchip memory keeping the network untouched. Since different memory components trade off capacity for bandwidth differently, a suboptimal mapping can result in high latency. We train and validate EGRL directly on the Intel NNPI chip for inference. EGRL outperforms policygradient, evolutionary search and dynamic programming baselines on ResNet50, ResNet101 and BERT achieving 2878% speedup over the native compiler.  Interpretable PlanningAware Representations for MultiAgent Trajectory Forecasting  Ivanovic, Boris*; Elhafsi, Amine; Rosman, Guy; Gaidon, Adrien; Pavone, Marco  Reasoning about human motion is an important prerequisite to safe and sociallyaware robotic navigation. As a result, multiagent behavior prediction has become a core component of modern humanrobot interactive systems, such as selfdriving cars. In particular, one of the main uses of behavior prediction in autonomous systems is to inform egorobot motion planning and control. A unifying theme among most human motion prediction approaches is that they produce trajectories (or distributions thereof) for each agent in a scene; an intuitive output representation that matches common evaluation metrics. However, a majority of planning and control algorithms reason about system dynamics rather than future agent tracklets, which can hinder their integration. Towards this end, we investigate Mixtures of Linear TimeVarying Systems as an output representation for trajectory forecasting that is more amenable to downstream planning and control use. Our approach leverages successful ideas from prior probabilistic trajectory forecasting works to learn dynamical system representations that are wellstudied in the planning and control literature. We consider an intuitive twoagent interaction scenario to illustrate how our method works and motivate further evaluation on largescale autonomous driving datasets as well as realworld hardware.  Architecture Compression  Ashok, Anubhav*  In this paper we propose a novel approach to model compression termed Architecture Compression. Instead of operating on the weight or filter space of the network like classical model compression methods, our approach operates on the architecture space. A 1D CNN encoderdecoder is trained to learn a mapping from discrete architecture space to a continuous embedding and back. Additionally, this embedding is jointly trained to regress accuracy and parameter count in order to incorporate information about the architecture's effectiveness on the dataset. During the compression phase, we first encode the network and then perform gradient descent in continuous space to optimize a compression objective function that maximizes accuracy and minimizes parameter count. The final continuous feature is then mapped to a discrete architecture using the decoder. We demonstrate the merits of this approach on visual recognition tasks such as CIFAR10, CIFAR100, FashionMNIST and SVHN and achieve a greater than 20x compression on CIFAR10.  Simultaneous Learning of the Inputs and Parameters in Neural Collaborative Filtering  Raziperchikolaei, Ramin*; Li, Tianyu; Chung, Young Joo  User and item representations have a significant impact on the prediction performance of neural networkbased collaborative filtering systems. Previous works fix the input to the user/item interaction vectors and/or IDs and train neural networks to learn the representations. We argue that this strategy adversely affects the quality of the representations since the similarities in the users’ tastes might not be reflected in the input space. We show that there is an implicit embedding matrix in the first fully connected layer which takes the user/item interaction vectors as the input. The role of the nonzero elements of the input vectors is to choose and combine a subset of the embedding vectors. To learn better representations, instead of fixing the input and only relying on neural network structure, we propose to learn the value of the nonzero elements of the input jointly with the neural network parameters. Our experiments on two movielens datasets and two realworld datasets show that our method outperforms the stateoftheart methods.  Neural Representations in Hybrid Recommender Systems: Prediction versus Regularization  Raziperchikolaei, Ramin*; Li, Tianyu; Chung, Young Joo  Autoencoderbased hybrid recommender systems have become popular recently because of their ability to learn user and item representations by reconstructing various information sources, including users' feedback on items (e.g., ratings) and side information of users and items (e.g., users' occupation and items' title). However, existing systems still use representations learned by matrix factorization (MF) to predict the rating, while using representations learned by neural networks as the regularizer. In our work, we define the neural representation for prediction (NRP) framework and apply it to the autoencoderbased recommendation systems. We theoretically analyze how our objective function is related to the previous MF and autoencoderbased methods and explain what it means to use neural representations as the regularizer. We also apply the NRP framework to a direct neural network structure which predicts the ratings without reconstructing the user and item information. Our experimental results confirm that neural representations are better for prediction than regularization and show that the NRP framework, combined with the direct neural network structure, outperforms the stateoftheart methods in the prediction task.  Anatomy of Catastrophic Forgetting: HiddenRepresentations and Task Semantics  Ramasesh, Vinay V*; Dyer, Ethan; Raghu, Maithra  Catastrophic forgetting is a central obstacle to continual learning. Many methods have been proposed to overcome this problem, but fully mitigating forgetting is likely hindered by a lack of understanding of the phenomenon’s fundamental properties. For example, how does catastrophic forgetting affect the hidden representations of neural networks? Are there underlying principles common to methods that mitigate forgetting? How is catastrophic forgetting affected by (semantic) similarities between sequential tasks? And what are good benchmark tasks that capture the essence of how catastrophic forgetting naturally arises in practice? This paper begins to provide answers to these and other questions.  MetaLearning Requires MetaAugmentation  Rajendran, Janarthanan*; Irpan, Alex; Jang, Eric  In several areas of machine learning, data augmentation is critical to achieving stateoftheart generalization performance. Examples include computer vision, speech recognition, and natural language processing. It is natural to suspect that data augmentation can play an equally important role in helping metalearners. In this work, we present a unified framework for metadata augmentation and an information theoretic view on how it prevents overfitting. Under this framework, we interpret existing augmentation strategies and propose modifications to handle overfitting. We show the importance of metaaugmentation on current benchmarks and metalearning algorithms and demonstrate that metaaugmentation produces large complementary benefits to recently proposed metaregularization techniques.  Automated Utterance Generation  Parikh, Soham*; Tiwari, Mitul; Vohra, Quaizar  Conversational AI assistants are becoming popular and questionanswering is an important part of any conversational assistant. Using relevant utterances as features in questionanswering has shown to improve both the precision and recall for retrieving the right answer by a conversational assistant. Hence, utterance generation has become an important problem with the goal of generating relevant utterances (sentences or phrases) from a knowledge base article that consists of a title and a description. However, generating good utterances usually requires a lot of manual effort, creating the need for an automated utterance generation. In this paper, we propose an utterance generation system which 1) uses extractive summarization to extract important sentences from the description, 2) uses multiple paraphrasing techniques to generate a diverse set of paraphrases of the title and summary sentences, and 3) selects good candidate paraphrases with the help of a novel candidate selection algorithm.  Uncovering Task Clusters in MultiTask Reinforcement Learning  Kumar, Varun; Rakelly, Kate; Majumdar, Somdeb*  Multitask learning refers to the approach of learning several distinct tasks using a shared representation. Such a strategy can be beneficial if the tasks share common structure: in this case, training a model on each individual tasks would be unnecessarily inefficient, as it would involve learning the common structure repeatedly. By contrast, a shared representation only needs to learn the structure a single time, following which it can be transferred to other tasks. This approach, when paired with deep neural networks, has proven effective in domains such as computer vision and natural language processing. Results in reinforcement learning, however, have been mixed, with multitask reinforcement learning sometimes proving to be less sample efficient than independent singletask models. We investigate multitask reinforcement learning in a recently published benchmark, MetaWorld MT10. We suggest a method to reduce conflicts in multitask reinforcement learning by dividing the task space into clusters of related tasks, and show that this method results in improved performance compared to prior work.  ECLIPSE: An ExtremeScale Linear Program Solver for WebApplications  Basu, Kinjal; Ghoting, Amol; Pan, Yao*; Keerthi, S. Sathiya; Mazumder, Rahul  Web applications (involving many millions of users and items) based on machine learning often involve global constraints (e.g., budget limits of advertisers) that need to be satisfied during deployment (inference). This problem can usually be formulated as a Linear Program (LP) involving billions to trillions of decision variables and constraints. Despite the appeal of an LP formulation, solving problems at such scales is well beyond the capabilities of existing LP solvers. Often, adhoc decomposition rules are used to approximately solve these LPs, which have limited optimality guarantees and lead to suboptimal performance in practice. In this work, we propose a distributed solver that solves the LP problems at scale. We propose a gradientbased algorithm on the smoothened dual of the LP with computational guarantees. The main workhorses of our algorithm are distributed matrixvector multiplications (with load balancing) and efficient projection operations on distributed machines. Experiments on realworld data show that our proposed LP solver, ECLIPSE, can even solve problems with 10^12 decision variables within a few hours  well beyond the capabilities of current generic LP solvers.  Deep Ensembles: a loss landscape perspective  Hu, Huiyi*; Fort, Stanislav; Lakshminarayanan, Balaji  Deep ensembles have been empirically shown to be a promising approach for improving accuracy, uncertainty and outofdistribution robustness of deep learning models. While deep ensembles were theoretically motivated by the bootstrap, nonbootstrap ensembles trained with just random initialization also perform well in practice, which suggests that there could be other explanations for why deep ensembles work well. Bayesian neural networks, which learn distributions over the parameters of the network, are theoretically wellmotivated by Bayesian principles, but do not perform as well as deep ensembles in practice, particularly under dataset shift. One possible explanation for this gap between theory and practice is that popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space. We investigate this hypothesis by building on recent work on understanding the loss landscape of neural networks and adding our own exploration to measure the similarity of functions in the space of predictions. Our results show that random initializations explore entirely different modes, while functions along an optimization trajectory or sampled from the subspace thereof cluster within a single mode predictionswise, while often deviating significantly in the weight space. Developing the concept of the diversityaccuracy plane, we show that the decorrelation power of random initializations is unmatched by popular subspace sampling methods. Finally, we evaluate the relative effects of ensembling, subspace based methods and ensembles of subspace based methods, and the experimental results validate our hypothesis.  CoCon: CooperativeContrastive Learning  Rai, Nishant*; Adeli, Ehsan; Lee, KuanHui; Gaidon, Adrien; Niebles, Juan Carlos  Labeling videos at scale is impractical. Consequently, selfsupervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggest contrastive learning is a promising framework to tackle this challenge. However, when applied to realworld videos, contrastive learning may unknowingly lead to separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to address this issue. We use datadriven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We experimentally evaluate our representations on the downstream task of action recognition. Our method sets a new state of the art on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higherorder class relationships. We will release code and models.  Curriculum and Decentralized Learning in Google Research Football  Domitrz, Witalis; Mikula, Maciej; Opała, Zuzanna*; Pacek, Mikołaj; Rychlicki, Mateusz; Sieniawski, Mateusz; Staniszewski, Konrad; Michalewski, Henryk; Miłoś, Piotr; Osiński, Błażej B  We make a study of various curricula in the game of football (aka soccer). We aim to understand various methodological and architectonical choices. Concentrating on football has an advantage of interpretability, but we believe that observations with regard to decentralized learning are applicable more generally.  Learning MixedInteger Convex Optimization Strategies for Robot Planning and Control  Cauligi, Abhishek*; Culbertson, Preston; Stellato, Bartolomeo; Schwager, Mac; Pavone, Marco  Mixedinteger convex programming (MICP) has seen significant algorithmic and hardware improvements with several orders of magnitude solve time speedups compared to 25 years ago. Despite these advances, MICP has been rarely applied to realworld robotic control because the solution times are still too slow for online applications. In this work, we extend the machine learning optimizer (MLOPT) framework to solve MICPs arising in robotics at very high speed. MLOPT encodes the combinatorial part of the optimal solution into a strategy. Using data collected from offline problem solutions, we train a multiclass classifier to predict the optimal strategy given problemspecific parameters such as states or obstacles. Compared to existing approaches, we use taskspecific strategies and prune redundant ones to significantly reduce the number of classes the predictor has to select from, thereby greatly improving scalability. Given the predicted strategy, the control task becomes a small convex optimization problem that we can solve in milliseconds. Numerical experiments on a freeflying space robot and taskoriented grasps show that our method provides not only 1 to 2 orders of magnitude speedups compared to stateoftheart solvers but also performance close to the globally optimal MICP solution.  Entity Skeletons as Intermediate Representations for Visual Storytelling  Chandu, Khyathi Raghavi*  We are enveloped by stories of visual interpretations in our everyday lives. Story narration often comprises of two stages: forming a central mind map of entities and weaving a story around them. In this paper, we address these two stages of introducing the right entities at seemingly reasonable junctures and also referring them coherently in the context of visual storytelling. The building blocks of this, also known as entity skeleton are entity chains including nominal and coreference expressions. We establish a strong baseline for skeleton informed generation and propose a glocal hierarchical attention model that attends to the skeleton both at the sentence (local) and the story (global) levels. We observe that our proposed models outperform the baseline in terms of automatic evaluation metric, METEOR. We also conduct human evaluation from which concludes that visual stories generated by our model are preferred 82% of the times.  Exact Polynomialtime Convex Optimization Formulations for TwoLayer ReLU Networks  Pilanci, Mert; Ergen, Tolga*  We develop exact representations of twolayer neural networks with rectified linear units in terms of a single convex program with number of variables polynomial in the number of training samples and number of hidden neurons.  Active Online Domain Adaptation  Chen, Yining*; Luo, Haipeng; Ma, Tengyu; Zhang, Chicheng  Online machine learning systems need to adapt to domain shifts. Meanwhile, acquiring label at every timestep is expensive. We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains. For online linear regression with oblivious adversaries, we provide a tight tradeoff that depends on the durations and dimensionalities of the hidden domains. Our algorithm can adaptively deal with interleaving spans of inputs from different domains. We also generalize our results to nonlinear regression for hypothesis classes with bounded eluder dimension and adaptive adversaries. Experiments on synthetic and realistic datasets demonstrate that our algorithm achieves lower regret than uniform queries and greedy queries with equal labeling budget.  DisARM: An Antithetic Gradient Estimator for Binary Latent Variables  Dong, Zhe*; Mnih, Andriy; Tucker, George  Training models with discrete latent variables is challenging due to the difficulty of estimating the gradients accurately. Much of the recent progress has been achieved by taking advantage of continuous relaxations of the system, which are not always available or even possible. The AugmentREINFORCEMerge (ARM) estimator (Yin and Zhou, 2019) provides an alternative that, instead of relaxation, uses continuous augmentation. Applying antithetic sampling over the augmenting variables yields a relatively lowvariance and unbiased estimator applicable to any model with binary latent variables. However, while antithetic sampling reduces variance, the augmentation process increases variance. We show that ARM can be improved by analytically integrating out the randomness introduced by the augmentation process, guaranteeing substantial variance reduction. Our estimator, \emph{DisARM}, is simple to implement and has the same computational cost as ARM. We evaluate DisARM on several generative modeling benchmarks and show that it consistently outperforms ARM and a strong independent sample baseline in terms of both variance and loglikelihood.  Boosted Sparse Oblique Decision Trees  Gabidolla, Magzhan*; Zharmagambetov, Arman S; CarreiraPerpinan, Miguel A  Boosted decision trees are widely used machine learning algorithms, achieving stateoftheart performance in many domains with little effort on hyperparameter tuning. Though much work on boosting has focused on the theoretical properties and empirical variations, there has been little progress on the tree learning procedure itself. To this day, boosting algorithms employ regular axisaligned trees as base learners optimized by CARTstyle greedy topdown induction. These trees are known to be highly suboptimal due to their greedy nature, and they are not wellsuited to model the correlation of features due to their axisaligned partition. In fact, these suboptimality characteristics are commonly believed to be beneficial because of the weak learning criterion in boosting. In this work we consider boosting better optimized sparse oblique decision trees trained with the recently proposed Tree Alternating Optimization (TAO). TAO generally finds much better approximate optima than CARTtype algorithms due to the ability to monotonically decrease a desired objective function over a decision tree. Our extensive experimental results demonstrate that boosted sparse oblique TAO trees improve upon CART trees by a large margin, and achieve better test error than other popular tree ensembles such as gradient boosting (XGBoost) and random forests. Moreover, the resulting TAO ensembles require far smaller number of trees.  A flexible, extensible software framework for model compression based on the LC algorithm  Idelbayev, Yerlan*; CarreiraPerpinan, Miguel A  We propose a software framework based on the ideas of the LearningCompression (LC) algorithm, that allows a user to compress a neural network or other machine learning model using different compression schemes with minimal effort. Currently, the supported compressions include pruning, quantization, lowrank methods (including automatically learning the layer ranks), and combinations of those, and the user can choose different compression types for different parts of a neural network. The library is written in Python and PyTorch and available online at https://github.com/UCMercedML/LCmodelcompression  Safety Aware Reinforcement Learning (SARL)  Miret, Santiago*; Wainwright, Carroll; Majumdar, Somdeb  As reinforcement learning agents become more and more integrated into complex, realworld environments, designing for safety becomes more and more important. We specifically focus on scenarios where the agent can cause undesired side effects that may be linked with performing the primary task. The interdependence of side effects with the primary task makes it difficult to define hard constraints for the agent without sacrificing task performance. In order to address this challenge, we propose a novel virtual agent embedded cotraining framework (SARL). SARL includes a primary reward based actor and a virtual agent that assesses side effect impacts and influences the behavior of the reward based actor via loss regularization. The actor loss is regularized with a proper distance metric measuring the difference in action probabilities of both agents. As such, in addition to optimizing for the task objective, the actor also aims to minimize the disagreement between itself and the safety agent. We apply SARL to tasks and environment in the SafeLife suite, which can generate complex tasks in dynamic environments, and construct performance vs sideeffect Pareto fronts. Preliminary results indicate that SARL is competitive with a rewardbased penalty method, which punishes side effects directly on the reward function, while also providing zeroshot generalization of the safety agent across different environments. This zeroshot generalization suggest that through SARL we can obtain a more flexible notion of side effects that is useful for a variety of settings.  Hamming Space Locality Preserving Neural Hashing for Similarity Search  Idelson, Daphna*  We propose a novel method for learning to map a largescale dataset in the feature representation space to binary hash codes in the hamming space, for fast and efficient approximate nearestneighbor similarity search. Our method is composed of a simple neural network and a novel training scheme, that aims to preserve the locality relations between the original data points. We achieve distance preservation of the original cosine space upon the new hamming space, by introducing a loss function that translates the relational similarities in both spaces to probability distributions  and optimizes the KL divergence between them. We also introduce a simple data sampling method by representing the database with randomly generated proxies, used as reference points to query points from the training set. Experimenting with three publicly available standard ANN benchmarks we demonstrate significant improvement over other binary hashing methods, achieving an improvement of between 7% to 17%. As opposed to other methods, we show high performance in both low (64 bits) and high (768 bits) dimensional bit representation, offering increased accuracy when resources are available and flexibility in choice of ANN strategy.  What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation  Feldman, Vitaly*; Zhang, Chiyuan  Deep learning algorithms are wellknown to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2020) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be longtailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving closetooptimal generalization error when the data distribution is longtailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given. In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closelyrelated subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in (Feldman 2020).  Can Neural Networks Learn NonVerbal Reasoning?  Zhang, Chiyuan*; Raghu, Maithra; Bengio, Samy  Neural networks have demonstrated excellent capabilities in learning generalizable patternmatching  the ability to identify simple properties of the training data and utilize these properties to correctly process unseen (test) instances. These results raise fundamental questions on the reasoning capabilities of neural networks and how they generalize. Can neural networks learn more sophisticated reasoning? Are there insights on how they generalize in pattern matching and sophisticated reasoning settings? In this paper, we introduce a visual reasoning task to help investigate these questions.  Learning to reason by learning on rationales  Piękos, Piotr*; Michalewski, Henryk; Malinowski, Mateusz  For centuries humans have been codifying observed natural or social phenomena in some abstract language. Such a language, which we call mathematics, is at the core of not only modern science but also everyday activity. In this work, we look into the basic algebraic formulations used to solve some real concrete problems. Like ``how much money I need to spend to buy 2 apples knowing each costs 2 pounds''. We teach to solve such math word problems very early and universally in our education system. The teacher asks a question about some real problem and expects not only answers but to understand the rationale behind them  consecutive precise steps that lead to the answer. In this work we are motivated by the same learning process and incorporate rationales during training of a language model. We also show that through learning to understand the order of steps in rationales, we can improve the overall performance of our model.  ModalityAgnostic Attention Fusion for visual search with text feedback  Dodds, Eric M*; Culpepper, Jack; Herdade, Simao; Zhang, Yang; Boakye, Kofi A  Image retrieval with natural language feedback offers the promise of catalog search based on finegrained visual features that go beyond objects and binary attributes, facilitating realworld applications such as ecommerce. Our ModalityAgnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS. We also introduce two new challenging benchmarks adapted from BirdstoWords and SpottheDiff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding ``attending'' to the image region they refer to.  MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records  Xu, Zhen*; So, David; Dai, Andrew M  Deep learning models trained on electronic health records (EHR) have demonstrated their potential to improve healthcare quality in a variety of areas, such as predicting diagnoses, reducing healthcare costs and personalizing medicine. However, most model architectures that are commonly employed were originally developed for academic unimodal machine learning datasets, such as ImageNet or WMT. In contrast, EHR data is multimodal, containing sparse and irregular longitudinal features with a mix of structured and unstructured data. Such complex data often requires specific modeling for each modality and a good strategy to fuse different representations to reach peak performance. To address this, we propose MUltimodal Fusion Architecture SeArch (MUFASA), the first multimodal Neural Architecture Search (NAS) method for EHR data. Specifically, we reformulate the NAS objective to simultaneously search for several architectures, jointly optimizing multimodal fusion strategies and permodality model architectures together. We demonstrate empirically that our MUFASA method outperforms established unimodal evolutionary NAS on Medical Information Mart for Intensive Care (MIMICIII) EHR data with comparable computation costs. What’s more, our experimental results show that MUFASA produces models that outperform the Transformer, and its NAS variant, the Evolved Transformer, on public EHR data. Compared with these baselines on MIMIC CCS diagnosis code prediction, our discovered models improve top5 recall from 0.88 to 0.91, and demonstrate the ability to generalize to other EHR tasks. Studying our top architecture in depth, we provide empirical evidence that MUFASA’s improvements are derived from its ability to optimize custom modeling for varying input modalities and find effective fusion strategies.  VirAAL: Virtual Adversarial Active Learning  Senay, Gregory*; Youbi Idrissi, Badr; Marine Haziza  This paper presents VirAAL, an Active Learning framework based on Virtual Adversarial Training (VAT), a semisupervised approach that regularizes the model through Local Distributional Smoothness (LDS). VirAAL aims to reduce the effort of annotation in Natural Language Understanding (NLU). Adversarial perturbations are added to the inputs making the posterior distribution more consistent. Therefore, entropybased Active Learning (AL) becomes robust by querying more informative samples without requiring additional components. VirAAL is an inexpensive method in terms of AL computation with a positive impact on data sampling. Furthermore, VirAAL decreases annotations in AL up to 80%.  Beyond Supervision for Monocular Depth Estimation  Guizilini, Vitor*; Ambruș, Rareș A; Li, Jie; Pillai, Sudeep; Gaidon, Adrien  Selfsupervised learning enables training predictive models on arbitrarily large amounts of unlabeled data. One of the most successful examples of selfsupervised learning is monocular depth estimation, which relies on strong geometric priors to learn from raw monocular image sequences in a structurefrommotion setting. In this work, we present recent breakthroughs in selfsupervised monocular depth estimation that establish a new state of the art on standard benchmarks, reaching parity with fully supervised methods. Our contributions center on a novel neural network architecture, PackNet, that is specifically designed for largescale selfsupervised learning on high resolution videos. We also discuss semisupervised training extensions that can effectively combine the selfsupervised objective with partial supervision, whether from very sparse Lidar scans, velocity information, or pretrained segmentation models, while keeping inference monocular. Finally, we introduce a new, diverse, and challenging benchmark: Dense Depth for Automated Driving (DDAD). DDAD contains diverse scenes collected using a fleet of autonomous vehicles across the US and Japan. Thanks to longrange Lidar sensors, we expand standard metrics to include (a) evaluation at longer ranges of up to 200m, to properly measure how performance degrades with distance; and (b) provide finegrained labels in the validation and test frames to enable percategory and perinstance metrics, thus overcoming the current limitation of uniform perpixel depth evaluations.  Synthetic Health Data for Fostering Reproducibility of Private Research Studies  Bhanot, Karan*; Dash, Saloni; Yale, Andrew; Guyon, Isabelle; Erickson, John; Bennett, Kristin  The inability to share private health data can severely stifle research and has led to the reproducibility crisis in biomedical research. Recent synthetic data generation methods provide an attractive alternative for making data available for research and education purposes without violating privacy. In this paper, we discuss our novel HealthGAN model that produces high quality synthetic health data and demonstrate its effectiveness by reproducing research studies. To preserve privacy, HealthGAN synthetic data can be released when research papers are published. Approaches can be developed on synthetic data and then evaluated on real data inside secure environments enabling novel method generation.  A Synthetic Data Petri Dish for Studying Mode Collapse in GANs  Mangalam, Karttikeya*; Garg, Rohin  In this extended abstract, we present a simple yet powerful data generation procedure for studying mode collapse in GANs. We describe a computationally efficient way to obtain visualizable high dimensional data using normalizing flows. We also train GANs (Table 1) on different proposed dataset Levels and find mode collapse to occur even in the most robust GAN formulations. We also use the inversion quality of our proposed transformation to visualize both the high dimensional generated samples in a 2D space and the learnt discriminator's distribution as a heatmap. Such 2D visualizations are illdefined with other dimensionality reduction methods such as PCA or tSNE when applied to natural images since they suffer from approximations, strong dependence on visualization hyperparameters and are computationally expensive. We believe our proposed procedure will serve as a petri dish for studying mode collapse in controlled settings and a better understanding of failure modes of proposed robust formulations thereby propelling research further in generative algorithms.  Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible  Wadia, Neha*; Duckworth, Daniel; Schoenholz, Samuel S; Dyer, Ethan; SohlDickstein, Jascha  We argue that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training can harness information contained in the samplesample second moment matrix of the dataset. We show that for models with a fully connected first layer, the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information; in the high dimensional regime, the training procedure has no access to this information, producing models that either generalize poorly or not at all. We experimentally verify the predicted harmful effects of data whitening and second order optimization on generalization. We further show experimentally that generalization continues to be harmed even when theoretical requirements are relaxed.  A Deep Learning Pipeline for Patient Diagnosis Prediction Using Electronic Health Records  Paudel, Bibek*; Shrestha, Yash Raj; Franz, Leopold H  Augmentation of disease diagnosis and decisionmaking in health care with machine learning algorithms is gaining much impetus in recent years. In particular, in the current epidemiological situation caused by COVID19 pandemic, swift and accurate prediction of disease diagnosis with machine learning algorithms could facilitate identification and care of vulnerable clusters of population, such as those having multimorbidity conditions. In order to build a useful disease diagnosis prediction system, advancement in both data representation and development of machine learning architectures are imperative. First, with respect to data collection and representation, we face severe problems due to multitude of formats and lack of coherency prevalent in Electronic Health Records (EHRs). This causes hindrance in extraction of valuable information contained in EHRs. Currently, no universal global data standard has been established. As a useful solution, we develop and publish a Python package to transform public health dataset into an easy to access universal format. This data transformation to an international health data format facilitates researchers to easily combine EHR datasets with clinical datasets of diverse formats. Second, machine learning algorithms that predict multiple disease diagnosis categories simultaneously remain underdeveloped. We propose two novel model architectures in this regard. First, DeepObserver, which uses structured numerical data to predict the diagnosis categories and second, ClinicalBERT\_Multi, that incorporates rich information available in clinical notes via natural language processing methods and also provides interpretable visualizations to medical practitioners. We show that both models can predict multiple diagnoses simultaneously with high accuracy.  Ads Clickthrough Rate Prediction Models For MultiDatasource Tasks  Wang, Erzhuo*  Clickthrough rate prediction in online advertisement is a challenging machine learning problem that involves multiple objectives and multiple data sources. For example, at Pinterest we serve both shopping and standard Ads products where each product has its own unique characteristics of creatives and user behavior patterns. In this work, we address this problem by adopting a multitask deep neural network model that jointly learns distinct distributions of the data from various sources simultaneously. To tackle the multidatasource problem, we proposed a sharedbottom, multitower model architecture. The multitower structure can effectively isolate the interference from the distinct data distributions of different sources, while the sharedbottom layers enable us to learn lower level common signals. Furthermore, we make use of the contextual signals on top of the neural networks to calibrate the predictions, such that a good confidence in the inferenced likelihood is established. In addition, an automatic machine learning framework is leveraged, which handles feature extraction and feature transforms with algorithms. Hence the cost of human feature engineering is saved. The multitower model results in improved offline evaluation results on both data sources than the single tower structure, as well as learning them with separate models. The integrated solution realizes significant CTR gain compared to vanilla multilayer perceptron neural network models on online A/B testing.  Adversarial Learning for Debiasing Knowledge Base Embeddings  Paudel, Bibek*; Arduini, Mario; Shrestha, Yash Raj; Zhang, Ce; Pirovano, Federico; Noci, Lorenzo  Knowledge Graphs (KG) are gaining increasing attention in both academia and industry. Despite their diverse benefits, recent research have identified social and cultural biases embedded in the representations learned from KGs. Such biases can have detrimental consequences on different population and minority groups as applications of KG begin to intersect and interact with social spheres. This paper describes our \textbf{workinprogress} which aims at identifying and mitigating such biases in Knowledge Graph (KG) embeddings. We explore gender bias in KGE, and a careful examination of popular KGE algorithms suggest that sensitive attribute like the gender of a person can be predicted from the embedding. This implies that such biases in popular KGs is captured by the structural properties of the embedding. As a preliminary solution to debiasing KGs, we introduce a novel framework to filter out the sensitive attribute information from the KG embeddings, which we call FAN (Filtering Adversarial Network). We also suggest the applicability of FAN for debiasing other network embeddings which could be explored in future work.  Meta Attention Networks: Meta Learning Attention to Modulate Information Between Sparsely Interacting Recurrent Modules  Madan, Kanika*; Ke, Nan Rosemary; Goyal, Anirudh; Bengio, Yoshua  Decomposing knowledge into interchangeable pieces promises a generalization advantage when, at some level of representation, the learner is likely to be faced with situations requiring novel combinations of existing pieces of knowledge or computation. We hypothesize that such a decomposition of knowledge is particularly relevant for higher levels of representation as we see this at work in human cognition and natural language in the form of systematicity or systematic generalization. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs, as well as its reward function are stationary and can be reused across tasks and changes in distribution. As the learner is confronted with variations in experiences, the attention selects which modules should be adapted and the parameters of those selected modules are adapted fast, while the parameters of attention mechanisms are updated slowly as metaparameters. We find that both the metalearning and the modular aspects of the proposed system greatly help achieve faster learning in experiments with reinforcement learning setup involving navigation in a partially observed gridworld.  Batch Reinforcement Learning Through Continuation Method  Guo, Yijie*; Chen, Minmin; Lee, Honglak; Chi, Ed H.  Many realworld application of reinforcement learning (RL) requires the agent to learn from a fixed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging due to the distribution shift. In this work, we propose a simple yet effective policybased approach to batch RL using global optimization methods known as continuation, i.e., by constraining the KullbackLeibler(KL) divergence between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint. We theoretically show that policy gradient with KL divergence regularization converges significantly faster than vanilla policy gradient under the tabular setting even with the exact gradient. We empirically verify that our method benefits not only from the faster convergence, but also reduced noise in the gradient estimate under the batch RL setting with function approximation. We present results on continuous control tasks and the tasks with discrete action to demonstrate the efficacy of our proposed method.  RotationInvariant Gait Identification with Quaternion Convolutional Neural Networks  Jing, Bowen*; Prabhu, Vinay Uday; Gu, Angela; Whaley, John  CNNbased accelerometric gait identification systems suffer a catastrophic drop in test accuracy when they encounter new device orientations unobserved during enrollment. In this paper we target this problem by introducing an SO(3)equivariant quaternion convolutional kernel inside the CNN and disseminate some initial promising results.  AttentionSampling Graph Convolutional Networks  Lippoldt, Franziska; Lavin, Alexander*  A principle advantage of Graph Convolutional Networks (GCN) lies in the ability to cope with irregular data, which we evaluate in the image domain by inspecting both graph downsampling methods and network accuracy with respect to edge connections. We specifically investigate the effects of distancebased vs featureattention downsampling, and suggest a method of generalizing pixelwise attention to the graphs setting to better represent distributions and irregularity. Our analysis is especially important for pathological images for carcinoma prediction: due to image size and overrepresented cellgraphs, downsampling is naturally required, and simplifying graph assumptions may misrepresent the cellular structures. With principled downsampling within GCN, we find that graph analysis of cells reveals possible stages of carcinoma development.  Energybased View of Retrosynthesis  Sun, Ruoxi*; Dai, Hanjun; Li, Li; Kearnes, Steven; Dai, Bo  Retrosynthesisthe process of identifying a set of reactants to synthesize a target moleculeis of vital importance to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. In this paper, we propose a framework that unifies sequence and graphbased methods as energybased models (EBMs) with different energy functions. This unified perspective provides critical insights about EBM variants through a comprehensive assessment of performance. Additionally, we present a novel ``dual'' variant within the framework that performs consistent training over Bayesian forward and backwardprediction by constraining the agreement between the two directions. This model improves stateoftheart performance by 9.6\% for templatefree approaches where the reaction type is unknown.  Neural Interventional GRUODEs  Zhou, Helen*; Xue, Yuan; Dai, Andrew M  Data is often generated as a continually accumulating byproduct of existing systems. This data can be observational, interventional, or a mixture of the two. In hospitals, for example, diagnostic measurements may be taken as needed, and treatments may be administered and recorded over a finite period of time. Leveraging recent advances in continuoustime modeling, we propose Neural Interventional GRUODEs (NIGO) to model passive observations alongside active interventions which happen at irregular timepoints. In this model, observations provide information about the underlying state of the system, whereas interventions drive changes in the underlying state. Our model seeks to capture the influence of interventions on the latent state, while also learning the dynamics of the system. Experiments are done on a simulated pendulum dataset with gravity interventions.  See, Hear, Explore: Curiosity via AudioVisual Association  Dean, Victoria*; Tulsiani, Shubham; Gupta, Abhinav  Exploration is one of the core challenges in reinforcement learning. A common formulation of curiositydriven exploration uses the difference between the real future and the future predicted by a learned model. However, predicting the future is an inherently difficult task which can be illposed in the face of stochasticity. In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses. Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration. Our method is inspired by the fact that, for humans, both sight and sound play a critical role in exploration. We present results on Habitat (a photorealistic navigation simulator), showing the benefits of using an audiovisual association model for intrinsically guiding learning agents in the absence of external rewards.  TSGLR: an Adaptive Thompson Sampling for the Switching MultiArmed Bandit Problem  ALAMI, Reda*; Azizi, Oussama  The stochastic multiarmed bandit problem has been widely studied under the stationary assumption. However in real world problems and industrial applications, this assumption is often unrealistic because the distributions of rewards may change over time. In this paper, we consider the piecewise iid nonstationary stochastic multiarmed bandit problem with unknown changepoints and we focus on the change of mean setup. To solve the latter, we propose a Thompson Sampling strategy equipped with a change point detector based on a well tuned nonparametric Generalized Likelihood Ratio test (GLR). We call the resulting strategy Thompson SamplingGLR (\TSGLR). Analytically, in the context of regret minimization for the global switching setting, our proposal achieves a $\mathcal{O}\left( K_T\log T\right)$ regret upperbound where $K_T$ is the overall number of changepoints up to the horizon $T$. This contradicts the lower bound in $\Omega(\sqrt{T})$. This result mainly comes from the order optimal detection delay of the GLR test for subGaussian distributions and its well controlled false alarm rate. Experimentally, we demonstrate that the $\TSGLR$ outperforms the stateofart non stationary stochastic bandits over synthetic Bernoulli rewards as well as on the Yahoo! User Click Log Dataset.  Learning Invariant Representations for Reinforcement Learning without Reconstruction  Zhang, Amy; McAllister, Rowan*; Calandra, Roberto; Gal, Yarin; Levine, Sergey  We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixelreconstruction. Our goal is to learn representations that both provide for effective downstream control and invariance to taskirrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the taskrelevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding taskirrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance.  ChemBERTa: Utilizing TransformerBased Attention for Understanding Chemistry  Chithrananda, Seyone*; Ramsundar, Bharath  Despite the success of pretraining methods in NLP and computer vision, machinelearningbased pretraining methods remain incredibly scarce and ineffective for applications to chemistry. Many previous graphbased molecular property prediction models have yet to have seen a strong boost in generalizability or prediction accuracy through pretraining techniques, by mapping molecules to a sparse discrete space, known as a molecular fingerprint or numerical representation of molecules. To solve this, we present ChemBERTa, a RoBERTalike transformer model that learns molecular fingerprints through semisupervised pretraining of the sequencetosequence language model, using maskedlanguage modelling of a large corpus of 250,000 SMILES strings, a wellknown text representation of molecules. We train the model over 15 epochs, obtaining a mean masked LM likelihood loss of 0.285. After pretraining, we finetune ChemBERTa by benchmarking its performance on Tox21, a multitask dataset for predicting the toxicities of molecules through various biochemical pathways. We also present the promise of visualizing the attention mechanism in ChemBERTa for the interpretability of chemical features in a molecule and evaluating the performance of our model. Our models have been made openly available through Huggingface's model hub with over 12,000 downloads, and we provide a tutorial for running masked language modelling, attention visualization, and binary classification experiments in the DeepChem library with ChemBERTa.  Gradient Descent on Unstable Dynamical Systems  Nar, Kamil*; Xue, Yuan; Dai, Andrew M  When training the parameters of a linear dynamical model, the gradient descent algorithm is likely to fail to converge if the squarederror loss is used as the training loss function. Restricting the parameter space to a smaller subset and running the gradient descent algorithm within this subset can allow learning stable dynamical systems, but this strategy does not work for unstable systems. In this work, we show that observations taken at different times from the system to be learned influence the dynamics of the gradient descent algorithm in substantially different degrees. We introduce a timeweighted logarithmic loss function to fix this imbalance and demonstrate its effectiveness in learning unstable systems.  Towards Learning Robots Which Adapt On The Fly  Julian, Ryan C*; Swanson, Benjamin; Sukhatme, Gaurav; Levine, Sergey; Finn, Chelsea; Hausman, Karol  One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to everchanging environments. Despite this potential, most of the robot learning systems today are deployed as a fixed policy and they are not being adapted after their deployment. Can we efficiently adapt previously learned behaviors to new environments, objects and percepts in the real world? We present a method and empirical evidence towards a robot learning framework that facilitates continuous adaption. In particular, we demonstrate how to adapt visionbased robotic manipulation policies to new variations by finetuning via offpolicy reinforcement learning, including changes in background, object shape and appearance, lighting conditions, and robot morphology. Further, this adaptation uses less than 0.2\% of the data necessary to learn the task from scratch. We find that the simple approach of finetuning pretrained policies leads to substantial performance gains over the course of finetuning, and that pretraining via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. We also find that these positive results hold in a limited continual learning setting, in which we repeatedly finetune a single lineage of policies using data from a succession of new tasks. Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 52 unique finetuning experiments on a real robotic grasping system pretrained on 580,000 grasps. 
CallforSubmissions
Note: Submissions are not blindreviewed, thus please include authors' names and affiliations in the submissions.
Acceptable material includes work which has already been submitted or published, preliminary results and controversial findings. We do not intend to publish paper proceedings, only abstracts will be shared through an online repository. Our primary goal is to foster discussion!
For examples of previously accepted talks, please watch the paper presentations from BayLearn 2019 or review the complete list of accepted submissions. For examples of abstracts that have been selected in the past, please see the schedule of talks from BayLearn 2018. This page has videos of the talks and links to PDFs of the abstracts are provided for each of the selected talks. 
