Models, Algorithms, and Data Science Seminar

Winter 2020 Schedule

Stay tuned. We will resume in Winter 2020. The reason is that Fall 2019 is already jam-packed with courses and reading groups.

Spring 2019 Schedule

Date Speaker Title
12 April Sanjoy Dasgupta Interactive Hierarchical Clustering
Abstract

Clustering is a powerful tool in data analysis, but it is often difficult to find a grouping that aligns closely with a user’s needs. We show how user feedback can interactively solicited and incorporated into hierarchical clustering. Our first step is to address a major failing of popular algorithms for hierarchical clustering: the lack of a clear objective function. We introduce a simple cost function on hierarchies over a set of points, given pairwise similarities between those points. We show that this criterion behaves sensibly in canonical instances and that it admits a top-down construction procedure with a provably good approximation ratio. We then show that this procedure lends itself naturally to an interactive setting in which the user is repeatedly shown snapshots of a hierarchy and makes corrections to them. We also design a Bayesian version of this scheme that maintains a distribution over hierarchies, and repeatedly solicits feedback on portions of the tree that have high uncertainty. This is joint work with Mike Luby, Chris Tosh, and Sharad Vikram.

19 April Yana Safonova Open Problems in Computational Immunology
Abstract

Rapid development of DNA sequencing technologies opened new avenues for analyzing adaptive immune systems through deep interrogation of antibody repertoires. I will describe some unanticipated features of the immune system revealed by these studies and will discuss new computational problems that have emerged. In particular, I will also discuss how these immunoinformatics challenges relate to information theory, evolutionary tree reconstruction, and analysis of variations in the immune system across human population. This is a joint work with Pavel Pevzner (UCSD), Siavash Mirarab (UCSD), Ramesh Rao (UCSD), Massimo Franceschetti (UCSD), Vinnu Bhardwaj (UCSD), Corey Watson (University of Louisville), and Timothy Smith (USDA).

26 April Rayan Saab New and Improved Binary Embeddings of Data
Abstract

We discuss two related problems that arise in the acquisition and processing of high-dimensional data. First, we consider distance-preserving fast binary embeddings. Here we propose fast methods to replace points from a subset of RN with points in a lower-dimensional cube {±1}m, which we endow with an appropriate function to approximate Euclidean distances in the original space. Second, we consider a problem in the quantization (i.e., digitization) of compressed sensing measurements. Here, we deal with measurements of approximately sparse signals as well as with measurements of manifold valued data. Our results apply not only to Gaussian measurement matrices but also to those drawn from bounded orthonormal systems and partial circulant ensembles, both of which arise naturally in applications and admit fast transforms. In all these problems we show state-of-the art error bounds, and to our knowledge, some of our results are the first of their kind.

3 May Shachar Lovett The Power of Asking More Informative Questions
Abstract

In the world of big data, it is often easy to get a large amount of unlabelled data. However, learning algorithms typically require labelled data to train on. Active learning is an approach to bridge this gap, where the learning algorithm may query the labels of a few data points. The hope is that by judiciously choosing which data points to query, the number of queries can be made much smaller compared to the classical setting where all the data is labelled.

Unfortunately, this turns out to be false, even for very simple concept classes. This led research in active learning to focus on making various assumptions on the underlying data, distribution, and so on.

I will present an alternative approach: allowing the learning algorithm to ask more informative questions about the data. As I will show, even allowing very simple enriched queries, such as comparing two data points, allows in many cases to obtain exponential improvement. In addition, the mathematical tools that allow for that turn out to have surprising applications to long-standing open problems in complexity theory and combinatorial geometry.

Based on joint works with Max Hopkins, Daniel Kane, Gaurav Mahajan, Shay Moran and Jiapeng Zhang.

10 May Julian McAuley Introduction to the Design, Practices, and Ethics of Recommender Systems
Abstract

In this talk we will discuss the modeling techniques behind personalized recommendation technology on the web. Examples of Recommender Systems range from simple statistical approaches like Amazon’s “people who bought X also bought Y” links, to complex AI-based approaches that drive feed ranking on sites like Facebook. We’ll discuss the models that drive these systems, look at the research questions that drive the future of this field in the coming years, and discuss their ethical implications

17 May No seminar
24 May Cyrus Rashtchian Algorithmic Challenges in DNA Data Storage
Abstract

Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. Due to the specific properties of the datasets in this domain, many new techniques must be invented.

One crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain randomness and separability properties, such as those coming from DNA data storage systems.

Other interesting algorithmic problems arise in the retrieval pipeline as well. For example, after computing the clusters, the next step is to find the most likely cluster center. This can be solved using algorithms for Trace Reconstruction, which is the problem of determining an unknown string using samples from a noisy channel that introduces insertions and deletions. There are also open questions relating to error correcting codes for insertions and deletions.


About the Seminar

The MAD Science Seminar is a weekly meeting, consisting of a high-quality technical talk and a provided lunch. Our focus will be quite broad, including speakers from computer science, statistics, mathematics, and data science. Unlike standard talk series, we aim to facilitate a collaborative and fun environment, with the goal of inspiring interdisciplinary research across UCSD.


Details

When
Fridays, 11-12 (talk), 12-12:30 (lunch)

Where
Atkinson Hall, 4004

Join the Mailing List
mad-science-l@mailman.ucsd.edu (join)


Organizers