Algorithms for Big Data CSE 291

Algorithms for Big Data

Course Description

UCSD Course CSE 291 - F00 (Fall 2020)

This is an advanced algorithms course. Many data-driven areas (computer vision, AR/VR, recommender systems, computational biology) rely on probabilistic and approximation algorithms to overcome the burden of massive datasets. This course explores a foundational view of these techniques by analyzing them through a geometric lens. The first two weeks will review linear algebra (normed spaces, orthogonality, random matrices) and randomized algorithms (approximation guarantees, concentration inequalities). Then, we dive into designing and analyzing algorithms for big data. The main topics include: sampling/sketching, dimensionality reduction, clustering, nearest neighbor search, and distributed models. Throughout the course, we will discuss motivating applications and current research trends, such as adversarial robustness, explainable AI, and learned embeddings. Students will be exposed to many open research problems.

For detailed course information and policies, see the official Course Syllabus

Update: Project report now due on Wed 12/16 at 5pm

Schedule

Date	Notes	Topics	HWs
10/2	Lecture 1	Course Overview
Supplemental Links Quanta Article on Geometry and Data Science
10/5	Lecture 2	Probability Review	HW 1 Out
Supplemental Links Wikipedia on Prob Inequalities: Markov, Chebyshev, Chernoff Video Lectures from UW: Video 1, Video 2, Video 3 Talk by Gabor Lugosi at MLSS on concentration inequalities
10/7	Lecture 3	Prob. Review Cont.
Supplemental Links 3blue1brown Probability: Part 1 and Part 2
10/9	Lecture 4	Approximate Counting	HW 1 Due
Supplemental Links Wikipedia on Morris' Algorithms Video Lecture by Jelani Nelson at Simons workshop
10/12	Lecture 5	Approx Counting cont.	HW 2 Out
10/14	Lecture 6	Distinct Elements
Supplemental Links Wikipedia on Flajolet-Martin Algorithm Video Lecture by Jelani Nelson at Harvard
10/16	Lecture 7	Finish Distinct Elements	HW 2 Due
Supplemental Links Wikipedia on Streaming algs Count-Min Sketch Paper
10/19	Lecture 8	AMS L2 Sketch	HW 3 Out
Supplemental Links AMS paper
10/21	Lecture 9	JL Dimensionality Reduction
Supplemental Links Jelani Nelson Lecture Notes Kane and Nelson Sparser JL Nir Ailon's survey video on advanced JL topics
10/23	Lecture 10	Finish JL	HW 3 Due
10/26	Lecture 11	Project Overview
10/28	Lecture 12	Approx Nearest Neighbors
Supplemental Links Survey Video on ANNS and LSH Associated Survey Article
10/30	Lecture 13	Hamming Dist LSH	Project Proposal Due
11/2	Lecture 14	Cosine Similarity LSH	HW 4 Out
Supplemental Links Quanta Article on ANNS
11/4	Lecture 15	Euclidean Dist LSH
11/6	Lecture 16	Finish LSH	HW 4 Due
11/9	Lecture 17	Clustering Overview
11/11	No Class	Veteran's Day
11/13	Lecture 18	Clustering k-center
Supplemental Links Sanjoy Dasgupta's Course Notes
11/16	Lecture 19	Clustering k-means
Supplemental Links Sanjoy Dasgupta's Course Notes
11/18	Lecture 20	Clustering k-means++
11/20	Lecture 21	Clustering for DNA Storage
Supplemental Links Paper
11/23	Lecture 22	Explainable Clustering	Proj Progress Due
Supplemental Links Blog Post on the 2-means proof
11/25	No Class	Thanksgiving
11/27	No Class	Thanksgiving
11/30	Lecture 23	Explainable Clustering
Supplemental Links Follow-up experimental ExKMC paper
12/2	Lecture 24	Adversarial Robustness
Supplemental Links Paper 1 Paper 2
12/4	Lecture 25	Vector Matrix Vector Queries
Supplemental Links Animated Video Paper
12/7	Presentations 1
12/9	Presentations 2
12/11	Presentations 3
12/16	Final Project Report Due		End of Course!

Homeworks

All homeworks due 5pm on the day listed.

#	Due:	Submit	Solution
HW 1	Fri. 10/9	Submit via Canvas	HW 1 Solution
HW 2	Fri. 10/16	Submit via Canvas	HW 2 Solution
HW 3	Fri. 10/23	Submit via Canvas	Soln Prob3 Code
HW 4	Fri. 11/6	Submit via Canvas	HW 4 Solution Code

Related Courses

The following courses contain relevant material (from slightly different points of view). Much of the material in this course is inspired by their lectures (although there are many differences as well).

Jelani Nelson's Sketching Algorithms for Big Data at Harvard

Paul Beame's Sublinear (and Streaming) Algorithms at UW

Ilya Razenshteyns's Algorithms Through Geometric Lens at UW

David Woodruff's Algorithms for Big Data at CMU

Greg Valiant's The Modern Algorithmic Toolbox at Stanford

The following books also contain relevant material (and other related topics)

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman. [ online copy ]

Foundations of Data Science by Avrim Blum, John Hopcroft, and Ravindran Kannan. [ online copy ]

Project Information

The goal of the project is to understand a specific problem/area in more depth. Here is a document about the project with more information. At a high-level, the project involves reading 1-2 papers thoroughly, and also, trying your hand at improving the results in various ways. We will break the project up into 4 separate milestones:

1) Proposal (due Fri 10/30): submit 1 page proposal on what the project will be about, list the relevant paper(s), and briefly outline what new directions you will explore. Link to submit on canvas.

2) Progress (due Fri 11/20): submit 3 page summary of the paper(s), based on what you understand so far. Explain the scope of the project and what you hope to find out. List preliminary results and/or motivating examples and/or fundamental challenges. This progress report should serve (roughly) as the introduction and outline of the final report.

3) Final Presentation (due Mon 12/7, Wed 12/9, or Fri 12/11): Prepare a 10 minute talk for the class on your project, including the relevant background material, the new results, and any suggestions for future work.

4) Final Report (due Wed 12/16): submit 6-10 page report on the full details of your project. The page limit is loose because it will depend on the format and the number of tables/figures. Ideally, it will look like a first draft of a conference submission (although it's okay if you don't achieve the same number of results as a typical conference paper).

Here is a list of relevant papers in no particular order. You may choose from this list, or choose paper(s) on your own (as long as they have a geometric/algorithmic component, and they have a significant theoretical component).

Algorithms with Predictions (survey). Michael Mitzenmacher, Sergei Vassilvitskii, 2020

Finding cliques using few probes. Uriel Feige, David Gamarnik, Joe Neeman, Miklós Z. Rácz, Prasad Tetali, 2020

Tree! I am no Tree! I am a Low Dimensional Hyperbolic Embedding. Rishi Sonthalia, Anna C. Gilbert, 2020

An Illuminating Algorithm for the Light Bulb Problem. Josh Alman, SOSA 2019

Top-down induction of decision trees: rigorous guarantees and inherent limitations. Guy Blanc, Jane Lange, Li-Yang Tan, ITCS 2020 [video]

Robust Communication-Optimal Distributed Clustering Algorithms. Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, David Woodruff, ICALP 2019

An Equivalence Between Private Classification and Online Prediction. Mark Bun, Roi Livni, Shay Moran, FOCS 2020

Provable tradeoffs in adversarially robust classification. Edgar Dobriban, Hamed Hassani, David Hong, Alexander Robey, 2020

The Gradient Complexity of Linear Regression. Mark Braverman, Elad Hazan, Max Simchowitz, Blake Woodworth, COLT 2020 [video]

Polynomial-time trace reconstruction in the smoothed complexity model. Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, Sandip Sinha, 2020 [video]

A Framework for Adversarially Robust Streaming Algorithms. Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff, Eylon Yogev, PODS 2020

Individual Fairness for k-Clustering. Sepideh Mahabadi, Ali Vakilian, ICML 2020

How to Solve Fair k-Center in Massive Data Models. Ashish Chiplunkar, Sagar Kale, Siva Natarajan Ramamoorthy, ICML 2020

Accelerating Large-Scale Inference with Anisotropic Vector Quantization. Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar, ICML 2020

Upper and Lower Bounds on the Cost of a Map-Reduce Computation. Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, Jeffrey D. Ullman, VLDB 2013.

Optimal multiclass overfitting by sequence reconstruction from Hamming queries. Jayadev Acharya, Ananda Theertha Suresh, ALT 2020

Parallel Correlation Clustering on Big Graphs. Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan, NeurIPS 2015

Hierarchical Clustering for Euclidean Data. Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, Grigory Yaroslavtsev, AISTATS 2019

Making AI Forget You: Data Deletion in Machine Learning. Antonio Ginart, Melody Y. Guan, Gregory Valiant, James Zou, NeurIPS 2019

Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing. Samuel McCauley, 2020

A cost function for similarity-based hierarchical clustering. Sanjoy Dasgupta, SODA 2016 [video]

On Symmetric and Asymmetric LSHs for Inner Product Search. Behnam Neyshabur, Nathan Srebro, ICML 2015

LSH Forest: Practical Algorithms Made Theoretical. Alexandr Andoni, Ilya Razenshteyn, Negev Shekel Nosatzki, SODA 2017

MinJoin: Efficient Edit Similarity Joins via Local Hash Minima. Haoyu Zhang, Qin Zhang, KDD 2019

Performance of Johnson-Lindenstrauss Transform for k-Means and k-Medians Clustering. Konstantin Makarychev, Yury Makarychev, Ilya Razenshteyn, STOC 2019

You may also choose any papers from this other course

Inspiration may also come from talks at this Simons workshop

Course Staff

Instructor
Cyrus Rashtchian
crashtchian@eng.ucsd.edu
Office Hours by appointment

Teaching Assistant
Subrato Chakravorty
suchakra@eng.ucsd.edu
Office Hours
Thurs 11 - 12 PT
Zoom link

Course Details

When
Fall 2020
MWF 11a - 11:50a

Zoom
Meeting ID: 973 3213 4888
Password: bigdata