UCSD Course CSE 291  F00 (Fall 2020)
This is an advanced algorithms course. Many datadriven areas (computer vision, AR/VR, recommender systems, computational biology) rely on probabilistic and approximation algorithms to overcome the burden of massive datasets. This course explores a foundational view of these techniques by analyzing them through a geometric lens. The first two weeks will review linear algebra (normed spaces, orthogonality, random matrices) and randomized algorithms (approximation guarantees, concentration inequalities). Then, we dive into designing and analyzing algorithms for big data. The main topics include: sampling/sketching, dimensionality reduction, clustering, nearest neighbor search, and distributed models. Throughout the course, we will discuss motivating applications and current research trends, such as adversarial robustness, explainable AI, and learned embeddings. Students will be exposed to many open research problems.
For detailed course information and policies, see the official Course Syllabus
Update: Project report now due on Wed 12/16 at 5pm
Date  Notes  Topics  HWs 

10/2  Lecture 1  Course Overview  
Supplemental LinksQuanta Article on Geometry and Data Science 

10/5  Lecture 2  Probability Review  HW 1 Out 
10/7  Lecture 3  Prob. Review Cont.  
10/9  Lecture 4  Approximate Counting  HW 1 Due 
Supplemental Links
Wikipedia on
Morris' Algorithms


10/12  Lecture 5  Approx Counting cont.  HW 2 Out 
10/14  Lecture 6  Distinct Elements  
Supplemental Links
Wikipedia on
FlajoletMartin Algorithm


10/16  Lecture 7  Finish Distinct Elements  HW 2 Due 
Supplemental Links 

10/19  Lecture 8  AMS L2 Sketch  HW 3 Out 
Supplemental Links 

10/21  Lecture 9  JL Dimensionality Reduction  
Supplemental Links
Jelani Nelson Lecture Notes


10/23  Lecture 10  Finish JL  HW 3 Due 
10/26  Lecture 11  Project Overview  
10/28  Lecture 12  Approx Nearest Neighbors  
Supplemental Links 

10/30  Lecture 13  Hamming Dist LSH  Project Proposal Due 
11/2  Lecture 14  Cosine Similarity LSH  HW 4 Out 
Supplemental LinksQuanta Article on ANNS 

11/4  Lecture 15  Euclidean Dist LSH  
11/6  Lecture 16  Finish LSH  HW 4 Due 
11/9  Lecture 17  Clustering Overview  
11/11  No Class  Veteran's Day  
11/13  Lecture 18  Clustering kcenter  
Supplemental LinksSanjoy Dasgupta's Course Notes 

11/16  Lecture 19  Clustering kmeans  
Supplemental LinksSanjoy Dasgupta's Course Notes 

11/18  Lecture 20  Clustering kmeans++  
11/20  Lecture 21  Clustering for DNA Storage  
Supplemental Links 

11/23  Lecture 22  Explainable Clustering  Proj Progress Due 
Supplemental LinksBlog Post on the 2means proof 

11/25  No Class  Thanksgiving  
11/27  No Class  Thanksgiving  
11/30  Lecture 23  Explainable Clustering  
Supplemental LinksFollowup experimental ExKMC paper 

12/2  Lecture 24  Adversarial Robustness  
12/4  Lecture 25  Vector Matrix Vector Queries  
Supplemental Links 

12/7  Presentations 1  
12/9  Presentations 2  
12/11  Presentations 3  
12/16  Final Project Report Due  End of Course! 
All homeworks due 5pm on the day listed.
#  Due:  Submit  Solution 

HW 1  Fri. 10/9  Submit via Canvas  HW 1 Solution 
HW 2  Fri. 10/16  Submit via Canvas  HW 2 Solution 
HW 3  Fri. 10/23  Submit via Canvas  Soln Prob3 Code 
HW 4  Fri. 11/6  Submit via Canvas  HW 4 Solution Code 
The following courses contain relevant material (from slightly different points of view). Much of the material in this course is inspired by their lectures (although there are many differences as well).
Jelani Nelson's Sketching Algorithms for Big Data at Harvard
Paul Beame's Sublinear (and Streaming) Algorithms at UW
Ilya Razenshteyns's Algorithms Through Geometric Lens at UW
David Woodruff's Algorithms for Big Data at CMU
Greg Valiant's The Modern Algorithmic Toolbox at Stanford
The following books also contain relevant material (and other related topics)
Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman. [ online copy ]
Foundations of Data Science by Avrim Blum, John Hopcroft, and Ravindran Kannan. [ online copy ]
The goal of the project is to understand a specific problem/area in more depth. Here is a document about the project with more information. At a highlevel, the project involves reading 12 papers thoroughly, and also, trying your hand at improving the results in various ways. We will break the project up into 4 separate milestones:
1) Proposal (due Fri 10/30): submit 1 page proposal on what the project will be about, list the relevant paper(s), and briefly outline what new directions you will explore. Link to submit on canvas.
2) Progress (due Fri 11/20): submit 3 page summary of the paper(s), based on what you understand so far. Explain the scope of the project and what you hope to find out. List preliminary results and/or motivating examples and/or fundamental challenges. This progress report should serve (roughly) as the introduction and outline of the final report.
3) Final Presentation (due Mon 12/7, Wed 12/9, or Fri 12/11): Prepare a 10 minute talk for the class on your project, including the relevant background material, the new results, and any suggestions for future work.
4) Final Report (due Wed 12/16): submit 610 page report on the full details of your project. The page limit is loose because it will depend on the format and the number of tables/figures. Ideally, it will look like a first draft of a conference submission (although it's okay if you don't achieve the same number of results as a typical conference paper).
Here is a list of relevant papers in no particular order. You may choose from this list, or choose paper(s) on your own (as long as they have a geometric/algorithmic component, and they have a significant theoretical component).
Algorithms with Predictions (survey). Michael Mitzenmacher, Sergei Vassilvitskii, 2020
Finding cliques using few probes. Uriel Feige, David Gamarnik, Joe Neeman, Miklós Z. Rácz, Prasad Tetali, 2020
Tree! I am no Tree! I am a Low Dimensional Hyperbolic Embedding. Rishi Sonthalia, Anna C. Gilbert, 2020
An Illuminating Algorithm for the Light Bulb Problem. Josh Alman, SOSA 2019
Topdown induction of decision trees: rigorous guarantees and inherent limitations. Guy Blanc, Jane Lange, LiYang Tan, ITCS 2020 [video]
Robust CommunicationOptimal Distributed Clustering Algorithms. Pranjal Awasthi, Ainesh Bakshi, MariaFlorina Balcan, Colin White, David Woodruff, ICALP 2019
An Equivalence Between Private Classification and Online Prediction. Mark Bun, Roi Livni, Shay Moran, FOCS 2020
Provable tradeoffs in adversarially robust classification. Edgar Dobriban, Hamed Hassani, David Hong, Alexander Robey, 2020
The Gradient Complexity of Linear Regression. Mark Braverman, Elad Hazan, Max Simchowitz, Blake Woodworth, COLT 2020 [video]
Polynomialtime trace reconstruction in the smoothed complexity model. Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, Sandip Sinha, 2020 [video]
A Framework for Adversarially Robust Streaming Algorithms. Omri BenEliezer, Rajesh Jayaram, David P. Woodruff, Eylon Yogev, PODS 2020
Individual Fairness for kClustering. Sepideh Mahabadi, Ali Vakilian, ICML 2020
How to Solve Fair kCenter in Massive Data Models. Ashish Chiplunkar, Sagar Kale, Siva Natarajan Ramamoorthy, ICML 2020
Accelerating LargeScale Inference with Anisotropic Vector Quantization. Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar, ICML 2020
Upper and Lower Bounds on the Cost of a MapReduce Computation. Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, Jeffrey D. Ullman, VLDB 2013.
Optimal multiclass overfitting by sequence reconstruction from Hamming queries. Jayadev Acharya, Ananda Theertha Suresh, ALT 2020
Parallel Correlation Clustering on Big Graphs.
Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan, NeurIPS 2015
Hierarchical Clustering for Euclidean Data.
Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, Grigory Yaroslavtsev, AISTATS 2019
Making AI Forget You: Data Deletion in Machine Learning.
Antonio Ginart, Melody Y. Guan, Gregory Valiant, James Zou, NeurIPS 2019
Approximate Similarity Search Under Edit Distance Using LocalitySensitive Hashing.
Samuel McCauley, 2020
A cost function for similaritybased hierarchical clustering.
Sanjoy Dasgupta, SODA 2016 [video]
On Symmetric and Asymmetric LSHs for Inner Product Search.
Behnam Neyshabur, Nathan Srebro, ICML 2015
LSH Forest: Practical Algorithms Made Theoretical. Alexandr Andoni, Ilya Razenshteyn, Negev Shekel Nosatzki, SODA 2017
MinJoin: Efficient Edit Similarity Joins via Local Hash Minima. Haoyu Zhang, Qin Zhang, KDD 2019
Performance of JohnsonLindenstrauss Transform for kMeans and kMedians Clustering. Konstantin Makarychev, Yury Makarychev, Ilya Razenshteyn, STOC 2019
You may also choose any papers from this other course
Inspiration may also come from talks at this Simons workshop
Instructor
Cyrus Rashtchian
crashtchian@eng.ucsd.edu
Office Hours by appointment
Teaching Assistant
Subrato Chakravorty
suchakra@eng.ucsd.edu
Office Hours
Thurs 11  12 PT
Zoom link
When
Fall 2020
MWF 11a  11:50a
Zoom
Meeting ID: 973 3213 4888
Password: bigdata