Data Science Through a Geometric Lens

Course Description

UCSD Course CSE 291 - F00 (Fall 2019)

This is an advanced algorithms course, with a focus on geometric algorithms motivated by data science applications. Many applications (computer vision, AR/VR, recommender systems, computational biology) rely on geometric ideas and data representations. In this context, geometry refers to distances between high-dimensional vectors (e.g., Lp distances) or strings (e.g., edit distance). This course will provide the foundations for designing and analyzing algorithms operating on data with geometric structure. We will begin with sketching/streaming algorithms and dimensionality reduction. Then, we will explore many algorithmic problems such as clustering and approximate nearest neighbor search via locality sensitive hashing. Finally, we will address massive data sets by adapting algorithms to distributed models. Students will be exposed to many open research problems.

For detailed course information and policies, see the official Course Syllabus


Schedule

Date Notes Topics HWs
9/30 Lecture 1 Overview; Approximate Counting
Supplemental Links

Wikipedia on Morris' Algorithms
Wikipedia on Prob Inequalities: Markov, Chebyshev, Chernoff
Video Lecture by Jelani Nelson at Harvard

10/2 Lecture 2 Distinct Elements; Hashing
Supplemental Links

Wikipedia on Flajolet-Martin Algorithm
Video Lecture by Jelani Nelson at Harvard
Hashing Lecture Notes by Shachar Lovett at UCSD

10/7 Lecture 3 Finish Distinct Elements
10/9 Lecture 4 AMS L2 Sketch; Majority Alg HW1 Out
10/14 Lecture 5 JL Dimensionality Reduction
Supplemental Links

Kane and Nelson Sparser JL
Nir Ailon's survey video on advanced JL topics

10/16 Lecture 6 Bourgain's Embedding
Supplemental Links

Avner Magen's notes with pics
Luca Trevisan's notes with example metrics

10/21 Lecture 6 (part 2) Bourgain's Embedding Cont. HW1 Due / HW2 Out
10/23 Lecture 7 Approx Nearest Neighbors; Hamming
Supplemental Links

Survey Video on ANNS and LSH
Associated Survey Article

10/28 Lecture 8 Hamming Dist LSH Proj Info Out
Supplemental Links

Quanta Article on Geometry and Data Science

10/30 Lecture 9 Euclidean Dist LSH HW2 Due / HW3 Out
Supplemental Links

Quanta Article on ANNS

11/4 Lecture 10 Clustering; k-center
Supplemental Links

Sanjoy Dasgupta's Course Notes

11/6 Lecture 11 Clustering; k-means Proj Proposal Due
Supplemental Links

Sanjoy Dasgupta's Course Notes

11/11 No Class Veteran's Day
11/13 Lecture 12 Clustering; k-means++ HW3 Due
11/18 No Class
11/20 No Class
11/25 Lecture 13 Distributed Similarity Join Proj Progress Due
11/27 Lecture 14 Distributed Clustering
12/2 Lecture 15 Clustering for DNA Storage
12/4 Presentations
12/11 Final Projects Due

Homeworks

All homeworks due 5pm on the day listed.

# Due: Submit Solution
HW1 Mon. 10/21 Submit by Email Solution + Code
HW2 Wed. 10/30 Submit by Email HW 2 Solution
HW3 Wed. 11/13 Submit by Email HW 3 Solution

Related Courses

The following courses contain relevant material (from slightly different points of view). Much of the material in this course is inspired by their lectures (although there are many differences as well).

Jelani Nelson's Sketching Algorithms for Big Data at Harvard

Paul Beame's Sublinear (and Streaming) Algorithms at UW

Ilya Razenshteyns's Algorithms Through Geometric Lens at UW

David Woodruff's Algorithms for Big Data at CMU

Greg Valiant's The Modern Algorithmic Toolbox at Stanford

The following books also contain relevant material (and other related topics)

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman. [ online copy ]

Foundations of Data Science by Avrim Blum, John Hopcroft, and Ravindran Kannan. [ online copy ]


Project Information

The goal of the project is to understand a specific problem/area in more depth. This involves reading 1-2 papers thoroughly, and also, trying your hand at improving the results in various ways. We will break the project up into 4 separate milestones:


1) Proposal (due Wed 11/6): submit 1 page proposal on what the project will be about, list the relevant paper(s), and briefly outline what new directions you will explore.

2) Progress (due Mon 11/25): submit 2-3 page summary of the paper(s), based on what you understand so far. Explain the scope of the project and what you hope to find out. List any preliminary results and/or motivating examples and/or fundamental challenges. This progress report should serve (roughly) as the introduction of the final report.

3) Final Presentation (due Mon 12/2 or Wed 12/4): Prepare a 15 minute talk for the class on your project, including the relevant background material, the new results, and any suggestions for future work.

4) Final Report (due Wed 12/11): submit 6-10 page report on the full details of your project. The page limit is loose because it will depend on the format and the number of tables/figures. Ideally, it will look like a first draft of a conference submission (although it's okay if you don't achieve the same number of results as a typical conference paper).

Here is a list of relevant papers. You may choose from this list, or choose paper(s) on your own (as long as they have a geometric/algorithmic component, and they have a significant theoretical component).

Upper and Lower Bounds on the Cost of a Map-Reduce Computation. Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, Jeffrey D. Ullman, VLDB 2013.

Sparser Johnson-Lindenstrauss Transforms Daniel M. Kane, Jelani Nelson, SODA 2012

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, Pradeep Dubey

Parallel Correlation Clustering on Big Graphs Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan

Hierarchical Clustering for Euclidean Data Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, Grigory Yaroslavtsev, AISTATS 2019

CoveringLSH: Locality-Sensitive Hashing without False Negatives. Rasmus Pagh, SODA 2016

Set similarity search beyond MinHash. Tobias Christiani, Rasmus Pagh, STOC 2017

A cost function for similarity-based hierarchical clustering. Sanjoy Dasgupta, SODA 2016 [video]

On Symmetric and Asymmetric LSHs for Inner Product Search Behnam Neyshabur, Nathan Srebro, ICML 2015

Near-Optimal (Euclidean) Metric Compression Piotr Indyk, Tal Wagner, SODA 2017

LSH Forest: Practical Algorithms Made Theoretical Alexandr Andoni,Ilya Razenshteyn, Negev Shekel Nosatzki, SODA 2017

MinJoin: Efficient Edit Similarity Joins via Local Hash Minima Haoyu Zhang, Qin Zhang, KDD 2019

Performance of Johnson-Lindenstrauss Transform for k-Means and k-Medians Clustering Konstantin Makarychev, Yury Makarychev, Ilya Razenshteyn, STOC 2019

You may also choose any papers from this other course

Course Staff

Instructor
Cyrus Rashtchian
crashtchian@eng.ucsd.edu
Office Hours
Mon 3:30p - 4:30p
(or by appointment)
Atkinson 4111

Teaching Assistant
Rithesh Ramapura Narasimha Murthy
rramapur@eng.ucsd.edu
Office Hours
Thurs 2:00p - 3:00p
EBU3B Room B215


Course Details

When
Fall 2019
MW 10:30a - 11:50a

Where
CSE Building (EBU3B)
Room 4140