TI2736-B: Big Data Processing

General description

The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.

Big Data Processing provides an introduction to systems and algorithms used to process Big Data. The main focus of the course is programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed data storage systems. Then, the course examines in detail the implementation of data analysis algorithms in Hadoop (Map/Reduce) and Spark, in the context of batch, streaming, and graph processing applications.

The course is also optional for the Minor “Software Design and Application.” Part of the course is thus dedicated to basic data processing.

Learning objectives

[all students] After the end of the course, all students should be able to:

Explain the different dimensions of big data problems
Understand why classical algorithms fail on many big data problems
Understand, explain and apply basic data processing operations (filtering, folding, projecting etc)
Understand and explain the major components of the Hadoop framework
Understand and explain the major components of the Spark framework
Create Spark-based algorithms for novel (unseen) practical problems
Explain the difference between iterative/non-iterative algorithms
Design iterative algorithms for simple practical problems.
Understand basic graph algorithms and their implementation on Spark

[BSc students]

Describe in which scenarii streaming algorithms are most applicable
Apply basic streaming algorithms in practical problems

[minor version]

Design and apply basic data processing pipelines
Understand basic data analysis concepts (such as aggregation, correlation and linear modelling)

Course Organization

5 ECTS: This means that you need to devote at least 140 hours of study for this course.
Lectures: The course consists of 14 2-hour lectures. You are not required, but you are strongly encouraged, to attend.
Homework: In the homework assignments, you will have to write code or reply to open questions. You will always work in groups of 2.
Groups: The students are responsible to form pairs and communicate them to the course TAs.
Labs: 4 hours per week, designed to help you work together and get support from teaching assistants.
Teaching Assistants: Teaching assistants will be available during lab hours to help you with solving your assignments. Do be active in asking questions, but don’t expect them to provide you with solutions to the exercises.

Assignments

You can find the course assignments linked through this page. All assignments (except one) are mandatory.

Your submission material is a Jupyter notebook including the full assignment text, your solutions and the results of running your solutions on the provided datasets.

You submit your assignments THE DAY BEFORE the deadline. For example, the deadline for the first assignment is on 29/11. You must submit your assignment by Nov 28, 23:59.

To submit your assignment, you must export the Jupyter notebook to PDF (Save as…) and upload it to the designated BrightSpace folder (you can find those on BrightSpace). Name your submission file as student_id1-student_id2.pdf, replacing student_id1 and student_id2 with your TU Delft IDs. It is enough if one member of each group uploads a version of the assignment.

The assignments are signed-off and graded by TAs. You are expected to be at the lab at the designated timeslot assigned to your group. Timeslots will be announced well in advance. During your timeslot, you must be able to demonstrate a notebook with your solution running live. The TAs will compare your results with the ground truth and grade your solution in place.

Late submission: All submissions must be handed in time, with no exceptions. Any late submission will be discarded and will be graded with 0. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.

Week	Lecture	Who?	Topic	Teacher	Assignment (Deadline)
13/11	1	All	Course introduction, Big and Fast data, Intro to course PLs	GG
13/11	2	All	Programming for Big Data (1)	GG	Functional programming (jupyter, html, scala, python) (29/11/2017)
20/11	1	All	Programming for Big Data (2)	GG
20/11	2	All	Distributed Systems	GG
27/11	1	All	Distributed Databases	GG	Distributed Databases (jupyter, html, solutions) (6/12/2017)
27/11	2	All	Distributed filesystems	GG
4/12	1	All	Spark: RDDs and Pair RDDs	GG
4/12	2	All	Spark Internals	JR	Spark (jupyter, html, scala, python) (20/12/2017)
11/12	1	All	Spark SQL, Synonyms with Word2Vec	GG
11/12	2	All	Recommending bands, Predicting pull request merges	GG	A stats library for Spark Optional (Exam day)
18/12	1	BSc	Stream processing	AK	Streaming (html, solutions) (17/1/2018)
18/12	2	BSc	Stream processing systems	AK
18/12	1	Minors	Data Science at the command line	GG	Unix Programming(jupyter, html, solutions) (17/1/2018)
18/12	2	Minors	Introduction to Data Science	GG
8/1	1	All	Recap	GG
8/1	2	All	No lecture	GG

Lecture notes alternative formats (may be obsolete/contain errors):

Teachers

GG: Georgios Gousios
AK: Asterios Katsifodimos
JR: Jan Rellermeyer

Assessment

Assignments (50%): Grade calculated as mean grade for all assignments. No minimum grade. If you don’t submit an assignment, or the submission is late, you will get a 0.
Written Exam (50%): Closed-book exam. Minimum grade: 6

Example exam material

Model exam, solutions

Resit policy

There will be an exam-only resit during Q3/4. You are allowed to transfer your assignment grade as a whole. This means that you will not be able to re-submit individual assignments. Effectively, you can only resit your written exam.

The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found.

Bibliography

[1]

J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.

[2]

M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.

[3]

S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.

[4]

H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.

[5]

H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.

[6]

B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.

TI2736-B: Big Data Processing

Course information

Georgios Gousios

09 September 2021