General description

The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.

Big Data Processing provides an introduction to systems and algorithms used to process Big Data. The main focus of the course is programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed data storage systems. Then, the course examines in detail the implementation of data analysis algorithms in Hadoop (Map/Reduce) and Spark, in the context of batch, streaming, and graph processing applications.

Every week, students will have to do an assignment, consisting mostly of coding exercises. To stir things up, the last assignment will include a (optional) programming/performance competition, similar in style to the popular Terasort benchmark.

The course is also optional for the Minor “Software Design and Application”. Part of the course is thus dedicated to basic data processing.

Learning objectives

[all students] After the end of the course, all students should be able to:

[BSc students] - Describe in which scenaria streaming algorithms are most applicable - Apply basic streaming algorithms in practical problems

[minor version] - Design and apply basic data processing pipelines - Understand basic data analysis concepts (such as aggregation, correlation and linear modelling)

Course Organization

Contents

Week Lecture Who? Topic Teacher Lecture Notes Homework
1 1 All Course introduction, Big data in the real world GG
1 2 All Programming techniques for Big Data GG Prog. Techniques for Big Data
2 1 All Distributed storage GG
2 2 All Distrubuted databases GG Distributed Databases
3 1 BSc Stream processing JH
3 2 BSc Stream processing systems JH
3 1 Minors Introduction to Data Processing GG
3 2 Minors Intoduction to Data science GG
4 1 All Map/Reduce algorithms GG
4 1 All Hadoop and friends GG
5 1 All Spark RDDs GG
5 2 All Pair RDDs and Dataframes GG Spark
6 1 All Algorithms on Spark GG
6 2 All Iterative algorithms on Spark GG
7 1 All Big Graphs GG
7 2 All Graph processing systems GG

Assessment

Bibliography

The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found.

[1] S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.

[2] J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.

[3] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.

[4] D. Miner and A. Shook, MapReduce design patterns: Building effective algorithms and analytics for hadoop and other systems. O’Reilly Media, Inc., 2012.

[5] B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.