General description

Big Data describes datasets that are either too big or too fast or both to be processed on a single computer. “TI2736-B Big Data Processing” provides a practical introduction to systems and algorithms used to process Big Data.

Learning objectives

[all students] After the end of the course, all students should be able to:

[BSc students] - Describe in which scenaria streaming algorithms are most applicable - Identify the correct streaming algorithm for a given streaming problem

[minor version] - Design and apply basic data processing pipelines - Understand basic data analysis concepts (such as aggregation, correlation and linear modelling)

Course Organization

Contents

Week Lecture Who? Topic Teacher Homework
1 1 All Course introduction, Big data in the real world GG
1 2 All Progamming techniques for Big Data GG
2 1 All Distributed storage GG
2 2 All Distrubuted databases GG
3 1 BSc Stream processing JH
3 2 BSc Stream processing systems JH
3 1 Minors Introduction to Data Processing GG
3 2 Minors Intoduction to Data science GG
4 1 All Map/Reduce algorithms GG
4 1 All Hadoop and friends GG
5 1 All Spark RDDs GG
5 2 All Pair RDDs and Dataframes GG
6 1 All Algorithms on Spark GG
6 2 All Iterative algorithms on Spark GG
7 1 All Big Graphs GG
7 2 All Graph processing systems GG

Assessment

Bibliography

The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found.

[1] S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.

[2] J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.

[3] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.

[4] D. Miner and A. Shook, MapReduce design patterns: Building effective algorithms and analytics for hadoop and other systems. O’Reilly Media, Inc., 2012.

[5] B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.