Big Data describes datasets that are either too big or too fast or both to be processed on a single computer. “TI2736-B Big Data Processing” provides a practical introduction to systems and algorithms used to process Big Data.
[all students] After the end of the course, all students should be able to:
[BSc students] - Describe in which scenaria streaming algorithms are most applicable - Identify the correct streaming algorithm for a given streaming problem
[minor version] - Design and apply basic data processing pipelines - Understand basic data analysis concepts (such as aggregation, correlation and linear modelling)
5 ECTS: This means that you need to devote at least 140 hours of study for this course.
Lectures: The course consists of 14 2-hour lectures. You are not required, but you are strongly encouraged, to attend.
Homework: In the homework assignments, you will have to write code or reply to open questions. You will always work in groups
Labs: 4 hours per week, designed to help you work together and get support from teaching assistants.
Teaching Assistants: Teaching assistants will be available during lab hours to help you with solving your assignments. Do be active in asking questions, but don’t expect them to provide you with solutions to the exercises.
Late submission: All submissions must be handed in time, with no exceptions. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.
|1||1||All||Course introduction, Big data in the real world||GG|
|1||2||All||Progamming techniques for Big Data||GG|
|3||2||BSc||Stream processing systems||JH|
|3||1||Minors||Introduction to Data Processing||GG|
|3||2||Minors||Intoduction to Data science||GG|
|4||1||All||Hadoop and friends||GG|
|5||2||All||Pair RDDs and Dataframes||GG|
|6||1||All||Algorithms on Spark||GG|
|6||2||All||Iterative algorithms on Spark||GG|
|7||2||All||Graph processing systems||GG|
The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found.
 S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.
 J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.
 H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.
 D. Miner and A. Shook, MapReduce design patterns: Building effective algorithms and analytics for hadoop and other systems. O’Reilly Media, Inc., 2012.
 B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.
This work is (c) 2017 - onwards by Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.