Big and Fast Data

What is big data?

An overloaded, fuzzy term

“Data too large to be efficiently processed on a single computer”

“Massive amounts of diverse, unstructured data produced by high-performance applications”

How big is “Big”?

Typical numbers associated with Big Data

How big is “Big”? – Instagram

Instagram

  • 1B daily users, clicking around the app
  • 95M photos daily
  • Most followed user: 237M followers, up from 180M in 2018

How big is “Big”? – FaceBook

FaceBook

Numbers (\(\Uparrow\)) are from 2014! Today on FB:

  • 2 Billion users
  • 1.32 Billion active users per day
  • 350 million photos per day (148k/min)
  • Every min: 510k comments, 293k status updates

The many Vs of Big data

Main Vs, by Doug Laney

  • Volume: large amounts of data
  • Variety: data comes in many different forms from diverse sources
  • Velocity: the content is changing quickly

More Vs

  • Value: data alone is not enough; how can value be derived from it?
  • Veracity: can we trust the data? How accurate is it?
  • Validity: ensure that the interpreted data is sound
  • Visibility: data from diverse sources need to be stitched together

Volume

We call Big Data big because it is really big:

Data growth rate

Variety

  • Structured data: SQL tables, images, format is known
  • Semi-structured data: JSON, XML
  • Unstructured data: Text, mostly

We often need to combine various data sources of different types to come up with a result

Velocity

Data is not just big; it is generated and needs to be processed fast. Think of:

  • Datacenters writing to log files
  • IoT sensors reporting temperatures around the globe
  • Twitter: 500 million tweets a day (or 6k/sec)
  • Stock Markets: high-frequency trading (latency costs money)
  • Online advertising

Data needs to be processed with soft or hard real-time guarantees

Big Data processing

  • The ETL cycle

    • Extract: Convert raw or semi-structured data into structured data
    • Transform: Convert units, join data sources, cleanup etc
    • Load: Load the data into another system for further processing
  • Big data engineering is concerned with building pipelines

  • Big data analytics is concerned with discovering patters

How to process all this data?

  • Batch processing: All data exists in some data store, a program processes the whole dataset at once
  • Stream processing: Processing of data as they arrive to the system

2 basic approaches to distribute data processing operations on lots of machines

  • Divide the data in chunks, apply the same algorithm on all chunks (concurrency)
  • Divide the problem in chunks, run it on a cluster of machines (parallelism)

Large-scale computing

Not a new discipline:

  • Cray-1 appeared in the late ’70s
  • Physicists used super computers for simulations in the ’80s
  • Shared-memory designs still in large scale use (e.g. TOP500 supercomputers)

What is new?

Large scale processing on distributed, commodity computers, enabled by advanced software using elastic resource allocation.

Software (not HW!) is what drives the Big Data industry

A brief history of Big Data tech

  • 2003: Google publishes the Google Filesystem paper, a large-scale distributed file system
  • 2004: Google publishes the Map/Reduce paper, a distributed data processing abstraction
  • 2006: Yahoo creates and open sources Hadoop, inspired by the Google papers
  • 2006: Amazon lunches its Elastic Compute Cloud, offering cheap, elastic resources
  • 2007: Amazon publishes the DynamoDB paper, sketches the blueprints of a cloud-native database
  • 2009 – onwards: The NoSQL movement. Schema-less, distributed databases defy the SQL way of storing data
  • 2010: Matei Zaharia et al. publish the Spark paper, brings FP to in-memory computations
  • 2012: Both Spark Streaming and Apache Flink appear, able to handle really high volume stream processing
  • 2012: Alex Krizhevsky et al. publish their deep learning image classification paper re-igniting interest in neural networks and solidifying the value of big data

The Big Data Tech Landscape

The big data landscape

Progress is mostly industry-driven

D: Most advancement in Big Data technologies came from the industry. The universities only started contributing late. Why?

Data is the new oil

Typical problems solved with Big Data

  • Modeling: What factors influence particular outcomes/behaviours?
  • Information retrieval: Finding needles in haystacks, aka search engines
  • Collaborative filtering: Recommending items based on items other users with similar tastes have chosen
  • Outlier detection: Discovering outstanding transactions

Image credits

  • Data is the new oil picture (c) the Economist

Bibliography

[1]
J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters,” in Proceedings of the 6th conference on symposium on opearting systems design & implementation - volume 6, 2004, pp. 10–10.
[2]
S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system,” in Proceedings of the nineteenth ACM symposium on operating systems principles, 2003, pp. 29–43.
[3]
G. DeCandia et al., “Dynamo: Amazon’s highly available key-value store,” ACM SIGOPS operating systems review, vol. 41, no. 6, pp. 205–220, 2007.
[4]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no. 10–10, p. 95, 2010.
[5]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
[6]
A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.