Empirical research methods

Quantitative research

Extract answers to research questions using mathematical, statistical or numerical techniques

Hypotheses

Propose an explanation to a phenomenon

Defined in pairs

A good hypothesis is readily falsifiable

Hypotheses | p-values

Measurement

Extract samples of data for a running process. Data types:

Python for data science

A brief introduction

  • A general purpose language with a huge number of domain-specific libs
  • Lingua franca for Data science
  • Usually, in conjunction with IPython / Jupyter
  • 3 base data types: lists, dictionaries, and Pandas DataFrames

Lists

Lists contain items of single type

a = range(1, 5)
print(a)
## range(1, 5)

Vectors

Numpy vectors enable optimized linear algebra operations

import numpy as np
a = np.array([1,2,3])
print(a)
## [1 2 3]

Operations apply on all elements

print(a * 2)
## [2 4 6]
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([1, 2, 3])
print(a.dot(b))
## [14 32]

Dictionaries

Dictionaries implement the Map ADT

a = {"foo": 1, "bar": 2}

print(a.keys())
## dict_keys(['foo', 'bar'])
print(a.values())
## dict_values([1, 2])

Pandas Data frames

The DataFrame is the data science workhorse. It is the equivalent of a relation in relational algebra, with richer data types:

  • Rows are measurements.
  • Columns are variables
import pandas as pd

mtcars = pd.read_csv("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv")
print(mtcars.head())
##                model   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
## 0          Mazda RX4  21.0    6  160.0  110  ...  16.46   0   1     4     4
## 1      Mazda RX4 Wag  21.0    6  160.0  110  ...  17.02   0   1     4     4
## 2         Datsun 710  22.8    4  108.0   93  ...  18.61   1   1     4     1
## 3     Hornet 4 Drive  21.4    6  258.0  110  ...  19.44   1   0     3     1
## 4  Hornet Sportabout  18.7    8  360.0  175  ...  17.02   0   0     3     2
## 
## [5 rows x 12 columns]

Data frames | information

Information about a data frame

print(mtcars.describe())
##              mpg        cyl        disp  ...         am       gear     carb
## count  32.000000  32.000000   32.000000  ...  32.000000  32.000000  32.0000
## mean   20.090625   6.187500  230.721875  ...   0.406250   3.687500   2.8125
## std     6.026948   1.785922  123.938694  ...   0.498991   0.737804   1.6152
## min    10.400000   4.000000   71.100000  ...   0.000000   3.000000   1.0000
## 25%    15.425000   4.000000  120.825000  ...   0.000000   3.000000   2.0000
## 50%    19.200000   6.000000  196.300000  ...   0.000000   4.000000   2.0000
## 75%    22.800000   8.000000  326.000000  ...   1.000000   4.000000   4.0000
## max    33.900000   8.000000  472.000000  ...   1.000000   5.000000   8.0000
## 
## [8 rows x 11 columns]

Data frames | data types

Information about the data types per column

print(mtcars.dtypes)
## model     object
## mpg      float64
## cyl        int64
## disp     float64
## hp         int64
## drat     float64
## wt       float64
## qsec     float64
## vs         int64
## am         int64
## gear       int64
## carb       int64
## dtype: object

Data frames | indexing

Data frame items by column. Returns a pandas.Series of values, which can be thought of as a list with types.

print(mtcars.mpg[0:5]) # or mtcars["mpg"]
## 0    21.0
## 1    21.0
## 2    22.8
## 3    21.4
## 4    18.7
## Name: mpg, dtype: float64

Data frames | indexing

Data frame items by row. Returns a new pandas.DataFrame

print(mtcars.iloc[[1,3], ]) # First and third row
##             model   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
## 1   Mazda RX4 Wag  21.0    6  160.0  110  ...  17.02   0   1     4     4
## 3  Hornet 4 Drive  21.4    6  258.0  110  ...  19.44   1   0     3     1
## 
## [2 rows x 12 columns]

Data frames | subsetting

Selecting values based on conditions in a data frame, returns a data frame

print(mtcars[(mtcars.cyl == 6) & (mtcars.gear >= 4)])
##             model   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
## 0       Mazda RX4  21.0    6  160.0  110  ...  16.46   0   1     4     4
## 1   Mazda RX4 Wag  21.0    6  160.0  110  ...  17.02   0   1     4     4
## 9        Merc 280  19.2    6  167.6  123  ...  18.30   1   0     4     4
## 10      Merc 280C  17.8    6  167.6  123  ...  18.90   1   0     4     4
## 29   Ferrari Dino  19.7    6  145.0  175  ...  15.50   0   1     5     6
## 
## [5 rows x 12 columns]

Data frames | applying a function

mtcars['model_len'] = mtcars.model.apply(lambda x: len(x))
print(mtcars.head())
##                model   mpg  cyl   disp   hp  ...  vs  am  gear  carb  model_len
## 0          Mazda RX4  21.0    6  160.0  110  ...   0   1     4     4          9
## 1      Mazda RX4 Wag  21.0    6  160.0  110  ...   0   1     4     4         13
## 2         Datsun 710  22.8    4  108.0   93  ...   1   1     4     1         10
## 3     Hornet 4 Drive  21.4    6  258.0  110  ...   1   0     3     1         14
## 4  Hornet Sportabout  18.7    8  360.0  175  ...   0   0     3     2         17
## 
## [5 rows x 13 columns]

Loading and saving data

Pandas can read tabular data from any data source (incl. databases). By default, it supports CSV, JSON, Excel etc

Writing data also creates CSV files, or can

Statistical data visualization

Histogram

Probability distribution plot for 1 variable

import matplotlib.pyplot as plt
mtcars.hist(['mpg'], grid=False)
## array([[<matplotlib.axes._subplots.AxesSubplot object at 0x132ad8090>]],
##       dtype=object)
plt.show()

Scatter plot

Actual values of 2 variable on 2D plot

mtcars.plot.scatter('mpg', 'qsec')
plt.show()

Box plot

Descriptive statistics of 1 variable grouped by another

mtcars.boxplot('mpg', by='cyl', grid=False)
plt.show()

(Grouped) Bar plot

Frequencies of >=1 groups of data

mtcars[['gear', 'vs']].groupby('gear').count().plot.bar()
plt.show()
#counts <- table(mtcars$vs, mtcars$gear)
#barplot(counts, main="Car Distribution by Gears and VS", xlab="Number of Gears",
#        col=c("darkblue","red"), legend = rownames(counts), beside=TRUE)

Line/Area chart

Facets

Split visualization in groups based on factors

ggplot(mtcars1) + aes(x = hp, y = mpg, shape=am, color=am) + facet_grid(gear~cyl) +
   xlab("Horsepower") + ylab("Miles per Gallon") + geom_point(size = 4) + defaults

Distributions

Distributions | Normal

Identified by the characteristic ‘bell curve’ histogram

## Warning: Removed 2 rows containing missing values (geom_bar).

Distributions | Non-normal

Histograms are left or right skewed