Data Analysis and Visualization Using R: Lesson 1

David Robinson
1/27/14

How to Read These Slides

In these slides, we show blocks of R code, which are immediately followed by their output:

print("hello world")
[1] "hello world"

The gray box shows the original R code, which you can copy and paste into your own R console to try yourself. The white box shows the code's output: you can compare it to your own results (or just trust us that that's the output).

Numeric variables

Assigning a variable

You store a value in a variable using the = operator:

x = 42

This gives the variable a a value of 42. You can show the value of a with:

print(x)
[1] 42

You can also assign a variable with <-: this is equivalent.

x <- 42

Variable names

Variable names consist of letters, digits, periods and underscores (_), and cannot start with a digit. Convention is to use periods as spaces.

Legal variable names include:

  • my.variable
  • my_variable

Illegal names include:

  • my-variable
  • dave's.variable
  • 2ndvariable

Using R like a scientific calculator

You can perform mathematical operations using +, -, *, and /:

x = 6 + 4
print(x)
[1] 10
x / 2
[1] 5
y = 4
x / y
[1] 2.5

Using R like a scientific calculator

You can use exponentiation with ^, or calculate the natural log:

x^2
[1] 100
y^3
[1] 64
log(x)
[1] 2.303

Assigning variables: FAQ

  • What is the difference between <- and =?
    • In 99% of cases, they act exactly the same, so it's personal preference. See here to see a description of the rare cases where they differ.
  • When do you need print(x) to display a variable, and when x?
    • When working in the R interactive terminal, the result of each line are displayed after being evaluated- print is unnecessary. When you source a .R file, you need print(x) in the line or it won't display.

Assigning variables: FAQ

  • Why is there a [1] before each result?
    • You'll find out in the next section!

Vectors

You may have noticed the [1] at the start of each result. That's because all numbers in R are actually represented as vectors of length 1. The [1] is there to indicate rows of results.

Vectors

For example, you can use : to create a long vector of consecutive integers:

1:60
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
[18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
[52] 52 53 54 55 56 57 58 59 60

The [1], [18][52] at the start of each row helps keep track of the position within the vector.

Creating and combining vectors

You can also create vectors yourself using c:

v1 = c(1, 2, 5, 7)
v2 = c(8, 6, 3, 2)

You can also use c to combine existing vectors together:

v3 = c(v1, v2)
print(v3)
[1] 1 2 5 7 8 6 3 2

Extracting from vectors

Use square brackets to retrieve a value from a vector, or multiple values:

v3
[1] 1 2 5 7 8 6 3 2
v3[4]
[1] 7
v3[4:7]
[1] 7 8 6 3

Operations on vectors

Mathematical operations on a vector apply to all elements:

v1 = c(1, 2, 5, 7)
v1 + 2
[1] 3 4 7 9
v1 / 2
[1] 0.5 1.0 2.5 3.5
sin(v1)
[1]  0.8415  0.9093 -0.9589  0.6570

Operations on vectors

Similarly, you can perform operations between two vectors:

v1
[1] 1 2 5 7
v2 = c(8, 6, 3, 2)
v1 + v2
[1] 9 8 8 9
v1 / v2
[1] 0.1250 0.3333 1.6667 3.5000

Operations on vectors

You can also easily summarize a vector by calculating the sum, mean, or length:

sum(v3)
[1] 34
mean(v3)
[1] 4.25
length(v3)
[1] 8

Character vectors

Not all values you could want to store in R are numeric. You could store:

  • subject names
  • gene sequences
  • text for analysis

We represent these as a series of characters (letters, digits, punctuation, etc).

Assigning a character vector

Character vectors are surrounded by either single or double quotation marks.

chv = "hello"
chv2 = 'hi'
chv3 = c("hello", "world")

Like numeric values, they are always vectors, though sometimes they are of length 1.

Matrices

Matrices are like two-dimensional vectors, organizing values into rows and columns:

m = matrix(1:9, ncol=3)
m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Attributes of a matrix

You can get the number of rows, the number of columns, or both:

NROW(m)
[1] 3
NCOL(m)
[1] 3
dim(m)
[1] 3 3

Retrieving a value

To extract one value from a matrix, use the structure matrix[row,column].

m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
m[1, 3]
[1] 7

Retrieving a row or column

Leaving the “row” spot or the “column” spot empty will extract, respectively, an entire column or an entire row.

m[1, ]
[1] 1 4 7
m[, 2]
[1] 4 5 6

Matrix arithmetic

You can add or multiply a single value by a matrix:

m + 3
     [,1] [,2] [,3]
[1,]    4    7   10
[2,]    5    8   11
[3,]    6    9   12
m * 2
     [,1] [,2] [,3]
[1,]    2    8   14
[2,]    4   10   16
[3,]    6   12   18

Transpose and diagonal

Use the t function to transpose a matrix:

t(m)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Use diag to extract the diagonal:

diag(m)
[1] 1 5 9

Matrix multiplication

You can also perform traditional matrix multiplication with the %*% operator

m2 = matrix(21:32, nrow=3)
m %*% m2
     [,1] [,2] [,3] [,4]
[1,]  270  306  342  378
[2,]  336  381  426  471
[3,]  402  456  510  564

Logical vectors

Another type of variable is a logical value: TRUE or FALSE. Like numbers, logical values are always stored in vectors (sometimes of length 1).

x = TRUE
y = c(TRUE, FALSE, TRUE)

Logical operators

Logical vectors are useful because they are the result of logical operators, such as

  • > : greater than
  • < : less than
  • == : equal to
  • != : not equal to
  • & : and
  • | : or

Logical operators: comparison

x = 2  # assignment
x > 0
[1] TRUE
x < 1
[1] FALSE
x != 10
[1] TRUE

Logical operators FAQ

  • Why is the logical operator for equals == and not =?
    • Because = is already reserved for assignment.

Data frames

Data frames store multiple columns of information together. Unlike a matrix, different columns in a data frame can store different kinds of information (numbers, factors, character vectors, etc)

Built-in Datasets

R comes with built-in datasets that can be retrieved by name. You can access one with the data function.

data(mtcars)

mtcars contains statistics about 32 cars in 1974, including miles per gallon, weight, number of cylinders, etc. Each row is one car, and each column one piece of information.

View data frame in RStudio

View(mtcars)

See details and documentation about the data with:

?mtcars

or

help(mtcars)

See first rows of data frame

One of the most useful functions is head, which shows the first 6 rows of a data frame (a good way to get an idea of its contents):

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
Datsun 710        22.8   4  108  93 3.85 2.320 18.61
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
Valiant           18.1   6  225 105 2.76 3.460 20.22
                  vs am gear carb
Mazda RX4          0  1    4    4
Mazda RX4 Wag      0  1    4    4
Datsun 710         1  1    4    1
Hornet 4 Drive     1  0    3    1
Hornet Sportabout  0  0    3    2
Valiant            1  0    3    1

Information about a data frame

Get the number of rows, columns or both:

nrow(mtcars)
[1] 32
ncol(mtcars)
[1] 11
dim(mtcars)
[1] 32 11

Access a column by name

Use $ to access one column by name:

mtcars$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

Each column is a vector once it is extracted.

Access one row or value

You can use square brackets with a comma to access a single row of a data frame:

mtcars[1, ]
          mpg cyl disp  hp drat   wt  qsec vs am gear
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4
          carb
Mazda RX4    4

Access one row or value

Or you can give row, column to get a single value at a particular position:

mtcars[3, 2]
[1] 4

Filtering a data frame

One common operation on data is to filter out rows based on some criterion.

Subsetting rows of a data frame

You can get a set of rows using their indices:

mtcars[1:2, ]
              mpg cyl disp  hp drat    wt  qsec vs am
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1
              gear carb
Mazda RX4        4    4
Mazda RX4 Wag    4    4

However, what if you want “all automatic cars” or “all cars with mpg > 20”?

Logical operators on a vector

Just like arithmetic operations, logical operators on a vector apply the test to each element individually:

v = c(1, 3, 12, 5, 2, 20)
v > 4
[1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

Compound logical operators on a vector

You can combine them using & (and) or | (or):

v > 4 & v < 15
[1] FALSE FALSE  TRUE  TRUE FALSE FALSE
v < 6 | v > 15
[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Logical operations on a column

This can equally easily be applied to a column of mtcars:

mtcars$mpg > 20
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
 [9]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
[25] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

Filtering a data frame logically

This logical vector can be used to subset rows of the data frame- TRUE means “keep the row”, FALSE means drop it. Place it before the comma in the square brackets:

v = mtcars$mpg > 20
efficient.cars = mtcars[v, ]

or just:

efficient.cars = mtcars[mtcars$mpg > 20, ]

Filtering on multiple conditions

You can combine multiple conditions using & (and) or | (or), such as looking for automatic gearshift cars with mpg > 20:

efficient.auto = mtcars[mtcars$mpg > 20 & mtcars$am == 0, ]
head(efficient.auto, 3)
                mpg cyl  disp  hp drat    wt  qsec vs
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1
               am gear carb
Hornet 4 Drive  0    3    1
Merc 240D       0    4    2
Merc 230        0    4    2

data.table

data.table is a third-party package that improves in many ways on the built-in data.frame.

We'll go over some of its advantages on Wednesday and Friday, but will focus on one- how it makes filtering more convenient- today.

Turn a data.frame into a data.table

Since data.table is a third-party package, you need to install it first. Once it is installed, you still have to load it into R:

library("data.table")

(You'll have to re-do that line each time you reopen R). Then convert your data.frame to a data.table:

mtcars.dt = as.data.table(mtcars)

Filtering a data.table

A data.table looks identical in many ways to a data.frame, but has some useful features. One is that when you're filtering, you don't need to say mtcars$ each time when you're in the brackets- you can just refer to the column names:

mtcars.dt[mpg > 20 & am == 0, ]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
2: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
3: 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

This doesn't mean the mpg and am variables exist: they exist only within those square brackets.

Next Time