Loading Data and initial exploration

Let’s load some random data from an R package. This first step could be replaced by another statement, provided you have your dataset in your computer or in the web.

    library(ggplot2)
    data(diamonds)

To find out what basic information about the dataset, you can run following commands. Be sure to identify their differences:

dim(diamonds)
names(diamonds)
summary(diamonds)

How many data points has our dataset? How many variables are there? What are their value ranges?

To view another textual summary, try

str(diamonds)

To peak at a few rows of the data or the last few rows, try

head(diamonds)
tail(diamonds)

Or, if you want the first 10th through 20th rows (inclusive)

diamonds[10:20, ]

Start with statistics

You can access variables by name:

summary(diamonds[, c("clarity", "price")])

A useful function that allows us a more close exploration (similar to summary) is the describe function available from Hmisc package

library(Hmisc)
describe(diamonds[, c(2, 5)]) # check columns 2 & 5

What is the information we can gather from this function? How are these connected to the data pre-processing steps discussed in the lecture?

Mean, Median, Range and Quartiles

The following statements provide us with the basic statistics. Pay attention to the fact that they need a single vector as an argument, which means that they can only be applied to one column at a time.

range(diamonds$color)
range(diamonds$depth)

Try it for the rest of the variables. Notice the differences between having a categorical and a numeric variable?

For the quantiles the default option will provide me with the standard quantiles (0%, 25%, 50%, 75% and 100%), however if i define an additional argument then I can get a different dichotomization.

quantile(diamonds$table)
quantile(diamonds$table, c(0.1, 0.3, 0.65))

Try quantile for the rest of the variables. What is happening with factor variables?

Variance, Histogram/Density plots, Pie/Bar plots

Variance and histogram provide information about the dispersion of the variable. Plotting the density tries to smooth data according to some distribution.

var(diamonds$price)
hist(diamonds$price)
plot(density(diamonds$price))

Try it for other variables. Does it work for factors?

For factor variables, you can use the table and pie charts.

table(diamonds$color)
pie(table(diamonds$color))
barplot(table(diamonds$color))

Multiple-variable exploration

Covariance and correlation are two metrics that show how two numeric variables interact. For more information on what they represent try help or the Web. They can be applied either to two variables or more.

cov(diamonds$price, diamonds$carat)
cor(diamonds$price, diamonds$carat)
cov(diamonds[,c(1,5,6,7))

Some of the varibles are discrete, or categorical so boxplot helps in this case. The bar in the middle is the median. The box around shows the interquartile range (IQR), which is the range between the 75% and the 25% observation.

boxplot(diamonds$price ~ diamonds$clarity)

Try it for other pairs of variables. What do you observe?

We might want a visual summary of some variables as well. You can every try it with more variables, but then it might need extra time to run.

pairs(diamonds[, c("depth", "price")])

#be careful!
pairs(diamonds)

We love plots! And it is always a nice way to stay close to your data. With the following plots you can visualize specific pairs of variables by taking into accout extra information from other variables.

with(diamonds, plot(price, depth, col = cut, pch = as.numeric(cut)))

Ofcourse, given the dataset, these plots might be readable or not :)

Another useful plot is the heat map. We first calculate the similarity between different data points with a distance function and then plot it.

Attention! Distance computation (as we will see in clustering) is a very expensive operation so if you have too much data, try to limit it. Also, distance computation is only valid for numeric variables.

dist.matrix <- as.matrix(dist(diamonds[1:100, c(1,5:10)]))
heatmap(dist.matrix)

Another useful visualization is the normal plot provided by ggplot2 package.

library(ggplot2)
qplot(carat, depth, data = diamonds, facets = cut ~ .)

Also, depending on the problem there are other visualization tools that can be used. Some examples are: scatterplot3d (from scatterplot3d package), levelplot (from lattice package), filled.contour and persp (from graphics package), etc.

Save charts to files

You can save your plots to files either by the menu in RStudio or by adding specific statements to your code. Pay attention to graphics.off() (or might be dev.off()) which is needed after plotting in order to save the plot and close the graphic device.

# save as a PDF file
pdf("myPlot.pdf")
x <- 1:50
plot(x, log(x))
graphics.off()
# Save as a postscript file
postscript("myPlot2.ps")
x <- -20:20
plot(x, x^2)
graphics.off()

Playing with conditionals on specific values

You may want all rows of the diamonds with price higher than 4000$.

    diamonds_more_than_4000 = diamonds[diamonds$price > 4000, ]
    head(diamonds_more_than_4000)

To extract only the color and clarity of these diamonds:

    color_clarity_more_than_4000 = diamonds[diamonds$price > 4000, c("color", "clarity", "price")]
    
    head(color_clarity_more_than_4000)

Or, realizing that color and clarity are the 2nd and 3rd columns and price is 7th, we can find the same data with this command:

    color_clarity_more_than_4000 = diamonds[diamonds$price > 4000, c(2, 3, 7)]
    head(color_clarity_more_than_4000)