Suppose you are a data scientist who is interested in modeling delays in flight arrivals at airports around the world. You think that arrivals could be correlated by month. Hence, since you’re looking at delays in arrivals you cluster average delay (in minutes) by month and destination airport.
To illustrate this problem I’m using the dataset flights
from Hadley Wickham’s package nycflights13
:
Per the documentation, flights
contains data on all
flights in 2013 that departed NYC via JFK, LaGuardia, or Newark. The
variables we’re interested in here are month
,
dest
, and arr_delay
.
bicluster()
requires data to be in table format, which
is what we will do here using spread()
from
tidyr
after using dplyr
to summarize the data
since we are analyzing average arrival delay.
library(tidyr)
flights <- flights %>%
group_by(month, dest) %>%
summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
spread(dest, mean_arr_delay) %>%
as.data.frame()
Now we name the rows using the text version of month
followed by removing month
and converting the data to a
matrix.
rownames(flights) <- month.name[flights$month]
flights <- as.matrix(flights[, -1])
flights[1:5, 1:5]
## ABQ ACK ALB ANC ATL
## January NA NA 35.17460 NA 4.152047
## February NA NA 17.38889 NA 5.174092
## March NA NA 17.16667 NA 7.029286
## April 12.222222 NA 18.00000 NA 11.724280
## May -6.516129 3.904762 10.19643 NA 8.187036
Now that our data is in correct form we can bicluster it.
We first need to determine how many groups of months (r) and how many groups of destination airports (c) to define. Since rows correspond to months, in this analysis, setting r = 4 may not be a bad idea: create a group for each season/quarter of the year. For the columns (airports) the number of groups is a bit more ambiguous, just like with k-means clustering. Since each continent of the world may take a different amount of time and may have different policies for flights from the US, let’s set c = 6.
Now we’re ready to bicluster. I’ll put the code first and give brief
descriptions to selected arguments. This will mostly be a repeat of the
documentation for biclustermd()
.
##
## Data has 1260 values, 11.75% of which are missing
## 11 Iterations
## Initial SSE = 198251; Final SSE = 83107, a 58.1% reduction
## Rand similarity used; Indices: Columns (P) = 1, Rows (Q) = 1
biclustermd
outputs a list of class
biclustermd
with useful information. Execute
?biclustermd
to view the output in detail.
This list is used to make plots and training data, for example.
To plot SSE by iteration, use
autoplot.biclustermd_sse()
, which outputs points by
default:
Next, cluster similarity measures. If a similarity measure for rows
equals one, that means that the row shuffling from the last iteration
and second to last iteration are identical. If zero, the two shufflings
have nothing in common. The same goes for column similarities.
autoplot.biclustermd_sim()
plots the RIs by iteration.
autoplot.biclustermd()
makes visual analysis of
biclustering results easy by making a heat map of the clustered data.
The algorithm does use randomness, so my results may look different from
yours.
?autoplot.biclustermd
autoplot(bc) +
scale_fill_viridis_c(na.value = "white") +
labs(x = "Destination Airport", y = "Month", fill = "Average Delay")
Often times it is helpful to run the data through an S-shaped
function before plotting. This is easy with
autoplot.biclustermd()
: set
transform_colors = TRUE
and specify by what constant you’d
like to scale your data by (with c
) before running it
through a standard normal CDF:
autoplot(bc, transform_colors = TRUE, c = 1/10) +
scale_fill_viridis_c(na.value = "white", limits = c(0, 1)) +
labs(x = "Destination Airport", y = "Month", fill = "Average Delay")
To make the results really stand out we can reorder the row and
column groups with the reorder
argument:
Since the algorithm uses purposeful randomness, it is recommend that
analysts run the biclustering multiple times and keep the one with the
smallest SSE. The function rep_biclustermd()
does exactly
that. Arguments to rep_biclustermd()
are the same as those
for biclustermd()
with 3 additional arguments: 1.
nrep
: the number of times to repeat the biclustering. 2.
parallel
: a logical indicating if parallelization should be
used or not. By default this is FALSE
. 3.
ncores
: the number of cores to parallelize over. Ignored if
parallel = FALSE
.
Below we run biclustering 100 times without using parallelization.
## $best_bc
##
## Data has 1260 values, 11.75% of which are missing
## 9 Iterations
## Initial SSE = 182584; Final SSE = 79965, a 56.2% reduction
## Rand similarity used; Indices: Columns (P) = 1, Rows (Q) = 1
##
## $rep_sse
## [1] 87317.68 80467.22 83597.98 86136.64 81709.14 83904.91 83311.03 88874.16
## [9] 89290.83 83149.45 82599.95 80478.32 83678.68 83208.12 81135.95 91052.73
## [17] 82141.15 83522.20 89346.45 82906.23 81622.75 87459.93 84022.41 82663.20
## [25] 81764.85 85712.65 84977.54 87980.69 84836.74 82764.63 82802.40 84977.54
## [33] 86799.56 85883.12 91002.27 82644.91 83678.68 85452.28 83445.97 87348.08
## [41] 84046.88 91160.01 82632.77 85446.45 80001.85 86097.43 81684.58 80513.17
## [49] 84976.47 82154.32 90865.93 83374.63 81793.79 82501.96 81518.55 83708.10
## [57] 81608.50 83491.11 83880.60 86524.91 83930.70 83509.52 84795.75 81634.99
## [65] 85679.59 82813.54 82480.60 85690.52 92757.65 83792.49 82373.11 83067.79
## [73] 87981.39 91026.90 80895.88 84260.72 87962.42 83230.33 84000.47 84285.40
## [81] 85877.01 86654.26 84354.56 83276.25 88562.70 85711.85 83773.49 82563.32
## [89] 90507.00 83695.70 84847.23 81600.47 80388.86 85149.75 79964.57 81466.87
## [97] 88661.07 88450.06 84058.06 90328.87
##
## $runtime
## user system elapsed
## 11.632 0.000 11.632
repeated_bc$best_bc
returns the best biclustering, and
has the same structure as the output of biclustermd()
.repeated_bc$rep_sse
is the SSE for each repeat, in
order.repeated_bc$runtime
provides the CPU runtime for the
user and system as well as elapsed time in seconds.The 100 repeats take approximately 20 seconds to complete.
Parallelization is quicker given sufficient complexity. However, as
nrep
gets large, parallelization gets slow, since all
nrep
objects have to be written to memory versus the
minimum SSE object when processing serially. These remarks are current
as of July 17, 2019.
## user system elapsed
## 11.632 0.000 11.632
We can plot the best SSE at each repeat, that is, we can take the
cumulative minimum of rep_sse
and plot:
We’ll end this tutorial by making a training dataset from the
clustered data using gather.biclustermd()
, which gives the
name of the row and column the data point comes from as well as the row
and column group it belongs to.
## row_name col_name row_cluster col_cluster bicluster_no value
## 1 January ABQ 1 1 1 NA
## 2 March ABQ 1 1 1 NA
## 3 April ABQ 1 1 1 12.22222
## 4 January ACK 1 1 1 NA
## 5 March ACK 1 1 1 NA
## 6 April ACK 1 1 1 NA
Now we’ll plot cell (1, 4) and fit a linear model to it.
autoplot(repeated_bc$best_bc, row_clusts = 1, col_clusts = 4) +
scale_fill_viridis_c(na.value = "white") +
labs(x = "Destination Airport", y = "Month", fill = "Average Delay")
model <- training %>%
filter(row_cluster == 1, col_cluster == 4) %>%
lm(value ~ row_name + col_name, data = .)
summary(model)
##
## Call:
## lm(formula = value ~ row_name + col_name, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.946 -2.760 -0.887 3.195 13.647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.313 3.190 3.233 0.00239 **
## row_nameJanuary -4.229 1.613 -2.623 0.01211 *
## row_nameMarch -3.803 1.586 -2.398 0.02098 *
## col_nameBDL 5.069 4.317 1.174 0.24696
## col_nameBHM 8.352 4.317 1.934 0.05982 .
## col_nameBNA 4.837 4.317 1.120 0.26897
## col_nameBUR -5.389 4.317 -1.248 0.21888
## col_nameBWI 5.064 4.317 1.173 0.24745
## col_nameCAK 7.040 4.317 1.631 0.11045
## col_nameCHO -2.038 4.849 -0.420 0.67636
## col_nameCLE -2.152 4.317 -0.498 0.62076
## col_nameCRW 7.710 4.317 1.786 0.08134 .
## col_nameCVG 11.662 4.317 2.701 0.00992 **
## col_nameIAD 6.931 4.317 1.605 0.11591
## col_nameJAC 1.554 4.849 0.320 0.75027
## col_nameJAX 4.635 4.317 1.073 0.28919
## col_nameMDW 4.756 4.317 1.102 0.27692
## col_nameMEM 3.738 4.317 0.866 0.39154
## col_nameMKE 4.194 4.317 0.971 0.33688
## col_nameORF 1.575 4.317 0.365 0.71701
## col_namePSE -3.899 4.317 -0.903 0.37168
## col_nameROC 4.809 4.317 1.114 0.27164
## col_nameSAT -8.490 4.317 -1.967 0.05587 .
## col_nameSMF 1.877 4.317 0.435 0.66591
## col_nameSTL 4.539 4.317 1.051 0.29913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.288 on 42 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5917, Adjusted R-squared: 0.3584
## F-statistic: 2.536 on 24 and 42 DF, p-value: 0.004026
## [1] 4.18657
If you have questions you can email me at [email protected]. In the case you’ve found an error please open up an issue.