Improving Adaboosting with decision stumps in R

Adaboosting is proven to be one of the most effective class prediction algorithms. It mainly consists of an ensemble simpler models (known as “weak learners”) that, although not very effective individually, are very performant combined.

The process by which these weak learners are combined is though more complex than simply averaging results. Very briefly, Adaboosting training process could be described as follows:

For each weak learner:

1) Train weak learner so that the weighted error sum of squares is minimised

2) Update weights, so that correctly classified cases have their weight reduced and misclassified cases have their weights increased.

3) Determine weak learner’s weight, i.e., the total contribution of the weak learner’s result to the overall score. This is known as alpha and is calculated as 0.5 * ln((1- error.rate)/error.rate))

As the weight is updated on each iteration, each weak learner will tend to focus more on those cases that were misclassified on previous instances.

For further information about Adaboosting Algorithm, this Schapire’s article provides a very useful high-level guidance http://rob.schapire.net/papers/explaining-adaboost.pdf

Decision stumps as weak learners

The most common weak learner used in Adaboosting is known as Decision Stump and consists basically on a decision tree of depth 1, i.e., a model that returns an output based on a single condition, which could be summarised as “If (condition) then A else B”.

ada package in R

Although the implementation provides very good results in terms of model performance, “ada” package has two main problems:

  • It creates too big objects: Even with not so big datasets (around 500k x 50) that the final model object can be extremely big (5 or 6 Gb) and consequently too expensive to keep in memory. Of course, this is needed to perform any kind of prediction with the model. This is because the ada object is an ensemble of rpart objects, which holds a bunch of other information which is actually not needed when you just want to train an adaBoost model with Stumps and then predict with it.

 

  • Its execution time is too high: As both ada and rpart objects have a lot of other uses, they are not optimal for Stumps in terms of timings

Introducing adaStump

Considering that each stump consists just of a condition, two probability estimates (one when the condition is TRUE and the other when it’s FALSE) and a multiplier (alpha), I thought it could be much better, both in terms of size and timings, even just using R.

For that reason I developed adaStump, which is a package that takes advantage of ada’s excellent training process implementation, but creates a smaller output object and generates faster predictions.

An Example

Before running the example, you need to download the package from my Github repo. In order to do that, you need to have devtools installed


install.packages("devtools")
devtools::install_github("nivangio/adaStump")

After the package is installed, let’s start with the example! In this case, the letters dataset is loaded, its variable names assigned and the variable isA is created, which gets a value 1 if true and 0 if false. That will be the class we will predict.


library(adaStump)

#Load Letters Dataset

letters.data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data", header = F)

names(letters.data) <- c("lettr", "xbox", "ybox", "width",
                          "high", "onpix", "xbar", "ybar",
                          "x2bar", "y2bar", "xybar", "x2ybr",
                         "xy2br", "xege", "xegvy", "yege", "yegvx")

#Create a binary class

letters.data$isA <- ifelse(letters.data$lettr == "A","Yes","No")

#Create model formula. Exclude letter var for obvious reasons

antec <- paste0(setdiff(names(letters.data),c("isA","lettr")), collapse = "+")

fla <- as.formula(paste("isA", antec, sep = "~"))

Now we are going to create two identical models. As AdaBoosting is usually ran with bagging, the bag.frac parameter is set to 1 so that all the cases are included. As what we want to demonstrate is that we can get the same results with much less memory usage and in less time, we need to ensure that both models are identical.


#ada model fit passing the corresponding control function to create Stumps

fit_with.ada <- ada(formula = fla,data = letters.data, type = "real",
    control = rpart.control(maxdepth=1,cp=-1,minsplit=0,xval=0)
, iter = 40, nu = 0.05, bag.frac = 1)

#adaStump

fit_with.adaStump <- adaStump(formula = fla,data = letters.data, 
type = "real", iter = 40, nu = 0.05, bag.frac = 1) 

Let us take a look now at the object sizes:

 format(object.size(fit_with.ada), units = "KB")
 [1] "56006.3 Kb" 
format(object.size(fit_with.adaStump), units = "KB")
 [1] "4.7 Kb" 

Taking a look at these numbers we can definitely say that the first problem is solved. Besides, the size of ada objects depend on the bag fraction, the sample size and the amount of iterations while adaStump objects only depend on the amount of iterations. There is almost no scenario where the size of the adaStump object can turn into a resource problem even on old personal computers. Now, let’s see if the new model can also predict quicker:

 
> system.time(predictions_ada <- predict(fit_with.ada, letters.data,
 type = "prob")) 
user  system elapsed 
1.100   0.000   1.114 
> system.time(predictions_adaStump <- predict(fit_with.adaStump, letters.data)) user  system elapsed 
0.42    0.00    0.42  

Indeed 🙂 . Let us check if the results are the same. Contrary to ada, adaStump prediction method only returns the probability estimate of the class outcome (i.e., the probability that the class variable is 1). Its complement can be easily calculated by doing 1 – the vector obtained.

 > summary(abs(predictions_ada[,2] - predictions_adaStump))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.000e+00 0.000e+00 0.000e+00 8.391e-18 1.041e-17 1.110e-16 

Although there is some difference in some cases, it tends to be minimal. We can say now that we can produce almost the same outcome using 1% of the memory and more than 2.5 times faster. However, we can do better

Usually, AdaBoosting generates stumps based on the same condition. In this case, for example, we can see by accessing to the model frame (fit_with.adaStump$model), that the condition x2ybr >= 2.5 is repeated constantly. Of course, all these rows can be collapsed so that the condition is evaluated only once. For that purpose you can use pruneTree()


> fit_with.prunedadaStump <- pruneTree(fit_with.adaStump) 
> system.time(predictions_prunedadaStump <- predict(fit_with.prunedadaStump, letters.data))
 user  system elapsed
 0.124   0.000   0.121  

Now it is more than 9 times faster. As a final check, let us see if the results are the same

 > summary(abs(predictions_ada[,2] - predictions_prunedadaStump))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
6.489e-11 3.317e-10 7.411e-10 1.194e-09 8.965e-10 9.125e-09 

Although the difference between both methods is much greater in this case (there is a pre-calculation step where some decimals are discarded) it is still acceptable for most use cases. So, choosing to do the reduction or not is a decision based on the level of precision needed.

You can download the full script used in this post here

Final considerations

Some observations before closing this post. Consider this is a first release. I will be completing and crossing out items of this to-do list:

  • So far, the function only works with binary numeric classes. That is 0 for “No” and 1 for “Yes”
  • From the 3 prediction types in “ada”, this package only covers type “prob”
  • The function name pruneTree() is awful, but changing the name implies re-building and re-uploading. I will include that with other necessary changes
  • Although faster, execution times (and their relationship) can be slightly different from computer to computer and from problem to problem. So, don’t stick to adaStump is X times higher.
  • If you find any bug, please write me. I have done several checks but of course problems can arise everywhere.

As usual, feel free to drop me a line with comments, observations, critics, etc

Advertisement
Posted in General Programming, Machine Learning, R, Uncategorized | Tagged , , , , , | 2 Comments

From JSON to Tables

“First things first”. Tautological, always true. However, sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless. In other words, first things first.
In my professional experience, I have heard thousands of potenitally brilliant ideas which couldn’t be even tested because the data scientist in question did not know how to handle data according to his/her needs. This has become particularly problematic with the popularisation of JSON: despite the undeniable advantages that this data structure has in terms of data storage, replication, etc, it presents a challenge for data scientist, as most algorithms require that the input data is passed in a tabular form. I have myself faced to this problem and, after a couple of times of creating some temporary solution, felt it was high time to come up to a general one: of course, the solution I have come up to is just one approach.
In order to understand it, two things from JSONs structure must be considered: Firstly, JSON objects are basically an array of objects. This means that, unlike traditional tabulated Databases there is no indication what they don´t have. This might seem trivial but if you want to tabulate JSONs it becomes certainly something to be solved. Secondly,  this array of objects can contain objects which are arrays themselves. This means that, when transforming this to a table, that value has to be transformed somehow to an appropiate format for a table. The function in question contemplates these two scenarios: it is intended to be used not only to tabulate objects with identical structure but also for objects whose internal structure is different
Before deep diving into the code of the function, let us propose a scenario where this could be useful. Imagine we would like to tabulate certain information from professional football players. In this mini-example, we have chosen Lionel Messi, Cristiano Ronaldo and George Best. The information is retrieved from dbpedia, which exposes the corresponding JSONs of their wikipedia’s entries. In the code below, the first steps:

library(rjson)
library(RCurl)

players <- c("Lionel_Messi","George_Best","Cristiano_Ronaldo")

players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json"))))

players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]])

players.unlist <- unlist(players.info)


I have to admit that this is not the simplest example as the JSONs retrieved are extremely complex and need much more pre-processing than usual (for example, than most API calls). The problem is that nowadays most API calls need an API Key and so the example becomes less reproducible.
Back to the example, firstly the data for each player is retrieved using getURL from RCurl and then converted to list by using fromJSON() from rjson, which converts json strings or files to lists. All the call is done inside a lapply statement, so the object returned is basically a list of JSONs converted to list, i.e., a list of lists. After that, from each of the sub-lists( in this case, 3, as there were three players) we are selecting only the piece of information that refers to the player himself; in a few words, a bit of cleanup. Finally, the file is unlisted. unlist() is a function which turns a list (a list of lists is a list itself) into a named vector of a certain type, to which all the elements can be coerced to; in this case, characters.

This is the point of manual intervention, where the variable selection has to be done. At this stage, it is highly recommended to have at least one example loaded in some JSON parser (JSON Viewer, or any browser’s JSONView add-in). The names of the vector generated by unlist() (“players.unlist” in the example) are a concatenation of the names of the lists that value was in, separated by “.”.

The elements in the character vector are ordered in the same way they appeared in the lists. So, the first thing to do is to establish where each of the cases start. In order to do so, the challenge now is to find a common pattern between the vector names which can identify where each of the elements start. This element does not necessarily have to be inside the final table. In this case I chose “surname\\.value$”. However, I would recommend, in most of the cases, to choose the first element that appears in the object.

Similarly, the challenge now is to find common patterns in the names of the vector elements that will define each of the variables. In order to do that, the wisest would be to takle a look at the names that you have. In this case:


> unique(names(players.unlist))
  [1] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.type"   
  [2] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.value"  
  [3] "http://www.w3.org/2002/07/owl#sameAs.type"              
  [4] "http://www.w3.org/2002/07/owl#sameAs.value"             
  [5] "http://www.w3.org/2000/01/rdf-schema#label.type"        
  [6] "http://www.w3.org/2000/01/rdf-schema#label.value"       
  [7] "http://www.w3.org/2000/01/rdf-schema#label.lang"        
  [8] "http://purl.org/dc/terms/subject.type"                  
  [9] "http://purl.org/dc/terms/subject.value"                 
 [10] "http://xmlns.com/foaf/0.1/homepage.type"                
 [11] "http://xmlns.com/foaf/0.1/homepage.value"               
 [12] "http://xmlns.com/foaf/0.1/depiction.type"               
 [13] "http://xmlns.com/foaf/0.1/depiction.value"              
 [14] "http://purl.org/dc/elements/1.1/description.type"       
 [15] "http://purl.org/dc/elements/1.1/description.value"      
 [16] "http://purl.org/dc/elements/1.1/description.lang"       
 [17] "http://xmlns.com/foaf/0.1/givenName.type"               
 [18] "http://xmlns.com/foaf/0.1/givenName.value"              
 [19] "http://xmlns.com/foaf/0.1/givenName.lang"               
 [20] "http://xmlns.com/foaf/0.1/name.type" 
By taking a look at the JSON of any of these examples, we can see that “type” refers to the type of value of “value”. So, in this case, we know that we are going to need only the ones that end with “.value”. However, the JSONs are too large to do a complete scan of all its elements, so the wisest thing would be to look for the desired value manually. For example, if we would like to take the date of birth:

> grep("(birth|Birth).*\\.value",unique(names(players.unlist)),value=T)
[1] "http://dbpedia.org/ontology/birthName.value"   
[2] "http://dbpedia.org/property/birthDate.value"   
[3] "http://dbpedia.org/property/birthPlace.value"  
[4] "http://dbpedia.org/property/dateOfBirth.value" 
[5] "http://dbpedia.org/property/placeOfBirth.value"
[6] "http://dbpedia.org/ontology/birthDate.value"   
[7] "http://dbpedia.org/ontology/birthPlace.value"  
[8] "http://dbpedia.org/ontology/birthYear.value"   
[9] "http://dbpedia.org/property/birthName.value"

Now we know that there are several different birth dates fields. In this case, we should take a look manually and based upon that choose the one that better fits our needs.In this example, I chose 6 arbitrary variables, extracted its pattern and chose suitable variable names for them. Please notice that, for example, date of death should be empty for Messi and Ronaldo while number is empty in the case of George Best (players didn’t use to have a fixed number in the past). Apart from that, “goals” has multiple entries per player, as the json has one value per club. This is the end of the script:

st.obj <- "^.+surname\\.value$"

columns <- c("fullname\\.value","ontology/height\\.value",
             "dateOfBirth\\.value","dateOfDeath\\.value",
             "ontology/number\\.value","property/goals.value$")

colnames <- c("full.name","height","date.of.birth","date.of.death","number","goals")

players.table <- tabulateJSON(players.unlist,st.obj,columns,colnames)

And this is the final result


> players.table
     full.name                             height  date.of.birth      date.of.death     
[1,] "Lionel Andrés Messi"                 "1.69"  "1987-06-24+02:00" NA                
[2,] "George Best"                         "1.524" "1946-05-22+02:00" "2005-11-25+02:00"
[3,] "Cristiano Ronaldo dos Santos Aveiro" "1.85"  "1985-02-05+02:00" NA                
     number goals                        
[1,] "10"   "6;5;242"                    
[2,] NA     "6;2;3;0;1;15;12;8;33;137;21"
[3,] "7"    "3;84;176"       

Finally, we get to the function. tabulateJSON() expects four parameters: an unlisted json (or whatever character vector that has similar characteristics as the ones produced by unlisting a JSON), a string that represents the name of the starting positions of the elements, a vector of characters (normally regex patterns) with the element names to be saught and finally the names to assign to those columns generated.

Now let’s take on how look tabulateJSON() works. Below, the entire code of the function:


tabulateJSON <- function (json.un, start.obj, columns, colnames) 
{
  if (length(columns) != length(colnames)) {
    stop("'columns' and 'colnames' must be the same length")
  }
  start.ind <- grep(start.obj, names(json.un))
  
  col.indexes <- lapply(columns, grep, names(json.un))
  col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind))
  
  
    temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F))
    
    collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind))
  
  if(length(collapse.cols) > 0){
    temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) 
      ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";")))
    
  }
  
  matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames)
  matr$pos <- NULL
  names(matr) <- colnames
  matr <- as.matrix(matr)
  colnames(matr) <- colnames
  return(matr)
}

How does it work? Firstly, it looks for the items that define a the start of an object based upon the pattern passed and return their indexes. These will be the delimiters.

  start.ind <- grep(start.obj, names(json.un))

After that, it will look for the names value that match the pattern passed in “columns” and return the indexes. Those indexes, have to be assigned to a position (i.e., a row number), which is done with findInterval(). This function maps a number to an interval given cut points.

For each of the variables, a data frame with 2 variables is created, where the first one is the position and the second one the value itself, which is obtained by accessing the values by the column indexes in the character vector. As it will be seen later, this is done in this way because it might be possible that for a sought pattern (that is converted to a variable) there might be more than one match in a row. Of course, that becomes problematic when trying to generate a tabulated dataset. For that reason, it is possible that temp.frames contains data frames with different number of rows.

  col.indexes <- lapply(columns, grep, names(json.un))
  col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind))
    temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F))

After this, it is necessary to collapse those variables which contain multiple values per row. Firstly, it checks whether there is any of the elemnts in the list whose amount of rows is greater than the amount of delimiters (which define the amount of rows). In the case there is any, the function uses plyr’s ddply() to collapse by row specifier (pos). After this process all the data frames in temp.frames will be of equal length:

    collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind))
  
  if(length(collapse.cols) > 0){
    temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) 
      ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";")))
    
  }
Finally, all the elements in temp.frames are merged one with another, the column name is assigned and “pos” is erased, as it does not belong to the dataset to be done:

  matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames)
  matr$pos <- NULL
  names(matr) <- colnames
  matr <- as.matrix(matr)
  colnames(matr) <- colnames
  return(matr)


Of course, using this function straightforward is a bit “uncomfortable”. For that reason, and depending your particular needs, you can include tabulateJSON() as part of a higher-level function. In this particular example, the function could receive the names of the and a list of high level variables, which will be mapped to a particular pattern, for example:

getPlayerInfo <- function(players,variables){
  
  players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json"))))
  
  players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]])
  
  players.unlist <- unlist(players.info)
  
  st.obj <- "^.+surname\\.value$"
  
  columns.to.grep <- paste0(variables,"\\.value$")
  
  #Check if there is a multiple match with different types
  
  col.grep <- lapply(columns.to.grep, grep, x=unique(names(players.unlist)))
  
  columns <- sapply(col.grep, function(x) unique(names(players.unlist))[x[1]])
  
  #Convert names to a regex
  
  columns <- gsub("\\.","\\\\\\.",columns)
  columns <- paste0("^",columns,"$")
  
  players.table <- tabulateJSON(players.unlist,st.obj,columns,variables)
  
  return(players.table)
  
}

So, the call will be:


getPlayerInfo(c("Lionel_Messi","David_Beckham","Zinedine_Zidane","George_Best","George_Weah"),c("fullname","height","dateOfBirth","dateOfDeath","number","goals"))
     fullname                                  height   dateOfBirth       
[1,] "Lionel Andrés Messi"                     "1.69"   "1987-06-24+02:00"
[2,] "David Robert Joseph Beckham"             "1.8288" "1975-05-02+02:00"
[3,] "Zinedine Yazid Zidane"                   "1.85"   "1972-06-23+02:00"
[4,] "George Best"                             "1.524"  "1946-05-22+02:00"
[5,] "George Tawlon Manneh;Oppong Ousman Weah" "1.84"   "1966-10-01+02:00"
     dateOfDeath        number goals                        
[1,] NA                 "10"   "6;5;242"                    
[2,] NA                 NA     "62;2;0;13;18"               
[3,] NA                 NA     "6;37;28;24"                 
[4,] "2005-11-25+02:00" NA     "6;2;3;0;1;15;12;8;33;137;21"
[5,] NA                 NA     "46;47;32;24;14;13;7;5;3;1"  

I hope you enjoyed it and/or found it useful at least 😉

As usual, if you have any comments, suggestions, critics, please drop me a line

Posted in Data Processing, General Programming, R | Tagged , , , , , , , , , , , | 3 Comments

Quick guide to parallel R with snow

Probably, the most common complains against R are related to its speed issues, especially when handling a high volume of information. This is, in principle, true, and relies partly on the fact that R does not run parallely…. unless you tell it to do so!

R offers a wide variety of packages dedicated to parallelisation. After doing several tries with some of them, I found snow the best one (for me) to perform classical parallellizable operations. Among the functions that this package provides, I found the parApply family the most useful ones and the easiest to work with to take profit of parallelisation.

In order to use these functions, it is necessary to have firstly a solid knoweldge of the apply-like functions in traditional R, i.e., lapply, sapply, vapply and apply. What are they exactly? In essence, they apply a function over an array of objects.

The example below would return a numeric vector consistent of 3,4,5…10 for variable “result”


my.vec <- 1:10

result <- vapply(my.vec,function(x) x+2,FUN.VALUE=0)

This code returns exactly the same output as the one below


my.vec <- 1:10

result <- numeric(length(my.vec))

for(i in 1:length(my.vec)){

result[i] <- my.vec[i]+2}

Of course, in this example no significant speed differences can be found between the 2 pieces of code shown. However, even in one-node executions, the first alternative is considerably faster, which can be appreciated when working with larger amount of data. For a vector of length 1000000, the difference is 0.94 vs 3.04 secs

In the internet, plenty of posts and tutorials about the use of lapply,sapply,vapply and apply can be found, for example here. However, it is not redundant to explain again what each function does:

a)Lapply

Applies a function over a vector and returns a vector of lists. If lapply were used in the previous example and would want to obtain the first element, a “list-like” indexing should be used


my.vec <- 1:10

result <- lapply(my.vec,function(x) x+2)

#get the third element of result

result[3][[1]]

b)Sapply/Vapply

It is very similar to lapply, but, instead of a vector of lists, it returns a vector of some type (numeric, character, Date, etc). Sapply will “deduce” the class of the output elements. In vapply, the output type has to be specified. Although it is simpler to use sapply, as there is no need to specify output type, vapply is faster (0.94 secs vs 4.04) and enables the user to control output type.



my.vec <- 1:1000000

system.time(result <- vapply(my.vec,function(x) x+2,FUN.VALUE=0))

system.time(result <- sapply(my.vec,function(x) x+2))

The FUN.VALUE argument is where the output type of vapply is specified, which is done by passing a “general form” to which the output should fit. If the output returned by the function does not match with the specified return type, R will throw an error


#These 2 alternatives are equivalent

vapply(letters[1:20], function(x) paste0(x,"_concat"),FUN.VALUE="")
vapply(letters[1:20], function(x) paste0(x,"_concat"),FUN.VALUE="aaa")

#This throws an error

vapply(letters[1:20], function(x) paste0(x,"_concat"),FUN.VALUE=0)

c)Apply

apply() is used to apply a function over a matrix row or columnwise, which is specified in its MARGIN argument with 1 for row and 2 for columns.


#Some random dataframe
ex.df <- data.frame(a=seq(1,100,1),b=seq(10,1000,10),c=runif(100))

#Apply rowwise: The first element of each row plus the second, multiplied by the third

aa <- apply(ex.df,1, function(x) (x[1]+x[2])*x[3])

#Apply columnwise. Extract the mean of each column

bb <- apply(ex.df,2, mean)

Tip: You may have noticed that you can write apply-like functions with a function(…) argument or without it. Usually you use “function(…) + a function” when the attributes of the object you are passing to the function have to do different things (like in the rowwise apply example). However, if it is not the case, you can just pass the name of the function, like in the apply columnwise example. Other functions you could use similarly are median, max, min, quantile, among others.

Parallelisation

If the use of apply functions is clear, then parallelisation is just one small step beyond with snow. The functions equivalents are


Base R snow
lapply parLapply
sapply parSapply
vapply
apply(rowwise) parRapply, parApply(,1)
apply(columnwise) parCapply, parApply(,2)

The only thing you really have to keep in mind is that when parallelising you have to explicitly export/declare everything you need to perform the parallel operation to each thread. This is done mainly with the clusterExport() and clusterEvalQ() functions. This is very important to keep in mind because you might be able to run something on a one-node traditional R code and then get errors with the same execution in parallel due to the fact that these things are missing.

From apply() rowwise to parRapply()

Below you will find a small example, very similar to the one done above with apply rowwise, illustrating the above mentioned small changes/additions needed in order to run your code parallely


#Return if the result of (x[1]+x[2])*x[3] is greater than 20 or not

# The same df as before

ex.df <- data.frame(a=seq(1,100,1),b=seq(10,1000,10),c=runif(100))

# Define the threshold

ths <- 20

# These 2 statements in Base R are equivalent

aa <- apply(ex.df,1, function(x) (x[1]+x[2])*x[3] > 20)
aa <- apply(ex.df,1, function(x) (x[1]+x[2])*x[3] > ths)


### Equivalent parallel execution ###

# Declare the cluster object. Here we use the default settings (SOCK) 
# and the number of nodes is specified by the number given

clus <- makeCluster(3)

# The equivalent for the first alternative would be very easy

aa <- parRapply(clus,ex.df, function(x) (x[1]+x[2])*x[3] > 20)

#However, if the variable "ths" needs to be used, a line has to be added

clusterExport(clus,"ths")
aa <- parRapply(clus,ex.df, function(x) (x[1]+x[2])*x[3] > ths)

The clusterExport() function exports an object to each node, enabling them to work parallely. The use of it, as it can be appreciated, is extremely simple: you need to pass the variable name/s in a character vector (or a single string, as in this case)

To conclude, imagine you would need to apply a custom function to each row of your data frame. Following the same example, in Base R it would be


#Declare the function

custom.function <- function(a,b,c){
result <- (a+b)*c
return(result)}


ex.df <- data.frame(a=seq(1,100,1),b=seq(10,1000,10),c=runif(100))

#Apply the declared function

aa <- apply(ex.df,1, function(x) custom.function(x[1],x[2],x[3]))


To perform the same action parallely with snow, you can declare the function inside a clusetEvalQ() statement or declare it in the base-workspace and then export it. If you use clusterEvalQ() you will not see the function in your workspace


#Create cluster

clus <- makeCluster(3)

#Option 1. Declare the function for each node

clusterEvalQ(clus, custom.function <- function(a,b,c){
result <- (a+b)*c
return(result)})

#Option 2. Export it form base workspace

custom.function <- function(a,b,c){
result <- (a+b)*c
return(result)}

clusterExport(clus,"custom.function")

ex.df <- data.frame(a=seq(1,100,1),b=seq(10,1000,10),c=runif(100))

#Apply the declared function

aa <- parRapply(clus,ex.df, function(x) custom.function(x[1],x[2],x[3]))


Performance differences

Of course, the main goal of parallelisation is to reduce execution times. However, it does not make much sense to work over a small set (a small list, data frame, etc) as parallelisation requires time for distributing the task among the nodes and collecting the results from them. This opertaion consumes some time of course, which is not significant when working with large volume of data but in cases of small datasets, the overall execution time required to fulfill a task parallely might be greater than doing it in just one core.

Although the execution time reduces considerably, it should not be assumed that it will decrease proportionally to the nodes introduced.,i.e., if the execution time in one node is 30 seconds it will not decrease to 10 seconds with 3 nodes. Probably, it will be higher.

These are the execution times for 1 and 3 nodes for the example above with a data frame of 1M and 3M rows.


# of Rows 1 Node 3 Nodes
1M 18.81 secs 10.90 secs
3M 65.67 secs 40.26 secs

Feel free to write any questions,suggestions, comments, etc.! I hope you liked it and find it useful

Posted in General Programming | Tagged , , , , , , , , , , , , , | 12 Comments

Dashboards in R with Shiny and GoogleVis

EAHU scrsht

Previously on this post, I introduced limitedly some features of the Shiny package. However, I felt the need of doing a new post related to Dashboards due to many reasons:

a) Shiny has changed most of its functions and the previous one is outdated
b) In this case, the outputs are done with GoogleVis, which is a very interesting package itself, so it is worth dedicating some lines to it
c) I finally got a space in the hosting service in RStudio to upload my Dashboard, so you can see how it works live here. IMPORTANT: Ignore the error message you get at the beginning

Context:

EAHU is an acronym for “Encuesta Anual de Hogares Urbanos” (in English, Yearly Survey of Urban Homes), a survey done by the official statistics institute in Argentina whose microdata is available here. The Dashboard´s raw data consists of a processed version of the 2010 to 2012 results.

How it works:

Although the filters are almost self-explanatory, it is necessary to highlight that the first chart (in the Income Tab) is not affected by the year filter, as the plot itself “slices” the data temporarily.

Bugs:

During the set-up, I had several issues with the encoding (Spanish uses many multi-byte characters, such as á, é or ñ) and decided to replace them for ASCII characters. The problem that turned out is that Google did not recognise 3 of the 5 provinces that have a special character in its name (Neuquén, Río Negro and Tucumán. The first two are under “Patagonia” and the last one is under “NOA”) so you will never see them plotted in the map, although they will definitely appear in the rest of the charts.

Before deep diving in the code, I recommend you entering the panel

The code:

The code consists mainly of three files, the standards for any Dashboard written in Shiny, global.R, ui.R and server.R. You will see that global.R is however not mandatory to do a Dashboard and, as it will be explained afterwards is not precisely necessary in this case.

global.R



library(googleVis)
library(reshape)
library(shiny)
library(reldist)
setwd("~/ShinyApps/EAHU/")


raw.data <- read.csv2("molten data_2.csv")

raw.data <- subset(raw.data,value > 0 )

As it is perfectly explained here under “Global Objects”, this code could have been perfectly inserted in server.R outside ShinyServer() function as the ui.R file does not have any component that is dependent of an external source. In this case, global.R is indeed not necessary because the source file is completely static and consequently there is no need to reference the UI.R file to an external source, which should be loaded prior to the display of the Dashboard. However, global.R could be extremely useful when working, for instance, with data coming from the Internet which might change more often and where the UI.R file should “adapt” itself to a source file, especially when this source file is also used by server.R. In this case, as the raw data is completely static, the UI.R file filters are completely “hard coded”. On the contrary, if your data comes from a SQL Query, for example, where your filter variables change frequently, the best option would be to do all the possible workout of the data in global.R so that ui.R and server.R can use it afterwards.

UI.R


library(shiny)

shinyUI(pageWithSidebar(
  # Application title
  headerPanel("EAHU Panel"),
  
  sidebarPanel( 
    numericInput("minedad", "Minimum Age",-1,-1,99),
    numericInput("maxedad", "Maximum Age",99,-1,99),
    checkboxGroupInput("gender","Gender",c(
      "Male"="Varon",
      "Female"="Mujer")),
    checkboxGroupInput("region","Region",c(
      "NOA"="NOA",
      "NEA"="NEA",
      "Cuyo"="Cuyo",
      "CABA"="CABA",
      "Buenos Aires"="BA",
      "Centro"="Centro",
      "Patagonia"="Patagonia")),
    checkboxGroupInput("year","Year",c(
      "2010"="2010",
      "2011" ="2011",
      "2012" = "2012")),
    submitButton(text="Ready")),
  
  mainPanel(
    tabsetPanel(
      tabPanel("Income",h4("Yearly income"),htmlOutput("motionchart")),
      tabPanel("Demography",h4("Ages by provinces"),htmlOutput("edades")),
      tabPanel("Occupation",h4("Occupation by province (in %)"),htmlOutput("tabla1"),
               h4("Activity of the occupied people (in %)"),htmlOutput("tabla2")),
      tabPanel("Education",htmlOutput("linech"))
    )
  )
))

The UI.R file can always be “split” into two main parts. Where you give the input (filters) and where you see the output(charts, tables, etc).

This is the “input” part


library(shiny)

shinyUI(pageWithSidebar(
  # Application title
  headerPanel("EAHU Panel"),
  
  sidebarPanel( 
    numericInput("minedad", "Minimum Age",-1,-1,99),
    numericInput("maxedad", "Maximum Age",99,-1,99),
    checkboxGroupInput("gender","Gender",c(
      "Male"="Varon",
      "Female"="Mujer")),
    checkboxGroupInput("region","Region",c(
      "NOA"="NOA",
      "NEA"="NEA",
      "Cuyo"="Cuyo",
      "CABA"="CABA",
      "Buenos Aires"="BA",
      "Centro"="Centro",
      "Patagonia"="Patagonia")),
    checkboxGroupInput("year","Year",c(
      "2010"="2010",
      "2011" ="2011",
      "2012" = "2012")),
    submitButton(text="Ready")),
  

and this one the output.



    mainPanel(
    tabsetPanel(
      tabPanel("Income",h4("Yearly income"),htmlOutput("motionchart")),
      tabPanel("Demography",h4("Ages by provinces"),htmlOutput("edades")),
      tabPanel("Occupation",h4("Occupation by province (in %)"),htmlOutput("tabla1"),
               h4("Activity of the occupied people (in %)"),htmlOutput("tabla2")),
      tabPanel("Education",htmlOutput("linech"))
    )
  )
))

As it was explained in this post, shiny works with an input – output logic based on two files that, unlike global.R, are mandatory for the whole Dashobard to work as expected. In this sense, it is essential to keep a clear idea of what a Dashobard does and how the information flows across the files to fulfill its purpose. Basically, it could be separated in three steps:

1) Some data is “given”(i.e. filters are selected, terms/numbers are entered, files are uploaded, etc etc)
2) The data is processed according to the given input data in 1) and results are “given”(outputs)
3) The results are shown

In Shiny, this could be divided in:

1) UI.R(“input” part)
2) server.R
3) UI.R(“output” part)

Parts 1 and 3 were explained before, and they are clearly easier than server.R where the most important work takes place. Firstly, an overview of the whole code.

server.R


library(shiny)


shinyServer(function (input, output) {

data <- reactive({
a <- subset(raw.data, Region %in% input$region & ch06 >= input$minedad &
ch06 <= input$maxedad & ch04 %in% input$gender & Anio %in% input$year)
a <- droplevels(a)
return(a)
})

data.mc <- reactive({
b <- subset(raw.data, Region %in% input$region & ch06 >= input$minedad &
ch06 <= input$maxedad & ch04 %in% input$gender)
b <- droplevels(b)
return(b)
})

output$motionchart <- renderGvis({

cast.1 <- as.data.frame(cast(data.mc(),jurisdiccion + Anio ~ variable, c(mean,sd)))

cast.1 <- cbind(cast.1,CV=cast.1[,4]/cast.1[,3])

ginis <- numeric()
for(i in 1:nrow(cast.1)){
ss <- subset(data.mc(),data.mc()[,1] == cast.1[i,1] & Anio == cast.1[i,2],select=value)
ginis <- append(ginis,gini(ss[,1]))
}


cast.1 <- cbind(cast.1,ginis)

return (gvisMotionChart(cast.1,"jurisdiccion",timevar="Anio",xvar="ipcf_mean",yvar="ginis",date.format="%Y"))

})

output$edades <- renderGvis({

cast.map <- aggregate(data()[,4],list(Jurisdiccion=data()$jurisdiccion),mean)
names(cast.map)[2] <- "promedio"
cast.map$promedio <- round(cast.map$promedio,3)

map <- gvisGeoChart(cast.map,colorvar="promedio",locationvar="Jurisdiccion",
options=list(region="AR",displayMode="regions", resolution="provinces",
title="Average age per Province",
titlePosition='out',
titleTextStyle="{color:'black',fontName:'Courier',fontSize:14}",
height=500, width=400))

quant<-cut(data()$ch06,quantile(data()$ch06,(0:4)/4),include.lowest=TRUE,na.rm=TRUE)
sset <- cbind(data(),quant)
tab <- with(sset,prop.table(table(jurisdiccion,quant),1))
tab <- as.data.frame.matrix(tab)
tab <- tab*100
tab <- round(tab,2)
tab <- cbind(row.names(tab),tab)
colnames(tab)[1] <- "Provincia"
bar.pl <- gvisColumnChart(tab,xvar=colnames(tab)[1],yvar=colnames(tab)[2:5],
options=list(title="Age cuartiles per province (in %)",
titlePosition='out',
hAxis="{slantedText:'true',slantedTextAngle:45}",
titleTextStyle="{color:'black',fontName:'Courier',fontSize:14}",
height=500, width=400))

merge <- gvisMerge(map,bar.pl,horizontal=TRUE,tableOptions="cellspacing=10")

return(merge)

})

output$tabla1 <- renderGvis({

t1 <- with(data(), prop.table(table(jurisdiccion,ocupacion),1))
t1 <- as.data.frame.matrix(t1)
t1 <- t1*100
t1 <- round(t1,1)
t1 <- cbind(row.names(t1),t1)
colnames(t1)[1] <- "Provincia"
t1.pl <- gvisTable(t1,options=list(page='enable',height=300,width=800))
return(t1.pl)
})

output$tabla2 <- renderGvis({
t2 <- with(data(), prop.table(table(jurisdiccion,caes_recode2),1))
t2 <- as.data.frame.matrix(t2)
t2 <- t2*100
t2 <- round(t2,1)
t2 <- cbind(row.names(t2),t2)
colnames(t2) <- tolower(colnames(t2))
colnames(t2)[1] <- "Provincia"
t2.pl <- gvisTable(t2,options=list(page='enable',height=300,width=800))

return(t2.pl)
})

output$linech <- renderGvis({
tab <- with(data(),prop.table(table(jurisdiccion,nivel_ed),1))
tab <- as.data.frame.matrix(tab)
tab <- tab*100
tab <- round(tab,2)
tab <- cbind(row.names(tab),tab)
colnames(tab)[1] <- "Provincia"
tab <- tab[,-grep("Sin instruccion",colnames(tab))]
order <- c("Provincia","Primaria Incompleta","Primaria Completa","Secundaria Incompleta","Secundaria Completa",
"Universitaria Incompleta", "Universitaria Completa")

index <- lapply(order, function(x) grep(x,names(tab)))

tab <- tab[,c(as.numeric(index))]

tab.s <- cbind(tab[,1],sum(tab[,2:7]),sum(tab[,3:7]),sum(tab[,4:7])
,sum(tab[,5:7]),sum(tab[,6:7]),sum(tab[,7]))

tab.2 <- data.frame(rep(0,nrow(tab)),rep(0,nrow(tab)),rep(0,nrow(tab))
,rep(0,nrow(tab)),rep(0,nrow(tab)),rep(0,nrow(tab)))

names(tab.2) <- names(tab)[2:7]

for ( i in 1:6){
for (j in 1:6){
if(i >= j){
tab.2[,j] <- tab.2[,j]+tab[,i+1]
} 
}
}
tab.pl <- as.data.frame(t(tab.2))
colnames(tab.pl) <- tab$Provincia
tab.pl <- cbind(Nivel_ed=names(tab)[2:7],tab.pl)

#### Area under curve ####

areas <- rbind(rep(100,ncol(tab.pl)),tab.pl)

areas[,2:ncol(areas)] <- areas[,2:ncol(areas)]/100

ed.coef <- (areas[1,-c(1)]+areas[2,-c(1)])/2
limit <- nrow(areas)-1

for (i in 2:limit){
j <- i +1
ed.coef <- ed.coef + (areas[i,-c(1)]+areas[j,-c(1)])/2
}

ed.coef <- ed.coef/limit

ed.coef <- t(ed.coef)
ed.coef <- round(ed.coef,4)
ed.coef <- cbind(colnames(areas)[-c(1)],ed.coef)

ed.coef <- as.data.frame(ed.coef)
colnames(ed.coef)<-c("Provincia","Education Completeness")

#### Plots ###

line.pl <- gvisLineChart(tab.pl,xvar=colnames(tab.pl)[1],yvar=colnames(tab.pl)[-1],
options=list(
title="Education Curve",
titlePosition='out',
hAxis="{slantedText:'true',slantedTextAngle:45}",
titleTextStyle="{color:'black',fontName:'Courier'}",
legend="{color:'black',fontName:'Courier'}",
fontSize="10",
chartArea="{left:40,top:30,width:'70%',height:'75%'}",            
height=550, width=600))

t1.ed <- gvisTable(ed.coef,
options=list(page='enable',fontSize="10",height=300,width=275))

ed.output <- gvisMerge(line.pl,t1.ed,horizontal=TRUE)

return(ed.output)
})

outputOptions(output, "motionchart", suspendWhenHidden = FALSE)
outputOptions(output, "edades", suspendWhenHidden = FALSE)
outputOptions(output, "tabla1", suspendWhenHidden = FALSE)
outputOptions(output, "tabla2", suspendWhenHidden = FALSE)
outputOptions(output, "linech", suspendWhenHidden = FALSE)
})

180 lines of code at once is surely too much information to grasp the idea underlying this dashboard and its implementation. Below, you will find a step by step explanation of each part.

1) Data creation:


data <- reactive({
 a <- subset(raw.data, Region %in% input$region & ch06 >= input$minedad &
 ch06 <= input$maxedad & ch04 %in% input$gender & Anio %in% input$year)
 a <- droplevels(a)
 return(a)
 })
 
 data.mc <- reactive({
 b <- subset(raw.data, Region %in% input$region & ch06 >= input$minedad &
 ch06 <= input$maxedad & ch04 %in% input$gender)
 b <- droplevels(b)
 return(b)
 })

Two different datasets are created; the first one is used by all the plots except the motionchart, which uses the second one. The main difference is that the motionchart does not take into account the year filter, as it is not needed because the graph itself can “slice” the data per year. In my opinion, although it might not be neat to use two different subsets for the plots displayed in the same dashboard, I think that in the case of the motion chart, the year filter does not really play any role, because the output itself enables the user to perform the same action.

2) Motionchart:

</pre>
<pre>output$motionchart <- renderGvis({

cast.1 <- as.data.frame(cast(data.mc(),jurisdiccion + Anio ~ variable, c(mean,sd)))

cast.1 <- cbind(cast.1,CV=cast.1[,4]/cast.1[,3])

ginis <- numeric()
for(i in 1:nrow(cast.1)){
ss <- subset(data.mc(),data.mc()[,1] == cast.1[i,1] & Anio == cast.1[i,2],select=value)
ginis <- append(ginis,gini(ss[,1]))
}


cast.1 <- cbind(cast.1,ginis)

return (gvisMotionChart(cast.1,"jurisdiccion",timevar="Anio",xvar="ipcf_mean",yvar="ginis",date.format="%Y"))

})

The first of the plots created with the googleVis package. When I first got in touch with it, it surprised me how it managed to plot 6 variables (i.e. the vertical and horizontal axis, the time, the name of the elements, the size and the colour of the balls) in a very clear way. Moreover, it has a menu where you can change to timeseries or barplots, so I would say it is 3 plots at the same time.

For this output, you can see that the data is processed using the cast() function, which belongs to the reshape package. This is because the data uploaded was previously “molten”. Equivalent results could have been obtained withthe use of aggregate(), which is in fact implemented in other outputs. This function is used to obtain a data frame with the mean income and its standard deviation.

Then, the coefficient of variation (standard variation/mean). Probably, these 3 measures could have been obtained with any visualization specific software, such as Tableau or Qlikview. However, is it possible to calculate GINI dynamically? If it is, it probably requires programming the whole integral calculation. In R, however, you can do it with just 4 lines using the reldist package

Finally, a small data frame with all the info is built and is plotted witgh the gvisMotionChart() function. (see the documentation for more details).

Please notice that you cannot call the reactive functions that filter the data, store them in an object and use them accross the functions because you will get an error. This is because the functions generating the plots are private, even if they are all contained in Shinyserver()

DON´T


library(shiny)


shinyServer(function (input, output) {
  
  data <- reactive({
    a <- subset(raw.data, Region %in% input$region & ch06 >= input$minedad &
                  ch06 <= input$maxedad & ch04 %in% input$gender & Anio %in% input$year)
    a <- droplevels(a)
    return(a)
  })

subset1 <- data()

 output$motionchart <- renderPlot({

table(subset1)    
    })
    

3) Map creation

output$edades <- renderGvis({

cast.map <- aggregate(data()[,4],list(Jurisdiccion=data()$jurisdiccion),mean)
names(cast.map)[2] <- "promedio"
cast.map$promedio <- round(cast.map$promedio,3)

map <- gvisGeoChart(cast.map,colorvar="promedio",locationvar="Jurisdiccion",
options=list(region="AR",displayMode="regions", resolution="provinces",
title="Average age per Province",
titlePosition='out',
titleTextStyle="{color:'black',fontName:'Courier',fontSize:14}",
height=500, width=400))

This map displays the average age of each province given the selected filters. The gvisGeoChart() enables to work with names that refer to geographical items (i.e, states, countries, continents, etc.) or with coordinates. In this case, it was considerably easier to work with names. The main drawback of it, due to the encoding issue mentioned, was that some of the provinces were not identified and were not included in the plot. In addition, I could not manage to add a title to it. It seems as if the geoplot does not enable titles.

4) Quartiles barplot

quant<-cut(data()$ch06,quantile(data()$ch06,(0:4)/4),include.lowest=TRUE,na.rm=TRUE)
sset <- cbind(data(),quant)
tab <- with(sset,prop.table(table(jurisdiccion,quant),1))
tab <- as.data.frame.matrix(tab)
tab <- tab*100
tab <- round(tab,2)
tab <- cbind(row.names(tab),tab)
colnames(tab)[1] <- "Provincia"
bar.pl <- gvisColumnChart(tab,xvar=colnames(tab)[1],yvar=colnames(tab)[2:5],
options=list(title="Age cuartiles per province (in %)",
titlePosition='out',
hAxis="{slantedText:'true',slantedTextAngle:45}",
titleTextStyle="{color:'black',fontName:'Courier',fontSize:14}",
height=500, width=400))

The main advantage of this barplot is that the quartiles are built dynamically based upon filters applied. The options documentation for the gvis plots is not very good in R itself. In my opinion, the best is to go to the google documentation. There you will be able to see what is editable and what is not. Please note that you will have to adapt what is written in the documentation to the R language.

5) Tables

output$tabla1 <- renderGvis({

t1 <- with(data(), prop.table(table(jurisdiccion,ocupacion),1))
t1 <- as.data.frame.matrix(t1)
t1 <- t1*100
t1 <- round(t1,1)
t1 <- cbind(row.names(t1),t1)
colnames(t1)[1] <- "Provincia"
t1.pl <- gvisTable(t1,options=list(page='enable',height=300,width=800))
return(t1.pl)
})

output$tabla2 <- renderGvis({
t2 <- with(data(), prop.table(table(jurisdiccion,caes_recode2),1))
t2 <- as.data.frame.matrix(t2)
t2 <- t2*100
t2 <- round(t2,1)
t2 <- cbind(row.names(t2),t2)
colnames(t2) <- tolower(colnames(t2))
colnames(t2)[1] <- "Provincia"
t2.pl <- gvisTable(t2,options=list(page='enable',height=300,width=800))

return(t2.pl)
})

This is the code to generate tables with the googlevis package. For this kind of visualizations, it might be more helpful than a normal R table because the rows can be sorted by any of the columns (ascendingly or descendingly) by just clicking on the column.

6) Education Curve

output$linech <- renderGvis({
tab <- with(data(),prop.table(table(jurisdiccion,nivel_ed),1))
tab <- as.data.frame.matrix(tab)
tab <- tab*100
tab <- round(tab,2)
tab <- cbind(row.names(tab),tab)
colnames(tab)[1] <- "Provincia"
tab <- tab[,-grep("Sin instruccion",colnames(tab))]
order <- c("Provincia","Primaria Incompleta","Primaria Completa","Secundaria Incompleta","Secundaria Completa",
"Universitaria Incompleta", "Universitaria Completa")

index <- lapply(order, function(x) grep(x,names(tab)))

tab <- tab[,c(as.numeric(index))]

tab.s <- cbind(tab[,1],sum(tab[,2:7]),sum(tab[,3:7]),sum(tab[,4:7])
,sum(tab[,5:7]),sum(tab[,6:7]),sum(tab[,7]))

tab.2 <- data.frame(rep(0,nrow(tab)),rep(0,nrow(tab)),rep(0,nrow(tab))
,rep(0,nrow(tab)),rep(0,nrow(tab)),rep(0,nrow(tab)))

names(tab.2) <- names(tab)[2:7]

for ( i in 1:6){
for (j in 1:6){
if(i >= j){
tab.2[,j] <- tab.2[,j]+tab[,i+1]
} 
}
}
tab.pl <- as.data.frame(t(tab.2))
colnames(tab.pl) <- tab$Provincia
tab.pl <- cbind(Nivel_ed=names(tab)[2:7],tab.pl)

#### Area under curve ####

areas <- rbind(rep(100,ncol(tab.pl)),tab.pl)

areas[,2:ncol(areas)] <- areas[,2:ncol(areas)]/100

ed.coef <- (areas[1,-c(1)]+areas[2,-c(1)])/2
limit <- nrow(areas)-1

for (i in 2:limit){
j <- i +1
ed.coef <- ed.coef + (areas[i,-c(1)]+areas[j,-c(1)])/2
}

ed.coef <- ed.coef/limit

ed.coef <- t(ed.coef)
ed.coef <- round(ed.coef,4)
ed.coef <- cbind(colnames(areas)[-c(1)],ed.coef)

ed.coef <- as.data.frame(ed.coef)
colnames(ed.coef)<-c("Provincia","Education Completeness")

#### Plots ###

line.pl <- gvisLineChart(tab.pl,xvar=colnames(tab.pl)[1],yvar=colnames(tab.pl)[-1],
options=list(
title="Education Curve",
titlePosition='out',
hAxis="{slantedText:'true',slantedTextAngle:45}",
titleTextStyle="{color:'black',fontName:'Courier'}",
legend="{color:'black',fontName:'Courier'}",
fontSize="10",
chartArea="{left:40,top:30,width:'70%',height:'75%'}",            
height=550, width=600))

t1.ed <- gvisTable(ed.coef,
options=list(page='enable',fontSize="10",height=300,width=275))

ed.output <- gvisMerge(line.pl,t1.ed,horizontal=TRUE)

return(ed.output)
})

Finally, the education curve is based upon the concept that the maximum education level achieved implies having completed prior stages. For instance, someone, whose education level is “Complete High School” means that he/she has achieved primary incomplete, primary complete, high school incomplete and high school complete.

Based upon this principle, you can pot a descending curve from the lowest to the highest level of education and compare graphically the drops in each province.

In addition, you can build a measure that calculates the area under those curves, which can support a graphical hypothesis with an empiric “hard” measure.

As the calculations are done with the subsetted data, the plot is einterly dynamic. This can enable you to see how the curves/measures change over time or age of the population.

Conclusion

The last tab is a perfect example of the power of R. Far from being a mere statistics software, R is a tool in constant expansion, that can perfectly replace and even beat the most used commercial specialized softwares. In this case, for instance, it has been proved that R can perform BI tasks with much more flexibility and potential than most of the softwares dedicated to that. In this context, it is just a question of time to overcome old and expensive softwares in the “data world” and replace them definitely by free and open source tools like R, among many others.

Posted in Data Visualization | Tagged , , , , , , , , | 37 Comments

A new Sudoku Solver in R. Part 1

Sudoku is nowadays probably the most widespread puzzle game in the world. As such, it has an interesting variety of solving techniques, not just with paper and pencil but also with computers.

Of course, I am not the first one who treated this issue. In the Internet, there is a huge variety of solutions with multiple languages. In R, there is even a package, dedicated exclusively to Sudokus.

In this first part, the solution presented just solves the puzzles that have always at least one empty cell with just one possible value. However, there are some puzzles where this is not enough, i.e., there is a certain point where no empty cell has just one possible value, which redunds in the need of the implementation of other strategies (which will be treated in future posts).

This first part covers just the cases where there is at least one empty cell with a unique possible value that can be filled. In my case, I tried to do it as most “human-like” as possible. Consequently, the following approach has NO “brute force like” code.

Therefore, I evaluated the ways in which I solve the puzzles without a computer and came up to four different general ways of identifying “sure” values:

1) A cell has just one possible value

sudoku ex 1

As you can see, the circled cell can only be a 2 as 4 and 7 are already in the same sector, 3 and 5 are present in the same row and 1,6,8 and 9 appear in the same column

2) A specific value is only valid in one cell of a certain row

sudoku ex 3

The red circled cell is the only one in its row that can be a 3, because the other three empty cells in the row belong to a sector where there is already a 3.

3) A specific value is only valid in one cell of a certain column

sudoku ex 2

The red circled cell is the only one in its column that can be a 9, as, from the other three empty cells, the first one from above has a 9 in the same sector and the last two ones have a 9 in the same row.

4) A specific value is only valid in one cell of a certain sector

sudoku ex 4

The red circled cell is the only one in that sector that can be a 1, as there are the blue squared ones above prevent all the empty cells from that column that also belong to the sector in question from being a 1. Additionally the other squared 1 in the bottom right sector, does not allow the left center empty cell of the bottom left sector to be a 1.

Below, you will find the code with its step by step explanation, which basically implements iteratively a series of functions that respond to the four above mentioned situations.

1) Matrix generation

#data entry option 1: Create a 9x9 matrix and enter numbers manually

data <- matrix(nrow=9, ncol=9)

data.entry (data)

#data entry option 2: Fetch a Sudoku from sudoku.org.uk with the Sudoku package

data <- fetchSudokuUK(date)

#data entry option 3: Generate a Sudoku with the Sudoku package, where n is the amount of blank cells

data <- generateSudoku(n)


#matrix set-up


datavalue <- array(dim=c(9,9,10))

data[data==0]<-NA

datavalue [,,1] <- data

datavalue [,,2:10] <- 1

As you can see, the puzzle (no matter how it was generated) is put into an array of 9x9x(9+1). As you can deduce from the code, the puzzle lays on the 1st z index and from 2 to 10, the number 1 is placed. The 2:10 z indexed slices will store the “status” of a specific number (z-1) in a specific cell, where 1 indicates that the cell can be z-1, 0 that it cannot and 2 that it is z-1.

For instance, if data[2,3,4] is 1, it means that the cell [2,3] in the puzzle can be a 3. If it is 2, it means that the cell [2,3] is definetly a 3.

The NA transformation is only necessary if your original input data (i.e., the puzzle) has any 0.

2) Update of the cube (z indexes 2 to 10)

#update cube

updatearray <- function (x){
  for (i in 1:9)
  {
    for (j in 1:9)
    {
      
      if (!is.na(x[i,j,1])){
        x [i,j,2:10] = 0
        x [i,j,x[i,j,1]+1] = 2
      }
      
    }
  }
  
  return (x)
}
rowcolelim <- function(x) {
  for (i in 1:9)
  {
    for (j in 1:9)
    {
      
      if (!is.na(x[i,j,1])){
        x [i,,x[i,j,1]+1] = 0
        x [,j,x[i,j,1]+1] = 0
        x [i,j,x[i,j,1]+1] = 2
      }
      
    }
  }
  return (x)
}
sectorelim <- function(x) {
  cuts <- c(1,4,7)
  for (i in 1:9)
  {
    for (j in 1:9)
    {
      
      if (!is.na(x[i,j,1])){
        rindex <- findInterval (i,cuts)
        cindex <- findInterval (j,cuts)
        if (rindex == 1 && cindex == 1){
          x [1:3,1:3,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 2 && cindex == 1){
        x [4:6,1:3,x[i,j,1]+1] = 0
        x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 3 && cindex == 1){
          x [7:9,1:3,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 1 && cindex == 2){
          x [1:3,4:6,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 2 && cindex == 2){
          x [4:6,4:6,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 3 && cindex == 2){
          x [7:9,4:6,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 1 && cindex == 3){
          x [1:3,7:9,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 2 && cindex == 3){
          x [4:6,7:9,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
        else if (rindex == 3 && cindex == 3){
          x [7:9,7:9,x[i,j,1]+1] = 0
          x [i,j,x[i,j,1]+1] = 2
        }
    }
  }
}
  return (x)
}

These 3 functions update the underlying cube accodring to the values in the z=1 layer according to the aforementioned references (0 = [x,y,z] means that [x,y] is NOT z-1, 1= [x,y,z] can be z-1, 2=[x,y,z] IS z-1).

They are executed at the beginning with the initial puzzle configuration and re-executed whenever an update is done to the z=1 layer

3) Looking for the missing values

a) A cell has just one possible value


uniqueoption <- function (x){
  for (i in 1:9)
  {
    for (j in 1:9)
    {
      
      if (is.na(x[i,j,1]) && sum(x[i,j,2:10]) == 1){
        x[i,j,1] <- grep (1,x[i,j,2:10])
        assign("flag", 1, envir = .GlobalEnv)
      }
      
    }
  }
  
  return (x)
}

b) A specific value is only valid in one cell of a certain row

uniqueinrow <- function (x){
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
    for (i in 1:9){
      if (length(grep(1,x[i,,l])) == 1){
        x[i,grep(1,x[i,,l]),1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
    } 
    }
  return (x)
}

c) A specific value is only valid in one cell of a certain column


uniqueincol <- function (x){
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      for (j in 1:9){
        if (length(grep(1,x[,j,l])) == 1){
          x[grep(1,x[,j,l]),j,1] <- (l-1)
          assign("flag", 1, envir = .GlobalEnv)
        }  
      }
    } 
  }
  return (x)
}

d) A specific value is only valid in one cell of a certain sector

getCoord <- function(x){
  coord <- c(datavalue)
  if (x == 1) {
    coord <- c(1,1)
  }
  else if (x == 2) {
    coord <- c(2,1)
  }
  else if (x == 3) {
    coord <- c(3,1)
  }
  else if (x == 4) {
    coord <- c(1,2)
  }
  else if (x == 5) {
    coord <- c(2,2)
  }
  else if (x == 6) {
    coord <- c(3,2)
  }
  else if (x == 7) {
    coord <- c(1,3)
  }
  else if (x == 8) {
    coord <- c(2,3)
  }
  else if (x == 9) {
    coord <- c(3,3)
  }
  return (list(xx=coord[1],y=coord[2]))
}
uniqueinsector1 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[1:3,1:3,l])) == 1){
        grepnr <- grep(1,x[1:3,1:3,l])
        x[getCoord(grepnr)$xx,getCoord(grepnr)$y,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}
uniqueinsector2 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[4:6,1:3,l])) == 1){
        grepnr <- grep(1,x[4:6,1:3,l])
        x[getCoord(grepnr)$xx+3,getCoord(grepnr)$y,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}
uniqueinsector3 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[7:9,1:3,l])) == 1){
        grepnr <- grep(1,x[7:9,1:3,l])
        x[getCoord(grepnr)$xx+6,getCoord(grepnr)$y,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}
uniqueinsector4 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[1:3,4:6,l])) == 1){
        grepnr <- grep(1,x[1:3,4:6,l])
        x[getCoord(grepnr)$xx,getCoord(grepnr)$y+3,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}
uniqueinsector5 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[4:6,4:6,l])) == 1){
        grepnr <- grep(1,x[4:6,4:6,l])
        x[getCoord(grepnr)$xx+3,getCoord(grepnr)$y+3,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}
uniqueinsector6 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[7:9,4:6,l])) == 1){
        grepnr <- grep(1,x[7:9,4:6,l])
        x[getCoord(grepnr)$xx+6,getCoord(grepnr)$y+3,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}
uniqueinsector7 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[1:3,7:9,l])) == 1){
        grepnr <- grep(1,x[1:3,7:9,l])
        x[getCoord(grepnr)$xx,getCoord(grepnr)$y+6,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
}
  return (x)}
uniqueinsector8 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[4:6,7:9,l])) == 1){
        grepnr <- grep(1,x[4:6,7:9,l])
        x[getCoord(grepnr)$xx+3,getCoord(grepnr)$y+6,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}
uniqueinsector9 <- function(x) {
  for (l in 2:10){
    if (length(grep(2,x[,,l])) < 9){
      if (length(grep(1,x[7:9,7:9,l])) == 1){
        grepnr <- grep(1,x[7:9,7:9,l])
        x[getCoord(grepnr)$xx+6,getCoord(grepnr)$y+6,1] <- (l-1)
        assign("flag", 1, envir = .GlobalEnv)
      }  
    }
  }
  return (x)}

As you can see, the functions to deduce the missing numbers respond exactly to the four different “human-like” techniques explained before.

The first three are really easy to understand.

Regarding the fourth criterion, I must firstly say that the solution implemented is not as optimal as it could be. However, it is good enough to perform its purpose. I might improve it for Part 2.

It consists basically of 9 functions (one for each sector) that look for cells with just one possible value.

Additionally, there is an extra function (getCoord(x)), which transforms the grepping index into a primary coordinate reference, which is then converted in each sector function according its position.

4) Loop

#process

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)
flag <- 1

while (flag == 1)
{
  
flag <-0

datavalue <- uniqueoption(datavalue)


datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinrow(datavalue)


datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueincol(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector1(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector2(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector3(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector4(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector5(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector6(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector7(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector8(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)

datavalue <- uniqueinsector9(datavalue)

datavalue <- updatearray(datavalue)
datavalue <- rowcolelim(datavalue)
datavalue <- sectorelim(datavalue)


}

Once the functions are declared, it is only necessary to insert them in a loop; as you could see before, whenever a value is updated the flag is set to 1, enabling the program to loop once more if anything was modified and stopping if there have been no changes. Please note that the value of the flag is not assigned in an usual way (with “<-" or "=") but inside the assign() function, which is the way in which you can declare/modify a public variable in R.

As it was stated at the introduction of this post, this script does not solve any unique-solution puzzle. In the next post dedicated to Sudokus, new functions will be presented to cover the situations where this script is not enough, basically, when there is no cell with just one possible value

I hope you enjoyed this post. Any comments,critics,corrections are welcome!

Posted in General Programming | Tagged , , , | 2 Comments

Social Media Monitoring tools in R with just a few lines

twitter-magnifying-glass

Social Media Analysis has become some kind of new obsession in Marketing. Every company wants to engage existing customers or attract new ones through this communication channel. Therefore, they hire designers, editors, community managers, etc. However, when it comes to measurement, when the impact of all the resources spent on social media communication has to be analyzed, they either do it extremely manually (i.e. manual counts, etc) or they spend an important amount of money in “pre-built” solutions that might a) not meet their needs exactly, b) be too difficult to master, c) need to constantly contact Customer Service whenever a problem is found or a change must be done, etc.

In my opinion, there is a misunderstanding of the importance of having the right tools (and persons) to analyze these results underlying this issue; although it might seem a field of exclusive qualitative focus, the work of a Data scientist is, undoubtely, as important as the above mentioned ones to succeed in Social Media Marketing.

Only using the right tools with solid sustainable analysis we can gather good insights from Social Media Marketing. If not, you’re being “trendy” but you are probably losing money and time.

The objective in this ocasion is to show how easy it can be to build your own Social Media Tool and at the same time give a very brief introduction to four very useful packages for the topic in question that can help you with your own developments:

Shiny: A package created by RStudio (http://shiny.rstudio.org/), to build Web applications very easily. In this case, you will see the code to operate locally. Here you may be able to find more info on how to run it on your own or a hosted server. (I have not tried this option yet)

I would encourage you to go a step beyond with this package, as only a few features are presented here. As you will see, “front-end” possibilities are barely exploited in this example.

twitteR: A very powerful package for Twitter Monitoring. Simple, easy and very effective.

tm: As you might have already imagined, tm stands for “Text Mining”. Apart from having text mining tools, it also provides very useful functions to pre-process texts

wordcloud: Package used to do Wordcloud plots.

Below, what you all expected… the code!! If you copy/paste it, you will see that an error message saying “Error: You must enter a query” appears twice (it comes from the twitteR package). My apologies, I could not manage to hide it (I have not found any error handler that could do it). If anyone has the solution, I will appreciate, if he/she could share it. I will include it and cite him/her in the post.

“Shiny” package requires two separate Scripts, named UI and server. They must be placed in the same folder. Once finished, to run your web application you should enter

library(shiny)
runApp(“the location of the desired folder”)

UI.R


library(shiny)
shinyUI(pageWithSidebar(

# Application title
headerPanel(“Tweets hunter”),

sidebarPanel( textInput(“term”, “Enter a term”, “”),
numericInput(“cant”, “Select a number of tweets”,1,0,200),
radioButtons(“lang”,”Select the language”,c(
“English”=”en”,
“Castellano”=”es”,
“Deutsch”=”de”)),
submitButton(text=”Run”)),

mainPanel(
h4(“Last 5 Tweets”),
tableOutput(“table”),
plotOutput(“wordcl”))
))

server.R


library(shiny)
library(twitteR)
library(wordcloud)
library(tm)

shinyServer(function (input, output) {

rawData <- reactive(function(){
tweets <- searchTwitter(input$term, n=input$cant,lang=input$lang)
twListToDF(tweets)
})

output$table <- reactiveTable(function () {
head(rawData()[1],n=5)
})
output$wordcl<- reactivePlot(function(){
tw.text<-enc2native(rawData()$text)
tw.text <- tolower(tw.text)
tw.text <- removeWords(tw.text,c(stopwords(input$lang),”rt”))
tw.text <- removePunctuation(tw.text,TRUE)
tw.text <-unlist(strsplit(tw.text,” “))

word<- sort(table(tw.text),TRUE)

wordc<-head(word,n=15)

wordcloud(names(wordc),wordc,random.color=TRUE,colors=rainbow(10),scale=c(15,2))
})
})

As you can see in the scripts, shiny package works with an input/output logic. In order to build your own applications, you will need to have a very clear idea of what should be given by the user (input) and what should be showed out (output) and in which file (ui or server) you should place each part of the process. The steps below might help

1) What: User enters parameters. Where: User Input Menu. File: ui.R
2) What: Parameters values are taken by the function and processed. Where: “R Engine”. File: server.R
3) What: Outputs are returned. Where: “R engine”. File: server.R
4) What: Outputs are shown. Where: Webpage (wherever you decided to put it). File: UI.R

Before we go on to analyze in more detail all the code presented, this is how the application looks like:

tweets hunter begin

From the user interface perspective, it is quite simple; we just type in a word select the amount of tweets we would like to extract and finally select the language. Below you will see a screenshot of the result of entering the term “music”, selecting 200 tweets and language English:

tweets hunter example

Let´s take a look to the code following the 4 steps mentioned above:

1) User enters parameters. UI.R


library(shiny)
shinyUI(pageWithSidebar(

# Application title
headerPanel(“Tweets hunter”),

sidebarPanel( textInput(“term”, “Enter a term”, “”),
numericInput(“cant”, “Select a number of tweets”,1,0,200),
radioButtons(“lang”,”Select the language”,c(
“English”=”en”,
“Castellano”=”es”,
“Deutsch”=”de”)),
submitButton(text=”Run”)),

The Menu logic is extremely easy to understand. Each of the widgets to enter parameters has the id as first argument. To call it in further actions, you will write input$(id). It works as any other variable. In the radioButtons, the name of the button is first entered and then the value that the radioButton variable will receive if that option is chosen.

The only thing to point out specially is the submitButton() function. Unless you include it, the script will process and output the results every time you change the parameters. The submitButton() function is particularly useful if the process that has to take place is expensive or if you need from the user to enter more than one parameter for the whole script to run correctly.

2) Parameters values are taken by the function and processed. server.R


library(shiny)
library(twitteR)
library(wordcloud)
library(tm)

shinyServer(function (input, output) {

rawData <- reactive(function(){
tweets <- searchTwitter(input$term, n=input$cant,lang=input$lang)
twListToDF(tweets)
})

Step 2 is a bit more complex. Firstly, all the necessary libraries (apart from shiny) have to be initialized in server.R. It is always advisable, when possible, to start them all together.

shinyServer(function(input, output) must always be the first line of your code in server.R before working with the “input” variables (the parameters entered by the user). Otherwise, it will not work.

reactive() is a function that indicates that whatever is processed inside that function, it will be done whenever the parameters are changed (if you entered the submitButton() in the UI, whenever you hit it. Otherwise, whenever you change the parameters). It is particularly useful to build the raw Data that you will process afterwards. You can find a very good explanation of it in the documentation

In this case, searchTwitter() uses the information entered by the user (the first argument is the term entered, the second the amount of tweets and the third one the language) and gives its output.

As the object returned by searchTwitter() is a bit difficult to handle, it is advisable to turn it into a Data Frame (twListToDF()) if you want to work, for example, with their texts.

Finally, in this process we got what we needed; the raw data.

3) Outputs are returned. File: server.R

output$table <- reactiveTable(function () {
head(rawData()[1],n=5)
})
output$wordcl<- reactivePlot(function(){
tw.text<-enc2native(rawData()$text)
tw.text <- tolower(tw.text)
tw.text <- removeWords(tw.text,c(stopwords(input$lang),”rt”))
tw.text <- removePunctuation(tw.text,TRUE)
tw.text <-unlist(strsplit(tw.text,” “))

word<- sort(table(tw.text),TRUE)

wordc<-head(word,n=15)

wordcloud(names(wordc),wordc,random.color=TRUE,colors=rainbow(10),scale=c(15,2))
})

reactiveTable() and reactivePlot() are functions to call other functions and return specific “reactive” outputs (in this case, a table and a plot) that will be afterwards displayed in the web interface. It is important to name them as output$(…) because, as it will be explained in Step 4, otherwise you will not be able to display it in the web application.

The reactive Table in this example just returns the first five records of the first column of the data frame and it can be understood by just reading the code.

The reactive Plot is a bit more sophisticated: Firstly, it takes the tweets and changes the encoding (enc2native()). I tried this script in 3 different computers and I had problems with encoding in one of them. That is just to avoid this issue.

Then, it converts everything to lower case. After that, the function removeWords() from the package “tm” is used to delete common words. As you can appreciate in the example, you can input whatever word, list of words, regex, etc. you would like to be removed. In this case, the stopwords from the user-entered language (input$lang) are removed, plus the term “rt”. In order to do a wordcloud (our final objective), this is particularly useful, as we will never would like to have “common words” in it. After that, punctuation is also removed.

Finally, all the tweets are turned into a list of words, then turned into a table (i.e., a frequency table), ordered descendently and the first 15 are chosen to plot.

For the wordcloud function, we enter the labels (the words themselves) of the generated table and the frequencies (the first two arguments in the example). This will determine the size of the words. The last arguments refer to the color order, the palette, and the maximum and minimum size for each word in the plot.

4) Outputs are shown. File: UI.R

mainPanel(
h4(“Last 5 Tweets”),
tableOutput(“table”),
plotOutput(“wordcl”))
))

The concluding part is very simple. You just have to specify where (in this case, the main panel) and what (in this case, the table and the plot) to show. Remember that every object generated with a recative function in the server.R file and declared correctly as output$(…) must now be called as “…”. In this case, the wordcloud was declared as output$wordcl and is called in the UI as plotOutput(“wordcl”).

That´s it! I hope you enjoyed it and sorry for the length :D. As usual, if you have any question, critic or correction, please feel free to write.

Posted in Data Visualization | Tagged , , , , , , , , , | 14 Comments

Animated graphs, another alternative for Data Visualization

peli 8mm

The world of Data Visualization offers infinite variants to display our Data. However, there is still some reluctance in exploiting all the possibilities that computers give us nowadays in this field, probably not because of a rejection of novelties but due to certain tendency of sticking to a vision related to “paper support”, i.e. to representations that could be put into paper. Even if interactive options are included (such as menus with filters, pivot tables, etc.), the way the data is finally displayed does not differ very much from what could have been done a hundred years ago; pie charts, bars, axes plots, etc.

This has of course several limitations: there is no possibility to show something dynamic, which is definetly not merely just an aesthetic detail. Animations can sometimes evidence considerably easier the incidence of time in the data we are exhibiting than presenting any static model.

Of course, this can be done in R. Below you will find the code for a hipothetic network graph and how the connections (edges) occur throughout time. In this post, the idea is to focus on the animation features, which can be used in any other type of graph. In the future, I will go on deeper in social networks and graphs.

Any comments, suggestions, critics or corrections please feel free to write. Enjoy!!


#load libraries
library(igraph)
library(animation)

#This is an invented Dataset that describes the edges generated throughout time between the different nodes. In order to understand this example, we have to assume that this is the order in which the edges where generated. In order to understand it, you have to read it by column, i.e., the first edge origined is 1 and 2, the second 3 and 4, the third 5 and 6, etc.

a <- c(1, 3, 5, 7, 1, 3, 2, 4, 1, 3, 2, 4, 1, 2, 5, 6)
b <- c(2, 4, 6, 8, 5, 7, 6, 8, 6, 8, 5, 7, 3, 4, 7, 8)

DB <- as.matrix(data.frame(a,b))

#create the graph

graph <- graph.edgelist(DB, directed=FALSE)

#reset the record of frames. In order to ensure that what we are going to animate does not include something else, it is always a good practice to reset before starting

ani.record(reset=TRUE)

# loop created to generate the animation. As there is no way to do it “cummulative”, it is always the same graph depicted (with all its nodes) but some of the edges are white to give the idea of progressive connection

for ( i in 1:nrow(DB))
{
colour <- c(rep(‘red’,length = i), rep(‘white’, length = nrow(DB)-i))
plot(graph, layout=layout.circle(graph), edge.color= colour)
ani.record()
}

#There are many possibilities to save the animation created. Below you will find only 2

#This is the best option. as I do not have the chance to store the images in a server and use the saveHTML function which is indeed considerably better. If you are going to run the animation locally or you have a server to store it, I would highly recommend this option.

saveHTML(ani.replay())

#This is the one I used in the post. It creates an mp4 file. Therefore, you will need to download ffmpeg.

saveVideo(ani.replay(), video.name = “graphanimation.mp4”, ffmpeg = “(the path where you have your ffmpeg archive)”)

This is the final result uploaded in Youtube. The quality is definetly not the best one.

Posted in Data Visualization, Uncategorized | Tagged , , , | Leave a comment

Working with geographical Data. Part 1: Simple National Infomaps

worldatnight

There is a popular expression in my country called “Gastar polvora en chimangos”, whose translation in English would be “spending gunpowder in chimangos”. Chimango is a kind of bird whose meat is useless for humans. So “spending gunpowder in chimangos” stands for spending a lot of money, time, effort, etc. in something not worth of it. This is of course an undesirable thing in any aspect of our lives, but I think it is crucial in the case of work: when a task that should be easy takes more effort than expected, we begin to have a “snowball effect” where the rest of the tasks get delayed as well. This redunds, as we all know, in staying up late and stressed to finish the tasks we planned for an 8-hour journey.

As you can see by googling, there are millions of packages, methods, contents, strategies, etc to work with geographical Data in R. In this series of post, I will present some of them, directly taken from my own experience. I will try to follow an increasing difficulty order. Of course, the more complex methods are more flexible and provide more alternatives.

In this case, we will keep it really simple and draw an infomap of a part of South America. Infomaps are very useful as they are widely spread and clear way of interpreting Data related to geographical zones. Infomaps have a double advantage: They are very clear to understand, but, as it is not feasible to do it easily in Excel, it is always impactful if you include one in a presentation.

Below you will find the R Code for a really simple approach. I hope you like it. Any comments, corrections or critics please write!!


library(‘maps’)
library(‘mapdata’)
library (‘RColorBrewer’)

DB <- as.matrix(c(‘Argentina’, ‘Brazil’, ‘Chile’, ‘Uruguay’, ‘Paraguay’, ‘Bolivia’, ‘Peru’))

#add population-density Data

DB <- cbind (DB, c(15,23,22,19,17,10,24))

#create a gradual palette of Reds. Function belongs to RColorBrewer

gama <- brewer.pal(6,”Reds”)
countries <- as.character(DB[,1])

# with the cut function you can assign numeric values to a certain interval defined by the user (in this case 0,5,10,15,20,max(DB))

DB <- cbind(DB, cut (as.numeric(DB[,2]),c(0,5,10,15,20,max(DB)),labels = FALSE, right = TRUE))

#With the ordinal values assigned to each country, we now create a character array with the colour code corresponding to each of them, based upon the palette we have created

col <- character()

for (i in 1:nrow(DB))
{
col <- append(col,gama[as.numeric(DB[i,3])])
}

#We draw the map. Please note that the arrays countries and col need to be maintained in the same order. If not, the colour assigned to each country will be wrong. So, be careful if you need to sort the values of any array before plotting.

map(‘worldHires’,countries,fill=TRUE,col=col,plot=TRUE, cex = 15, exact=TRUE)
legend(“bottomright”, c(“up to 15”, “16 – 17”, “18 – 19”, “20-21”, “22-23”, “more than 23”),border=gama, fill = gama, cex = 1.3, box.col = “white”)

#Although RStudio (I do not know of other interfaces) provides an interface option to import a plot to a file, if you have to export the map, I would advise doing it per CLI, as the sizes and proportions are much easier to handle. In this case, it would be as follows:

png(file= (your Path),width = (width in pixels), height = (height in pixels), res = 120)
map(‘worldHires’,countries,fill=TRUE,col=col,plot=TRUE, cex = 15, exact=TRUE)
legend(“bottomright”, c(“up to 15”, “16 – 17”, “18 – 19”, “20-21”, “22-23”, “more than 23”),border=gama, fill = gama, cex = (the size you want for the box), box.col = “white”)
dev.off()

This is the final result

map4blog

Posted in Data Visualization | Tagged , , , , , | 5 Comments

Florence Nightingale and the importance of Data Visualization

nightingale

Florence Nightingale is held as a heroine for the British people because of her work during the Crimean War. However, she would not have been so fairly recognised if she had not been also a superb statistician: in a brilliant documentary released by the BBC in December 2010 as part of the series “The beauty of Diagrams” , the dilemma Nightingale had to face after returning from the Crimean War is depicted; after retrieving information about the soldier´s death causes in the military hospital during two years, she discovered some revealing facts: the majority of them were dying not because of wounds caused during the battle but due to infections (typhus and colera, among others) inside the hospital, triggered by neglecting hygiene conditions.

In this regard, she was convinced that similar changes to those made during the Crimean War in the military hospitals would have the same results in the civilian hospitals in London. When she came back, she faced many difficulties in getting an appointment with the responsibles for the sanity in hospitals to show her proposal. After a long struggle, she finally managed to receive an audition, but of a very short duration. In this context, she knew her presentation had to be short and concise, but very impactful at the same time.

Challenged by this adversity, she came up to the idea of what we know today as the Nightingale Diagram or Nightingale Rose, a circular graph that shows the amount of deaths per cause throughout the months. As we can see, the picture is self-explanatory of the issue Nightingale wanted to point out. The clarity of this diagram enabled her to convince the Authorities in London of the necessity of a change in the sanitary conditions, as the diagram left no space for misinterpretation.

Nightingale Diagram: Blue represents deaths occasioned by diseases, red stands for the deaths due to wounds and black for all other causes of death.

This anecdote does not differ very much from what we, actual data workers, have to cope with every day; bosses or managers, who we have to convince in a very short time. Therefore, our presentation has to summarize as much as possible many hours invested in that task and, at the same time, prove our points. Here is where Data Visualization becomes a great ally: the ability to show results in a convincing way is as important as the  long hours of work that took us to arrive to that conclusions.

In our digital era, there are plenty of programmes that can help us with this and, what´s more, many of them are completely free and open source. My personal favourite (and the one I will use most in this blog) is R.

To conclude this first post, below you will find an R Script to recreate your own Rose Diagram (as similar as possible. I was not able to adapt the second pie to the scale of the first one)

Any comments, questions, recommendations, please shoot 😉 !


#you will need to have the library plotrix installed
library (plotrix)
#I could not manage to upload the dataset. Whenever I am able to download it I will do it
data <- as.matrix(read.table(“wherever you get the dataset”,header=TRUE))
#Dirty color-binding
data <- cbind (data,’red’)
data <- cbind (data, ‘blue’)
data <- cbind (data, ‘green’)

colnames(data)[5] <- ‘color1’
colnames(data)[6] <- ‘color2’
colnames(data)[7] <- ‘color3’

#The series have to be added up, so that the diagram superposition respects the area of each case
series1 <- as.numeric(data[1:12,4]) + as.numeric(data[1:12,3]) + as.numeric(data[1:12,2])
series2 <- as.numeric(data[1:12,3]) + as.numeric(data[1:12,2])
series3 <- as.numeric(data[1:12,2])
series4 <- as.numeric(data[13:24,4]) + as.numeric(data[13:24,3]) + as.numeric(data[13:24,2])
series5 <- as.numeric(data[13:24,3]) + as.numeric(data[13:24,2])
series6 <- as.numeric(data[13:24,2])

#Drawing the plot for year 1. Setting rings in different colors is a bit tricky and that´s it is easier by creating three graphs

radial.pie (series1, labels= data[1:12,1], sector.colors = data [1:12,5], clockwise= TRUE, show.grid.labels= FALSE, add = FALSE)
radial.pie (series2,sector.colors = data[1:12,6], labels= data[1:12,1], clockwise= TRUE, add = TRUE, show.grid.labels= FALSE)
radial.pie (series3,sector.colors = data[1:12,7], labels= data[1:12,1], clockwise= TRUE, add = TRUE, show.grid.labels= FALSE)

# legend for the graph
legend (x=3000,y=8,legend = c(colnames(data)[4],colnames(data)[3],colnames(data)[2]), border = data [1,5:7], fill=data [1,5:7])

#Year 2
radial.pie (series4, labels= data[13:24,1], sector.colors = data [1:12,5], clockwise= TRUE, show.grid.labels= FALSE, add = FALSE, radial.lim = max(series1))
radial.pie (series5,sector.colors = data[1:12,6], labels= data[1:12,1], clockwise= TRUE, add = TRUE, show.grid.labels= FALSE)
radial.pie (series6,sector.colors = data[1:12,7], labels= data[1:12,1], clockwise= TRUE, add = TRUE, show.grid.labels= FALSE)

legend (x=1600,y=8,legend = c(colnames(data)[4],colnames(data)[3],colnames(data)[2]), border = data [1,5:7], fill=data [1,5:7])

Plot for year 1

nightingale 1

Plot for year 2

rose year 2

Posted in Data Visualization | Tagged , , , , , , | 7 Comments