Load Packages

One of the great features of R is that there is a vast ecosystem of user-built packages that have extended the functionality built into its core implementation. These packages will allow us to connect to our data source, turn text into numerical features, and visualize the end result. This webpage was made using this R code and RMarkdown.

if(require("RODBC")==FALSE){install.packages("RODBC")}
## Loading required package: RODBC
library(RODBC)
if(require("qdap")==FALSE){install.packages("qdap")}
## Loading required package: qdap
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
## 
##     Filter
library(qdap)
if(require("tm")==FALSE){install.packages("tm")}
## Loading required package: tm
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
## 
##     ngrams
## 
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
library(tm)
if(require("stringr")==FALSE){install.packages("stringr")}
## Loading required package: stringr
## 
## Attaching package: 'stringr'
## The following object is masked from 'package:qdap':
## 
##     %>%
library(stringr)
if(require("SnowballC")==FALSE){install.packages("SnowballC")}
## Loading required package: SnowballC
library(SnowballC)
if(require("Tmisc")==FALSE){install.packages("Tmisc")}
## Loading required package: Tmisc
library(Tmisc)
if(require("wordcloud")==FALSE){install.packages("wordcloud")}
## Loading required package: wordcloud
library(wordcloud)

Connect to Database

It’s always best practice to source your database credentials from another R file, so they are protected. To do this, simply put this connection code into it and fill in the ODBC_NAME, USERNAME, & PASSWORD for your connection. You’ll also need an ODBC connection on your host machine with your database information:

ARKABASE <- odbcConnect("ODBC_NAME",uid="USERNAME",pwd="PASSWORD"")

source('C:/Users/plapo/Documents/arkabase_conn.R')

Pull Slacks

Once you’ve connected, you can pull in all the messages. Note: This database is hosted at Amazon Web Services has aleady been loaded with Arkatechture’s Slack data using Slack’s API & Alteryx.

Messages

Using the ODBC connection to the database containing the Slack data, pull all the messages. Note: There are special records with a subtypes that are associated with specific user acitivites vs. raw messages. Therefore, we will limit our initial scope to those.

slack_messages<-sqlQuery(ARKABASE,
                        "SELECT *
                         FROM ARKATPR.SLAK_TPR.SLACK_MESSAGES")
sqlQuery(ARKABASE,
          "SELECT DISTINCT SUBTYPE 
          FROM ARKATPR.SLAK_TPR.SLACK_MESSAGES")
##              SUBTYPE
## 1      channel_topic
## 2         file_share
## 3       channel_name
## 4        pinned_item
## 5        bot_message
## 6            bot_add
## 7               <NA>
## 8         bot_remove
## 9      channel_leave
## 10      reminder_add
## 11      channel_join
## 12   reply_broadcast
## 13                  
## 14      file_comment
## 15   channel_archive
## 16 slackbot_response
## 17   channel_purpose
## 18      file_mention
slack_messages<-sqlQuery(ARKABASE,
                        "SELECT * 
                         FROM ARKATPR.SLAK_TPR.SLACK_MESSAGES
                         WHERE SUBTYPE=''")
head(slack_messages,1)
##                     MSG_ID           TIMESTAMP    TYPE SUBTYPE CHANNEL
## 1 alteryx1464293931.000009 2016-05-26 20:18:51 message      NA alteryx
##   INVITER                         TEXT   USER_ID
## 1      NA when is the next user group? U0J6Q16NM

Users

We’ll also want to pull the users from our Slack data for future removal in message content at a later stage by creating some exemption variables used as custom “stop words” that are removed/replaced in the text mining. This section isn’t as critical due to the later decision to remove all Slack links below. We’re also done with our data pulls, so we will close our connection to the database.

slack_users<-sqlQuery(ARKABASE,
                      "SELECT * 
                      FROM ARKATPR.SLAK_TPR.SLACK_USERS")

#creating list of @user ids 
slack_users$exemption <- paste(slack_users$USER_ID,slack_users$NAME,sep="|")
exemptionfull<-slack_users$exemption
exemption<-slack_users$USER_ID

odbcClose(ARKABASE)

Text Mining

Create “Corpus”

In order to mine the Slack messages for the top words in each, all the messages for each channel will be concatenated. A new data frame chan will be created by creating a custom function that loops over each message concatenating them separated by spaces. This will create the body of text, or Corpus to mine for each Channel.

ASCII Conversion

The iconv function is used to convert from the latin1 extended-ASCII encoding available in Slack & R to just the basic ASCII character set eliminating any extended characters by replacing them with the empty string "".

slack_messages$TEXT <- sapply(slack_messages$TEXT,function(row) 
  iconv(row, from="latin1", to="ASCII", sub="")) 

Initialize pre-Corpus

Create a data frame with 3 variables: NAME, Freq, and Text where NAME is the name of the CHANNEL, Freq is the number of Slack messages, and Text is initialized with the default value of "textarrdf". Note: The Channel & Messages cannot be Factor data types & ust be chr data types for Text Mining.

chan = data.frame(table(slack_messages$CHANNEL), stringsAsFactors = FALSE)
chan = chan[order(chan$Freq, decreasing=TRUE), ]
colnames(chan) = c("NAME", "Freq")
#chan <- unfactor(chan)



chan$Text <- "textarrdf"

Load pre-Corpus

Create concatenation function and loop over chan data frame to concatenate Slack messages for each channel into one Text and convert it into the wordCorpus for text mining.

create_text<- function( tempchannel )
{
  print(nrow(tempchannel))
  # creating empty array
  textarrdf = c(" ")
  
  for(i in 1:nrow(tempchannel))
  {
    slackdf = tempchannel[i, "TEXT"]
    textarrdf = paste(textarrdf, slackdf , sep = " ")
  }
  
  return(textarrdf)
} 

# loop through the channels to aggregate the texts
for(chan_num in 1:nrow(chan))
{
  # store the Channel name in variable = "CHANNEL"
  CHAN <- chan[chan_num,"NAME"]
  
  # print the iteration number and Channel name, 
  # just to know the program is working
  print(paste("Loading Channel num =", chan_num, "| NAME =", CHAN ))
  
  # from the name matching our current selection.
  tempheadline <- subset(slack_messages, slack_messages$CHANNEL == CHAN)
  
  # collect all the headlines by this user into a single array
  chan$Text[chan_num] <- create_text(tempheadline)
}
## [1] "Loading Channel num = 1 | NAME = general"
## [1] 4460
## [1] "Loading Channel num = 2 | NAME = envirovantage"
## [1] 2819
## [1] "Loading Channel num = 3 | NAME = smb"
## [1] 2501
## [1] "Loading Channel num = 4 | NAME = random"
## [1] 2487
## [1] "Loading Channel num = 5 | NAME = marketing"
## [1] 1789
## [1] "Loading Channel num = 6 | NAME = development"
## [1] 1172
## [1] "Loading Channel num = 7 | NAME = arkabase"
## [1] 721
## [1] "Loading Channel num = 8 | NAME = vetsfirstchoice"
## [1] 559
## [1] "Loading Channel num = 9 | NAME = funcommittee"
## [1] 498
## [1] "Loading Channel num = 10 | NAME = vizzinquest"
## [1] 464
## [1] "Loading Channel num = 11 | NAME = hacknight"
## [1] 388
## [1] "Loading Channel num = 12 | NAME = product"
## [1] 368
## [1] "Loading Channel num = 13 | NAME = aws"
## [1] 344
## [1] "Loading Channel num = 14 | NAME = vizzers"
## [1] 272
## [1] "Loading Channel num = 15 | NAME = llbean"
## [1] 198
## [1] "Loading Channel num = 16 | NAME = pizzachallenge"
## [1] 157
## [1] "Loading Channel num = 17 | NAME = badges"
## [1] 121
## [1] "Loading Channel num = 18 | NAME = td"
## [1] 110
## [1] "Loading Channel num = 19 | NAME = testing"
## [1] 106
## [1] "Loading Channel num = 20 | NAME = blackbaud"
## [1] 75
## [1] "Loading Channel num = 21 | NAME = jira-troubles"
## [1] 74
## [1] "Loading Channel num = 22 | NAME = light"
## [1] 73
## [1] "Loading Channel num = 23 | NAME = paydiant"
## [1] 59
## [1] "Loading Channel num = 24 | NAME = sqltroubles"
## [1] 58
## [1] "Loading Channel num = 25 | NAME = awstroubles"
## [1] 55
## [1] "Loading Channel num = 26 | NAME = alteryx"
## [1] 54
## [1] "Loading Channel num = 27 | NAME = edx"
## [1] 47
## [1] "Loading Channel num = 28 | NAME = projectdino"
## [1] 46
## [1] "Loading Channel num = 29 | NAME = ev-bitbucket"
## [1] 42
## [1] "Loading Channel num = 30 | NAME = devops"
## [1] 30
## [1] "Loading Channel num = 31 | NAME = gamenight"
## [1] 28
## [1] "Loading Channel num = 32 | NAME = lunchtime"
## [1] 25
## [1] "Loading Channel num = 33 | NAME = rant"
## [1] 23
## [1] "Loading Channel num = 34 | NAME = pie"
## [1] 21
## [1] "Loading Channel num = 35 | NAME = safran-aero"
## [1] 7
## [1] "Loading Channel num = 36 | NAME = tipsandtricks"
## [1] 5
## [1] "Loading Channel num = 37 | NAME = chriss-weird-diet"
## [1] 3
## [1] "Loading Channel num = 38 | NAME = makeachange"
## [1] 3
## [1] "Loading Channel num = 39 | NAME = pranks"
## [1] 2
## [1] "Loading Channel num = 40 | NAME = slackalytics"
## [1] 2
rm(tempheadline,chan_num,CHAN)

Text Pre-Processing

Remove Special Slack Encoding

Slack uses special message formatting that replaces the characters &, <, and > with &amp, &lt, and &gt respectively. In addition, all links are wrapped in <...>. Therefore, the first step is to remove any string that’s a link to something else. Note: Later we may come back and quantify these, but we will remove them for now, so we can visualize the words in the message only.

removeSlackLinks <- function(doc) {
    gsub( " *<.*?> *", "", doc)
}

chan$Text<-gsub( " *<.*?> *", "", chan$Text)

Create Corpus

So far we’ve been using a traditional R dataframe. In order to use many of R’s text mining capabilities, we’ll need to convert the text into a different class: Corpus.

wordCorpus <- Corpus(VectorSource(chan$Text))

Whitespace & Lower Case

Whitespace, such as tabs, multiple spaces between words, and line breaks many help with the formatting & structure of the document; however, they don’t help us extract the meaning behind the words, so we will remove them. Also, upper case & lower case characters are encoded differently. This is not a distinction that provides differentiation in our analysis, so we will stanndardize the case of all letters to lower.

wordCorpus <- tm_map(wordCorpus, stripWhitespace)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))

Stop Words

Common words don’t help us understand the topic or meaning in a sentence. For example, in the previous sentence, the words don’t, us, the, or, in, and a could all be removed and we’d still understand the topic: Common words help understand topic meaning sentence. The removal of don’t does invert the point that the author is making; howerver, the fundamental concept is retained from the remining words. More importantly, these common words have no discriminatory power in differentiating one topic or meaning from another.

wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))

Remove Numbers & Puncuntuation

Finally, since this is our initial pass at analyzing the text, we’re going simplify our scope and stick to just atomic words and eliminate any numbers or punctuation.

wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, removeNumbers)

Stemming

The final step in processing the text is to “stem” the words. Stemming further standardizes slight variations of the same word. For example, argue, argued, argues, & arguing are all essentially the same verb with different tenses. Stemming reduces these 4 different verbs to the common stem of argu.

wordCorpus <- tm_map(wordCorpus, stemDocument) 

Word Clouds

Now that our coprora for each channel are standardized as much as possible, we can visualize each.

# loop through the channels to aggregate the texts
for(chan_num in 1:nrow(chan))
{
  # store the Channel name in variable = "CHANNEL"
  CHAN <- chan[chan_num,"NAME"]
  
  # print the iteration number and Channel name, 
  # just to know the program is working
  print(paste("Loading Channel num =", chan_num, "| NAME =", CHAN ))
  
  # load the corpus for the channel into a temporary corpus
  tempCorpus <- tm_map(wordCorpus[chan_num], stemDocument)
  
  # check that there are words in the corpus
  myTdm <- TermDocumentMatrix(tempCorpus)
  
  # create the word cloud from the temporary corpus if it contains words
  if (myTdm$nrow > 0) {
   wordcloud(tempCorpus, scale=c(4, 0.01), random.order=FALSE, 
          rot.per=0.35, use.r.layout=FALSE, colors = brewer.pal(8, "Dark2"))
  }
}
## [1] "Loading Channel num = 1 | NAME = general"

## [1] "Loading Channel num = 2 | NAME = envirovantage"

## [1] "Loading Channel num = 3 | NAME = smb"

## [1] "Loading Channel num = 4 | NAME = random"

## [1] "Loading Channel num = 5 | NAME = marketing"

## [1] "Loading Channel num = 6 | NAME = development"

## [1] "Loading Channel num = 7 | NAME = arkabase"

## [1] "Loading Channel num = 8 | NAME = vetsfirstchoice"

## [1] "Loading Channel num = 9 | NAME = funcommittee"

## [1] "Loading Channel num = 10 | NAME = vizzinquest"

## [1] "Loading Channel num = 11 | NAME = hacknight"

## [1] "Loading Channel num = 12 | NAME = product"

## [1] "Loading Channel num = 13 | NAME = aws"

## [1] "Loading Channel num = 14 | NAME = vizzers"

## [1] "Loading Channel num = 15 | NAME = llbean"

## [1] "Loading Channel num = 16 | NAME = pizzachallenge"

## [1] "Loading Channel num = 17 | NAME = badges"

## [1] "Loading Channel num = 18 | NAME = td"

## [1] "Loading Channel num = 19 | NAME = testing"

## [1] "Loading Channel num = 20 | NAME = blackbaud"

## [1] "Loading Channel num = 21 | NAME = jira-troubles"

## [1] "Loading Channel num = 22 | NAME = light"

## [1] "Loading Channel num = 23 | NAME = paydiant"

## [1] "Loading Channel num = 24 | NAME = sqltroubles"

## [1] "Loading Channel num = 25 | NAME = awstroubles"

## [1] "Loading Channel num = 26 | NAME = alteryx"

## [1] "Loading Channel num = 27 | NAME = edx"

## [1] "Loading Channel num = 28 | NAME = projectdino"

## [1] "Loading Channel num = 29 | NAME = ev-bitbucket"

## [1] "Loading Channel num = 30 | NAME = devops"

## [1] "Loading Channel num = 31 | NAME = gamenight"

## [1] "Loading Channel num = 32 | NAME = lunchtime"

## [1] "Loading Channel num = 33 | NAME = rant"

## [1] "Loading Channel num = 34 | NAME = pie"

## [1] "Loading Channel num = 35 | NAME = safran-aero"

## [1] "Loading Channel num = 36 | NAME = tipsandtricks"

## [1] "Loading Channel num = 37 | NAME = chriss-weird-diet"

## [1] "Loading Channel num = 38 | NAME = makeachange"
## [1] "Loading Channel num = 39 | NAME = pranks"

## [1] "Loading Channel num = 40 | NAME = slackalytics"