One of the great features of R is that there is a vast ecosystem of user-built packages that have extended the functionality built into its core implementation. These packages will allow us to connect to our data source, turn text into numerical features, and visualize the end result. This webpage was made using this R code and RMarkdown.
if(require("RODBC")==FALSE){install.packages("RODBC")}
## Loading required package: RODBC
library(RODBC)
if(require("qdap")==FALSE){install.packages("qdap")}
## Loading required package: qdap
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
##
## Filter
library(qdap)
if(require("tm")==FALSE){install.packages("tm")}
## Loading required package: tm
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
##
## ngrams
##
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
library(tm)
if(require("stringr")==FALSE){install.packages("stringr")}
## Loading required package: stringr
##
## Attaching package: 'stringr'
## The following object is masked from 'package:qdap':
##
## %>%
library(stringr)
if(require("SnowballC")==FALSE){install.packages("SnowballC")}
## Loading required package: SnowballC
library(SnowballC)
if(require("Tmisc")==FALSE){install.packages("Tmisc")}
## Loading required package: Tmisc
library(Tmisc)
if(require("wordcloud")==FALSE){install.packages("wordcloud")}
## Loading required package: wordcloud
library(wordcloud)
It’s always best practice to source your database credentials from another R file, so they are protected. To do this, simply put this connection code into it and fill in the ODBC_NAME
, USERNAME
, & PASSWORD
for your connection. You’ll also need an ODBC connection on your host machine with your database information:
ARKABASE <- odbcConnect("ODBC_NAME",uid="USERNAME",pwd="PASSWORD"")
source('C:/Users/plapo/Documents/arkabase_conn.R')
Once you’ve connected, you can pull in all the messages. Note: This database is hosted at Amazon Web Services has aleady been loaded with Arkatechture’s Slack data using Slack’s API & Alteryx.
Using the ODBC connection to the database containing the Slack data, pull all the messages. Note: There are special records with a subtypes that are associated with specific user acitivites vs. raw messages. Therefore, we will limit our initial scope to those.
slack_messages<-sqlQuery(ARKABASE,
"SELECT *
FROM ARKATPR.SLAK_TPR.SLACK_MESSAGES")
sqlQuery(ARKABASE,
"SELECT DISTINCT SUBTYPE
FROM ARKATPR.SLAK_TPR.SLACK_MESSAGES")
## SUBTYPE
## 1 channel_topic
## 2 file_share
## 3 channel_name
## 4 pinned_item
## 5 bot_message
## 6 bot_add
## 7 <NA>
## 8 bot_remove
## 9 channel_leave
## 10 reminder_add
## 11 channel_join
## 12 reply_broadcast
## 13
## 14 file_comment
## 15 channel_archive
## 16 slackbot_response
## 17 channel_purpose
## 18 file_mention
slack_messages<-sqlQuery(ARKABASE,
"SELECT *
FROM ARKATPR.SLAK_TPR.SLACK_MESSAGES
WHERE SUBTYPE=''")
head(slack_messages,1)
## MSG_ID TIMESTAMP TYPE SUBTYPE CHANNEL
## 1 alteryx1464293931.000009 2016-05-26 20:18:51 message NA alteryx
## INVITER TEXT USER_ID
## 1 NA when is the next user group? U0J6Q16NM
We’ll also want to pull the users from our Slack data for future removal in message content at a later stage by creating some exemption variables used as custom “stop words” that are removed/replaced in the text mining. This section isn’t as critical due to the later decision to remove all Slack links below. We’re also done with our data pulls, so we will close our connection to the database.
slack_users<-sqlQuery(ARKABASE,
"SELECT *
FROM ARKATPR.SLAK_TPR.SLACK_USERS")
#creating list of @user ids
slack_users$exemption <- paste(slack_users$USER_ID,slack_users$NAME,sep="|")
exemptionfull<-slack_users$exemption
exemption<-slack_users$USER_ID
odbcClose(ARKABASE)
In order to mine the Slack messages for the top words in each, all the messages for each channel will be concatenated. A new data frame chan
will be created by creating a custom function that loops over each message concatenating them separated by spaces. This will create the body of text, or Corpus to mine for each Channel.
The iconv
function is used to convert from the latin1
extended-ASCII encoding available in Slack & R to just the basic ASCII
character set eliminating any extended characters by replacing them with the empty string ""
.
slack_messages$TEXT <- sapply(slack_messages$TEXT,function(row)
iconv(row, from="latin1", to="ASCII", sub=""))
Create a data frame with 3 variables: NAME
, Freq
, and Text
where NAME
is the name of the CHANNEL
, Freq
is the number of Slack messages, and Text
is initialized with the default value of "textarrdf"
. Note: The Channel & Messages cannot be Factor
data types & ust be chr
data types for Text Mining.
chan = data.frame(table(slack_messages$CHANNEL), stringsAsFactors = FALSE)
chan = chan[order(chan$Freq, decreasing=TRUE), ]
colnames(chan) = c("NAME", "Freq")
#chan <- unfactor(chan)
chan$Text <- "textarrdf"
Create concatenation function and loop over chan
data frame to concatenate Slack messages for each channel into one Text
and convert it into the wordCorpus
for text mining.
create_text<- function( tempchannel )
{
print(nrow(tempchannel))
# creating empty array
textarrdf = c(" ")
for(i in 1:nrow(tempchannel))
{
slackdf = tempchannel[i, "TEXT"]
textarrdf = paste(textarrdf, slackdf , sep = " ")
}
return(textarrdf)
}
# loop through the channels to aggregate the texts
for(chan_num in 1:nrow(chan))
{
# store the Channel name in variable = "CHANNEL"
CHAN <- chan[chan_num,"NAME"]
# print the iteration number and Channel name,
# just to know the program is working
print(paste("Loading Channel num =", chan_num, "| NAME =", CHAN ))
# from the name matching our current selection.
tempheadline <- subset(slack_messages, slack_messages$CHANNEL == CHAN)
# collect all the headlines by this user into a single array
chan$Text[chan_num] <- create_text(tempheadline)
}
## [1] "Loading Channel num = 1 | NAME = general"
## [1] 4460
## [1] "Loading Channel num = 2 | NAME = envirovantage"
## [1] 2819
## [1] "Loading Channel num = 3 | NAME = smb"
## [1] 2501
## [1] "Loading Channel num = 4 | NAME = random"
## [1] 2487
## [1] "Loading Channel num = 5 | NAME = marketing"
## [1] 1789
## [1] "Loading Channel num = 6 | NAME = development"
## [1] 1172
## [1] "Loading Channel num = 7 | NAME = arkabase"
## [1] 721
## [1] "Loading Channel num = 8 | NAME = vetsfirstchoice"
## [1] 559
## [1] "Loading Channel num = 9 | NAME = funcommittee"
## [1] 498
## [1] "Loading Channel num = 10 | NAME = vizzinquest"
## [1] 464
## [1] "Loading Channel num = 11 | NAME = hacknight"
## [1] 388
## [1] "Loading Channel num = 12 | NAME = product"
## [1] 368
## [1] "Loading Channel num = 13 | NAME = aws"
## [1] 344
## [1] "Loading Channel num = 14 | NAME = vizzers"
## [1] 272
## [1] "Loading Channel num = 15 | NAME = llbean"
## [1] 198
## [1] "Loading Channel num = 16 | NAME = pizzachallenge"
## [1] 157
## [1] "Loading Channel num = 17 | NAME = badges"
## [1] 121
## [1] "Loading Channel num = 18 | NAME = td"
## [1] 110
## [1] "Loading Channel num = 19 | NAME = testing"
## [1] 106
## [1] "Loading Channel num = 20 | NAME = blackbaud"
## [1] 75
## [1] "Loading Channel num = 21 | NAME = jira-troubles"
## [1] 74
## [1] "Loading Channel num = 22 | NAME = light"
## [1] 73
## [1] "Loading Channel num = 23 | NAME = paydiant"
## [1] 59
## [1] "Loading Channel num = 24 | NAME = sqltroubles"
## [1] 58
## [1] "Loading Channel num = 25 | NAME = awstroubles"
## [1] 55
## [1] "Loading Channel num = 26 | NAME = alteryx"
## [1] 54
## [1] "Loading Channel num = 27 | NAME = edx"
## [1] 47
## [1] "Loading Channel num = 28 | NAME = projectdino"
## [1] 46
## [1] "Loading Channel num = 29 | NAME = ev-bitbucket"
## [1] 42
## [1] "Loading Channel num = 30 | NAME = devops"
## [1] 30
## [1] "Loading Channel num = 31 | NAME = gamenight"
## [1] 28
## [1] "Loading Channel num = 32 | NAME = lunchtime"
## [1] 25
## [1] "Loading Channel num = 33 | NAME = rant"
## [1] 23
## [1] "Loading Channel num = 34 | NAME = pie"
## [1] 21
## [1] "Loading Channel num = 35 | NAME = safran-aero"
## [1] 7
## [1] "Loading Channel num = 36 | NAME = tipsandtricks"
## [1] 5
## [1] "Loading Channel num = 37 | NAME = chriss-weird-diet"
## [1] 3
## [1] "Loading Channel num = 38 | NAME = makeachange"
## [1] 3
## [1] "Loading Channel num = 39 | NAME = pranks"
## [1] 2
## [1] "Loading Channel num = 40 | NAME = slackalytics"
## [1] 2
rm(tempheadline,chan_num,CHAN)
Slack uses special message formatting that replaces the characters &
, <
, and >
with &
, <
, and >
respectively. In addition, all links are wrapped in <...>
. Therefore, the first step is to remove any string that’s a link to something else. Note: Later we may come back and quantify these, but we will remove them for now, so we can visualize the words in the message only.
removeSlackLinks <- function(doc) {
gsub( " *<.*?> *", "", doc)
}
chan$Text<-gsub( " *<.*?> *", "", chan$Text)
So far we’ve been using a traditional R dataframe. In order to use many of R’s text mining capabilities, we’ll need to convert the text into a different class: Corpus
.
wordCorpus <- Corpus(VectorSource(chan$Text))
Whitespace, such as tabs, multiple spaces between words, and line breaks many help with the formatting & structure of the document; however, they don’t help us extract the meaning behind the words, so we will remove them. Also, upper case & lower case characters are encoded differently. This is not a distinction that provides differentiation in our analysis, so we will stanndardize the case of all letters to lower.
wordCorpus <- tm_map(wordCorpus, stripWhitespace)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
Common words don’t help us understand the topic or meaning in a sentence. For example, in the previous sentence, the words don’t, us, the, or, in, and a could all be removed and we’d still understand the topic: Common words help understand topic meaning sentence. The removal of don’t does invert the point that the author is making; howerver, the fundamental concept is retained from the remining words. More importantly, these common words have no discriminatory power in differentiating one topic or meaning from another.
wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))
Finally, since this is our initial pass at analyzing the text, we’re going simplify our scope and stick to just atomic words and eliminate any numbers or punctuation.
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, removeNumbers)
The final step in processing the text is to “stem” the words. Stemming further standardizes slight variations of the same word. For example, argue, argued, argues, & arguing are all essentially the same verb with different tenses. Stemming reduces these 4 different verbs to the common stem of argu.
wordCorpus <- tm_map(wordCorpus, stemDocument)
Now that our coprora for each channel are standardized as much as possible, we can visualize each.
# loop through the channels to aggregate the texts
for(chan_num in 1:nrow(chan))
{
# store the Channel name in variable = "CHANNEL"
CHAN <- chan[chan_num,"NAME"]
# print the iteration number and Channel name,
# just to know the program is working
print(paste("Loading Channel num =", chan_num, "| NAME =", CHAN ))
# load the corpus for the channel into a temporary corpus
tempCorpus <- tm_map(wordCorpus[chan_num], stemDocument)
# check that there are words in the corpus
myTdm <- TermDocumentMatrix(tempCorpus)
# create the word cloud from the temporary corpus if it contains words
if (myTdm$nrow > 0) {
wordcloud(tempCorpus, scale=c(4, 0.01), random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors = brewer.pal(8, "Dark2"))
}
}
## [1] "Loading Channel num = 1 | NAME = general"
## [1] "Loading Channel num = 2 | NAME = envirovantage"