22 June 2021 - grep, sub, regular expressions

lesson designed by TJ Boutorwick

some explanations modified by Stephen

Finding text in R with grep

R can be used to manipulate text. It has built-in functions for this, notably grepand sub. Examples of grep:

strings <- c("Hello", "where are you going", "goiing?", "goi9ng")

# the output indicates one instance was found
grep("Hello", strings)

## [1] 1

Replacing text in R with sub

sub is used to find and replace text in a string. Some examples of sub are below.

strings_changed <- sub("Hello", "hi", strings)
# notice "Hello" has changed to "hi"
strings_changed

## [1] "hi"                  "where are you going" "goiing?"            
## [4] "goi9ng"

strings_changed2 <- sub("goiing", "going", strings)

# notice change to "goiing"
strings_changed2

## [1] "Hello"               "where are you going" "going?"             
## [4] "goi9ng"

Regular expressions

In the previous example, we knew that the word “goiing” (with two i’s) existed and it was easy to change. However, sometimes you may not know what the misspelling is. In this case, regular expressions are convenient. These can be though of as placeholders that represent different characters. Here is a way to replace any character in ‘goiing’:

strings_changed3 <- sub("goi.ng", "going", strings)

# the `.` means "match anything" so it has matched the extra "i"
strings_changed3

## [1] "Hello"               "where are you going" "going?"             
## [4] "going"

Often times you may want to see what the character was that was replaced. You can do that by using brackets and this \\1:

strings_changed4 <- sub("goi(.)ng", "\\1", strings)

# what is different here? Now the results are the thing that was matched (and subsequently used to replace the original string)
strings_changed4

## [1] "Hello"               "where are you going" "i?"                 
## [4] "9"

This should be all we need to start looking at some data (for now).

Course Learning Objectives

The dataset we are looking at is comprised of CLO data for all courses in FHSS. The courses were those offered in the first trimester of 2021.

The question we are focusing on is:: To what extent is Bloom’s taxonomy present in the CLOs? In a nutshell, we want to make a frequency list of all occurrences of Bloom’s taxonomy levels.

First, import the data into a variable called dat. NOTE: the file is tab separated.

(you can download the data here FHSS CLOs)

## import whole dataset
#dat <- read.csv("closDataset.csv", sep="\t")

dat <- read.csv('https://www.stephenskalicky.com/r_data/closDataset.csv', sep = "\t")

# note how the course code is entered - the last three digits are the number, where the first of these three digits tells us the level of the course (i.e., 1 = first year, 2 = second year, etc.).
dat$COURSE[1:10]

##  [1] "ALIN591" "ALIN592" "ALIN690" "ANTH100" "ANTH101" "ANTH102" "ANTH200"
##  [8] "ANTH201" "ANTH204" "ANTH208"

We will need to create a new variable year so we know which year a course is. Examine the COURSE variable and see what it looks like. You should see four characters followed by a series of numbers. We want to get the first number after the four letters. How can we do that? (Remember that the ‘.’ matches any character)

# create new column that looks for something six characters long and captures the fifth character
dat$year <- sub("....(.)..", "\\1", dat$COURSE)

# compare this to the dat$COURSE output above - we are capturing the fifth character from every course name. 
dat$year[1:10]

##  [1] "5" "5" "6" "1" "1" "1" "2" "2" "2" "2"

It may be good to compare between undergraduate and postgraduate courses. Let’s create a new variable level and have it be “UG” if the year is less than 4, and “PG” otherwise. First convert year to a number.

# convert the column to an integer
dat$year <- as.numeric(dat$year)

Use ifelse to make a new column (bonus - how could you do this with mutate and a tidyverse approach?)

# create the ifelse function - note this is different than tidyverse ifelse but functions almost the same. 
dat$level <- ifelse(dat$year < 4, "UG", "PG")

# seems to be working
dat$level[1:10]

##  [1] "PG" "PG" "PG" "UG" "UG" "UG" "UG" "UG" "UG" "UG"

dat$year[1:10]

##  [1] 5 5 6 1 1 1 2 2 2 2

Now we can move on to the CLOs. The first thing to do is to split each word up and format them to be lowercase. The “\W” regular expression is used to match any non-word characters in a string - this means it will match things like whitespace but will not match letters, numbers, punctuation, etc. The strsplit function splits strings based on a defined criteria - if we use “” we are telling strsplit to separate words based on non-word characters. In other words, we are effectively asking to split words based on whitespace.

## separate CLOs into their own list

# we need to put an extra "\" in front of the "\W" so that R knows not to run the regex literally
clos <- strsplit(dat$CLO, "\\W")

# check one - note that "Undertake" is captialized
clos[5]

## [[1]]
##  [1] "1"               "Identify"        "and"             "analyse"        
##  [5] "fundamental"     "ideas"           ""                "concepts"       
##  [9] ""                "and"             "research"        "practices"      
## [13] "of"              "contemporary"    "social"          "and"            
## [17] "cultural"        "anthropology"    ""                ""               
## [21] "2"               "Undertake"       "research"        "utilising"      
## [25] "an"              "anthropological" "perspective"     "and"            
## [29] "communicate"     "your"            "findings"        "in"             
## [33] "verbal"          "and"             "written"         "form"           
## [37] ""                ""                "3"               "Explain"        
## [41] "how"             "different"       "aspects"         "of"             
## [45] "culture"         "relate"          "to"              "one"            
## [49] "another"         "and"             "are"             "integrated"     
## [53] "in"              "a"               "cultural"        "system"

Let’s change all of the words to lowercase using lappy - this function will take a single function and “apply” it to a list of things. This means we will “apply” the function tolower to everything in the list clos

clos <- lapply(clos, tolower)

# what happened to "Undertake"?
clos[5]

## [[1]]
##  [1] "1"               "identify"        "and"             "analyse"        
##  [5] "fundamental"     "ideas"           ""                "concepts"       
##  [9] ""                "and"             "research"        "practices"      
## [13] "of"              "contemporary"    "social"          "and"            
## [17] "cultural"        "anthropology"    ""                ""               
## [21] "2"               "undertake"       "research"        "utilising"      
## [25] "an"              "anthropological" "perspective"     "and"            
## [29] "communicate"     "your"            "findings"        "in"             
## [33] "verbal"          "and"             "written"         "form"           
## [37] ""                ""                "3"               "explain"        
## [41] "how"             "different"       "aspects"         "of"             
## [45] "culture"         "relate"          "to"              "one"            
## [49] "another"         "and"             "are"             "integrated"     
## [53] "in"              "a"               "cultural"        "system"

Using what we know about, we can create stop-lists for each level of Bloom’s taxonomy. There is a guide from the university about writing CLOs. It includes a table at the bottom of common words for each level of Bloom’s taxonomy:

Bloom’s Taxonomy

Generate a set of vectors which include key words associated with each level of Bloom’s taxonomy. The ^ is a regex which means “start of a string” and the | means OR.

So each vector is a list of string patterns we can use to search for using regex, and because we have the | (or) operator, we can search through them all more efficiently.

know <- "^write|^list|^label|^name|^state|^define|^recognise|^characterise|^correct|^establish|^identify|^infer|^match"

comprehend <- "^explain|^summarise|^paraphrase|^describe|^illustrate|^interpret|^classify"

apply <- "^use|^compute|^solve|^demonstrate|^construct|^execute|^implement"

analyse <- "^categorise|^compare|^contrast|^separate|^differentiate|^organise|^attribute"

synthesise <- "^plan|^integrate|^formulate|^theorise|^design|^build"

evaluate <-"^judge|^recommend|^critique|^justify|^check"

Using these stop-lists, we can now apply them to the list of clos to change all of the words to Bloom’s taxonomy levels. It introduces a couple of new(is) concepts. First look at the fifth clo in clos - notice that the second word is “identify” which is part of the know object above.

clos[5]

## [[1]]
##  [1] "1"               "identify"        "and"             "analyse"        
##  [5] "fundamental"     "ideas"           ""                "concepts"       
##  [9] ""                "and"             "research"        "practices"      
## [13] "of"              "contemporary"    "social"          "and"            
## [17] "cultural"        "anthropology"    ""                ""               
## [21] "2"               "undertake"       "research"        "utilising"      
## [25] "an"              "anthropological" "perspective"     "and"            
## [29] "communicate"     "your"            "findings"        "in"             
## [33] "verbal"          "and"             "written"         "form"           
## [37] ""                ""                "3"               "explain"        
## [41] "how"             "different"       "aspects"         "of"             
## [45] "culture"         "relate"          "to"              "one"            
## [49] "another"         "and"             "are"             "integrated"     
## [53] "in"              "a"               "cultural"        "system"

Write a function again using lapply to “apply” a single function to a lot of things. In this case, we are using an anonymous function which is a function that does not have a name and is not saved permanently. Below, we apply the anonymous function which says to use gsub on every element of x, where x = clos. The pattern to search for is the object know (see above), and the replacement is the word know.

## apply each of the above to the clos, changing the words
clos <- lapply(clos, function(x) gsub(know, "know", x))

# identify has been changed to know
clos[5]

## [[1]]
##  [1] "1"               "know"            "and"             "analyse"        
##  [5] "fundamental"     "ideas"           ""                "concepts"       
##  [9] ""                "and"             "research"        "practices"      
## [13] "of"              "contemporary"    "social"          "and"            
## [17] "cultural"        "anthropology"    ""                ""               
## [21] "2"               "undertake"       "research"        "utilising"      
## [25] "an"              "anthropological" "perspective"     "and"            
## [29] "communicate"     "your"            "findings"        "in"             
## [33] "verbal"          "and"             "written"         "form"           
## [37] ""                ""                "3"               "explain"        
## [41] "how"             "different"       "aspects"         "of"             
## [45] "culture"         "relate"          "to"              "one"            
## [49] "another"         "and"             "are"             "integrated"     
## [53] "in"              "a"               "cultural"        "system"

Repeat this for each list representing a level of Bloom’s Taxonomy

clos <- lapply(clos, function(x) gsub(comprehend, "comprehend", x))
clos <- lapply(clos, function(x) gsub(apply, "apply", x))
clos <- lapply(clos, function(x) gsub(analyse, "analyse", x))
clos <- lapply(clos, function(x) gsub(synthesise, "syntehsise", x))
clos <- lapply(clos, function(x) gsub(evaluate, "evaluate", x))

Now that we know which words we want to keep, we can create a variable keep_words which will contain them. We know that we have changed all of the words in each level to just one word, so we can search for just those words.

keep_words <- "^know$|^comprehend$|^apply$|^analyse$|^synthesise$|^evaluate$"
keep_words

## [1] "^know$|^comprehend$|^apply$|^analyse$|^synthesise$|^evaluate$"

Next, we simply have to keep the words in clos that are in keep_words.

clos <- lapply(clos, function(x) grep(keep_words, x, value = T))

clos[5]

## [[1]]
## [1] "know"       "analyse"    "comprehend"

Finally, let’s view the results in table form after saving it to a variable clos_freq.

clos_freq <- table(unlist(clos))

clos_freq

## 
##    analyse      apply comprehend   evaluate       know synthesise 
##        614       1579        548        453        667         36

first encounters with regular expressions

Stephen Skalicky

28/06/2021