R can be used to manipulate text. It has built-in functions for this,
notably grep
and sub
. Examples of
grep
:
strings <- c("Hello", "where are you going", "goiing?", "goi9ng")
# the output indicates one instance was found
grep("Hello", strings)
## [1] 1
sub
is used to find and replace text in a string. Some
examples of sub
are below.
strings_changed <- sub("Hello", "hi", strings)
# notice "Hello" has changed to "hi"
strings_changed
## [1] "hi" "where are you going" "goiing?"
## [4] "goi9ng"
strings_changed2 <- sub("goiing", "going", strings)
# notice change to "goiing"
strings_changed2
## [1] "Hello" "where are you going" "going?"
## [4] "goi9ng"
In the previous example, we knew that the word “goiing” (with two i’s) existed and it was easy to change. However, sometimes you may not know what the misspelling is. In this case, regular expressions are convenient. These can be though of as placeholders that represent different characters. Here is a way to replace any character in ‘goiing’:
strings_changed3 <- sub("goi.ng", "going", strings)
# the `.` means "match anything" so it has matched the extra "i"
strings_changed3
## [1] "Hello" "where are you going" "going?"
## [4] "going"
Often times you may want to see what the character was that was
replaced. You can do that by using brackets and this
\\1
:
strings_changed4 <- sub("goi(.)ng", "\\1", strings)
# what is different here? Now the results are the thing that was matched (and subsequently used to replace the original string)
strings_changed4
## [1] "Hello" "where are you going" "i?"
## [4] "9"
This should be all we need to start looking at some data (for now).
The dataset we are looking at is comprised of CLO data for all courses in FHSS. The courses were those offered in the first trimester of 2021.
The question we are focusing on is:: To what extent is Bloom’s taxonomy present in the CLOs? In a nutshell, we want to make a frequency list of all occurrences of Bloom’s taxonomy levels.
dat
.
NOTE: the file is tab separated. (you can download the data here FHSS CLOs)
## import whole dataset
#dat <- read.csv("closDataset.csv", sep="\t")
dat <- read.csv('https://www.stephenskalicky.com/r_data/closDataset.csv', sep = "\t")
# note how the course code is entered - the last three digits are the number, where the first of these three digits tells us the level of the course (i.e., 1 = first year, 2 = second year, etc.).
dat$COURSE[1:10]
## [1] "ALIN591" "ALIN592" "ALIN690" "ANTH100" "ANTH101" "ANTH102" "ANTH200"
## [8] "ANTH201" "ANTH204" "ANTH208"
year
so we know
which year a course is. Examine the COURSE
variable and see
what it looks like. You should see four characters followed by a series
of numbers. We want to get the first number after the four letters. How
can we do that? (Remember that the ‘.’ matches any character)# create new column that looks for something six characters long and captures the fifth character
dat$year <- sub("....(.)..", "\\1", dat$COURSE)
# compare this to the dat$COURSE output above - we are capturing the fifth character from every course name.
dat$year[1:10]
## [1] "5" "5" "6" "1" "1" "1" "2" "2" "2" "2"
level
and have it be
“UG” if the year
is less than 4, and “PG” otherwise. First
convert year
to a number.# convert the column to an integer
dat$year <- as.numeric(dat$year)
Use ifelse
to make a new column (bonus - how could you
do this with mutate
and a tidyverse
approach?)
# create the ifelse function - note this is different than tidyverse ifelse but functions almost the same.
dat$level <- ifelse(dat$year < 4, "UG", "PG")
# seems to be working
dat$level[1:10]
## [1] "PG" "PG" "PG" "UG" "UG" "UG" "UG" "UG" "UG" "UG"
dat$year[1:10]
## [1] 5 5 6 1 1 1 2 2 2 2
strsplit
function splits
strings based on a defined criteria - if we use “” we are telling
strsplit
to separate words based on non-word characters. In
other words, we are effectively asking to split words based on
whitespace.## separate CLOs into their own list
# we need to put an extra "\" in front of the "\W" so that R knows not to run the regex literally
clos <- strsplit(dat$CLO, "\\W")
# check one - note that "Undertake" is captialized
clos[5]
## [[1]]
## [1] "1" "Identify" "and" "analyse"
## [5] "fundamental" "ideas" "" "concepts"
## [9] "" "and" "research" "practices"
## [13] "of" "contemporary" "social" "and"
## [17] "cultural" "anthropology" "" ""
## [21] "2" "Undertake" "research" "utilising"
## [25] "an" "anthropological" "perspective" "and"
## [29] "communicate" "your" "findings" "in"
## [33] "verbal" "and" "written" "form"
## [37] "" "" "3" "Explain"
## [41] "how" "different" "aspects" "of"
## [45] "culture" "relate" "to" "one"
## [49] "another" "and" "are" "integrated"
## [53] "in" "a" "cultural" "system"
Let’s change all of the words to lowercase using lappy
-
this function will take a single function and “apply” it to a list of
things. This means we will “apply” the function tolower
to
everything in the list clos
clos <- lapply(clos, tolower)
# what happened to "Undertake"?
clos[5]
## [[1]]
## [1] "1" "identify" "and" "analyse"
## [5] "fundamental" "ideas" "" "concepts"
## [9] "" "and" "research" "practices"
## [13] "of" "contemporary" "social" "and"
## [17] "cultural" "anthropology" "" ""
## [21] "2" "undertake" "research" "utilising"
## [25] "an" "anthropological" "perspective" "and"
## [29] "communicate" "your" "findings" "in"
## [33] "verbal" "and" "written" "form"
## [37] "" "" "3" "explain"
## [41] "how" "different" "aspects" "of"
## [45] "culture" "relate" "to" "one"
## [49] "another" "and" "are" "integrated"
## [53] "in" "a" "cultural" "system"
Generate a set of vectors which include key words associated with
each level of Bloom’s taxonomy. The ^
is a regex which
means “start of a string” and the |
means OR.
So each vector is a list of string patterns we can use to search for
using regex, and because we have the |
(or) operator, we
can search through them all more efficiently.
know <- "^write|^list|^label|^name|^state|^define|^recognise|^characterise|^correct|^establish|^identify|^infer|^match"
comprehend <- "^explain|^summarise|^paraphrase|^describe|^illustrate|^interpret|^classify"
apply <- "^use|^compute|^solve|^demonstrate|^construct|^execute|^implement"
analyse <- "^categorise|^compare|^contrast|^separate|^differentiate|^organise|^attribute"
synthesise <- "^plan|^integrate|^formulate|^theorise|^design|^build"
evaluate <-"^judge|^recommend|^critique|^justify|^check"
clos
- notice that the second word is “identify” which is
part of the know
object above.clos[5]
## [[1]]
## [1] "1" "identify" "and" "analyse"
## [5] "fundamental" "ideas" "" "concepts"
## [9] "" "and" "research" "practices"
## [13] "of" "contemporary" "social" "and"
## [17] "cultural" "anthropology" "" ""
## [21] "2" "undertake" "research" "utilising"
## [25] "an" "anthropological" "perspective" "and"
## [29] "communicate" "your" "findings" "in"
## [33] "verbal" "and" "written" "form"
## [37] "" "" "3" "explain"
## [41] "how" "different" "aspects" "of"
## [45] "culture" "relate" "to" "one"
## [49] "another" "and" "are" "integrated"
## [53] "in" "a" "cultural" "system"
Write a function again using lapply
to “apply” a single
function to a lot of things. In this case, we are using an
anonymous function
which is a function that does not have a
name and is not saved permanently. Below, we apply the anonymous
function which says to use gsub
on every element of x,
where x = clos. The pattern to search for is the object
know
(see above), and the replacement is the word
know
.
## apply each of the above to the clos, changing the words
clos <- lapply(clos, function(x) gsub(know, "know", x))
# identify has been changed to know
clos[5]
## [[1]]
## [1] "1" "know" "and" "analyse"
## [5] "fundamental" "ideas" "" "concepts"
## [9] "" "and" "research" "practices"
## [13] "of" "contemporary" "social" "and"
## [17] "cultural" "anthropology" "" ""
## [21] "2" "undertake" "research" "utilising"
## [25] "an" "anthropological" "perspective" "and"
## [29] "communicate" "your" "findings" "in"
## [33] "verbal" "and" "written" "form"
## [37] "" "" "3" "explain"
## [41] "how" "different" "aspects" "of"
## [45] "culture" "relate" "to" "one"
## [49] "another" "and" "are" "integrated"
## [53] "in" "a" "cultural" "system"
Repeat this for each list representing a level of Bloom’s Taxonomy
clos <- lapply(clos, function(x) gsub(comprehend, "comprehend", x))
clos <- lapply(clos, function(x) gsub(apply, "apply", x))
clos <- lapply(clos, function(x) gsub(analyse, "analyse", x))
clos <- lapply(clos, function(x) gsub(synthesise, "syntehsise", x))
clos <- lapply(clos, function(x) gsub(evaluate, "evaluate", x))
keep_words
which will contain them. We know that
we have changed all of the words in each level to just one word, so we
can search for just those words.keep_words <- "^know$|^comprehend$|^apply$|^analyse$|^synthesise$|^evaluate$"
keep_words
## [1] "^know$|^comprehend$|^apply$|^analyse$|^synthesise$|^evaluate$"
keep_words
.clos <- lapply(clos, function(x) grep(keep_words, x, value = T))
clos[5]
## [[1]]
## [1] "know" "analyse" "comprehend"
clos_freq
.clos_freq <- table(unlist(clos))
clos_freq
##
## analyse apply comprehend evaluate know synthesise
## 614 1579 548 453 667 36