Exploring the Teaching Excellence Framework (TEF) 2017 submissions using topic modelling to look for language differences between awards. Work for PGCert assignment 2019-20. (All the code is available on GitHub, it will be available CC-BY licenced for the community to adapt and use.)
The Teaching Excellence Framework (TEF) is a major policy initiative in higher education. Moore, Higham, Sanders et al. (2017) explain how the submissions made by providers in the TEF Year Two (TEF2) process provided a detailed source of evidence for excellence in teaching and impact evaluation practice. The Higher Education Academy (HEA) commissioned a research team to devise and undertake a data mining process on the text content of each submission. The results provided an initial guidance of a gold, silver or bronze award, some of which were adjusted by a panel before deciding a final outcome (Kernohan (2017)).
There are further data mining techniques and visualisation tools which can be used to study the submission texts in the following ways:
These findings may be able to assist or guide future work. The R programming language has been used to prepare and analyse the data files.
The TEF2 statement texts and award data were downloaded from the Office for Students website. These data files are available in CSV and/or Excel format. The statements are available in PDF format. Work was required to convert and interpret files using R.
LDA (Latent Direchlet Allocation) is a topic modelling technique that has been used to analyse the statements. It has been applied in previous work by Poon, Goloshchapova et al. (2018). The statements (documents) are taken through a series of pre-processing steps including stemming and removing stop-words, then the LDA process (‘text2vec’) is applied. The pre-processing could be further improved after running the LDA process the first time by customising the list of stop-words or by tokenising instead of stemming.
The LDA process aims to identify topics that are covered in one or more documents by looking at the frequency and co-occurence of words. A hyper-parameter for this model, ‘number of latent topics’, was set to 50.
Two output matrices are produced:
The six most likely keywords are shown in the analyses below. These lists are useful for identifying garbage terms, some of which could be removed if re-running the process. The identified topics may include terms which a human can identify as being related to the same real topic.
The ‘LDAvis’ software is used to visualise the topics, list which terms have been identified, and map how the topics relate to each other. (Topics with more terms in common appear in close proximity on the intertopic distance maps). Interpreting the LDAvis output can help to further refine the LDA process.
After running the LDA process on the whole set of documents, it was repeated with subsets for each of the gold, silver and bronze awarded institutions, with a visualisation for each. It was also repeated with subsets for higher and further education institutions.
These terms are most important within the identified topics. Note that all terms have been stemmed (the ends of words are removed so that ‘support’, ‘supporting’ and ‘supported’ are treated the same, for example).
Rank | 🇬🇧Whole | 🥇Gold | 🥈Silver | 🥉Bronze | 🏨HEI | 🏫 FEC |
---|---|---|---|---|---|---|
1 | student | student | student | student | univers | colleg |
2 | univers | univers | univers | colleg | student | student |
3 | colleg | teach | colleg | univers | teach | learn |
4 | learn | colleg | learn | teach | learn | educ |
5 | support | learn | teach | progamm | employ | support |
6 | teach | support | support | support | support | higher |
These terms are most frequent, regardless of topics.
Rank | 🇬🇧Whole | 🥇Gold | 🥈Silver | 🥉Bronze | 🏨HEI | 🏫FEC |
---|---|---|---|---|---|---|
1 | student | student | student | student | student | student |
2 | univers | univers | learn | learn | univers | colleg |
3 | learn | support | univers | colleg | learn | support |
4 | support | teach | develop | support | support | learn |
5 | teach | learn | teach | develop | teach | develop |
6 | colleg | provid | employ | univers | develop | staff |
These are the six most salient terms from each of the five widest distributed topics for each study (the largest circles in the visualisation).
*_word_vectors_for_each_topic.csv
files.See the 🇬🇧whole visualisation
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
support | teach | work | learn | univers |
student | learn | research | employ | student |
learn | student | graduat | support | teach |
develop | success | learn | teach | undergradu |
includ | experi | employ | skill | studi |
feedback | demonstr | practic | profession | academ |
See the 🥇gold visualisation
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
student | univers | student | practic | well |
support | teach | academ | staff | first |
improv | student | support | learn | year |
research | learn | also | feedback | provid |
inform | studi | use | includ | three |
data | subject | person | programm | take |
See the 🥈silver visualisation
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
develop | learn | student | student | colleg |
learn | develop | provid | metric | student |
teach | programm | programm | tef | part |
profession | student | feedback | employ | progress |
staff | support | learn | data | work |
practic | staff | access | assess | area |
See the 🥉bronze visualisation
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
employ | student | student | colleg | learn |
work | learn | staff | higher | ukprn |
skill | develop | develop | provis | can |
feedback | project | support | develop | name |
use | support | higher | learn | event |
student | research | innov | degree | offer |
See the 🏨HEI visualisation
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
student | learn | univers | develop | support |
feedback | develop | student | learn | employ |
academ | work | employ | research | teach |
offer | experi | studi | staff | sector |
work | profession | enhanc | institut | improv |
librari | staff | engag | provid | work |
See the 🏫FEC visualisation
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
student | student | colleg | employ | student |
support | develop | educ | higher | within |
skill | learn | learn | progress | engag |
use | support | work | programm | academ |
cours | skill | provis | student | develop |
develop | work | curriculum | ensur | support |
Moore, J, Higham, L, Sanders, J, Jones, S, Candarli, D, Mountford-Zimdars, A 2017, ‘Evidencing teaching excellence: Analysis of the TEF2 provider submissions’, Higher Education Academy.
Office for Students 2019, TEF data, Office for Students.
Poon, S-H, Goloshchapova, I, Pritchard, M & Reed, P 2018, ‘Corporate Social Responsibility Reports: Topic Analysis and Big Data Approach’, European Journal of Finance.
Kernohan, D 2017, ‘TEF results - Who moved up and who fell down?’, Wonkhe.
data/
raw/
(excluded from version control)
tef_y2_allcontext.csv
contextual datatef_y2_allmetrics.csv
all the metricstefyeartwo_awards.xlsx
all the outcomes/awardsTEFYearTwo_AllSubmissions/
folder
10000055_Abingdon and Witney College_Submission.pdf
PROVIDER_TEFUKPRN
underscore PROVIDER_NAME
underscore .pdf
TEFYearTwo_AllSubmissions_txt/
folder
processed/
tdm.rds
text document matrix data for the whole (may use later)gold_tdm.rds
text document matrix data for the gold awardssilver_tdm.rds
text document matrix data for the silverbronze_tdm.rds
text document matrix data for the bronzehei_tdm.rds
text document matrix data for higher educationfec_tdm.rds
text document matrix data for further educationcode/
convert_pdf2txt.R
Convert directory in one go from PDF to txttext2vec_whole.R
Perform LDA (column for award) and create visualizationtext2vec_gold.R
Just gold awarded institutions (run after ‘whole’)text2vec_silver.R
Just silver awarded institutions (run after ‘whole’)text2vec_bronze.R
Just bronze awarded institutions (run after ‘whole’)text2vec_hei.R
Just ‘HigherEducationInstitution’ (run after ‘whole’)text2vec_fec.R
Just ‘FurtherEducationCollege’ (run after ‘whole’)result/
LDA_plot.Rmd
The knitR file to launch the visualizationLDA_plot.html
The knitted visualizationwhole_doc_topic_probabilities.csv
Probability of each topic in each docwhole_word_vectors_for_each_topic.csv
Which words occur in which topicsldavis/
folder
LDA_plot.html
fileLDA_plot_gold.Rmd
LDA_plot_gold.html
gold_doc_topic_probabilities.csv
gold_word_vectors_for_each_topic.csv
gold_ldavis/
folderLDA_plot_silver.Rmd
LDA_plot_silver.html
silver_doc_topic_probabilities.csv
silver_word_vectors_for_each_topic.csv
silver_ldavis/
folderLDA_plot_bronze.Rmd
LDA_plot_bronze.html
bronze_doc_topic_probabilities.csv
bronze_word_vectors_for_each_topic.csv
bronze_ldavis/
folderLDA_plot_hei.Rmd
LDA_plot_hei.html
hei_doc_topic_probabilities.csv
hei_word_vectors_for_each_topic.csv
hei_ldavis/
folderLDA_plot_fec.Rmd
LDA_plot_rec.html
fec_doc_topic_probabilities.csv
fec_word_vectors_for_each_topic.csv
fec_ldavis/
foldermaps/
folder
map1_whole.png
Intertopic distance map for wholemap2_gold.png
Intertopic distance map for goldmap3_silver.png
Intertopic distance map for silvermap4_bronze.png
Intertopic distance map for bronzemap5_hei.png
Intertopic distance map for HEImap6_fec.png
Intertopic distance map for FEC