INTRODUCTION:
Covid-19 has shifted our day-to-day lives and has impacted the world globally. Much of what is going on in the world is brought to us through the news, in the form of text.
In this notebook, we will be performing Natural Language Processing techniques on a dataset of Covid-19 related news. We will be doing tokenization, lemmatization, Part-of-speech (POS) tagging, Named Entity Recognition, stopword removal, and more to sentences from news articles, and the entire dataset. You will be using the NLP library spaCy to perform these NLP techniques. This library makes it simple to perform these complex operations for text in a specified language.
This will allow you to explore the NLP Pipeline and analyze the contents of the text within the dataset. You will also be creating word clouds based on the modified text from the dataset. This will allow us to visualize the key words from the articles to better understand which content is most important in them.
For this notebook you will need to install the following packages in addition to libraries previously used:
1) spaCy: pip install -U spacy
2) spaCy's English package (change the command according to your environment, ex: python vs py): py -m spacy download en
3) WordCloud: pip install wordcloud
4) MultiDict: pip install multidict
Note: For some students who already have spaCy installed, an issue may occur when trying to work with the English package mentioned in (2). If this occurs, there is a commented line of code with an alternative way to run the same method. Only if you run into an issue where 'en' is not recognized, run the following pip command and switch the spacy.load() function in the code second code cell below to the commented call.
Change the command according to your environment, ex: python vs py: py -m spacy download en
HOMEWORK:
Go through the notebook by running each cell, one at a time.
Look for (TO DO) for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.
The notebook will be marked on 25.
Each (TO DO) has a number of points associated with it.
# Before starting we will import every module that we will be using
import pandas as pd
import matplotlib.pyplot as plt
import multidict as multidict
import re
import spacy
from spacy import displacy
from wordcloud import WordCloud
%run -m spacy download en
# The core spacy object that will be used for tokenization, lemmatization, POS Tagging, ...
# Note that this is specifically for the English language and requires the English package to be installed
# via pip to work as intended.
sp = spacy.load('en')
# If the above causes an error after installing the package described in (2), install the package described
# in the Note section within the introduction and run this line of code instead of the above.
#sp = spacy.load('en')
PART 1 - Sentence Analysis
In this part, we will use the modules from spaCy to perform the different steps of the NLP pipeline on sentences from the included file on Covid-19 related news articles from CBC news. The dataset is included with this submission, but details regarding it can be found here. The first thing that we will do is load the file into a pandas dataframe.
# Read the dataset, show top ten rows
df = pd.read_csv("news.csv")
df.head(10)
From this dataset, we will start by extracting selecting five different sentences that we will be using throughout this section from an article. First, we will display the text of an article and will manually copy the sentences that that will be used for this section. Notice that there are many tags saved within the dataset, but we will not worry about those for now.
df["text"][1]
From this text, we will select one sentence that will be used by the examples provided within the notebook, sentence_example, and five sentences that you will be using to answer five questions within this section, sentence1, ... sentence5. Sentences 4 and 5 are the same due to that sentence being great for both of the questions.
# Sentence to be used for running examples
sentence_example = "Government guidelines in Canada recommend that people stay at least two metres away from others as part of physical distancing measures to curb the spread of COVID-19."
# Sentences to be used for future questions
sentence1 = "I think those are provocative and those are hypothesis- generating, but then they need to be tested in the field.\" Loeb is running such a field test himself — a randomized controlled trial of the use of medical versus N95 masks among health care workers to see if there is a difference in the transmission of COVID-19."
sentence2 = "The World Health Organization recommends that people wear masks if they are coughing and sneezing or if they are caring for someone who is sick."
sentence3 = "Will it help if everyone wears a mask?"
sentence4 = "Infection control guidelines do recommend extra personal protective equipment (including N-95 respirators) to protect against airborne transmission for healthcare workers performing procedures that generate high concentrations of aerosolized particles, such as intubations, on COVID-19 patients, McGeer said."
sentence5 = "Infection control guidelines do recommend extra personal protective equipment (including N-95 respirators) to protect against airborne transmission for healthcare workers performing procedures that generate high concentrations of aerosolized particles, such as intubations, on COVID-19 patients, McGeer said."
With the sentences that we will be using defined, we will now explore how spaCy can be applied on a sentence and go through examples of its usage. You will then be asked questions to solve on your own with your understanding of the NLP concepts, with the examples, and with links to relevant documentation.
First, we will pass the example sentence into our spacy object sp to retrieve the tokenization, lemmatization, dependency values, Part-of-speech (POS) tags, and more from the sentence. As you will see, spaCy makes this process very easy.
# Call our spaCy object to retrieve the results of running the sentence through the NLP Pipeline
# Note that we can reuse the sp variable without redefining it.
sentence_example_content = sp(sentence_example)
for token in sentence_example_content:
print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ +
" Dependency: " + token.dep_)
# We will take a look at the dependency tree to view how the words relate to each other
displacy.render(sentence_example_content, style="dep")
In the code above, we see that we are able to access the tags from the dependency tree for each token by calling .dep_. However, we can also directly access elements from the dependency tree (as seen in the code below). For more examples regarding how to navigate through the dependency trees, you can take a look at some official spaCy examples. However, below are details that are enough to know for this notebook.
Looking at the dependency tree above, we see that the words have arrows to represent the relationships. Each of these have a dependency tab to explain the dependencies between tokens. For example, 'Government' is a compound of the noun 'guidelines'. Thus, 'Government guidelines' is a noun compound.
In the code, after parsing text with spaCy into tokens, it is possible to access the words which a token has arrows connected to. This is exhibited in the code below.
Note that when accessing a child node, you are able to access the properties in the same way that you would for a regular spaCy token (.pos_, ...).
# Display how to access the dependency children within a dependency tree
for token in sentence_example_content:
print("Current token: " + token.text)
print("All children of this token:", list(token.children))
print("Left children of this token:", list(token.lefts))
print("Right children of this token:", list(token.rights))
print()
We can also use sp to retrieve the Named Entity Recognition (NER) tags of terms in a provided text. Below we use the results obtained above after calling spaCy to retrieve the NER tags for each token in the sentence that has been assigned an NER tag. We access this by calling .label_ on an element from the iteration returned by calling .ents on the results returned by spaCy.
# Loop through all tokens that contain a NER tag and print the token along with the corresponding NER tag
for token in sentence_example_content.ents:
print("\"" + token.text + "\" is a " + token.label_ )
We'll explore Named Entity Recognition in the next Notebook, as this one focuses on linguistic analysis at the level of sentences and corpus.
(TO DO) Q1 - 1 mark
For sentence1, use spaCy to run the sentence through the NLP Pipeline and determine how many tokens are in the sentence.
# TODO: How many tokens
sentence1_content = sp(sentence1)
print(len(sentence1_content))
(TO DO) Q2 - 2 marks
For sentence2, display the dependency tree and determine what the subject is for the verb recommends (the entire name). You do not need to do this automatically, just print the value that you find by looking at the dependency tree.
# Display the dependency tree for sentence5
sentence2_content = sp(sentence2)
displacy.render(sentence2_content, style="dep")
# What is the subject of the verb 'recommends' in sentence5
print("The subject of verb recommends is The World Health Organization")
(TO DO) Q3 - 2 marks
For sentence3, use spaCy to run the sentence through the NLP Pipeline and print only the words that are VERBs.
# TODO: Find the verbs
sentence3_content = sp(sentence3)
for token in sentence3_content:
if token.pos_ == "VERB":
print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ +
" Dependency: " + token.dep_)
(TO DO) Q4
a) For sentence4, use spaCy to run the sentence through the NLP Pipeline and print only the words that are adjectives (ADJs).
b) For each adjective found in (a), find the nounds that the adjective modifies. To do this, you will need to go through the tags from the dependency tree to find the adjectives with the amod tag to find the following noun that it modifies.
(TO DO) Q4 (a) - 1 mark
a) For sentence4, use spaCy to run the sentence through the NLP Pipeline and print only the words that are adjectives (ADJs).
# TODO: Find the adjectives
s4 = sp(sentence4)
for token in s4:
if token.pos_ == "ADJ":
print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ +
" Dependency: " + token.dep_)
(TO DO) Q4 (b) - 3 marks
b) For each applicable adjectives in the sentence, find the nouns that the adjective modifies. If several adjectives modify a single noun, each of those adjectives should be printed with those nouns (ex: 'extra -> equipment').
To do this, you will need to go through the tags from the dependency tree to find the adjectives with the amod tag to find the following noun that it modifies. Note that not every adjective will have the amod dependency, but many will.
Hint: Recall from the example at the beginning of this part that you are able to select a token and access each arrow leaving the token.
Also note that you can approach this problem in many ways, so feel free to design the approach yourself (as long as it correctly answers the question).
# Display the dependency tree
displacy.render(s4, style="dep")
# TODO: Print the nouns that an adjective modifies with the amod dependency label
# Go through the spaCy tokens, look for a specific POS tag, find the amod relations and print the relationship
noun_adj_pairs = []
for i,token in enumerate(s4):
if token.pos_ not in ('NOUN'):
continue
for j in range(i+1,len(s4)):
if s4[j].pos_ == 'ADJ':
noun_adj_pairs.append((token,s4[j]))
break
print(noun_adj_pairs)
(TO DO) Q5 - 3 marks
For sentence5, use spaCy to run the sentence through the NLP Pipeline and find all noun compounds. A noun compound consists of one or more words with a compound dependency value (that are also NOUNs) followed by a noun (compound, ..., compound, non-compound NOUN).
To view the noun compounds, you can view the dependency tree of the sentence after running it through the NL pipeline via spaCy.
You must put all of the compounds together in order to get marks. Print the obtained noun compounds.
Note that you can approach this problem in many ways, so feel free to design the approach yourself (as long as it correctly answers the question).
Ex: 'Infection control guidelines' is a noun compound.
Below we parsed the sentence, printed out the compounds, and displayed the dependency tree for you to look at before coding in the following cell.
# Apply spaCy to sentence3
s5 = sp(sentence5)
# Display all compounds within the sentence
for token in s5:
if token.dep_ == "compound":
print(token)
# Display the dependency tree
displacy.render(s5, style="dep")
# TODO: Find, connect, and print all noun compounds
print([chunk.text for chunk in s5.noun_chunks])
(TO DO) Q6 - 1 marks
Using the provided sentence, briefly explain the impact that length has on dependency tree of a sentence and on the parsing of the sentence (through a comment or print statement).
# TODO: Complete answer
sentence_parse = "In a surprisingly high turnout, millions of South Korean voters wore masks and moved slowly between lines of tape at polling stations on Wednesday to elect lawmakers in the shadow of the spreading coronavirus."
s_parse = sp(sentence_parse)
for token in s_parse:
print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ +
" Dependency: " + token.dep_)
displacy.render(s_parse, style="dep")
# TODO: What is the impact
print("The length of the sentence can have great impact on parsing. The dependency tree grows bigger and deeper as the sentence grows complexity. Trying to parse the sentence and find its noun compounds and adjectives becomes more difficult.")
PART 2 - Corpus Analysis
For the second section of this notebook, we will focus on analyzing the entire corpus by making word clouds based off of corpus content. This will help us identify the key words within the articles based on the criteria that we apply to the data with NLP techniques.
For this section we will be using the WordCloud library which allows us to create the word clouds with text or by frequencies of the words within the text. The code for generating the word clouds based on the frequency comes from this WordCloud example.
We will start with a simple example of creating a word cloud based on the titles of the documents in our corpus. Although we could use the descriptions of the aritcles or the actual test, this can take too long. Thus, we will be working with the titles which will allow word clouds to be generated in around a minute each.
We will make a word cloud based on the frequencies of each term from the titles in our corpus, by calling the getFrequencyDictForText function below, and passing those frequencies to the word cloud via WordCloud's generate_from_frequencies function.
# Code from the example in: https://amueller.github.io/word_cloud/auto_examples/frequency.html
def getFrequencyDictForText(sentence):
fullTermsDict = multidict.MultiDict()
tmpDict = {}
# making dict for counting frequencies
for text in sentence.split(" "):
val = tmpDict.get(text, 0)
tmpDict[text.lower()] = val + 1
for key in tmpDict:
fullTermsDict.add(key, tmpDict[key])
return fullTermsDict
# This function comes from: https://towardsdatascience.com/simple-wordcloud-in-python-2ae54a9f58e5
# Define a function to plot a word cloud
def plot_cloud(wordcloud):
# Set figure size
plt.figure(figsize=(40, 30))
# Display image
plt.imshow(wordcloud)
# No axis details
plt.axis("off");
# This can take about a minute
# Retrieve the frequencies from the titles in the dataframe
frequencies = getFrequencyDictForText(' '.join(df["title"]))
# Create a word cloud based on the frequencies from the titles in the dataframe
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)
As you can see, we can easliy use the frequency of the terms from the titles to create a word cloud. These word clouds can be customized to be within image backgrounds, have custom colours, and more.
The above word cloud contains some important terms, but many of the most frequent terms are not very important (or are symbols/numbers). These terms that appear very frequently in many different types of documents, but are not important, are called stopwords. For example, the words the, to, is, of, ... are words that appear extremely frequently in text, but provide no meaningful information when analyzing a document. Often we want to remove these stopwords. For that reason, NLP libraries, such as spaCy, provide methods of detecting which words are stopwords. Below is an example of how spaCy can be used to determine if a word is a stopword (based on the sentence used in the first example from part 1).
# Call our spaCy object to retrieve the results of running the sentence through the NLP Pipeline
# Note that we can reuse the sp variable without redefining it.
sentence_example_content = sp(sentence_example)
for token in sentence_example_content:
print("Text: " + str(token.text) + " Is stopword: " + str(token.is_stop))
Thus, in the next few questions you will be exploring different ways of manipulating the title data before generating the frequencies to create the word clouds. This will result in different word clouds that allow us to view important terms from the text based on certain criteria.
In the next few questions, be sure to recall the example from part 1 which exhibits how spaCy can be used to perform lemmatization and the example above which exhibits how spaCy can be used to perform stopword detection.
(TO DO) Q7 - 2 marks
Make a word cloud based on the frequency of the content from the document titles, where the stopwords are removed (you must use spaCy to find the stopwords).
Ensure that you use random_state=1 when generating the word cloud.
# TO DO
# Get the titles and run them through spaCy
title_content = sp(' '.join(df["title"]))
# Create a string of with no stopwords
no_stop_words = []
for t in title_content:
if not t.is_stop:
no_stop_words.append(t.text)
# Get the frequencies
frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)
(TO DO) Q8
1) Make a lemma cloud (a lemma is from the lemmatization of a token) based on the frequency of the content from the document titles with the stopwords removed, where the lemmas come from spaCy.
2) Then, compare the resulting word cloud with the word cloud generated in Q7. What is the difference between the two?
Ensure that you use random_state=1 when generating the word cloud.
(TO DO) Q8 (a) - 2 marks
Make a lemma cloud (a lemma is from the lemmatization of a token) based on the frequency of the content from the document titles with the stopwords removed, where the lemmas come from spaCy.
Ensure that you use random_state=1 when generating the word cloud.
# TO DO
no_stop_words = []
for t in title_content:
if not t.is_stop:
no_stop_words.append(t.lemma_)
frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)
(TO DO) Q8 (b) - 1 mark
Compare the resulting word cloud with the word cloud generated in Q7. What is the difference between the two (give a specific example)?
TODO (b) By lemmatizing the words of the news data, it appears to give it less meaning overrall. Words like restrictions
and supplies
seem to have more meaning using the original wordcloud. supply
is not even part of the second picture. Overall the lematization of this particular goal seems to give it less context.
(TO DO) Q9 - 3 marks
Build a word cloud based with the content of the titles of the documents where only the Adjectives are used AND where all of the stopwords are removed AND where the lemmas are added (rather than the text).
Ensure that you use random_state=1 when generating the word cloud.
# TO DO (adjectives only, remove stopwords, and add only lemmas)
no_stop_words = []
for t in title_content:
if not t.is_stop and t.pos_ == "ADJ":
no_stop_words.append(t.lemma_)
frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)
(TO DO) Q10 - 2 marks
Based on your own choice, build a word cloud based with the content of the titles of the documents where only the Verbs or the Nouns are used (you select one of these to work with) AND where all of the stopwords are removed AND where the lemmas are added (rather than the text).
Ensure that you use random_state=1 when generating the word cloud.
# TO DO (select only one of the POS types above, remove stopwords, and add only lemmas)
no_stop_words = []
for t in title_content:
if not t.is_stop and t.pos_ == "VERB":
no_stop_words.append(t.lemma_)
frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)
Now that all of the word clouds have been created, you will answer some questions to analyze how the NLP techniques that have been perfermed impacted the generated word clouds.
(TO DO) Q11
Answer the following two questions:
1) How does removing the stopwords in Q7 affect the word cloud?
2) Out of the word clouds that you have created, which word cloud do you believe provided the most relevant terms related to Covid-19 and why?
(TO DO) Q11 (a) - 1 mark
How does removing the stopwords in Q7 affect the word cloud?
Removing stopwords in Q7 allows for word cloud to get more meaningful context. Stopwords do not provide much context and filled up the word cloud with ambiguous statements, words like 'or' for example.
(TO DO) Q11 (b) - 1 mark
Out of the word clouds that you have created, which word cloud do you believe provided the most relevant terms related to Covid-19 and why?
The one I believe provided most context to Covid-19 story is from Q7. By removing stopwords but keeping original text (rather than using lemma text) I feel the overall message is clear and after reading some news articles throughout the months, it seems that catchphrases like 'restrictions' and 'supplies' as well as 'case' gives much more overall meaning. When we perform word clouds using lemmatization, some of the words lose their original meaning.