Natural Language Processing (NLP)

INTRODUCTION:

Covid-19 has shifted our day-to-day lives and has impacted the world globally. Much of what is going on in the world is brought to us through the news, in the form of text.

In this notebook, we will be performing Natural Language Processing techniques on a dataset of Covid-19 related news. We will be doing tokenization, lemmatization, Part-of-speech (POS) tagging, Named Entity Recognition, stopword removal, and more to sentences from news articles, and the entire dataset. You will be using the NLP library spaCy to perform these NLP techniques. This library makes it simple to perform these complex operations for text in a specified language.

This will allow you to explore the NLP Pipeline and analyze the contents of the text within the dataset. You will also be creating word clouds based on the modified text from the dataset. This will allow us to visualize the key words from the articles to better understand which content is most important in them.

For this notebook you will need to install the following packages in addition to libraries previously used:
1) spaCy: pip install -U spacy
2) spaCy's English package (change the command according to your environment, ex: python vs py): py -m spacy download en
3) WordCloud: pip install wordcloud
4) MultiDict: pip install multidict

Note: For some students who already have spaCy installed, an issue may occur when trying to work with the English package mentioned in (2). If this occurs, there is a commented line of code with an alternative way to run the same method. Only if you run into an issue where 'en' is not recognized, run the following pip command and switch the spacy.load() function in the code second code cell below to the commented call.
Change the command according to your environment, ex: python vs py: py -m spacy download en

HOMEWORK:
Go through the notebook by running each cell, one at a time.
Look for (TO DO) for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.

The notebook will be marked on 25.
Each (TO DO) has a number of points associated with it.


# Before starting we will import every module that we will be using
import pandas as pd
import matplotlib.pyplot as plt
import multidict as multidict
import re
import spacy
from spacy import displacy
from wordcloud import WordCloud
%run -m spacy download en
# The core spacy object that will be used for tokenization, lemmatization, POS Tagging, ...
# Note that this is specifically for the English language and requires the English package to be installed
# via pip to work as intended.

sp = spacy.load('en')

# If the above causes an error after installing the package described in (2), install the package described
# in the Note section within the introduction and run this line of code instead of the above.
#sp = spacy.load('en')

PART 1 - Sentence Analysis

In this part, we will use the modules from spaCy to perform the different steps of the NLP pipeline on sentences from the included file on Covid-19 related news articles from CBC news. The dataset is included with this submission, but details regarding it can be found here. The first thing that we will do is load the file into a pandas dataframe.

# Read the dataset, show top ten rows
df = pd.read_csv("news.csv")
df.head(10)
Unnamed: 0 authors title publish_date description text url
0 0 [] 'More vital now:' Gay-straight alliances go vi... 2020-05-03 1:30 Lily Overacker and Laurell Pallot start each g... Lily Overacker and Laurell Pallot start each g... https://www.cbc.ca/news/canada/calgary/gay-str...
1 1 [] Scientists aim to 'see' invisible transmission... 2020-05-02 8:00 Some researchers aim to learn more about how t... This is an excerpt from Second Opinion, a week... https://www.cbc.ca/news/technology/droplet-tra...
2 2 ['The Canadian Press'] Coronavirus: What's happening in Canada and ar... 2020-05-02 11:28 Canada's chief public health officer struck an... The latest: The lives behind the numbers: Wha... https://www.cbc.ca/news/canada/coronavirus-cov...
3 3 [] B.C. announces 26 new coronavirus cases, new c... 2020-05-02 18:45 B.C. provincial health officer Dr. Bonnie Henr... B.C. provincial health officer Dr. Bonnie Henr... https://www.cbc.ca/news/canada/british-columbi...
4 4 [] B.C. announces 26 new coronavirus cases, new c... 2020-05-02 18:45 B.C. provincial health officer Dr. Bonnie Henr... B.C. provincial health officer Dr. Bonnie Henr... https://www.cbc.ca/news/canada/british-columbi...
5 5 ['Senior Writer', 'Chris Arsenault Is A Senior... Brazil has the most confirmed COVID-19 cases i... 2020-05-02 8:00 From describing coronavirus as a "little flu,"... With infection rates spiralling, some big city... https://www.cbc.ca/news/world/brazil-has-the-m...
6 6 ['Cbc News'] The latest on the coronavirus outbreak for May 1 2020-05-01 20:43 The latest on the coronavirus outbreak from CB... Coronavirus Brief (CBC) Canada is officiall... https://www.cbc.ca/news/the-latest-on-the-coro...
7 7 ['Cbc News'] Coronavirus: What's happening in Canada and ar... 2020-05-01 11:51 Nova Scotia announced Friday it is immediately... The latest: The lives behind the numbers: Wha... https://www.cbc.ca/news/canada/coronavirus-cov...
8 8 ['Senior Writer', "Adam Miller Is Senior Digit... Did the WHO mishandle the global coronavirus p... 2020-04-30 8:00 The World Health Organization has come under f... The World Health Organization has come under f... https://www.cbc.ca/news/health/coronavirus-who...
9 9 ['Thomson Reuters'] Armed people in Michigan's legislature protest... 2020-04-30 21:37 Hundreds of protesters, some armed, gathered a... Hundreds of protesters, some armed, gathered a... https://www.cbc.ca/news/world/protesters-michi...

From this dataset, we will start by extracting selecting five different sentences that we will be using throughout this section from an article. First, we will display the text of an article and will manually copy the sentences that that will be used for this section. Notice that there are many tags saved within the dataset, but we will not worry about those for now.

df["text"][1]
'This is an excerpt from\xa0Second Opinion, a\xa0weekly\xa0roundup of eclectic and under-the-radar health and medical science news emailed to subscribers every Saturday morning.\xa0If you haven\'t subscribed yet, you can do that by\xa0clicking here.  The coronavirus that causes COVID-19 spreads through droplets that we spew as we breathe, talk, cough and sneeze —\xa0so tiny that they\'re invisible to the naked eye.\xa0 That\'s why\xa0questions remain about the virus\'s transmission and what precautions need to be taken to curb its spread as governments begin to lift restrictions. Will it help if everyone wears a mask? Is keeping everyone two metres apart far enough? Some researchers aim to learn more about transmission by trying to make invisible sneezes, coughs and breaths more visible. Here\'s a closer look at that research and what it might reveal. How do scientists think COVID-19 is transmitted? According to the World Health Organization, the disease spreads primarily through tiny droplets expelled\xa0when a person infected with SARS-CoV-2\xa0sneezes, coughs, exhales or spits while talking. They can infect another person who:   Comes into contact with those droplets through their eyes, nose or mouth (droplet transmission).   Touches objects or surfaces on which droplets have landed and then touches their eyes, nose or mouth (contact transmission).   The WHO says\xa0it\'s important to stay "more than one metre away" from a person who is sick. But the Public Health Agency of Canada recommends staying a distance of at least two metres or two arms\' lengths\xa0away, not just from people who are sick\xa0but from all people you don\'t live with. Why is 2 metres the recommended distance for preventing transmission? Scientists in the 19th century showed respiratory droplets from a person\'s nose and mouth can carry micro-organisms such as bacteria and viruses. Then, in 1934, W.F. Wells at the\xa0Harvard School of Public Health showed that large droplets (bigger than 0.1 millimetre)\xa0tended to fall and settle on the ground within a distance of two metres, while smaller droplets evaporated and the\xa0virus particles left behind could remain suspended in the air for a long time.  Wells proposed that could explain\xa0how diseases are transmitted. Since then, respiratory diseases have been divided into those transmitted via droplets\xa0(usually from\xa0close contact) and those that are airborne and can spread\xa0over longer distances, such as measles or tuberculosis.\xa0 INTERACTIVECoronavirus tracker: 56,000+ cases in Canada on Saturday Such tiny particles are presumably pushed around by air currents, but can\'t move easily due to air resistance. So their actual movements haven\'t been well modelled or measured, said Lydia Bourouiba, professor and director of the Fluid Dynamics of Disease Transmission Laboratory at the Massachusetts Institute of Technology.\xa0 "And that\'s why the notion of airborne [transmission] is very murky," said Bourouiba, who is Canadian. Why don\'t experts think the virus is airborne? A pair of recent studies raised the notion of airborne transmission, but Mark Loeb,\xa0a\xa0professor at\xa0Hamilton\'s McMaster University\xa0who specializes in infectious disease research, cautions against putting too much stock in them.\xa0 Researchers found traces of RNA from SARS-CoV-2 in washrooms and some high-traffic areas in hospitals in Wuhan, China, and in Nebraska, and suggested it got into those areas through the air, though there was no evidence the particles were still infectious.\xa0 Loeb said\xa0that\'s just a "signal" that part of the virus was there.\xa0 "Does it mean that COVID-19 is spreading from person to person through aerosols? I would say definitively not," Loeb said.\xa0 Government guidelines in Canada recommend that people stay at least two metres away from others as part of physical distancing measures to curb the spread of COVID-19. (Gary Moore/CBC) If the virus were airborne, we\'d know by now, said Dr. Allison McGeer, an infectious disease specialist with Sinai Health in Toronto who is leading a national research team studying how COVID-19 is transmitted. "The reason we know that is because all around the world we have hundreds of health-care workers who are taking care of patients wearing regular masks," she\xa0said. "If this were airborne\xa0—\xa0\xa0if this were usually in those small [aerosol] particles\xa0—\xa0all those health-care workers would be getting sick." The fact that they\'re not is a contrast to what would happen if the virus remained infectious in the air for hours even after an infected person left the room\xa0— which is\xa0part of what makes diseases like measles so contagious. Infection control guidelines do recommend extra personal protective equipment (including N-95 respirators)\xa0to protect against airborne transmission for health-care workers performing\xa0procedures that generate high concentrations of aerosolized particles, such as intubations, on COVID-19 patients, McGeer said.\xa0\xa0 Return to normal hinges on immunity, say those pushing for new COVID-19 blood tests But even in that scenario, the degree to which those particles are infectious hasn\'t been yet proven\xa0— and it a very different scenario than what people are facing in the community.\xa0 "You and I don\'t have to worry walking down the street that we\'re going to be breathing the air of somebody who walked down the street five minutes ahead of us who had COVID-19 and didn\'t know it," said McGeer, "That we can be confident about." Is there evidence the virus could be spread farther than 2\xa0metres? Some studies, including Bourouiba\'s, show that droplets from coughs and sneezes can, in fact, travel much farther than expected. Bourouiba\'s high-speed imaging\xa0measurements and modelling show smaller respiratory droplets don\'t behave like individual droplets\xa0but\xa0are in a turbulent gas cloud trapping them and carrying them forward within it. The moist environment reduces evaporation, allowing droplets of many sizes to survive much longer and travel much farther than two metres —\xa0up to seven or eight metres, in the case of a sneeze.\xa0 WATCH | Close-up view of the droplets released by a person sneezing  (Credit Lydia Bourouiba/MIT/JAMA Networks) She said the research "is about revealing what you cannot see with the naked eye." A more recent Canadian study used a "cough chamber" to show that if someone coughs without covering their mouth, droplets from the cough are still travelling at a speed of about one kilometre per\xa0hour when they hit the two-metre edge of the chamber. Within the chamber, droplets remained suspended for up to three minutes.\xa0 WATCH | The speed and distance travelled by droplets from a cough  Dr. Samira Mubareka, a virologist at Sunnybrook Hospital in Toronto who co-authored the study, said it\xa0"gives you a sense of what the possibilities are," but noted that the researchers, who were studying flu patients, detected very little virus in the droplets. What does that say about the\xa02-metre guideline? Bourouiba says her research points to the potential for exposure beyond two metres from someone who is coughing and sneezing. As she\xa0wrote in the journal JAMA Insights in March, that means it\'s "vitally important" for health care workers to wear high-grade personal protective equipment in the form of respirators even if they\'re farther than two metres away from infected patients.\xa0\xa0 However, she does think two metres can be\xa0far enough for healthy people in the general public in most environments, since breathing and talking don\'t propel droplets and surrounding cloud too far. Mubareka stands by the two-metre guideline despite the findings of her cough chamber study. Despite dramatic\xa0images of respiratory droplets being propelled from someone\'s nose and mouth, it\'s not yet clear how many of them contain virus\xa0and how many are infectious. Second OpinionShould masks be mandatory in public to stop the spread of COVID-19? "And that\'s really the key variable\xa0— that\'s what really determines your risk," she said. "Those are the kinds of things we haven\'t been able to measure." That may change, she added, with the recent invention of particle samplers designed specifically for viruses. Loeb of McMaster, notes\xa0that a cough chamber and similar laboratory setups are highly artificial settings and controlled environments. "They\'re basically saying what\'s theoretically possible," he said. "I think those are provocative and those are hypothesis- generating, but then they need to be tested in the field." Loeb is running such a field test himself — a randomized controlled trial of the use of medical versus N95 masks among health care workers to see if there is a difference in the transmission of COVID-19. But are coughing and sneezing all we need to worry about? That\'s a question on a lot of people\'s minds, given that a growing number of studies have shown asymptomatic and\xa0pre-symptomatic transmission is possible, especially among those who live with an infected person.\xa0 Even though researchers aren\'t sure exactly how people without symptoms transmit the disease, the new evidence has prompted both U.S. and Canadian officials to suggest apparently healthy people wear masks in public to protect others —\xa0"because it prevents you from breathing or speaking moistly on them," Prime Minister Justin Trudeau famously explained. "People generate particles when they\'re talking, singing, breathing —\xa0so you don\'t have to necessarily be coughing," Mubareka said. "It\'s just that maybe the dispersion is a little bit more limited."\xa0 WATCH | The\xa0droplets produced when someone speaks with, and without, a mask This video, from a study published in April 2020 the New England Journal of Medicine by researchers at the U.S. National Institutes of Health, uses laser light scattering to show droplets produced when someone speaks. 0:42\xa0 A recent brief video and report by U.S. National Institutes of Health researchers used lasers to show that droplets projected less than 10 centimetres\xa0when someone says the phrase "Stay healthy." It found the louder someone spoke, the more droplets were emitted. But they were dramatically reduced if a damp washcloth\xa0— a stand-in for a mask\xa0—\xa0was placed over the speaker\'s mouth. So what about using masks to curb the spread of COVID-19? Studies have already provided evidence that the\xa0rate at which sick people shed the virus into their surroundings is reduced when they wear a mask.\xa0 Other studies, such as a 2009 paper in Journal of the Royal Society Interface, use imaging to show how wearing a mask while coughing reduces the jet of air that\'s normally directed forward and down. A surgical mask "effectively blocks the forward momentum of the cough jet and its aerosol content," the study found. Some does leak out the sides, top and bottom, but without much momentum. A 2009 study by researchers in the U.S. and Singapore uses schlieren imaging to show airflow from a person\'s mouth a) without a mask b) with a medical mask and c) with an N95 mask. (Gary S. Settles/Penn State University/Journal of the Royal Society Interface) The World Health Organization recommends that people wear masks if they are coughing and sneezing or if they are caring for someone who is sick. It notes that studies haven\'t been conducted yet on whether or not transmission is reduced when healthy people wear\xa0masks in public, but it encourages countries to look into that.\xa0 Many governments haven\'t waited. Los Angeles, Italy and Austria are among the places that have begun requiring customers to wear masks while shopping.\xa0  To read the entire\xa0Second Opinion\xa0newsletter every Saturday morning, subscribe by\xa0clicking here.'

From this text, we will select one sentence that will be used by the examples provided within the notebook, sentence_example, and five sentences that you will be using to answer five questions within this section, sentence1, ... sentence5. Sentences 4 and 5 are the same due to that sentence being great for both of the questions.

# Sentence to be used for running examples
sentence_example = "Government guidelines in Canada recommend that people stay at least two metres away from others as part of physical distancing measures to curb the spread of COVID-19."
# Sentences to be used for future questions
sentence1 = "I think those are provocative and those are hypothesis- generating, but then they need to be tested in the field.\" Loeb is running such a field test himself — a randomized controlled trial of the use of medical versus N95 masks among health care workers to see if there is a difference in the transmission of COVID-19."
sentence2 = "The World Health Organization recommends that people wear masks if they are coughing and sneezing or if they are caring for someone who is sick."
sentence3 = "Will it help if everyone wears a mask?"
sentence4 = "Infection control guidelines do recommend extra personal protective equipment (including N-95 respirators) to protect against airborne transmission for healthcare workers performing procedures that generate high concentrations of aerosolized particles, such as intubations, on COVID-19 patients, McGeer said."
sentence5 = "Infection control guidelines do recommend extra personal protective equipment (including N-95 respirators) to protect against airborne transmission for healthcare workers performing procedures that generate high concentrations of aerosolized particles, such as intubations, on COVID-19 patients, McGeer said."

With the sentences that we will be using defined, we will now explore how spaCy can be applied on a sentence and go through examples of its usage. You will then be asked questions to solve on your own with your understanding of the NLP concepts, with the examples, and with links to relevant documentation.

First, we will pass the example sentence into our spacy object sp to retrieve the tokenization, lemmatization, dependency values, Part-of-speech (POS) tags, and more from the sentence. As you will see, spaCy makes this process very easy.

# Call our spaCy object to retrieve the results of running the sentence through the NLP Pipeline
# Note that we can reuse the sp variable without redefining it.
sentence_example_content = sp(sentence_example)
for token in sentence_example_content:
    print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ + 
          " Dependency: " + token.dep_)
# We will take a look at the dependency tree to view how the words relate to each other
displacy.render(sentence_example_content, style="dep")
Text: Government Lemma: government POS: NOUN Dependency: compound
Text: guidelines Lemma: guideline POS: NOUN Dependency: nsubj
Text: in Lemma: in POS: ADP Dependency: prep
Text: Canada Lemma: Canada POS: PROPN Dependency: pobj
Text: recommend Lemma: recommend POS: VERB Dependency: ROOT
Text: that Lemma: that POS: SCONJ Dependency: mark
Text: people Lemma: people POS: NOUN Dependency: nsubj
Text: stay Lemma: stay POS: VERB Dependency: ccomp
Text: at Lemma: at POS: ADV Dependency: advmod
Text: least Lemma: least POS: ADV Dependency: advmod
Text: two Lemma: two POS: NUM Dependency: nummod
Text: metres Lemma: metre POS: NOUN Dependency: npadvmod
Text: away Lemma: away POS: ADV Dependency: advmod
Text: from Lemma: from POS: ADP Dependency: prep
Text: others Lemma: other POS: NOUN Dependency: pobj
Text: as Lemma: as POS: SCONJ Dependency: prep
Text: part Lemma: part POS: NOUN Dependency: pobj
Text: of Lemma: of POS: ADP Dependency: prep
Text: physical Lemma: physical POS: ADJ Dependency: amod
Text: distancing Lemma: distancing POS: NOUN Dependency: compound
Text: measures Lemma: measure POS: NOUN Dependency: pobj
Text: to Lemma: to POS: PART Dependency: aux
Text: curb Lemma: curb POS: VERB Dependency: relcl
Text: the Lemma: the POS: DET Dependency: det
Text: spread Lemma: spread POS: NOUN Dependency: dobj
Text: of Lemma: of POS: ADP Dependency: prep
Text: COVID-19 Lemma: covid-19 POS: NUM Dependency: pobj
Text: . Lemma: . POS: PUNCT Dependency: punct
Government NOUN guidelines NOUN in ADP Canada PROPN recommend VERB that SCONJ people NOUN stay VERB at ADV least ADV two NUM metres NOUN away ADV from ADP others NOUN as SCONJ part NOUN of ADP physical ADJ distancing NOUN measures NOUN to PART curb VERB the DET spread NOUN of ADP COVID-19. NUM compound nsubj prep pobj mark nsubj ccomp advmod advmod nummod npadvmod advmod prep pobj prep pobj prep amod compound pobj aux relcl det dobj prep pobj

In the code above, we see that we are able to access the tags from the dependency tree for each token by calling .dep_. However, we can also directly access elements from the dependency tree (as seen in the code below). For more examples regarding how to navigate through the dependency trees, you can take a look at some official spaCy examples. However, below are details that are enough to know for this notebook.

Looking at the dependency tree above, we see that the words have arrows to represent the relationships. Each of these have a dependency tab to explain the dependencies between tokens. For example, 'Government' is a compound of the noun 'guidelines'. Thus, 'Government guidelines' is a noun compound.

In the code, after parsing text with spaCy into tokens, it is possible to access the words which a token has arrows connected to. This is exhibited in the code below.

Note that when accessing a child node, you are able to access the properties in the same way that you would for a regular spaCy token (.pos_, ...).

# Display how to access the dependency children within a dependency tree
for token in sentence_example_content:
    print("Current token: " + token.text)
    print("All children of this token:", list(token.children))
    print("Left children of this token:", list(token.lefts))
    print("Right children of this token:", list(token.rights))
    print()
Current token: Government
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: guidelines
All children of this token: [Government, in]
Left children of this token: [Government]
Right children of this token: [in]

Current token: in
All children of this token: [Canada]
Left children of this token: []
Right children of this token: [Canada]

Current token: Canada
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: recommend
All children of this token: [guidelines, stay, .]
Left children of this token: [guidelines]
Right children of this token: [stay, .]

Current token: that
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: people
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: stay
All children of this token: [that, people, away, as]
Left children of this token: [that, people]
Right children of this token: [away, as]

Current token: at
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: least
All children of this token: [at]
Left children of this token: [at]
Right children of this token: []

Current token: two
All children of this token: [least]
Left children of this token: [least]
Right children of this token: []

Current token: metres
All children of this token: [two]
Left children of this token: [two]
Right children of this token: []

Current token: away
All children of this token: [metres, from]
Left children of this token: [metres]
Right children of this token: [from]

Current token: from
All children of this token: [others]
Left children of this token: []
Right children of this token: [others]

Current token: others
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: as
All children of this token: [part]
Left children of this token: []
Right children of this token: [part]

Current token: part
All children of this token: [of]
Left children of this token: []
Right children of this token: [of]

Current token: of
All children of this token: [measures]
Left children of this token: []
Right children of this token: [measures]

Current token: physical
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: distancing
All children of this token: [physical]
Left children of this token: [physical]
Right children of this token: []

Current token: measures
All children of this token: [distancing, curb]
Left children of this token: [distancing]
Right children of this token: [curb]

Current token: to
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: curb
All children of this token: [to, spread]
Left children of this token: [to]
Right children of this token: [spread]

Current token: the
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: spread
All children of this token: [the, of]
Left children of this token: [the]
Right children of this token: [of]

Current token: of
All children of this token: [COVID-19]
Left children of this token: []
Right children of this token: [COVID-19]

Current token: COVID-19
All children of this token: []
Left children of this token: []
Right children of this token: []

Current token: .
All children of this token: []
Left children of this token: []
Right children of this token: []

We can also use sp to retrieve the Named Entity Recognition (NER) tags of terms in a provided text. Below we use the results obtained above after calling spaCy to retrieve the NER tags for each token in the sentence that has been assigned an NER tag. We access this by calling .label_ on an element from the iteration returned by calling .ents on the results returned by spaCy.

# Loop through all tokens that contain a NER tag and print the token along with the corresponding NER tag
for token in sentence_example_content.ents:
    print("\"" + token.text + "\" is a " + token.label_ )
"Canada" is a GPE
"at least two metres" is a QUANTITY

We'll explore Named Entity Recognition in the next Notebook, as this one focuses on linguistic analysis at the level of sentences and corpus.

(TO DO) Q1 - 1 mark
For sentence1, use spaCy to run the sentence through the NLP Pipeline and determine how many tokens are in the sentence.

# TODO: How many tokens
sentence1_content = sp(sentence1)
print(len(sentence1_content))
61

(TO DO) Q2 - 2 marks
For sentence2, display the dependency tree and determine what the subject is for the verb recommends (the entire name). You do not need to do this automatically, just print the value that you find by looking at the dependency tree.

# Display the dependency tree for sentence5
sentence2_content = sp(sentence2)
displacy.render(sentence2_content, style="dep")
# What is the subject of the verb 'recommends' in sentence5
print("The subject of verb recommends is The World Health Organization")
The DET World PROPN Health PROPN Organization PROPN recommends VERB that SCONJ people NOUN wear VERB masks NOUN if SCONJ they PRON are AUX coughing VERB and CCONJ sneezing VERB or CCONJ if SCONJ they PRON are AUX caring VERB for ADP someone PRON who PRON is AUX sick. ADJ det compound compound nsubj mark nsubj ccomp dobj mark nsubj aux advcl cc conj cc mark nsubj aux conj prep pobj nsubj relcl acomp
Ellipsis

(TO DO) Q3 - 2 marks
For sentence3, use spaCy to run the sentence through the NLP Pipeline and print only the words that are VERBs.

# TODO: Find the verbs
sentence3_content = sp(sentence3)
for token in sentence3_content:
     if token.pos_ == "VERB":
         print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ + 
              " Dependency: " + token.dep_)
Text: Will Lemma: Will POS: VERB Dependency: aux
Text: help Lemma: help POS: VERB Dependency: ROOT
Text: wears Lemma: wear POS: VERB Dependency: ccomp

(TO DO) Q4
a) For sentence4, use spaCy to run the sentence through the NLP Pipeline and print only the words that are adjectives (ADJs).
b) For each adjective found in (a), find the nounds that the adjective modifies. To do this, you will need to go through the tags from the dependency tree to find the adjectives with the amod tag to find the following noun that it modifies.

(TO DO) Q4 (a) - 1 mark
a) For sentence4, use spaCy to run the sentence through the NLP Pipeline and print only the words that are adjectives (ADJs).

# TODO: Find the adjectives
s4 = sp(sentence4)
for token in s4:
    if token.pos_ == "ADJ":
        print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ + 
              " Dependency: " + token.dep_)
Text: extra Lemma: extra POS: ADJ Dependency: amod
Text: personal Lemma: personal POS: ADJ Dependency: amod
Text: protective Lemma: protective POS: ADJ Dependency: amod
Text: airborne Lemma: airborne POS: ADJ Dependency: amod
Text: high Lemma: high POS: ADJ Dependency: amod
Text: such Lemma: such POS: ADJ Dependency: amod
Text: COVID-19 Lemma: covid-19 POS: ADJ Dependency: nummod

(TO DO) Q4 (b) - 3 marks
b) For each applicable adjectives in the sentence, find the nouns that the adjective modifies. If several adjectives modify a single noun, each of those adjectives should be printed with those nouns (ex: 'extra -> equipment').

To do this, you will need to go through the tags from the dependency tree to find the adjectives with the amod tag to find the following noun that it modifies. Note that not every adjective will have the amod dependency, but many will.

Hint: Recall from the example at the beginning of this part that you are able to select a token and access each arrow leaving the token.

Also note that you can approach this problem in many ways, so feel free to design the approach yourself (as long as it correctly answers the question).

# Display the dependency tree       
displacy.render(s4, style="dep")
Infection NOUN control NOUN guidelines NOUN do AUX recommend VERB extra ADJ personal ADJ protective ADJ equipment ( NOUN including VERB N-95 NUM respirators) NOUN to PART protect VERB against ADP airborne ADJ transmission NOUN for ADP healthcare NOUN workers NOUN performing VERB procedures NOUN that DET generate VERB high ADJ concentrations NOUN of ADP aerosolized VERB particles, NOUN such ADJ as SCONJ intubations, NOUN on ADP COVID-19 ADJ patients, NOUN McGeer PROPN said. VERB compound compound nsubj aux ccomp amod amod amod dobj prep compound pobj aux advcl prep amod pobj prep compound nsubj pcomp dobj nsubj relcl amod dobj prep amod pobj amod prep pobj prep nummod pobj nsubj
# TODO: Print the nouns that an adjective modifies with the amod dependency label
# Go through the spaCy tokens, look for a specific POS tag, find the amod relations and print the relationship
noun_adj_pairs = []
for i,token in enumerate(s4):
    if token.pos_ not in ('NOUN'):
        continue
    for j in range(i+1,len(s4)):
        if s4[j].pos_ == 'ADJ':
            noun_adj_pairs.append((token,s4[j]))
            break
print(noun_adj_pairs)
[(Infection, extra), (control, extra), (guidelines, extra), (equipment, airborne), (respirators, airborne), (transmission, high), (healthcare, high), (workers, high), (procedures, high), (concentrations, such), (particles, such), (intubations, COVID-19)]

(TO DO) Q5 - 3 marks
For sentence5, use spaCy to run the sentence through the NLP Pipeline and find all noun compounds. A noun compound consists of one or more words with a compound dependency value (that are also NOUNs) followed by a noun (compound, ..., compound, non-compound NOUN).

To view the noun compounds, you can view the dependency tree of the sentence after running it through the NL pipeline via spaCy.

You must put all of the compounds together in order to get marks. Print the obtained noun compounds.

Note that you can approach this problem in many ways, so feel free to design the approach yourself (as long as it correctly answers the question).

Ex: 'Infection control guidelines' is a noun compound.

Below we parsed the sentence, printed out the compounds, and displayed the dependency tree for you to look at before coding in the following cell.

# Apply spaCy to sentence3
s5 = sp(sentence5)

# Display all compounds within the sentence
for token in s5:
    if token.dep_ == "compound":
        print(token)

# Display the dependency tree
displacy.render(s5, style="dep")
Infection
control
N-95
healthcare
Infection NOUN control NOUN guidelines NOUN do AUX recommend VERB extra ADJ personal ADJ protective ADJ equipment ( NOUN including VERB N-95 NUM respirators) NOUN to PART protect VERB against ADP airborne ADJ transmission NOUN for ADP healthcare NOUN workers NOUN performing VERB procedures NOUN that DET generate VERB high ADJ concentrations NOUN of ADP aerosolized VERB particles, NOUN such ADJ as SCONJ intubations, NOUN on ADP COVID-19 ADJ patients, NOUN McGeer PROPN said. VERB compound compound nsubj aux ccomp amod amod amod dobj prep compound pobj aux advcl prep amod pobj prep compound nsubj pcomp dobj nsubj relcl amod dobj prep amod pobj amod prep pobj prep nummod pobj nsubj
# TODO: Find, connect, and print all noun compounds
print([chunk.text for chunk in s5.noun_chunks])
['Infection control guidelines', 'extra personal protective equipment', 'N-95 respirators', 'airborne transmission', 'healthcare workers', 'procedures', 'high concentrations', 'aerosolized particles', 'intubations', 'COVID-19 patients', 'McGeer']

(TO DO) Q6 - 1 marks
Using the provided sentence, briefly explain the impact that length has on dependency tree of a sentence and on the parsing of the sentence (through a comment or print statement).

# TODO: Complete answer

sentence_parse = "In a surprisingly high turnout, millions of South Korean voters wore masks and moved slowly between lines of tape at polling stations on Wednesday to elect lawmakers in the shadow of the spreading coronavirus."
s_parse = sp(sentence_parse)
for token in s_parse:
    print("Text: " + str(token.text) + " Lemma: " + str(token.lemma_) + " POS: " + token.pos_ + 
          " Dependency: " + token.dep_)
displacy.render(s_parse, style="dep")
# TODO: What is the impact
print("The length of the sentence can have great impact on parsing. The dependency tree grows bigger and deeper as the sentence grows complexity. Trying to parse the sentence and find its noun compounds and adjectives becomes more difficult.")
Text: In Lemma: in POS: ADP Dependency: prep
Text: a Lemma: a POS: DET Dependency: det
Text: surprisingly Lemma: surprisingly POS: ADV Dependency: advmod
Text: high Lemma: high POS: ADJ Dependency: amod
Text: turnout Lemma: turnout POS: NOUN Dependency: pobj
Text: , Lemma: , POS: PUNCT Dependency: punct
Text: millions Lemma: million POS: NOUN Dependency: nsubj
Text: of Lemma: of POS: ADP Dependency: prep
Text: South Lemma: south POS: ADJ Dependency: amod
Text: Korean Lemma: korean POS: ADJ Dependency: amod
Text: voters Lemma: voter POS: NOUN Dependency: pobj
Text: wore Lemma: wear POS: VERB Dependency: ROOT
Text: masks Lemma: mask POS: NOUN Dependency: dobj
Text: and Lemma: and POS: CCONJ Dependency: cc
Text: moved Lemma: move POS: VERB Dependency: conj
Text: slowly Lemma: slowly POS: ADV Dependency: advmod
Text: between Lemma: between POS: ADP Dependency: prep
Text: lines Lemma: line POS: NOUN Dependency: pobj
Text: of Lemma: of POS: ADP Dependency: prep
Text: tape Lemma: tape POS: NOUN Dependency: pobj
Text: at Lemma: at POS: ADP Dependency: prep
Text: polling Lemma: polling POS: NOUN Dependency: compound
Text: stations Lemma: station POS: NOUN Dependency: pobj
Text: on Lemma: on POS: ADP Dependency: prep
Text: Wednesday Lemma: Wednesday POS: PROPN Dependency: pobj
Text: to Lemma: to POS: PART Dependency: aux
Text: elect Lemma: elect POS: VERB Dependency: advcl
Text: lawmakers Lemma: lawmaker POS: NOUN Dependency: dobj
Text: in Lemma: in POS: ADP Dependency: prep
Text: the Lemma: the POS: DET Dependency: det
Text: shadow Lemma: shadow POS: NOUN Dependency: pobj
Text: of Lemma: of POS: ADP Dependency: prep
Text: the Lemma: the POS: DET Dependency: det
Text: spreading Lemma: spread POS: VERB Dependency: amod
Text: coronavirus Lemma: coronavirus POS: NOUN Dependency: pobj
Text: . Lemma: . POS: PUNCT Dependency: punct
In ADP a DET surprisingly ADV high ADJ turnout, NOUN millions NOUN of ADP South ADJ Korean ADJ voters NOUN wore VERB masks NOUN and CCONJ moved VERB slowly ADV between ADP lines NOUN of ADP tape NOUN at ADP polling NOUN stations NOUN on ADP Wednesday PROPN to PART elect VERB lawmakers NOUN in ADP the DET shadow NOUN of ADP the DET spreading VERB coronavirus. NOUN prep det advmod amod pobj nsubj prep amod amod pobj dobj cc conj advmod prep pobj prep pobj prep compound pobj prep pobj aux advcl dobj prep det pobj prep det amod pobj
The length of the sentence can have great impact on parsing. The dependency tree grows bigger and deeper as the sentence grows complexity. Trying to parse the sentence and find its noun compounds and adjectives becomes more difficult.

PART 2 - Corpus Analysis

For the second section of this notebook, we will focus on analyzing the entire corpus by making word clouds based off of corpus content. This will help us identify the key words within the articles based on the criteria that we apply to the data with NLP techniques.

For this section we will be using the WordCloud library which allows us to create the word clouds with text or by frequencies of the words within the text. The code for generating the word clouds based on the frequency comes from this WordCloud example.

We will start with a simple example of creating a word cloud based on the titles of the documents in our corpus. Although we could use the descriptions of the aritcles or the actual test, this can take too long. Thus, we will be working with the titles which will allow word clouds to be generated in around a minute each.

We will make a word cloud based on the frequencies of each term from the titles in our corpus, by calling the getFrequencyDictForText function below, and passing those frequencies to the word cloud via WordCloud's generate_from_frequencies function.

# Code from the example in: https://amueller.github.io/word_cloud/auto_examples/frequency.html
def getFrequencyDictForText(sentence):
    fullTermsDict = multidict.MultiDict()
    tmpDict = {}
    # making dict for counting frequencies
    for text in sentence.split(" "):
        val = tmpDict.get(text, 0)
        tmpDict[text.lower()] = val + 1
    for key in tmpDict:
        fullTermsDict.add(key, tmpDict[key])
    return fullTermsDict
# This function comes from: https://towardsdatascience.com/simple-wordcloud-in-python-2ae54a9f58e5
# Define a function to plot a word cloud
def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(40, 30))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off");
# This can take about a minute
# Retrieve the frequencies from the titles in the dataframe
frequencies = getFrequencyDictForText(' '.join(df["title"]))
# Create a word cloud based on the frequencies from the titles in the dataframe
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)

As you can see, we can easliy use the frequency of the terms from the titles to create a word cloud. These word clouds can be customized to be within image backgrounds, have custom colours, and more.

The above word cloud contains some important terms, but many of the most frequent terms are not very important (or are symbols/numbers). These terms that appear very frequently in many different types of documents, but are not important, are called stopwords. For example, the words the, to, is, of, ... are words that appear extremely frequently in text, but provide no meaningful information when analyzing a document. Often we want to remove these stopwords. For that reason, NLP libraries, such as spaCy, provide methods of detecting which words are stopwords. Below is an example of how spaCy can be used to determine if a word is a stopword (based on the sentence used in the first example from part 1).

# Call our spaCy object to retrieve the results of running the sentence through the NLP Pipeline
# Note that we can reuse the sp variable without redefining it.
sentence_example_content = sp(sentence_example)
for token in sentence_example_content:
    print("Text: " + str(token.text) + " Is stopword: " + str(token.is_stop))
Text: Government Is stopword: False
Text: guidelines Is stopword: False
Text: in Is stopword: True
Text: Canada Is stopword: False
Text: recommend Is stopword: False
Text: that Is stopword: True
Text: people Is stopword: False
Text: stay Is stopword: False
Text: at Is stopword: True
Text: least Is stopword: True
Text: two Is stopword: True
Text: metres Is stopword: False
Text: away Is stopword: False
Text: from Is stopword: True
Text: others Is stopword: True
Text: as Is stopword: True
Text: part Is stopword: True
Text: of Is stopword: True
Text: physical Is stopword: False
Text: distancing Is stopword: False
Text: measures Is stopword: False
Text: to Is stopword: True
Text: curb Is stopword: False
Text: the Is stopword: True
Text: spread Is stopword: False
Text: of Is stopword: True
Text: COVID-19 Is stopword: False
Text: . Is stopword: False

Thus, in the next few questions you will be exploring different ways of manipulating the title data before generating the frequencies to create the word clouds. This will result in different word clouds that allow us to view important terms from the text based on certain criteria.

In the next few questions, be sure to recall the example from part 1 which exhibits how spaCy can be used to perform lemmatization and the example above which exhibits how spaCy can be used to perform stopword detection.

(TO DO) Q7 - 2 marks
Make a word cloud based on the frequency of the content from the document titles, where the stopwords are removed (you must use spaCy to find the stopwords).

Ensure that you use random_state=1 when generating the word cloud.

# TO DO
# Get the titles and run them through spaCy
title_content = sp(' '.join(df["title"]))
# Create a string of with no stopwords
no_stop_words = []
for t in title_content:
    if not t.is_stop:
        no_stop_words.append(t.text)
# Get the frequencies
frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)

(TO DO) Q8
1) Make a lemma cloud (a lemma is from the lemmatization of a token) based on the frequency of the content from the document titles with the stopwords removed, where the lemmas come from spaCy.
2) Then, compare the resulting word cloud with the word cloud generated in Q7. What is the difference between the two?

Ensure that you use random_state=1 when generating the word cloud.

(TO DO) Q8 (a) - 2 marks
Make a lemma cloud (a lemma is from the lemmatization of a token) based on the frequency of the content from the document titles with the stopwords removed, where the lemmas come from spaCy.

Ensure that you use random_state=1 when generating the word cloud.

# TO DO
no_stop_words = []
for t in title_content:
    if not t.is_stop:
        no_stop_words.append(t.lemma_)

frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)

(TO DO) Q8 (b) - 1 mark
Compare the resulting word cloud with the word cloud generated in Q7. What is the difference between the two (give a specific example)?

TODO (b) By lemmatizing the words of the news data, it appears to give it less meaning overrall. Words like restrictions and supplies seem to have more meaning using the original wordcloud. supply is not even part of the second picture. Overall the lematization of this particular goal seems to give it less context.

(TO DO) Q9 - 3 marks
Build a word cloud based with the content of the titles of the documents where only the Adjectives are used AND where all of the stopwords are removed AND where the lemmas are added (rather than the text).

Ensure that you use random_state=1 when generating the word cloud.

# TO DO (adjectives only, remove stopwords, and add only lemmas)
no_stop_words = []
for t in title_content:
    if not t.is_stop and t.pos_ == "ADJ":
        no_stop_words.append(t.lemma_)

frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)

(TO DO) Q10 - 2 marks
Based on your own choice, build a word cloud based with the content of the titles of the documents where only the Verbs or the Nouns are used (you select one of these to work with) AND where all of the stopwords are removed AND where the lemmas are added (rather than the text).

Ensure that you use random_state=1 when generating the word cloud.

# TO DO (select only one of the POS types above, remove stopwords, and add only lemmas)
no_stop_words = []
for t in title_content:
    if not t.is_stop and t.pos_ == "VERB":
        no_stop_words.append(t.lemma_)

frequencies = getFrequencyDictForText(' '.join(no_stop_words))
# Create the word cloud (with random_state=1)
word_cloud = WordCloud(width=3000, height=2000, random_state=1, collocations=False).generate_from_frequencies(frequencies)
# Plot the word cloud
plot_cloud(word_cloud)

Now that all of the word clouds have been created, you will answer some questions to analyze how the NLP techniques that have been perfermed impacted the generated word clouds.

(TO DO) Q11
Answer the following two questions:
1) How does removing the stopwords in Q7 affect the word cloud?
2) Out of the word clouds that you have created, which word cloud do you believe provided the most relevant terms related to Covid-19 and why?

(TO DO) Q11 (a) - 1 mark
How does removing the stopwords in Q7 affect the word cloud?

Removing stopwords in Q7 allows for word cloud to get more meaningful context. Stopwords do not provide much context and filled up the word cloud with ambiguous statements, words like 'or' for example.

(TO DO) Q11 (b) - 1 mark
Out of the word clouds that you have created, which word cloud do you believe provided the most relevant terms related to Covid-19 and why?

The one I believe provided most context to Covid-19 story is from Q7. By removing stopwords but keeping original text (rather than using lemma text) I feel the overall message is clear and after reading some news articles throughout the months, it seems that catchphrases like 'restrictions' and 'supplies' as well as 'case' gives much more overall meaning. When we perform word clouds using lemmatization, some of the words lose their original meaning.