Information Extraction from Text Using Python

[ad_1]


Information Extraction (IE) is a crucial cog in the field of Natural Language Processing (NLP) and linguistics. It’s widely used for tasks such as Question Answering Systems, Machine Translation, Entity Extraction, Event Extraction, Named Entity Linking, Coreference Resolution, Relation Extraction, etc.

In information extraction, there is an important concept of triples.

A triple represents a couple of entities and a relation between them. For example, (Obama, born, Hawaii) is a triple in which ‘Obama’ and ‘Hawaii’ are the related entities, and the relation between them is ‘born’.

In this article, we will focus on the extraction of these types of triples from a given text.

In the previous section, we managed to easily extract triples from a few sentences. However, in the real world, the data size is huge and manual extraction of structured information is not feasible. Therefore, automating this information extraction becomes important.

There are multiple approaches to perform information extraction automatically. Let’s understand them one-by-one:

  1. Rule-based Approach: We define a set of rules for the syntax and other grammatical properties of a natural language and then use these rules to extract information from text.

  2. Supervised: Let’s say we have a sentence S. It has two entities E1 and E2. Now, the supervised machine learning model has to detect whether there is any relation (R) between E1 and E2. So, in a supervised approach, the task of relation extraction turns into the task of relation detection. The only drawback of this approach is that it needs a lot of labeled data to train a model.

  3. Semi-supervised: When we don’t have enough labeled data, we can use a set of seed examples (triples) to formulate high-precision patterns that can be used to extract more relations from the text.

Information Extraction using Python and spaCy

In this section, we will use the very popular NLP library spaCy to discover and extract interesting information from text data such as different entity pairs that are associated with some relation or another.

1. spaCy’s Rule-Based Matching

First, we will import the required libraries:

import re 
import string 
import nltk 
import spacy 
import pandas as pd 
import numpy as np 
import math 
from tqdm import tqdm 

from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 

pd.set_option('display.max_colwidth', 200)

Next, let’s load a spaCy model.

# load spaCy model
nlp = spacy.load("en_core_web_sm")

We are all set to mine information from text based on some interesting patterns.

Pattern: X such as Y

# sample text 
text = "GDP in developing countries such as Vietnam will continue growing at a high rate." 

# create a spaCy object 
doc = nlp(text)

To be able to pull out the desired information from the above sentence, it is really important to understand its syntactic structure — things like the subject, object, modifiers, and parts-of-speech (POS) in the sentence.

We can easily explore these syntactic details in the sentence by using spaCy:

# print token, dependency, POS tag 
for tok in doc: 
  print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

Output:

GDP --> nsubj --> NOUN 
in --> prep --> ADP 
developing --> amod --> VERB 
countries --> pobj --> NOUN 
such --> amod --> ADJ 
as --> prep --> ADP 
Vietnam --> pobj --> PROPN 
will --> aux --> VERB 
continue --> ROOT --> VERB 
growing --> xcomp --> VERB 
at --> prep --> ADP 
a --> det --> DET 
high --> amod --> ADJ 
rate --> pobj --> NOUN 
. --> punct --> PUNCT

Have a look around the terms “such” and “as” . They are followed by a noun (“countries”). And after them, we have a proper noun (“Vietnam”) that acts as a hyponym.

So, let’s create the required pattern using the dependency tags and the POS tags:

#define the pattern 
pattern = [{'POS':'NOUN'}, 
           {'LOWER': 'such'}, 
           {'LOWER': 'as'}, 
           {'POS': 'PROPN'} #proper noun]

Let’s extract the pattern from the text.

# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

Output: “countries such as Vietnam”

Nice! It works perfectly. However, if we could get “developing countries” instead of just “countries”, then the output would make more sense.

So, we will now also capture the modifier of the noun just before “such as” by using the code below:

# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'amod', 'OP':"?"}, # adjectival modifier
           {'POS':'NOUN'},
           {'LOWER': 'such'},
           {'LOWER': 'as'},
           {'POS': 'PROPN'}]

matcher.add("matching_1", None, pattern)
matches = matcher(doc)

span = doc[matches[0][1]:matches[0][2]]
print(span.text)

Output: “developing countries such as Vietnam”

Note: The key ‘OP’: ‘?’ in the pattern above means that the modifier (‘amod’) can occur once or not at all.

In a similar manner, we can get several pairs from any piece of text:

  • Fruits such as apples
  • Cars such as Ferrari
  • Flowers such as rose

Now let’s use some other patterns to extract more hypernyms and hyponyms.

Pattern: X and/or Y

doc = nlp("Here is how you can keep your car and other vehicles clean.") 

# print dependency tags and POS tags
for tok in doc: 
  print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

Output:

Here --> advmod --> ADV 
is --> ROOT --> VERB 
how --> advmod --> ADV 
you --> nsubj --> PRON 
can --> aux --> VERB 
keep --> ccomp --> VERB 
your --> poss --> DET 
car --> dobj --> NOUN 
and --> cc --> CCONJ 
other --> amod --> ADJ 
vehicles --> conj --> NOUN 
clean --> oprd --> ADJ 
. --> punct --> PUNCT

Output: “car and other vehicles”

# Matcher class object 
matcher = Matcher(nlp.vocab) 

#define the pattern 
pattern = [{'DEP':'amod', 'OP':"?"}, 
           {'POS':'NOUN'}, 
           {'LOWER': 'and', 'OP':"?"}, 
           {'LOWER': 'or', 'OP':"?"}, 
           {'LOWER': 'other'}, 
           {'POS': 'NOUN'}] 
           
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

Output: “car and other vehicles”

Let’s try out the same code to capture the “X or Y” pattern.

# replaced 'and' with 'or' 
doc = nlp("Here is how you can keep your car or other vehicles clean.")

The rest of the code will remain the same.

# Matcher class object 
matcher = Matcher(nlp.vocab) 

#define the pattern 
pattern = [{'DEP':'amod', 'OP':"?"}, 
           {'POS':'NOUN'}, 
           {'LOWER': 'and', 'OP':"?"}, 
           {'LOWER': 'or', 'OP':"?"}, 
           {'LOWER': 'other'}, 
           {'POS': 'NOUN'}] 
           
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

Output: “car or other vehicles”

Pattern: X, especially Y

doc = nlp("A healthy eating pattern includes fruits, especially whole fruits.") 

for tok in doc: 
  print(tok.text, tok.dep_, tok.pos_)

Output:

A --> det --> DET 
healthy --> amod --> ADJ 
eating --> compound --> NOUN 
pattern --> nsubj --> NOUN 
includes --> ROOT --> VERB 
fruits --> dobj --> NOUN 
, --> punct --> PUNCT 
especially --> advmod --> ADV 
whole --> amod --> ADJ 
fruits --> appos --> NOUN 
. --> punct --> PUNCT
# Matcher class object 
matcher = Matcher(nlp.vocab)

#define the pattern 
pattern = [{'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}, 
           {'IS_PUNCT':True}, 
           {'LOWER': 'especially'}, 
           {'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}] 
           
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

Output: “fruits, especially whole fruits”

Conclusion

In this article, we learned about Information Extraction, the concept of relations and triples, and different methods for relation extraction. Although we have covered a lot of ground, we have just scratched the surface of the field of Information Extraction.

I urge you all to implement this code yourself and see if you can come up with some interesting patterns to mine.

Thanks!

Read More …

[ad_2]


Write a comment