How To Create A Multiple Language Dictionary Using A Pipeline

[ad_1]

A language dictionary is the best solution to verify spelling errors in a doc utilizing NLP methods in python. How? Easy create a evaluate perform utilizing the module recordlinkage or writing a perform utilizing Jaccard Similarity equation. But that is for one more time.

Please import:

%%time
%matplotlib inline
from matplotlib import pyplot as plt
import time
import re, random
import random
import string
import sys, sorts, os
import numpy as np
import pandas as pd
from textblob import Word
from nltk.tag import pos_tag
from nltk import word_tokenize
from textblob.taggers import SampleTagger
from textblob.decorators import requires_nltk_corpus
from textblob.utils import tree2str, filter_insignificant
from textblob.base import BaseNPExtractor
from textblob.wordnet import VERB
from textblob import Word
from spacy import displacy
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.util import ngrams
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
from nltk import trigrams
nltk.obtain('punkt')
nltk.obtain('wordnet')
cease = stopwords.phrases('english')
import en_core_web_sm
nlp = en_core_web_sm.load()
from spacy.language import Language

from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp)
nlp.add_pipe(ruler)

from translate import Translator
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
from autocorrect import Speller
spell = Speller(lang="https://datascienceplus.com/en")
from textblob import TextBlob
import spacy

To Generate A Word List:

from nltk.corpus import phrases
word_list = phrases.phrases()
#To save the modify file. To save file to your desktop.  This is the easiest way to save lots of your checkpoint whenever you modified a doc in python.  
word_list.to_csv(r'C:UsersXXXXXXXXDesktopdictwordslang.csv', index = False, header=True) 

Save your file on the desktop, then add the file into your Jupyter Notebook or Jupyter Labs. Import the glossary into your python code.
Let us put together the glossary. Text processing, making the phrases tokens, lemmatization, positions, tags, dep, alpha, and cease phrases.

Word Tokens: the method of segmenting textual content into phrases, punctuation marks, and so on

Word Lemmatization: is the method of grouping collectively the inflected types of a phrase to be analyzed, recognized by the phrase’s lemma or dictionary kind

Word Position: is the method of categorical the phrases group into the components of speech

Word Tag: is the method of assigning linguistic data for that phrase

Word Dependency: is the method of assigning syntactic dependency labels describing the relations between particular person tokens like topic or object

Word Alpha: the method of figuring out if the phrase is alpha or not

Word Stop: the method of figuring out cease phrases, for instance (is, not, this)

You can do that is Spacy:

%%time
tokens = []
lemma = []
pos = []
tag = []
dep = []
alpha = []
cease = []



for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
        tag.append([n.tag_ for n in doc])
        dep.append([n.dep_ for n in doc])
        alpha.append([n.is_alpha for n in doc])
        cease.append([n.is_stop for n in doc])

        
    else:
        # We wish to ensure that the lists of parsed outcomes have the
        # similar variety of entries of the unique Dataframe, so add some 
        #blanks in case the parse fails

        tokens.append(None)
        lemma.append(None)
        pos.append(None)
        tag.append(None)
        dep.append(None)
        alpha.append(None)
        cease.append(None)
        
        
wordlist['tokens'] = tokens
wordlist['lemma'] = lemma
wordlist['pos'] = pos
wordlist['tag'] = tag
wordlist['dep'] = dep 
wordlist['alpha'] = alpha
wordlist['stop'] = cease

This takes me 1 min and 40s to finish. Note: If you might be utilizing this code to evaluation a doc, it’ll take longer.

What I love to do, group the phrases of their column.

To Get Adjectives:

def get_adjectives(textual content):
    blob = TextBlob(textual content)
    return [ word for (word,tag) in blob.tags if tag.startswith("JJ")]
wordlist['adjectives'] = wordlist['words'].apply(get_adjectives)

To Get Verbs:

def get_verbs(textual content):
    blob = TextBlob(textual content)
    return [ word for (word,tag) in blob.tags if tag.startswith("VB")]

wordlist['verbs'] = wordlist['words'].apply(get_verbs)

To Get Adverbs:

def get_adverbs(textual content):
    blob = TextBlob(textual content)
    return [ word for (word,tag) in blob.tags if tag.startswith("RB")]

wordlist['adverb'] = wordlist['words'].apply(get_adverbs)

To Get Nouns:

def get_nouns(textual content):
    blob = TextBlob(textual content)
    return [ word for (word,tag) in blob.tags if tag.startswith("NN")]
wordlist['nouns'] = wordlist['words'].apply(get_nouns)

Word Sentiment:
To perceive if a phrase is detrimental, constructive or impartial

wordlist[['polarity', 'subjectivity']] = wordlist['words'].apply(lambda phrases: pd.Series(TextBlob(phrases).sentiment))

To Clean Column of Words:
The code under will delete the brackets across the phrase.

wordlist['xxxx'] = dictionary['xxxx'].apply(lambda x: ",".be part of(x) if isinstance(x, checklist) else x)

Translate the List of Words
What you need to use is %%time in line one after which the code to translate to phrases. The key phrase %%time will document the time that it takes for the code to run. On my system, it took the translating into Spanish, and it took Four hours, 5 minutes, and 25 seconds. I’d use TextBlob as a translator. Google translator is superb whenever you wish to translate phrases accurately, nevertheless it doesn’t work on a regular basis as a result of it’s worthwhile to join with a server, and the server might be down.

To get language codes, go to Language Codes

%%time
translator = Translator()
wordlist["spanishwords"] = wordlist["words"].map(lambda x: translator.translate(x, src="https://datascienceplus.com/en", dest="es").textual content)

You can create a translator pipe with the intention to translator a number of languages as a collection:

spanishme = Translator(to_lang="es")
frenchme = Translator(to_lang="fr")
italianme = Translator(to_lang="it")
germanme = Translator(to_lang="de")
hindime = Translator(to_lang="hi")
chineseme = Translator(to_lang="zh")
japanme = Translator(to_lang="ja")
korenme = Translator(to_lang="ko")
taglome = Translator(to_lang="tl")
viteme = Translator(to_lang="vi")
thaime = Translator(to_lang="th")
russiame = Translator(to_lang="ru")
afrikaansme = Translator(to_lang="af")

%%time

spanishme = []
frenchme = []
italianme = []
germanme = []
hindime = []
chineseme = []
japanme = []
korenme = []
taglome = []
viteme = []
thaime = []
russiame = []
afrikaansme = []

for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4):
    
        spanishme.append([spanishme for n in doc])
        frenchme.append([frenchme  for n in doc])
        italianme.append([italianme for n in doc])
        germanme.append([germanme  for n in doc])
        hindime.append([hindime for n in doc])
        chineseme.append([chineseme  for n in doc])
        japanme.append([japanme for n in doc])
        taglome.append([taglome  for n in doc])
        viteme.append([viteme for n in doc])
        thaime.append([thaime  for n in doc])
        russiame.append([russiame for n in doc])
        afrikaansme.append([afrikaansme for n in doc])

wordlist['spainishwords'] = spanishme
wordlist['frenchwords'] = frenchme
wordlist['italianwords'] = italianme
wordlist['germanword'] = germanme
wordlist['hindiwords'] = hindime
wordlist['chinesewords'] = chineseme
wordlist['japanhwords'] = japanme
wordlist['koreanwords'] = korenme
wordlist['tagalogwords'] = taglome
wordlist['vitetnamesewords'] = viteme
wordlist['thaiwords'] = thaime
wordlist['russianwords'] = russiame
wordlist['afrikaanwords'] = afrikaansme

Depending in your system, this will likely take some time to run. For my laptop system to all of those languages, it’ll take a minimum of 52 hours. My suggestion to you, use the code above and translate two languages and add a timer to the code. From the time, you’ll know the way lengthy the code will run.

At completion, save the sector in your desktop.

word_list.to_csv(r'C:UsersXXXXXXXXDesktopdictwordslangnew.csv', index = False, header=True) 



[ad_2]

Source hyperlink

Write a comment