How To Create A Multiple Language Dictionary Using A Pipeline
[ad_1]
A language dictionary is the best solution to verify spelling errors in a doc utilizing NLP methods in python. How? Easy create a evaluate perform utilizing the module recordlinkage or writing a perform utilizing Jaccard Similarity equation. But that is for one more time.
Please import:
%%time %matplotlib inline from matplotlib import pyplot as plt import time import re, random import random import string import sys, sorts, os import numpy as np import pandas as pd from textblob import Word from nltk.tag import pos_tag from nltk import word_tokenize from textblob.taggers import SampleTagger from textblob.decorators import requires_nltk_corpus from textblob.utils import tree2str, filter_insignificant from textblob.base import BaseNPExtractor from textblob.wordnet import VERB from textblob import Word from spacy import displacy import nltk from nltk import word_tokenize, pos_tag, ne_chunk from nltk import RegexpParser from nltk import Tree from nltk.corpus import stopwords from nltk.stem.snowball import SnowballStemmer from nltk.tokenize import RegexpTokenizer from nltk.stem import WordNetLemmatizer,PorterStemmer from nltk.util import ngrams from nltk.stem import PorterStemmer stemming = PorterStemmer() from nltk import trigrams nltk.obtain('punkt') nltk.obtain('wordnet') cease = stopwords.phrases('english') import en_core_web_sm nlp = en_core_web_sm.load() from spacy.language import Language from spacy.pipeline import EntityRuler ruler = EntityRuler(nlp) nlp.add_pipe(ruler) from translate import Translator from spacy.lang.en import English from spacy.matcher import PhraseMatcher from spacy.tokens import Doc, Span, Token from autocorrect import Speller spell = Speller(lang="https://datascienceplus.com/en") from textblob import TextBlob import spacy
To Generate A Word List:
from nltk.corpus import phrases word_list = phrases.phrases() #To save the modify file. To save file to your desktop. This is the easiest way to save lots of your checkpoint whenever you modified a doc in python. word_list.to_csv(r'C:UsersXXXXXXXXDesktopdictwordslang.csv', index = False, header=True)
Save your file on the desktop, then add the file into your Jupyter Notebook or Jupyter Labs. Import the glossary into your python code.
Let us put together the glossary. Text processing, making the phrases tokens, lemmatization, positions, tags, dep, alpha, and cease phrases.
Word Tokens: the method of segmenting textual content into phrases, punctuation marks, and so on
Word Lemmatization: is the method of grouping collectively the inflected types of a phrase to be analyzed, recognized by the phrase’s lemma or dictionary kind
Word Position: is the method of categorical the phrases group into the components of speech
Word Tag: is the method of assigning linguistic data for that phrase
Word Dependency: is the method of assigning syntactic dependency labels describing the relations between particular person tokens like topic or object
Word Alpha: the method of figuring out if the phrase is alpha or not
Word Stop: the method of figuring out cease phrases, for instance (is, not, this)
You can do that is Spacy:
%%time tokens = [] lemma = [] pos = [] tag = [] dep = [] alpha = [] cease = [] for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4): if doc.is_parsed: tokens.append([n.text for n in doc]) lemma.append([n.lemma_ for n in doc]) pos.append([n.pos_ for n in doc]) tag.append([n.tag_ for n in doc]) dep.append([n.dep_ for n in doc]) alpha.append([n.is_alpha for n in doc]) cease.append([n.is_stop for n in doc]) else: # We wish to ensure that the lists of parsed outcomes have the # similar variety of entries of the unique Dataframe, so add some #blanks in case the parse fails tokens.append(None) lemma.append(None) pos.append(None) tag.append(None) dep.append(None) alpha.append(None) cease.append(None) wordlist['tokens'] = tokens wordlist['lemma'] = lemma wordlist['pos'] = pos wordlist['tag'] = tag wordlist['dep'] = dep wordlist['alpha'] = alpha wordlist['stop'] = cease
This takes me 1 min and 40s to finish. Note: If you might be utilizing this code to evaluation a doc, it’ll take longer.
What I love to do, group the phrases of their column.
To Get Adjectives:
def get_adjectives(textual content): blob = TextBlob(textual content) return [ word for (word,tag) in blob.tags if tag.startswith("JJ")] wordlist['adjectives'] = wordlist['words'].apply(get_adjectives)
To Get Verbs:
def get_verbs(textual content): blob = TextBlob(textual content) return [ word for (word,tag) in blob.tags if tag.startswith("VB")] wordlist['verbs'] = wordlist['words'].apply(get_verbs)
To Get Adverbs:
def get_adverbs(textual content): blob = TextBlob(textual content) return [ word for (word,tag) in blob.tags if tag.startswith("RB")] wordlist['adverb'] = wordlist['words'].apply(get_adverbs)
To Get Nouns:
def get_nouns(textual content): blob = TextBlob(textual content) return [ word for (word,tag) in blob.tags if tag.startswith("NN")] wordlist['nouns'] = wordlist['words'].apply(get_nouns)
Word Sentiment:
To perceive if a phrase is detrimental, constructive or impartial
wordlist[['polarity', 'subjectivity']] = wordlist['words'].apply(lambda phrases: pd.Series(TextBlob(phrases).sentiment))
To Clean Column of Words:
The code under will delete the brackets across the phrase.
wordlist['xxxx'] = dictionary['xxxx'].apply(lambda x: ",".be part of(x) if isinstance(x, checklist) else x)
Translate the List of Words
What you need to use is %%time in line one after which the code to translate to phrases. The key phrase %%time will document the time that it takes for the code to run. On my system, it took the translating into Spanish, and it took Four hours, 5 minutes, and 25 seconds. I’d use TextBlob as a translator. Google translator is superb whenever you wish to translate phrases accurately, nevertheless it doesn’t work on a regular basis as a result of it’s worthwhile to join with a server, and the server might be down.
To get language codes, go to Language Codes
%%time translator = Translator() wordlist["spanishwords"] = wordlist["words"].map(lambda x: translator.translate(x, src="https://datascienceplus.com/en", dest="es").textual content)
You can create a translator pipe with the intention to translator a number of languages as a collection:
spanishme = Translator(to_lang="es") frenchme = Translator(to_lang="fr") italianme = Translator(to_lang="it") germanme = Translator(to_lang="de") hindime = Translator(to_lang="hi") chineseme = Translator(to_lang="zh") japanme = Translator(to_lang="ja") korenme = Translator(to_lang="ko") taglome = Translator(to_lang="tl") viteme = Translator(to_lang="vi") thaime = Translator(to_lang="th") russiame = Translator(to_lang="ru") afrikaansme = Translator(to_lang="af") %%time spanishme = [] frenchme = [] italianme = [] germanme = [] hindime = [] chineseme = [] japanme = [] korenme = [] taglome = [] viteme = [] thaime = [] russiame = [] afrikaansme = [] for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4): spanishme.append([spanishme for n in doc]) frenchme.append([frenchme for n in doc]) italianme.append([italianme for n in doc]) germanme.append([germanme for n in doc]) hindime.append([hindime for n in doc]) chineseme.append([chineseme for n in doc]) japanme.append([japanme for n in doc]) taglome.append([taglome for n in doc]) viteme.append([viteme for n in doc]) thaime.append([thaime for n in doc]) russiame.append([russiame for n in doc]) afrikaansme.append([afrikaansme for n in doc]) wordlist['spainishwords'] = spanishme wordlist['frenchwords'] = frenchme wordlist['italianwords'] = italianme wordlist['germanword'] = germanme wordlist['hindiwords'] = hindime wordlist['chinesewords'] = chineseme wordlist['japanhwords'] = japanme wordlist['koreanwords'] = korenme wordlist['tagalogwords'] = taglome wordlist['vitetnamesewords'] = viteme wordlist['thaiwords'] = thaime wordlist['russianwords'] = russiame wordlist['afrikaanwords'] = afrikaansme
Depending in your system, this will likely take some time to run. For my laptop system to all of those languages, it’ll take a minimum of 52 hours. My suggestion to you, use the code above and translate two languages and add a timer to the code. From the time, you’ll know the way lengthy the code will run.
At completion, save the sector in your desktop.
word_list.to_csv(r'C:UsersXXXXXXXXDesktopdictwordslangnew.csv', index = False, header=True)
[ad_2]
Source hyperlink