A Magical Introduction to Classification Algorithms


by Bryan Berend |

About Bryan: Bryan Berend is a Lead Data Scientist at Nielsen, a big shopper and media analysis firm with workers in over 100 international locations. He at present lives in Chicago and loves operating, cooking, and pub trivia.


When you first begin studying about data science, one of many first stuff you find out about are classification algorithms. The idea behind these algorithms is fairly easy: take some details about a knowledge level and place the information level within the appropriate group or class.

A good instance is the e-mail spam filter. The objective of a spam filter is to label incoming emails (i.e. information factors) as “Spam” or “Not Spam” utilizing details about the e-mail (the sender, variety of capitalized phrases within the message, and so on.).

The e mail spam filter is an efficient instance, however it will get boring after some time. Spam classification is the default instance for lectures or convention displays, so that you hear about it time and again. What if we might discuss a unique classification algorithm that was a bit extra attention-grabbing? Something extra nerdy? Something extra…magical?

That’s proper people! Today we’ll be speaking in regards to the Sorting Hat from the Harry Potter universe. We’ll pull some Harry Potter information from the online, analyze it, after which construct a classifier to kind characters into the completely different homes. Should be enjoyable!


The classifier constructed under is just not extremely refined. Thus, it ought to be handled as a “first pass” of the issue so as to display some primary web-scraping and text-analysis methods. Also, due to a comparatively small pattern dimension, we is not going to be using traditional coaching methods like cross-validation. We are merely gathering some information, constructing a easy rule-based classifier, and seeing the outcomes.

Side be aware:

The concept for this weblog publish got here from Brian Lange’s glorious presentation on classification algorithms at PyData Chicago 2016. You can discover the video of the discuss right here and the slides right here. Thanks Brian!

Step One: Pulling Data from the Web

In case you’ve got been dwelling below a rock for the final 20 years, the Sorting Hat is a magical hat that locations incoming Hogwarts college students into the 4 Hogwarts homes: Gryffindor, Slytherin, Hufflepuff, and Ravenclaw. Each home has sure traits, and when the Sorting Hat is positioned on a scholar’s head, it reads their minds and determines which home they’d be the most effective match for. By this definition, the Sorting Hat is a multiclass classifier (greater than two teams) as opposed to a binary classifier (precisely two teams), like an spam filter.

If we’re going kind college students into completely different homes, we’ll want some details about the scholars. Thankfully, there’s quite a lot of data on harrypotter.wikia.com. This web site has articles on practically each aspect of the Harry Potter universe, together with college students and school. As an added bonus, Fandom, the corporate that runs the web site, has an easy-to-use API with a lot of nice documentation. Hazzah!

We’ll begin by importing pandas and requests. The former shall be used for organizing the information, whereas the later shall be used to really make the information requests to the API.

We’ll additionally want a wise approach to loop by all of the completely different college students at Hogwarts and document the home they’re sorted into by the Sorting Hat (this would be the “truth” that we’ll evaluate our outcomes to). By poking across the web site, it seems that articles are grouped by “Category”, reminiscent of “Hogwarts_students” and “Films_(real-world)”. The Fandom API permits us to checklist out all the articles of a given class.

Let’s use Ravenclaw for example. We’ll get all the information right into a variable known as information after which we’ll put it right into a Pandas DataBody.

# Import modules
import pandas as pd
import requests

# Get Ravenclaw articles
class = 'Ravenclaws'
url = 'http://harrypotter.wikia.com/api/v1/Articles/List?develop=1&restrict=1000&class=' + class
requested_url = requests.get(url)
json_results = requested_url.json()
information = json_results['items']
ravenclaw_df = pd.DataBody(information)

print('Number of articles: {}'.format(len(information)))

Number of articles: 158

You can observe together with this complete evaluation in Rodeo!

Yhat be aware:

If you are following alongside in our Python IDE, Rodeo, simply copy and paste the code above into the Editor or Terminal tab.
You can view leads to both the History or Terminal tab. Bonus: Did you possibly can drag and drop the tabs and panes to rearrange and resize?

summary feedback id ns original_dimensions revision thumbnail title kind url
0 {{Ravenclaw particular person… 0 5080 10 None {‘id’: 964956, ‘timestamp’: ‘1460047333’, ‘use… None Ravenclaw particular person infobox NaN /wiki/Template:Ravenclaw_individual_infobox
1 Roland Abberley was a Ravenclaw scholar at Hog… 0 33946 0 None {‘id’: 1024340, ‘timestamp’: ‘1479282062’, ‘us… None Roland Abberley article /wiki/Roland_Abberley
2 Stewart Ackerley (born c. 1982-1983) was a wiz… 0 7011 0 None {‘id’: 1024309, ‘timestamp’: ‘1479281746’, ‘us… None Stewart Ackerley article /wiki/Stewart_Ackerley
3 Jatin Agarkar was a Ravenclaw scholar at Hogwa… 0 99467 0 None {‘id’: 1039350, ‘timestamp’: ‘1482842767’, ‘us… None Jatin Agarkar article /wiki/Jatin_Agarkar
4 Alannis was a feminine Ravenclaw scholar at Hogw… 0 27126 0 {‘width’: 322, ‘top’: 546} {‘id’: 1024320, ‘timestamp’: ‘1479281862’, ‘us… http://vignette3.wikia.nocookie.web/harrypotte… Alannis article /wiki/Alannis

We can see a number of issues from this:

  • The first commentary on this checklist is “Ravenclaw individual infobox”. Since this isn’t a scholar, we wish to filter our outcomes on the “type” column.
  • Unfortunately ravenclaw_df would not have the articles’ contents…simply article abstracts. In order to get the contents, we want to use a unique API request and question information based mostly on the articles’ ids.
  • Furthermore, we are able to write a loop to run over all the homes and get one dataframe with all the information we want.
# Set variables
homes = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin']
mydf = pd.DataBody()

# Gets article ids, article url, and home
for home in homes:
    url = "http://harrypotter.wikia.com/api/v1/Articles/List?expand=1&limit=1000&category=" + home + 's'
    requested_url = requests.get(url)
    json_results = requested_url.json()
    information = json_results['items']

    house_df = pd.DataBody(information)
    house_df = house_df[house_df['type'] == 'article']
    house_df.reset_index(drop=True, inplace=True)
    house_df.drop(['abstract', 'comments', 'ns', 'original_dimensions', 'revision', 'thumbnail', 'type'], axis=1, inplace=True)
    house_df['house'] = pd.Series([house]*len(house_df))
    mydf = pd.concat([mydf, house_df])

mydf.reset_index(drop=True, inplace=True)

# Print outcomes
print('Number of scholar articles: {}'.format(len(mydf)))

Number of scholar articles: 748

      id             title                     url       home
0  33349     Astrix Alixan     /wiki/Astrix_Alixan  Gryffindor
1  33353   Filemina Alchin   /wiki/Filemina_Alchin  Gryffindor
2   7018  Euan Abercrombie  /wiki/Euan_Abercrombie  Gryffindor
3  99282      Sakura Akagi      /wiki/Sakura_Akagi  Gryffindor
4  99036       Zakir Akram       /wiki/Zakir_Akram  Gryffindor

         id             title                     url      home
743  100562  Phylis Whitehead  /wiki/Phylis_Whitehead  Slytherin
744    3153            Wilkes            /wiki/Wilkes  Slytherin
745   35971      Ella Wilkins      /wiki/Ella_Wilkins  Slytherin
746   44393    Rufus Winickus    /wiki/Rufus_Winickus  Slytherin
747     719     Blaise Zabini     /wiki/Blaise_Zabini  Slytherin

Getting article contents

Now that we’ve got the article ids, we are able to begin pulling article contents. But some these articles are MASSIVE with unbelievable quantities of element…simply check out Harry Potter’s or Voldemort’s articles!

If we have a look at a few of the most essential characters, we’ll see that all of them have a “Personality and traits” part of their article. This looks as if a logical place to extract data that the Sorting Hat would use in its resolution. Not all characters have a “Personality and traits” part (reminiscent of Zakir Akram), so this step will scale back the variety of college students in our information by a big quantity.

The following code pulls the “Personality and traits” part from every article and computes the size of that part (i.e. variety of textual content characters). Then it merges that information with our preliminary dataframe mydf by “id” (this takes a short while to run).

# Loops by articles and pulls the "Personality and traits" part from every scholar
# If that part doesn't exist for a scholar, we simply report a clean string
# This takes a couple of minutes to run
text_dict = {}
for iden in mydf['id']:
    url = 'http://harrypotter.wikia.com/api/v1/Articles/AsSimpleJson?id=' + str(iden)
    requested_url = requests.get(url)
    json_results = requested_url.json()
    sections = json_results['sections']
    contents = [sections[i]['content'] for i, x in enumerate(sections) if sections[i]['title'] == 'Personality and traits']

    if contents:
        paragraphs = contents[0]
        texts = [paragraphs[i]['text'] for i, x in enumerate(paragraphs)]
        all_text = ' '.be a part of(texts)
        all_text = ''
    text_dict[iden] = all_text

# Places information right into a DataBody and computes the size of the "Personality and traits" part
text_df = pd.DataBody.from_dict(text_dict, orient='index')
text_df.columns = ['id', 'text']
text_df['text_len'] = text_df['text'].map(lambda x: len(x))

# Merges our textual content information again with the data in regards to the college students
mydf_all = pd.merge(mydf, text_df, on='id')
mydf_all.sort_values('text_len', ascending=False, inplace=True)
# Creates a brand new DataBody with simply the scholars who've a "Personality and traits" part
mydf_relevant = mydf_all[mydf_all['text_len'] > 0]

print('Number of useable articles: {}'.format(len(mydf_relevant)))

Number of useable articles: 94

id title url home textual content text_len
689 343 Tom Riddle /wiki/Tom_Riddle Slytherin Voldemort was thought of by many to be “the mo… 26924
169 13 Harry Potter /wiki/Harry_Potter Gryffindor Harry was an especially courageous, loyal, and selfl… 12987
726 49 Dolores Umbridge /wiki/Dolores_Umbridge Slytherin Dolores Umbridge was nothing wanting a sociop… 9668
703 259 Horace Slughorn /wiki/Horace_Slughorn Slytherin Horace Slughorn was described as having a bumb… 7944
54 4178 Albus Dumbledore /wiki/Albus_Dumbledore Gryffindor Considered to be probably the most highly effective wizard of h… 7789

Step Two: Getting Hogwarts House Characteristics utilizing NLTK

Now that we’ve got information on quite a lot of college students, we wish to classify college students into completely different homes. In order to try this, we’ll want an inventory of the traits for every home. We will begin with the traits on harrypotter.wikia.com.

trait_dict = {}
trait_dict['Gryffindor'] = ['bravery', 'nerve', 'chivalry', 'daring', 'courage']
trait_dict['Slytherin'] = ['resourcefulness', 'cunning', 'ambition', 'determination', 'self-preservation', 'fraternity',
trait_dict['Ravenclaw'] = ['intelligence', 'wit', 'wisdom', 'creativity', 'originality', 'individuality', 'acceptance']
trait_dict['Hufflepuff'] = ['dedication', 'diligence', 'fairness', 'patience', 'kindness', 'tolerance', 'persistence',

Notice that each one of those traits are nouns, which is an efficient factor; we wish to be in line with our traits. Some of the traits on the wiki have been non-nouns, so I modified them as follows:

  • “ambitious” (an adjective) – this may be simply modified to ‘ambition’
  • “hard work”, “fair play”, and “unafraid of toil” – these multi-word phrases will also be modified to single-word nouns:
  • “hard work” –> ‘diligence’
  • “fair play” –> ‘equity’
  • “unafraid of toil” –> ‘persistence’

Now that we’ve got an inventory of traits for every home, we are able to merely scan by the “text” column in our DataBody and rely the variety of instances a attribute seems. Sounds easy, proper?

Unfortunately we aren’t achieved but. Take the next sentences from Neville Longbottom’s “Personality and traits” part:

When he was youthful, Neville was clumsy, forgetful, shy, and plenty of thought of him ill-suited for Gryffindor home as a result of he appeared timid.

With the assist of his pals, to whom he was very loyal, the encouragement of Professor Remus Lupin to face his fears in his third yr, and the motivation of realizing his dad and mom’ torturers have been on the unfastened, Neville turned braver, extra confident, and devoted to the struggle towards Lord Voldemort and his Death Eaters.

The daring phrases on this passage ought to be counted in the direction of one of many homes, however they will not be as a result of they’re adjectives. Similarly, phrases like “bravely” and “braveness” additionally wouldn’t rely. In order to make our classification algorithm work correctly, we want to establish synonyms, antonyms, and phrase types.


We can discover synonyms of phrases utilizing the synsets perform in WordWeb, a lexical database of English phrases that’s included within the nltk module (“NLTK” stands for Natural Language Toolkit). A “synset”, brief for “synonym set”, is a set of synonymous phrases, or “lemmas”. The synsets perform returns the “synsets” which might be related to a selected phrase.

Confused? So was I when first discovered about this materials. Let’s run some code after which analyze it.

from nltk.corpus import wordnet as wn

# Synsets of differents phrases
foo1 = wn.synsets('bravery')
print("Synonym sets associated with the word 'bravery': {}".format(foo1))

foo2 = wn.synsets('equity')
print("Synonym sets associated with the word 'fairness': {}".format(foo2))

foo3 = wn.synsets('wit')
print("Synonym sets associated with the word 'wit': {}".format(foo3))

foo4 = wn.synsets('crafty')
print("Synonym sets associated with the word 'cunning': {}".format(foo4))

foo4 = wn.synsets('crafty', pos=wn.NOUN)
print("Synonym sets associated with the *noun* 'cunning': {}".format(foo4))

# Prints out the synonyms ("lemmas") related to every synset
foo_list = [foo1, foo2, foo3, foo4]
for foo in foo_list:
    for synset in foo:
        print((synset.identify(), synset.lemma_names()))

Synonym units related to the phrase ‘bravery’: [Synset(‘courage.n.01’), Synset(‘fearlessness.n.01’)]

Synonym units related to the phrase ‘equity’: [Synset(‘fairness.n.01’), Synset(‘fairness.n.02’), Synset(‘paleness.n.02’), Synset(‘comeliness.n.01’)]

Synonym units related to the phrase ‘wit’: [Synset(‘wit.n.01’), Synset(‘brain.n.02’), Synset(‘wag.n.01’)]

Synonym units related to the phrase ‘crafty’: [Synset(‘craft.n.05’), Synset(‘cunning.n.02’), Synset(‘cunning.s.01’), Synset(‘crafty.s.01’), Synset(‘clever.s.03’)]

Synonym units related to the noun ‘crafty’: [Synset(‘craft.n.05’), Synset(‘cunning.n.02’)]

(‘braveness.n.01’, [‘courage’, ‘courageousness’, ‘bravery’, ‘braveness’])
(‘fearlessness.n.01’, [‘fearlessness’, ‘bravery’])
(‘equity.n.01’, [‘fairness’, ‘equity’])
(‘equity.n.02’, [‘fairness’, ‘fair-mindedness’, ‘candor’, ‘candour’])
(‘paleness.n.02’, [‘paleness’, ‘blondness’, ‘fairness’])
(‘comeliness.n.01’, [‘comeliness’, ‘fairness’, ‘loveliness’, ‘beauteousness’])
(‘wit.n.01’, [‘wit’, ‘humor’, ‘humour’, ‘witticism’, ‘wittiness’])
(‘mind.n.02’, [‘brain’, ‘brainpower’, ‘learning_ability’, ‘mental_capacity’, ‘mentality’, ‘wit’])
(‘wag.n.01’, [‘wag’, ‘wit’, ‘card’])
(‘craft.n.05’, [‘craft’, ‘craftiness’, ‘cunning’, ‘foxiness’, ‘guile’, ‘slyness’, ‘wiliness’])
(‘crafty.n.02’, [‘cunning’])

Okay, that is quite a lot of output, so let’s level out some notes & potential issues:

  • Typing wn.synsets(‘bravery’) yields two synsets: one for ‘braveness.n.01’ and one for ‘fearlessness.n.01’. Let’s dive deeper into what this really means:
  • The first half (‘braveness’ and ‘fearlessness’) is the phrase the synset is centered round…let’s name it the “center” phrase. This implies that the synonyms (“lemmas”) within the synset all imply the identical factor as the middle phrase.
  • The second half (‘n’) stands for “noun”. You can see that the synsets related to the phrase “cunning” embody ‘artful.s.01’ and ‘intelligent.s.03’ (adjectives). These are right here as a result of the phrase “cunning” is each a noun and an adjective. To restrict our outcomes to simply nouns, we are able to specify wn.synsets(‘crafty’, pos=wn.NOUN).
  • The third half (’01’) refers to the particular that means of the middle phrase. For instance, ‘equity’ can imply “conformity with rules or standards” in addition to “making judgments free from discrimination or dishonesty”.

We additionally we see that the synset perform offers us some synonym units that we might not need. The synonym units related to the phrase ‘equity’ contains ‘paleness.n.02 (“having a naturally light complexion”) and ‘comeliness.n.01’ (“being good looking and attractive”). These will not be traits related to Hufflepuff (though Neville Longbottom grew up to be very good-looking), so we want to manually exclude these synsets from our evaluation.

Translation: getting synonyms is tougher than it appears

Antonyms and Word Forms

After we get all of the synonyms (which we’ll really do in a second), we additionally want to fear in regards to the antonyms (phrases reverse in that means) and completely different phrase types (“brave”, “bravely”, and “braver” for “bravery”). We can do quite a lot of the heavy work in nltk, however we may also have to manually create adverbs and comparative / superlative adjectives.

# Prints the completely different lemmas (synonyms), antonyms, and derivationally-related phrase types for the synsets of "bravery"
foo1 = wn.synsets('bravery')
for synset in foo1:
    for lemma in synset.lemmas():
        print("Synset: {}; Lemma: {}; Antonyms: {}; Word Forms: {}".format(synset.identify(), lemma.identify(), lemma.antonyms(),

Synset: braveness.n.01; Lemma: braveness; Antonyms: [Lemma(‘cowardice.n.01.cowardice’)]; Word Forms: [Lemma(‘brave.a.01.courageous’)]

Synset: braveness.n.01; Lemma: courageousness; Antonyms: []; Word Forms: [Lemma(‘brave.a.01.courageous’)]

Synset: braveness.n.01; Lemma: bravery; Antonyms: []; Word Forms: []

Synset: braveness.n.01; Lemma: braveness; Antonyms: []; Word Forms: [Lemma(‘brave.a.01.brave’), Lemma(‘audacious.s.01.brave’)]

Synset: fearlessness.n.01; Lemma: fearlessness; Antonyms: [Lemma(‘fear.n.01.fear’)]; Word Forms: [Lemma(‘audacious.s.01.fearless’), Lemma(‘unafraid.a.01.fearless’)]

Synset: fearlessness.n.01; Lemma: bravery; Antonyms: []; Word Forms: []

Putting all of it collectively

The following code creates an inventory of the synonyms, antonyms, and phrases types for every of the home traits described earlier. To be sure we’re exhaustive, a few of these won’t really be correctly-spelled English phrases.

# Manually choose the synsets which might be related to us
relevant_synsets = {}
relevant_synsets['Ravenclaw'] = [wn.synset('intelligence.n.01'), wn.synset('wit.n.01'), wn.synset('brain.n.02'),
                                 wn.synset('wisdom.n.01'), wn.synset('wisdom.n.02'), wn.synset('wisdom.n.03'),
                                 wn.synset('wisdom.n.04'), wn.synset('creativity.n.01'), wn.synset('originality.n.01'),
                                 wn.synset('originality.n.02'), wn.synset('individuality.n.01'), wn.synset('credence.n.01'),
relevant_synsets['Hufflepuff'] = [wn.synset('dedication.n.01'), wn.synset('commitment.n.04'), wn.synset('commitment.n.02'),
                                  wn.synset('diligence.n.01'), wn.synset('diligence.n.02'), wn.synset('application.n.06'),
                                  wn.synset('fairness.n.01'), wn.synset('fairness.n.01'), wn.synset('patience.n.01'),
                                  wn.synset('kindness.n.01'), wn.synset('forgivingness.n.01'), wn.synset('kindness.n.03'),
                                  wn.synset('tolerance.n.03'), wn.synset('tolerance.n.04'), wn.synset('doggedness.n.01'),
                                  wn.synset('loyalty.n.01'), wn.synset('loyalty.n.02')]
relevant_synsets['Gryffindor'] = [wn.synset('courage.n.01'), wn.synset('fearlessness.n.01'), wn.synset('heart.n.03'),
                                  wn.synset('boldness.n.02'), wn.synset('chivalry.n.01'), wn.synset('boldness.n.01')]
relevant_synsets['Slytherin'] = [wn.synset('resourcefulness.n.01'), wn.synset('resource.n.03'), wn.synset('craft.n.05'),
                                 wn.synset('cunning.n.02'), wn.synset('ambition.n.01'), wn.synset('ambition.n.02'),
                                 wn.synset('determination.n.02'), wn.synset('determination.n.04'),
                                 wn.synset('self-preservation.n.01'), wn.synset('brotherhood.n.02'),
                                 wn.synset('inventiveness.n.01'), wn.synset('brightness.n.02'), wn.synset('ingenuity.n.02')]

# Function that may get the completely different phrase types from a lemma
def get_forms(lemma):
    drfs = lemma.derivationally_related_forms()
    output_list = []
    if drfs:
        for drf in drfs:
            drf_pos = str(drf).cut up(".")[1]
            if drf_pos in ['n', 's', 'a']:
                if drf_pos in ['s', 'a']:
                    # Adverbs + "-ness" nouns + comparative & superlative adjectives
                    if len(drf.identify()) == 3:
                        last_letter = drf.identify()[-1:]
                        output_list.append(drf.identify().decrease() + last_letter + 'er')
                        output_list.append(drf.identify().decrease() + last_letter + 'est')
                    elif drf.identify()[-4:] in ['able', 'ible']:
                    elif drf.identify()[-1:] == 'e':
                    elif drf.identify()[-2:] == 'ic':
                    elif drf.identify()[-1:] == 'y':
        return output_list
        return output_list

# Creates a duplicate of our trait dictionary
# If we do not do that, then we continuously replace the dictariony we're looping by, inflicting an infinite loop
import copy
new_trait_dict = copy.deepcopy(trait_dict)
antonym_dict = {}

# Add synonyms and phrase types to the (new) trait dictionary; additionally add antonyms (and their phrase types) to the antonym dictionary
for home, traits in trait_dict.objects():
    antonym_dict[house] = []
    for trait in traits:
        synsets = wn.synsets(trait, pos=wn.NOUN)
        for synset in synsets:
            if synset in relevant_synsets[house]:
                for lemma in synset.lemmas():
                    if get_forms(lemma):
                    if lemma.antonyms():
                        for ant in lemma.antonyms():
                            if get_forms(ant):
    new_trait_dict[house] = sorted(checklist(set(new_trait_dict[house])))
    antonym_dict[house] = sorted(checklist(set(antonym_dict[house])))

# Print a few of our outcomes
print("Gryffindor traits: {}".format(new_trait_dict['Gryffindor']))
print("Gryffindor anti-traits: {}".format(antonym_dict['Gryffindor']))

Gryffindor traits: [‘bold’, ‘bolder’, ‘boldest’, ‘boldly’, ‘boldness’, ‘brass’, ‘brassier’, ‘brassiest’, ‘brassily’, ‘brassiness’, ‘brassy’, ‘brave’, ‘bravely’, ‘braveness’, ‘braver’, ‘bravery’, ‘bravest’, ‘cheek’, ‘cheekier’, ‘cheekiest’, ‘cheekily’, ‘cheekiness’, ‘cheeky’, ‘chivalry’, ‘courage’, ‘courageous’, ‘courageouser’, ‘courageousest’, ‘courageously’, ‘courageousness’, ‘daring’, ‘face’, ‘fearless’, ‘fearlesser’, ‘fearlessest’, ‘fearlessly’, ‘fearlessness’, ‘gallantry’, ‘hardihood’, ‘hardiness’, ‘heart’, ‘mettle’, ‘nerve’, ‘nervier’, ‘nerviest’, ‘nervily’, ‘nerviness’, ‘nervy’, ‘politesse’, ‘spunk’, ‘spunkier’, ‘spunkiest’, ‘spunkily’, ‘spunkiness’, ‘spunky’]

Gryffindor anti-traits: [‘cowardice’, ‘fear’, ‘timid’, ‘timider’, ‘timidest’, ‘timidity’, ‘timidly’, ‘timidness’]

# Tests that the trait dictionary and the antonym dictionary haven't any repeats amongst homes
from itertools import mixtures
def test_overlap(dict):
    outcomes = []
    house_combos = mixtures(checklist(dict.keys()), 2)
    for combo in house_combos:
    return outcomes

# Outputs outcomes from our check; ought to output "False"
print("Any words overlap in trait dictionary? {}".format(sum(test_overlap(new_trait_dict)) != 6))
print("Any words overlap in antonym dictionary? {}".format(sum(test_overlap(antonym_dict)) != 6))

Any phrases overlap in trait dictionary? False
Any phrases overlap in antonym dictionary? False

Step 3: Sorting Students into Houses

The time has lastly come to kind college students into their homes! Our classification algorithm will work like this:

  • For every scholar, undergo their “Personality and traits” part phrase by phrase
  • If a phrase seems in a home’s trait checklist, then we add 1 to that home’s rating
  • Similarly, if a phrase seems in a home’s anti-trait checklist, then we subtract 1 from that home’s rating
  • The home with the best rating is the one we assign the scholar to
  • If there’s a tie, we’ll merely output “Tie!”

For instance, if a personality’s “Personality and traits” part was simply the sentence “Alice was brave”, then Alice would have a rating of 1 for Gryffindor and 0 for all different homes; we might kind Alice into Gryffindor.

# Imports "word_tokenize", which breaks up sentences into phrases and punctuation
from nltk import word_tokenize

# Function that kinds the scholars
def sort_student(textual content):
    text_list = word_tokenize(textual content)
    text_list = [word.lower() for word in text_list]
    score_dict = {}
    homes = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin']
    for home in homes:
        score_dict[house] = (sum([True for word in text_list if word in new_trait_dict[house]]) -
                                  sum([True for word in text_list if word in antonym_dict[house]]))

    sorted_house = max(score_dict, key=score_dict.get)
    sorted_house_score = score_dict[sorted_house]
    if sum([True for i in score_dict.values() if i==sorted_house_score]) == 1:
        return sorted_house
        return "Tie!"

# Test our perform
print(sort_student('Alice was courageous'))
print(sort_student('Alice was British'))


Our perform appears to work, so let’s apply it to our information and see what we get!

# Turns off a warning
pd.choices.mode.chained_assignment = None

mydf_relevant['new_house'] = mydf_relevant['text'].map(lambda x: sort_student(x))
id title url home textual content text_len new_house
689 343 Tom Riddle /wiki/Tom_Riddle Slytherin Voldemort was thought of by many to be “the mo… 26924 Hufflepuff
169 13 Harry Potter /wiki/Harry_Potter Gryffindor Harry was an especially courageous, loyal, and selfl… 12987 Ravenclaw
726 49 Dolores Umbridge /wiki/Dolores_Umbridge Slytherin Dolores Umbridge was nothing wanting a sociop… 9668 Ravenclaw
703 259 Horace Slughorn /wiki/Horace_Slughorn Slytherin Horace Slughorn was described as having a bumb… 7944 Slytherin
54 4178 Albus Dumbledore /wiki/Albus_Dumbledore Gryffindor Considered to be probably the most highly effective wizard of h… 7789 Hufflepuff
709 33 Severus Snape /wiki/Severus_Snape Slytherin At instances, Snape might seem chilly, cynical, ma… 6894 Ravenclaw
164 331 Peter Pettigrew /wiki/Peter_Pettigrew Gryffindor Peter Pettigrew was characterised by weak point…. 6600 Gryffindor
230 14 Ronald Weasley /wiki/Ronald_Weasley Gryffindor Ron was a really humorous individual, however usually emotion… 6078 Ravenclaw
646 16 Draco Malfoy /wiki/Draco_Malfoy Slytherin Draco was, typically, an boastful, spiteful b… 5435 Tie!
468 53 Gilderoy Lockhart /wiki/Gilderoy_Lockhart Ravenclaw Gilderoy Lockhart’s defining traits w… 5167 Slytherin
84 47 Rubeus Hagrid /wiki/Rubeus_Hagrid Gryffindor Hagrid was an extremely heat, kind-hearted ma… 4884 Hufflepuff
76 15 Hermione Granger /wiki/Hermione_Granger Gryffindor Hermione was famous for being extraordinarily intelli… 4648 Tie!
114 52 Remus Lupin /wiki/Remus_Lupin Gryffindor Remus was compassionate, clever, tolerant… 4321 Hufflepuff
223 26 Arthur Weasley /wiki/Arthur_Weasley Gryffindor While Arthur Weasley was usually seen as “fun” i… 4316 Slytherin
679 5091 Albus Potter /wiki/Albus_Potter Slytherin Albus was a quiet, form, and considerate younger … 3522 Tie!
23 31 Sirius Black /wiki/Sirius_Black Gryffindor Sirius was true to the perfect of a Gryffindor s… 3483 Hufflepuff
131 32 Minerva McGonagall /wiki/Minerva_McGonagall Gryffindor Minerva virtually continuously exuded magnanimity a… 3188 Hufflepuff
227 25 Ginevra Weasley /wiki/Ginevra_Weasley Gryffindor Ginny was a forceful, impartial woman who oft… 3113 Tie!
229 30 Percy Weasley /wiki/Percy_Weasley Gryffindor Percy was extraordinarily formidable and devoted to… 3099 Ravenclaw
647 313 Lucius Malfoy /wiki/Lucius_Malfoy Slytherin Despite being the embodiment of wealth and inf… 3069 Ravenclaw
print("Match rate: {}".format(sum(mydf_relevant['house'] == mydf_relevant['new_house']) / len(mydf_relevant)))
print("Percentage of ties: {}".format(sum(mydf_relevant['new_house'] == 'Tie!') / len(mydf_relevant)))

Match charge: 0.2553191489361702
Percentage of ties: 0.32978723404255317

Hmmm. Those will not be the outcomes we have been anticipating. Let’s attempt to examine why Voldemort was sorted into Hufflepuff.

# Voldemort's textual content information
tom_riddle = word_tokenize(mydf_relevant['text'].values[0])
tom_riddle = [word.lower() for word in tom_riddle]

# Instead of computing a rating, we'll checklist out the phrases within the textual content that match phrases in our traits and antonyms dictionaries
words_dict = {}
anti_dict = {}
homes = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin']
for home in homes:
    words_dict[house] = [word for word in tom_riddle if word in new_trait_dict[house]]
    anti_dict[house] = [word for word in tom_riddle if word in antonym_dict[house]]


{‘Slytherin’: [‘ambition’], ‘Ravenclaw’: [‘intelligent’, ‘intelligent’, ‘mental’, ‘individual’, ‘mental’, ‘intelligent’], ‘Hufflepuff’: [‘kind’, ‘loyalty’, ‘true’, ‘true’, ‘true’, ‘loyalty’], ‘Gryffindor’: [‘brave’, ‘face’, ‘bold’, ‘face’, ‘bravery’, ‘brave’, ‘courageous’, ‘bravery’]}

{‘Slytherin’: [], ‘Ravenclaw’: [‘common’], ‘Hufflepuff’: [], ‘Gryffindor’: [‘fear’, ‘fear’, ‘fear’, ‘fear’, ‘fear’, ‘fear’, ‘cowardice’, ‘fear’, ‘fear’]}

As you possibly can see, Slytherin had a rating of (1-0) = 1, Ravenclaw had (6-1) = 5, Hufflepuff had (6-0) = 6, and Gryffindor had (8-9) = -1.

It’s additionally attention-grabbing to be aware that Voldemort’s “Personality and Traits section”, which is the longest of any scholar, matched with solely 31 phrases in our synonym and antonym dictionaries, which implies that different college students most likely had a lot decrease matched phrase counts. This implies that we’re making our classification resolution off little or no information, which explains the misclaffication charge and the excessive variety of ties.


The classifier we constructed is just not very profitable (we do barely higher than simplying guessing), however we’ve got to contemplate that our method was fairly simplistic. Modern e mail spam filters are very sophistocated and do not simply classify based mostly on the presence of sure phrases, so future enhancements to our algorithm ought to equally keep in mind extra data. Here’s a brief checklist of concepts for future enhancements:

  • Consider which homes different relations have been positioned
  • Use different sections of the the Harry Potter wiki articles, like “Early Life” or the summary firstly of the article
  • Instead of taking a small checklist of traits and their synonyms, create an inventory of probably the most frequent phrases within the “Personality and traits” part for every home and classify based mostly on that.
  • Employ extra refined text-analysis methods like sentiment evaluation

However, we did be taught loads about APIs and nltk within the course of, so on the finish of the day I’m calling it a win. Now that we’ve got these instruments in our pocket, we’ve got a strong base for future endeavours and may exit and conquer Python similar to Neville conquered Nagini.


Source hyperlink

Write a comment