 ## K Nearest Neighbors Application – Practical Machine Learning Tutorial with Python p.14

In the last part we introduced Classification, which is a supervised form of machine learning, and explained the K Nearest Neighbors algorithm intuition. In this tutorial, we’re actually going to apply a simple example of the algorithm using Scikit-Learn, and then in the subsquent tutorials we’ll build our own algorithm to learn more about how it works under the hood.
To exemplify classification, we’re going to use a Breast Cancer Dataset, which is a dataset donated to the University of California, Irvine (UCI) collection from the University of Wisconsin-Madison. UCI has a large Machine Learning Repository.

https://pythonprogramming.net

Source

### Comment List

• sentdex
December 8, 2020

HI, is anyone getting this Error?
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

• sentdex
December 8, 2020

Thanks for sentdex I am beginning to see the the matrix… at least the door.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

i=0

for predDelta in (clf.predict(X_test) – y_test):

if predDelta!=0:

i=i+1

print(f" the number of inaccurate predictions are {i}")

print(f" based on my accuracy calculation {1 – i/len(y_test)}")

print(f" based on sk learn accuracy calculation {accuracy}")

• sentdex
December 8, 2020

The below code may help.

import numpy as np

from sklearn.model_selection import cross_validate, train_test_split

from sklearn import preprocessing, neighbors

import pandas as pd

df.replace('?',-99999, inplace = True)

#df.drop(['id'], 1, inplace=True)

X = np.array(df.drop(['class'],1))

y = np.array(df['class'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

print(accuracy)

example_measures = np.array([4,2,1,1,1,2,3,2,1])

example_measures = example_measures.reshape(1,-1)

prediction = clf.predict(example_measures)

print(prediction)

• sentdex
December 8, 2020
• sentdex
December 8, 2020

For anyone who gets an ValueError: labels ['class'] not contained in axis, just use df.columns = df.columns.str.replace(' ', '') at the beginning to remove any space in name for columns

• sentdex
December 8, 2020

example_measures=np.array([[4,2,1,1,1,2,3,2,1]]).reshape(1,-1)
hahahaha I am sure that line came from stackoverflow

• sentdex
December 8, 2020

I'm getting this error when I Fit the model:
ValueError: Found input variables with inconsistent numbers of samples: [559, 140]

Some problem with reshaping where X-train is [599,10] and y_train is [140, 10].
I tried making changes and I get some other error instead.

• sentdex
December 8, 2020

but we do have this dataset already in sklearn.datasets

• sentdex
December 8, 2020

• sentdex
December 8, 2020

so the nearestneigbors and kneighbors are not the same the thing?

• sentdex
December 8, 2020

I don't seem to be able to download the dataset. It seems like I am the only person with this problem. Any ideas?

• sentdex
December 8, 2020

Link to the dataset used in the video
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer

• sentdex
December 8, 2020

Better explained than when I was a student in a university.

• sentdex
December 8, 2020

ValueError: Classification metrics can't handle a mix of continuous and binary targets

• sentdex
December 8, 2020

Any one can help …..
I got an error saying "KeyError: "['class'] not found in axis"" and tried doing X = np.array(df.drop('class', 1))
But I am still getting the same error

• sentdex
December 8, 2020

X = np.array(df.drop('class', 1)) worked for me when I got the "labels ['class'] not contained in axis" error

• sentdex
December 8, 2020

import numpy as np
import pandas as pd

from sklearn import preprocessing, model_selection, neighbors

df.replace('?', -99999, inplace = True)

df.drop(['sample_id'], 1, inplace = True)

X = np.array(df.drop(['class'],1))

y = np.array(df['class'])

X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()

clf.fit(X_train,y_train)

accuracy = clf.score(X_test,y_test)

example_measures = np.array([[4,2,1,1,1,2,3,2,1]])

prediction = clf.predict(example_measures)

• sentdex
December 8, 2020

Can someone help, I only get a Tokenizing error when I print accuracy

• sentdex
December 8, 2020

getting this error after i run clf.fit: "ValueError: Input contains NaN, infinity or a value too large for dtype('float64').". Anyone else seen this?

• sentdex
December 8, 2020

My accuracy is : 0.9928571428571429

• sentdex
December 8, 2020

No errors but the prediction is always  or [2 2] or [2 2 …] … I copied data from the sample data expecting the program to classify it as, "something it recognized" and therefore a different value.  was what I was hoping for and what I would have expected as most of the samples are benign but, nope … just .

• sentdex
December 8, 2020

ValueError: Found input variables with inconsistent numbers of samples: [559, 140]

anyone else getting this error??????

• sentdex
December 8, 2020

Hey! Can you make a video on logistic regression. And how to code logistic regression?

• sentdex
December 8, 2020

For anyone who gets error at reshape, try this instead of reshape:
example_measures=np.array([[4,2,1,1,1,2,3,2,1]]). The feature needs 2D Array, that's why double bracket!

• sentdex
December 8, 2020

Bad choice of -99999 for missing data. Entries like ['?',1,1,1] and ['?',9,9,9] are obviously different in all but one unknown property, while [-99999,1,1,1] and [-99999,9,9,9] are very near to each other due to a large dummy value. Just drop 16 rows with missing values.

• sentdex
December 8, 2020

Why -1 on the reshape?

• sentdex
December 8, 2020

bro, you remind me of thenewboston soooo much!!!

• sentdex
December 8, 2020

This is the code:
df.replace('?', -99999, inplace=True)

df.drop(['id'], 1, inplace=True)

i have some problem here:
• 