## K Nearest Neighbors Application – Practical Machine Learning Tutorial with Python p.14

[ad_1]

In the last part we introduced Classification, which is a supervised form of machine learning, and explained the K Nearest Neighbors algorithm intuition. In this tutorial, we’re actually going to apply a simple example of the algorithm using Scikit-Learn, and then in the subsquent tutorials we’ll build our own algorithm to learn more about how it works under the hood.

To exemplify classification, we’re going to use a Breast Cancer Dataset, which is a dataset donated to the University of California, Irvine (UCI) collection from the University of Wisconsin-Madison. UCI has a large Machine Learning Repository.

https://pythonprogramming.net

https://twitter.com/sentdex

https://www.facebook.com/pythonprogramming.net/

https://plus.google.com/+sentdex

Source

[ad_2]

HI, is anyone getting this Error?

raise ValueError(

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Thanks for sentdex I am beginning to see the the matrix… at least the door.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

i=0

for predDelta in (clf.predict(X_test) – y_test):

if predDelta!=0:

i=i+1

print(f" the number of inaccurate predictions are {i}")

print(f" based on my accuracy calculation {1 – i/len(y_test)}")

print(f" based on sk learn accuracy calculation {accuracy}")

The below code may help.

import numpy as np

from sklearn.model_selection import cross_validate, train_test_split

from sklearn import preprocessing, neighbors

import pandas as pd

df = pd.read_csv('Bcancer.txt')

df.head()

df.replace('?',-99999, inplace = True)

#df.drop(['id'], 1, inplace=True)

X = np.array(df.drop(['class'],1))

y = np.array(df['class'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

print(accuracy)

example_measures = np.array([4,2,1,1,1,2,3,2,1])

example_measures = example_measures.reshape(1,-1)

prediction = clf.predict(example_measures)

print(prediction)

https://archive.ics.uci.edu/ml/datasets.php ——-> updated Site

For anyone who gets an ValueError: labels ['class'] not contained in axis, just use df.columns = df.columns.str.replace(' ', '') at the beginning to remove any space in name for columns

example_measures=np.array([[4,2,1,1,1,2,3,2,1]]).reshape(1,-1)

hahahaha I am sure that line came from stackoverflow

I'm getting this error when I Fit the model:

ValueError: Found input variables with inconsistent numbers of samples: [559, 140]

Some problem with reshaping where X-train is [599,10] and y_train is [140, 10].

I tried making changes and I get some other error instead.

but we do have this dataset already in sklearn.datasets

I can't download the same datasets.

so the nearestneigbors and kneighbors are not the same the thing?

I don't seem to be able to download the dataset. It seems like I am the only person with this problem. Any ideas?

Link to the dataset used in the video

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer

Better explained than when I was a student in a university.

ValueError: Classification metrics can't handle a mix of continuous and binary targets

Any one can help …..

I got an error saying "KeyError: "['class'] not found in axis"" and tried doing X = np.array(df.drop('class', 1))

But I am still getting the same error

X = np.array(df.drop('class', 1)) worked for me when I got the "labels ['class'] not contained in axis" error

import numpy as np

import pandas as pd

from sklearn import preprocessing, model_selection, neighbors

df = pd.read_csv('C:/…/breast_cancer_dataset.txt')

df.replace('?', -99999, inplace = True)

df.drop(['sample_id'], 1, inplace = True)

X = np.array(df.drop(['class'],1))

y = np.array(df['class'])

X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()

clf.fit(X_train,y_train)

accuracy = clf.score(X_test,y_test)

example_measures = np.array([[4,2,1,1,1,2,3,2,1]])

prediction = clf.predict(example_measures)

Can someone help, I only get a Tokenizing error when I print accuracy

getting this error after i run clf.fit: "ValueError: Input contains NaN, infinity or a value too large for dtype('float64').". Anyone else seen this?

My accuracy is : 0.9928571428571429

No errors but the prediction is always [2] or [2 2] or [2 2 …] … I copied data from the sample data expecting the program to classify it as, "something it recognized" and therefore a different value. [1] was what I was hoping for and what I would have expected as most of the samples are benign but, nope … just [2].

ValueError: Found input variables with inconsistent numbers of samples: [559, 140]

anyone else getting this error??????

Hey! Can you make a video on logistic regression. And how to code logistic regression?

For anyone who gets error at reshape, try this instead of reshape:

example_measures=np.array([[4,2,1,1,1,2,3,2,1]]). The feature needs 2D Array, that's why double bracket!

Bad choice of -99999 for missing data. Entries like ['?',1,1,1] and ['?',9,9,9] are obviously different in all but one unknown property, while [-99999,1,1,1] and [-99999,9,9,9] are very near to each other due to a large dummy value. Just drop 16 rows with missing values.

Why -1 on the reshape?

bro, you remind me of thenewboston soooo much!!!

This is the code:

df.replace('?', -99999, inplace=True)

df.drop(['id'], 1, inplace=True)

i have some problem here:

KeyError: "['id'] not found in axis"

All working fine up till the example_measures point then nothing… no errors, no beeps, no nothing!! No idea what is wrong, even after a full half day of re-writing the code…still nothing!