K Nearest Neighbors Application – Practical Machine Learning Tutorial with Python p.14




[ad_1]

In the last part we introduced Classification, which is a supervised form of machine learning, and explained the K Nearest Neighbors algorithm intuition. In this tutorial, we’re actually going to apply a simple example of the algorithm using Scikit-Learn, and then in the subsquent tutorials we’ll build our own algorithm to learn more about how it works under the hood.
To exemplify classification, we’re going to use a Breast Cancer Dataset, which is a dataset donated to the University of California, Irvine (UCI) collection from the University of Wisconsin-Madison. UCI has a large Machine Learning Repository.

https://pythonprogramming.net
https://twitter.com/sentdex
https://www.facebook.com/pythonprogramming.net/
https://plus.google.com/+sentdex

Source


[ad_2]

Comment List

  • sentdex
    December 8, 2020

    HI, is anyone getting this Error?
    raise ValueError(
    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

  • sentdex
    December 8, 2020

    Thanks for sentdex I am beginning to see the the matrix… at least the door.

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    clf = neighbors.KNeighborsClassifier()

    clf.fit(X_train, y_train)

    accuracy = clf.score(X_test, y_test)

    i=0

    for predDelta in (clf.predict(X_test) – y_test):

    if predDelta!=0:

    i=i+1

    print(f" the number of inaccurate predictions are {i}")

    print(f" based on my accuracy calculation {1 – i/len(y_test)}")

    print(f" based on sk learn accuracy calculation {accuracy}")

  • sentdex
    December 8, 2020

    The below code may help.

    import numpy as np

    from sklearn.model_selection import cross_validate, train_test_split

    from sklearn import preprocessing, neighbors

    import pandas as pd

    df = pd.read_csv('Bcancer.txt')

    df.head()

    df.replace('?',-99999, inplace = True)

    #df.drop(['id'], 1, inplace=True)

    X = np.array(df.drop(['class'],1))

    y = np.array(df['class'])

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    clf = neighbors.KNeighborsClassifier()

    clf.fit(X_train, y_train)

    accuracy = clf.score(X_test, y_test)

    print(accuracy)

    example_measures = np.array([4,2,1,1,1,2,3,2,1])

    example_measures = example_measures.reshape(1,-1)

    prediction = clf.predict(example_measures)

    print(prediction)

  • sentdex
    December 8, 2020
  • sentdex
    December 8, 2020

    For anyone who gets an ValueError: labels ['class'] not contained in axis, just use df.columns = df.columns.str.replace(' ', '') at the beginning to remove any space in name for columns

  • sentdex
    December 8, 2020

    example_measures=np.array([[4,2,1,1,1,2,3,2,1]]).reshape(1,-1)
    hahahaha I am sure that line came from stackoverflow

  • sentdex
    December 8, 2020

    I'm getting this error when I Fit the model:
    ValueError: Found input variables with inconsistent numbers of samples: [559, 140]

    Some problem with reshaping where X-train is [599,10] and y_train is [140, 10].
    I tried making changes and I get some other error instead.

  • sentdex
    December 8, 2020

    but we do have this dataset already in sklearn.datasets

  • sentdex
    December 8, 2020

    I can't download the same datasets.

  • sentdex
    December 8, 2020

    so the nearestneigbors and kneighbors are not the same the thing?

  • sentdex
    December 8, 2020

    I don't seem to be able to download the dataset. It seems like I am the only person with this problem. Any ideas?

  • sentdex
    December 8, 2020

    Link to the dataset used in the video
    https://archive.ics.uci.edu/ml/datasets/Breast+Cancer

  • sentdex
    December 8, 2020

    Better explained than when I was a student in a university.

  • sentdex
    December 8, 2020

    ValueError: Classification metrics can't handle a mix of continuous and binary targets

  • sentdex
    December 8, 2020

    Any one can help …..
    I got an error saying "KeyError: "['class'] not found in axis"" and tried doing X = np.array(df.drop('class', 1))
    But I am still getting the same error

  • sentdex
    December 8, 2020

    X = np.array(df.drop('class', 1)) worked for me when I got the "labels ['class'] not contained in axis" error

  • sentdex
    December 8, 2020

    import numpy as np
    import pandas as pd

    from sklearn import preprocessing, model_selection, neighbors

    df = pd.read_csv('C:/…/breast_cancer_dataset.txt')

    df.replace('?', -99999, inplace = True)

    df.drop(['sample_id'], 1, inplace = True)

    X = np.array(df.drop(['class'],1))

    y = np.array(df['class'])

    X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.2)

    clf = neighbors.KNeighborsClassifier()

    clf.fit(X_train,y_train)

    accuracy = clf.score(X_test,y_test)

    example_measures = np.array([[4,2,1,1,1,2,3,2,1]])

    prediction = clf.predict(example_measures)

  • sentdex
    December 8, 2020

    Can someone help, I only get a Tokenizing error when I print accuracy

  • sentdex
    December 8, 2020

    getting this error after i run clf.fit: "ValueError: Input contains NaN, infinity or a value too large for dtype('float64').". Anyone else seen this?

  • sentdex
    December 8, 2020

    My accuracy is : 0.9928571428571429

  • sentdex
    December 8, 2020

    No errors but the prediction is always [2] or [2 2] or [2 2 …] … I copied data from the sample data expecting the program to classify it as, "something it recognized" and therefore a different value. [1] was what I was hoping for and what I would have expected as most of the samples are benign but, nope … just [2].

  • sentdex
    December 8, 2020

    ValueError: Found input variables with inconsistent numbers of samples: [559, 140]

    anyone else getting this error??????

  • sentdex
    December 8, 2020

    Hey! Can you make a video on logistic regression. And how to code logistic regression?

  • sentdex
    December 8, 2020

    For anyone who gets error at reshape, try this instead of reshape:
    example_measures=np.array([[4,2,1,1,1,2,3,2,1]]). The feature needs 2D Array, that's why double bracket!

  • sentdex
    December 8, 2020

    Bad choice of -99999 for missing data. Entries like ['?',1,1,1] and ['?',9,9,9] are obviously different in all but one unknown property, while [-99999,1,1,1] and [-99999,9,9,9] are very near to each other due to a large dummy value. Just drop 16 rows with missing values.

  • sentdex
    December 8, 2020

    Why -1 on the reshape?

  • sentdex
    December 8, 2020

    bro, you remind me of thenewboston soooo much!!!

  • sentdex
    December 8, 2020

    This is the code:
    df.replace('?', -99999, inplace=True)

    df.drop(['id'], 1, inplace=True)

    i have some problem here:
    KeyError: "['id'] not found in axis"

  • sentdex
    December 8, 2020

    All working fine up till the example_measures point then nothing… no errors, no beeps, no nothing!! No idea what is wrong, even after a full half day of re-writing the code…still nothing!

Write a comment