K-Means Clustering – Methods using Scikit-learn in Python – Tutorial 23 in Jupyter Notebook




[ad_1]

In this tutorial on Python for Data Science, you will learn about how to do K-means clustering/Methods using pandas, scipy, numpy and Scikit-learn libraries in Jupyter notebook.

This is the 23th Video of Python for Data Science Course! In This series I will explain to you Python and Data Science all the time! It is a deep rooted fact, Python is the best programming language for data analysis because of its libraries for manipulating, storing, and gaining understanding from data. Watch this video to learn about the language that make Python the data science powerhouse. Jupyter Notebooks have become very popular in the last few years, and for good reason. They allow you to create and share documents that contain live code, equations, visualizations and markdown text. This can all be run from directly in the browser. It is an essential tool to learn if you are getting started in Data Science, but will also have tons of benefits outside of that field. Harvard Business Review named data scientist “the sexiest job of the 21st century.” Python pandas is a commonly-used tool in the industry to easily and professionally clean, analyze, and visualize data of varying sizes and types. We’ll learn how to use pandas, Scipy, Sci-kit learn and matplotlib tools to extract meaningful insights and recommendations from real-world datasets

Python Data Science Practice Jupyter notebooks and Data Sets: https://github.com/theengineeringworld/Python-data-science

Source


[ad_2]

Comment List

  • TheEngineeringWorld
    November 26, 2020

    Can the K-Means method be used to identify ripeness of fruit?
    Thank you

  • TheEngineeringWorld
    November 26, 2020

    Good explanation! please share the code to email: sujalbhagat97@gmail.com

  • TheEngineeringWorld
    November 26, 2020

    this was very well explained. thanks a lot

  • TheEngineeringWorld
    November 26, 2020

    Very helpful. Meanwhile, why is "Predict" not used here ?

  • TheEngineeringWorld
    November 26, 2020

    Hi, I realy appretiated your teaching, thank you. On purpose, this class is missing at github. regards

  • TheEngineeringWorld
    November 26, 2020

    1:35 Shouldn't it be "import numpy as np" ?

  • TheEngineeringWorld
    November 26, 2020

    How to cluster from database or csv

  • TheEngineeringWorld
    November 26, 2020

    Sir if you were asked to print all observations of say cluster 1 then how will you do it?

  • TheEngineeringWorld
    November 26, 2020

    Solve the following problems using Sklearn. Use CardiologyCategorical.csv dataset for the problems.

    1. How many instances does the dataset have?

    2. How many attributes does the dataset have?

    3. How many attributes are nominal?

    4. Build a decision tree that predicts whether a patient has a heart disease. Record the 10-fold cross-validation accuracy of your model as A1. Save the result buffer and Insert the confusion matrix obtained with 10-fold cross-validation in your report.

    5. Create a new attribute coarseBloodPressure, with values: Low if blood pressure is less than or equal to 120, Normal if blood pressure is greater than 120 but less than or equal to 140, and High if blood pressure is greater than 140. Now remove blood pressure attribute and name the new dataset as CardiologyCategoricalNew.csv.

    6. Build a new decision tree that predicts whether a patient has a heart condition, using CardiologyCategoricalNew.csv dataset. Record the 10-fold cross-validation accuracy of your model as A2. Insert the confusion matrix obtained with 10-fold cross-validation in your report

    7. Build Naïve Bayes model (using CardiologyCategorical.csv dataset) that predicts whether a patient has a heart disease. Record the 10-fold cross-validation accuracy of your model as A3. Save the result buffer and Insert the confusion matrix obtained with 10-fold cross-validation in your report

    8. Compare A1, A2 and A3.

    9. Given a new test dataset named CardiologyTestdata.csv. Predict the classes for the two instances in the give test data using Decision tree and Naïve Bayes models.

    help me

  • TheEngineeringWorld
    November 26, 2020

    Thank you, very clear explanation! Appreciated

  • TheEngineeringWorld
    November 26, 2020

    code—

    import pandas as pd
    import numpy as np

    import matplotlib.pyplot as plt
    import sklearn

    from sklearn.cluster import KMeans

    from mpl_toolkits.mplot3d import Axes3D
    from sklearn.preprocessing import scale
    import sklearn.metrics as sm
    from sklearn import datasets
    from sklearn.metrics import confusion_matrix,classification_report

    %matplotlib inline
    plt.rcParams ['figure.figsize']=7,4

    iris = datasets.load_iris()
    x= scale(iris.data)

    y = pd.DataFrame(iris.target)
    variable_names = iris.feature_names
    x[0:10,]

    clustering = KMeans (n_clusters = 3, random_state = 5 )
    clustering.fit(x)

    iris_df = pd.DataFrame(iris.data)
    iris_df.colums = ['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width','Species']
    y.colums = ['Targets']

  • TheEngineeringWorld
    November 26, 2020

    Any one send me full code plzzzzzzzz

  • TheEngineeringWorld
    November 26, 2020

    Plz give me all file of data set

  • TheEngineeringWorld
    November 26, 2020

    What if there is no true answer for the observations? What metric should we use to evaluate the cluster results?

  • TheEngineeringWorld
    November 26, 2020

    Sir how can we implement Clustering Methodology Using Artificial Bee Colony Algorithm in Python?

  • TheEngineeringWorld
    November 26, 2020

    Thank you for clear explanation. What to do if data has no labels? Please help me on how to use scikit learn classification report on data with no prior labels(y)

  • TheEngineeringWorld
    November 26, 2020

    Unfortunately there codes are not on Github

  • TheEngineeringWorld
    November 26, 2020

    please tell me, how can i add my own data ?

    thanks a lot ! Y

  • TheEngineeringWorld
    November 26, 2020

    can we take sepal length and width for plotting scatter plot?

  • TheEngineeringWorld
    November 26, 2020

    Hi, where can I find the Jupyter Notebook ? Thanks!

  • TheEngineeringWorld
    November 26, 2020

    please tell me that how can we add our own dataset here..like if we have it in a notebook file?

  • TheEngineeringWorld
    November 26, 2020

    This was wonderful! thank you 🙂

  • TheEngineeringWorld
    November 26, 2020

    First of all, thank you for sharing.
    I get an error with:
    rcParams['figure.figsize'] = 7, 4
    NameError: name 'rcParams' is not defined
    Any idea why?

  • TheEngineeringWorld
    November 26, 2020

    Thanks for the video!! I wrote the code with few minor changes, hope this helps 🙂

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import sklearn

    from sklearn.cluster import KMeans
    from mpl_toolkits.mplot3d import Axes3D
    from sklearn.preprocessing import scale

    import sklearn.metrics as sm
    from sklearn import datasets
    from sklearn.metrics import confusion_matrix,classification_report

    import matplotlib.pyplot as plt
    plt.rc('figure', figsize=(7,4))

    iris = datasets.load_iris()
    X = scale(iris.data)
    Y = pd.DataFrame(iris.target)
    variable_name = iris.feature_names
    X[0:10,]

    clustering = KMeans(n_clusters=3,random_state=5)
    clustering.fit(X)

    iris_df = pd.DataFrame(iris.data)
    iris_df.columns =['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
    Y.columns = ['Targets']

    color_theme = np.array(['darkgray','lightsalmon','powderblue'])
    plt.subplot(1,2,1)
    plt.scatter(x=iris_df.Petal_Length,y=iris_df.Petal_Width,c=color_theme[iris.target], s=50)
    plt.title('Ground Truth Classification')
    plt.subplot(1,2,2)
    plt.scatter(x=iris_df.Petal_Length,y=iris_df.Petal_Width,c=color_theme[clustering.labels_], s=50)
    plt.title('K-Means Classification')

    relabel = np.choose(clustering.labels_,[2,0,1]).astype(np.int64)
    plt.subplot(1,2,1)
    plt.scatter(x=iris_df.Petal_Length,y=iris_df.Petal_Width,c=color_theme[iris.target], s=50)
    plt.title('Ground Truth Classification')
    plt.subplot(1,2,2)
    plt.scatter(x=iris_df.Petal_Length,y=iris_df.Petal_Width,c=color_theme[clustering.labels_], s=50)
    plt.title('K-Means Classification')

    print(classification_report(Y,relabel))

  • TheEngineeringWorld
    November 26, 2020

    what the contents of the red data, and another, how we print it

  • TheEngineeringWorld
    November 26, 2020

    Why do we do scaling ?

  • TheEngineeringWorld
    November 26, 2020

    Hi I was wondering if you have your code posted anywhere like on github? I'm a beginner at this and it would be super helpful!

  • TheEngineeringWorld
    November 26, 2020

    Thanks, Suppose we have columns as (Server ID, Critical error, non critical) and now we need to have a label as Health with values: Good, Bad and Moderate. so if we use the K means algorithm to cluster and put three labels 0,1 and 2. So here how can we test the accuracy?? since we don't have any labels.. Looking at the graph, we can say cluster as good, bad or moderate.

  • TheEngineeringWorld
    November 26, 2020

    Hi! Off topic question. If I perform the cluster analysis with different algorithms from sklearn how can I check the best one? Thanks in advance!

Write a comment