Machine Learning with Text in scikit-learn (PyData DC 2016)




[ad_1]

Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we’ll build and evaluate predictive models from real-world text using scikit-learn. (Presented at PyData DC on October 7, 2016.)

GitHub repository: https://github.com/justmarkham/pydata-dc-2016-tutorial
Enroll in my online course: http://www.dataschool.io/learn/
Subscribe to the Data School newsletter: http://www.dataschool.io/subscribe/

== OTHER RESOURCES ==
My scikit-learn video series: https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
My pandas video series: https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y

== JOIN THE DATA SCHOOL COMMUNITY ==
Blog: https://www.dataschool.io
Twitter: https://twitter.com/justmarkham
Facebook: https://www.facebook.com/DataScienceSchool/
YouTube: https://www.youtube.com/user/dataschool?sub_confirmation=1

Join “Data School Insiders” to receive exclusive rewards! https://www.patreon.com/dataschool

Source


[ad_2]

Comment List

  • Data School
    December 1, 2020

    As ever, an excellent run through. Thanks

  • Data School
    December 1, 2020

    Hello Sir!
    I want to ask a question is that "how to convert journal title name to journal abbreviation using NLP or the method which is easy than NLP?
    Please guide me, waiting for your kind response.
    Thanks in anticipation.

  • Data School
    December 1, 2020

    Great lesson. I have this humble request. I would like to use R syntax in python but I don't know how to interface these two programs. If you can share some materials I will be very happy.

  • Data School
    December 1, 2020

    284

  • Data School
    December 1, 2020

    It’s sklearn.model_selection instead of sklearn.cross_validation

  • Data School
    December 1, 2020

    Excellent!

  • Data School
    December 1, 2020

    Great video, amazing teaching skills…thanks a ton 🙂

  • Data School
    December 1, 2020

    it was a great session.

  • Data School
    December 1, 2020

    another very good, easy to follow tutorial ^^

  • Data School
    December 1, 2020

    So clear! Big ups.

  • Data School
    December 1, 2020

    Hi, Kevin! I have question regarding using countvectorization in CV. Can we just transform before splitting to folds and train model? In principle, it will not train features which are not in the training set.Can please elaborate on this? Thank you!

  • Data School
    December 1, 2020

    Thank You.

  • Data School
    December 1, 2020

    Amazing course! Thanks.

  • Data School
    December 1, 2020

    can any one explain Max_df and Min_df Clearly

  • Data School
    December 1, 2020

    If i have 100 articles then i have to create 100 corpus related to that or something else

  • Data School
    December 1, 2020

    Please do a video on sentiment analysis

  • Data School
    December 1, 2020

    Excellent Explanation! Thanks a lot… Kevin

  • Data School
    December 1, 2020

    This guy. You teach amazingly well. Gifted communicator, looking forward to future content!

  • Data School
    December 1, 2020

    Again, another great tutorial! Recently I've been watching a lot (i.e. almost all haha) of your videos; really love how you explain code line by line so that we can understand the "why" in addition to the "how". There are only a handful of people I've found online that have a similar teaching style. 

    Honestly, I think your content is almost too good to give away for free. Maybe you should consider publishing your videos on Udemy (e.g. Kirill Eremenko is an excellent teacher as well and reaches a huge audience on Udemy – https://www.udemy.com/user/kirilleremenko/ ). Anyway, thanks again really appreciate it!

  • Data School
    December 1, 2020

    Thank you very much for sharing the lecture, its one of the best and well explained lecture on this topic for beginners like me.

  • Data School
    December 1, 2020

    Thank you Kevin!I made classificator for a male – female using method CountVectorizer().its working !In my case it was sex detection via first middle and last name. https://github.com/Andrew32bit/Machine-learning/blob/master/sex_detection.ipynb .Very useful tutorial
    ps.But i did vectorized before train/test split.

  • Data School
    December 1, 2020

    hi kevin, hope u can make videos with deep learning such as CNN,RNN, LSTM

  • Data School
    December 1, 2020

    sir in the model we are just feeding the machine which is desperate or not . if we want to feed more class suppose we want to predict a comment which is positive or negative or neutral then which will be the commands of scikit learn or how we implement these ?

  • Data School
    December 1, 2020

    with imbalanced multi-class text dataset, should I normalize the data with TFIDF weight score or not?

  • Data School
    December 1, 2020

    Thank you Kevin. I really liked the in depth explanation of the concept. Teachers like you inspire me a lot….

  • Data School
    December 1, 2020

    Thank you Kevin for your great video! I have a question: how can I combine the plural and singular words or verbs with different tenses together and just keep one of them (I don't want to differentiate them)?

  • Data School
    December 1, 2020

    Thank you Kevin! Such informed tutorial with great details but not redundant!

  • Data School
    December 1, 2020

    Easy and informative!

  • Data School
    December 1, 2020

    Good job! Thank so much!

  • Data School
    December 1, 2020

    Great video Kevin. Is this process an alternative for using NLTK?

  • Data School
    December 1, 2020

    Hi Kevin, lots of appreciate for the tutorial!

    I got a question regarding on how to merge other features to the the Vectorization feature.

    For example, when we pass got a column of 'text' feature pass to TFIDFVector (after fit and transform), how do we properly add other features to it such as 'subject' feature to each of the train and test instances.

    Thanks in advance. 🙂

  • Data School
    December 1, 2020

    Hi Kevin, I follow and recommend your tutorials to my friends.
    With this particular video I have a question. Why did you choose Scikit learn package for feature extraction, Text cleaning etc. What advantages do I get over NLTK. NLTK can easily interact with Scikit learn for ML classification algorithms.

    I believe NLTK is more mature and has 1000 other features which might not be required while performing basic Text mining. But I am just trying to know the differences w.r.t to performance, ease of use and interaction with Scikit for ML.

  • Data School
    December 1, 2020

    Another great video by Kevin! Thanks a lot for kind sharing.

  • Data School
    December 1, 2020

    how i can classify an email as positive or negative response

  • Data School
    December 1, 2020

    Awesome lecture as always

  • Data School
    December 1, 2020

    Thanks, I downloaded all your videos. Although I had my capstone project in NLP: Analyzing tweets, I am thinking your self paced class will be beneficial to me.

  • Data School
    December 1, 2020

    Did you gave the similar lecture in any other python conference or is it completely different ??

  • Data School
    December 1, 2020

    thank you…

Write a comment