## Machine Learning with Text in scikit-learn (PyCon 2016)

Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we’ll build and evaluate predictive models from real-world text using scikit-learn. (Presented at PyCon on May 28, 2016.)

GitHub repository: https://github.com/justmarkham/pycon-2016-tutorial

Enroll in my online course: http://www.dataschool.io/learn/

== OTHER RESOURCES ==

== LET’S CONNECT! ==

JOIN the “Data School Insiders” community and receive exclusive rewards:
https://www.patreon.com/dataschool

Source

### Comment List

• Data School
November 27, 2020

The tutorial is not audible, can't hear a thing. I saved it locally on my phon, but don't know what's happening.

• Data School
November 27, 2020

The method which he uses to explain all concepts that are said is totally didact. Some teachers say terms to explain terms and at the final, you do not understand anything, however, Kevin Markham explains each term precisely without utilizing other terms. I admire the way which he teaches. Way to Go, and greetings from Brazil.

• Data School
November 27, 2020

Merci 😊

• Data School
November 27, 2020

You are awesome Kevin

• Data School
November 27, 2020
• Data School
November 27, 2020

Is there any tutorial to analyse system logs with ML? thanks in advance!

• Data School
November 27, 2020

Hi Kevin, great video content! I just have a question. At 33:23 where you mentioned about the 5 interesting things that were observed, stop words are dropped and not included in the tokens list.

However, during vect.fit(simple_train), the stop_words argument is set to None.

Can I presume that there is a set of standardized stop words and CountVectorizer drops it and the stop_words argument takes in user-specified stop words?

• Data School
November 27, 2020

you make wonderful videos and courses, however it is very expensive for international students like me.

• Data School
November 27, 2020

Is the audio out

• Data School
November 27, 2020

I could listen to this voice all day.

• Data School
November 27, 2020

Another fantastic video – thanks Kevin

• Data School
November 27, 2020

To scale down the feature what should we prefer Standardization or Normalization and why? and when to use it?

• Data School
November 27, 2020

Thank you for all your helpful videos. I have a question related to vectorization:
At 1:07:36, if we use the words from the test set to fit our model, we could obtain a document-term matrix where some terms would have only zero entries. Would that have negative effects on our classifier?

• Data School
November 27, 2020

I m SAP ABAP Engineer, trying to integrate python + ABAP. Have seen few videos on Python ML, but listening to Kavin Video reminds of Steve Jobs Marketing Speech : Clear Concise Calm and Rich Knowledge Embedded in this video. I will be watching this video multiple times because it has rich practical content and more importantly Kavin art of Speech brutally attract one's attention 🙂. Keep Guiding Us 🙏.

• Data School
November 27, 2020

Hey kevin could please make a video on machine learning pipelining .

• Data School
November 27, 2020

Boy you made the best tutorial. Talking slow is magical!

• Data School
November 27, 2020

The BEST ML tutorials , I have come across… Thanks a lot … God bless you …

• Data School
November 27, 2020

This is amazing video .. u really a great teacher.. can i get whole course videos ..pls …..

• Data School
November 27, 2020

Too bad I cannot hear a thing

• Data School
November 27, 2020

print false positive: 1:38:01

• Data School
November 27, 2020

Is it still relevant in 2019? Thanks for letting me know

• Data School
November 27, 2020

Thanks a lot Kevin

• Data School
November 27, 2020

This is a great, great tutorial and in depth explanation on many related topics! Thanks so much!

• Data School
November 27, 2020

@Data School, Again and Again you are the best Kevin. I was scared of the text analytics and web scraping. You can teach in such an intuitive and lucid way. Thanks a ton

• Data School
November 27, 2020

I need to test Pega system build along with python for machine learning.I am automation tester but need to do AI testing,can you please guide how can i go about.

• Data School
November 27, 2020

1.25 speed perfect

• Data School
November 27, 2020

in my case..,shape of x_train and x_train_dtm is different..and getting ValueError: Found input variables with inconsistent numbers of samples: [25, 153]

• Data School
November 27, 2020

Great video. I would like to know if you would be doing videos on tokenizing ,stemming and lemmatizing and other core NLP techniques.

• Data School
November 27, 2020

I really enjoy your structured approach to teaching these classes 🙂

• Data School
November 27, 2020

Hey, You have used 2 classes for classification right? What if I need more than 2 class, eg: contempt, depression, anger, joy and many such emotions. Do I need to change any of the code in here, or providing a data set with multiple classes is enough?

And I have one more doubt; Once the model is built and prepared, how can I actually know into which class, a new text document supplied as input will belong to? eg: If the new document is ham or spam?

• Data School
November 27, 2020

Great video!

• Data School
November 27, 2020

Your videos just feel so friendly and inclusive, while being really educational. Your way of teaching is great. I thank you sincerely!

• Data School
November 27, 2020

thankyou so much for this video. cleared all the doubts i had. thankyou again

• Data School
November 27, 2020

Awesome video would u please make videos on performance metrics and featurization and feature engineering

• Data School
November 27, 2020

two question about the Bag of Words which have obsessed me for a while.first question is my source file has 2 columns, one is email content, which is text format, the other is country name(3 different countries) from where the email is sent, and I want to label if the email is Spam or not, here the assumption is the email sent from different countries also matters if email is spam or not. so besides the bag of words, I want to add a feature which is country, the question is that is there is way to implement it in sklearn.The other question is besides Bag of Words, what if I also want to consider the position of the words, for instance if word appears in first sentence, I want to lower its weight, if word appears in last sentence, I want to increase its weight, is there a way to implement it in sklearn.Thanks.

• Data School
November 27, 2020

I cant wait till I have watched enough of your content to start on your courses.

• Data School
November 27, 2020

Excellent video. Thankyou so much Kevin Sir,it really helped me a lot.

• Data School
November 27, 2020

Thanks for the detailed information, Is that possible to use Multidimensional?