Scraping Medium Publications – A Python Tutorial for Beginners

[ad_1]

Some time in the past I used to be attempting to make some evaluation on a Medium publication for a private mission. Nevertheless, knowledge acquisition was an issue as a result of solely scraping the publication’s residence web page doesn’t make sure you get all the information you need. 

That’s once I discover out that every publication has its personal archive. You simply must kind “/archive” after the publication URL. You’ll be able to even specify a 12 months, month, and day and discover all of the tales printed on that date. One thing like this:

https://publicationURL/archive/12 months/month/day
And immediately the issue was solved. A quite simple scraper would do the job. On this tutorial, we’ll see how one can code a easy however highly effective net scraper that can be utilized in any Medium publication. Additionally, the idea of this scraper can be utilized to scrape knowledge from a lot of completely different web sites for no matter cause you wish to.

As we’re scraping a Medium publication, nothing higher than use The Startup for for instance. In keeping with them, The Startup is the biggest energetic Medium publication with over 700ok followers and subsequently it ought to be an incredible supply of information. On this article, you’ll see how one can scrape all of the articles printed by them in 2019 and the way this knowledge could be helpful.

Internet Scraping

Internet scraping is the method of gathering knowledge from web sites utilizing automatized scripts. It consists of three essential steps: fetching the web page, parsing the HTML, and extracting the data you want.

The third step is the one that may be a bit of tough at first. It consists principally of discovering the elements of the HTML the comprise the data you need. You will discover this by opening the web page you wish to scrape and urgent the F12 key in your keyboard. Then you possibly can choose a component of the web page to examine. You’ll be able to see this within the picture under.

Then all it’s good to do is to make use of the tags and lessons within the HTML to tell the scraper the place to search out the data. It’s essential do it for each a part of the web page you wish to scrape. You’ll be able to see it higher within the code.

The Code

As it is a easy scraper, we’ll solely use requests, BeautifulSoup, and Pandas. Requests will probably be used to get the pages we’d like, whereas BeautifulSoup parses the HTML. Pandas we’ll be used to retailer the information in a DataFrame after which export as a .csv file. So we’ll start importing these libraries and initializing an empty listing to retailer the information. This listing we’ll be fulfilled with different lists.

import pandas as pd
from bs4 import BeautifulSoup
import requests

stories_data = []

As talked about earlier, the Medium archive shops the tales by the information of publication. As we intend to scrape each story printed in The Startup in 2019, we have to iterate over day by day of each month of that 12 months. We’ll use nested `for` loops to iterate over the months of the 12 months after which over the times of every month. To do that, you will need to differentiate the variety of days in every month. It’s additionally essential to ensure all days and months are represented by two-digit numbers.

for month in vary(1, 13):
    if month in [1, 3, 5, 7, 8, 10, 12]:
        n_days = 31
    elif month in [4, 6, 9, 11]:
        n_days = 30
    else:
        n_days = 28

    for day in vary(1, n_days + 1):

        month, day = str(month), str(day)

        if len(month) == 1:
            month = f'0{month}'
        if len(day) == 1:
            day = f'0{day}'

And now the scraping begins. We will use the month and day do arrange the date that may even be saved together with the scraped knowledge and, after all, creates the URL for that particular day. We that is performed, we are able to simply use requests to get the web page and parse the HTML with BeautifulSoup.

date = f'{month}/{day}/2019'
url = f'https://medium.com/swlh/archive/2019/{month}/{day}'

web page = requests.get(url)
soup = BeautifulSoup(web page.textual content, 'html.parser')

So that is The Startup’s archive web page for January 1st, 2019. We will see that every story is saved in a container. What we have to do is to seize all these containers. For this, we’ll use the find_all technique.

tales = soup.find_all('div', class_='streamItem streamItem--postPreview js-streamItem')

The above code generates an inventory containing all of the story containers on the web page. One thing like this:

All we have to do now could be to iterate over it and seize the data we would like from every story. We’ll scrape:

  • The creator’s URL, from which we are able to later extract the creator’s username if we wish to;
  • The studying time;
  • The story title and subtitle;
  • The quantity os claps and responses;
  • The story URL from the Learn extra… button.
  • We’ll first choose a field contained in the container that I name the creator’s field. From this field, we’ll extract the creator’s URL and the studying time. And right here is our solely situation on this scraper: if the container doesn’t present a studying time, we’ll not scrape this story and transfer to the subsequent one. It’s because such tales comprise solely pictures and one or two traces of textual content. We’re not keen on that as we are able to consider it as outliers. We’ll use the try to besides blocks to deal with that.

    Apart from that, we should be ready for a narrative not having a title or a subtitle (sure, that occurs) and, after all, not having claps or responses. The if clause we’ll do the job of stopping an error from being raised in such conditions. 

    All this scraped info will later be appended to the each_story listing that’s initialized within the loop. That is the code for all this:

    for story in tales:
        each_story = []
    
        author_box = story.discover('div', class_='postMetaInline u-floatLeft u-sm-maxWidthFullWidth')
        author_url = author_box.discover('a')['href']
        
        attempt:
            reading_time = author_box.discover('span', class_='readingTime')['title']
        besides:
            proceed
    
        title = story.discover('h3').textual content if story.discover('h3') else '-'
        subtitle = story.discover('h4').textual content if story.discover('h4') else '-'
    
        if story.discover('button', class_='button button--chromeless u-baseColor--buttonNormal'
                                       ' js-multirecommendCountButton u-disablePointerEvents'):
    
            claps = story.discover('button', class_='button button--chromeless u-baseColor--buttonNormal'
                                                ' js-multirecommendCountButton u-disablePointerEvents').textual content
        else:
            claps = 0
    
        if story.discover('a', class_='button button--chromeless u-baseColor--buttonNormal'):
            responses = story.discover('a', class_='button button--chromeless u-baseColor--buttonNormal').textual content
        else:
            responses = 'Zero responses'
    
        story_url = story.discover('a', class_='button button--smaller button--chromeless u-baseColor--buttonNormal')[
            'href']
    

    Cleansing some knowledge

    Earlier than we transfer to scrape the textual content of the story let’s first perform a little cleansing within the reading_time and responses knowledge. As a substitute of storing these variables as “5 min learn” and “5 responses”, we’ll preserve solely the numbers. These two traces of code can have this performed:

    reading_time = reading_time.cut up()[0]
    responses = responses.cut up()[0]
    

    Again to scraping…

    We’ll no scrape the article web page. We’ll use requests as soon as extra to get the story_url web page and BeautifulSoup to parse the HTML. From the article web page, we have to discover all of the part tags, that are the place the textual content of the article is. We’ll additionally initialize two new lists, one to retailer the article’s paragraphs and the opposite to retailer the title of every part within the article.

    story_page = requests.get(story_url)
    story_soup = BeautifulSoup(story_page.textual content, 'html.parser')
    
    sections = story_soup.find_all('part')
    story_paragraphs = []
    section_titles = []
    

    And now we solely have to loop by the sections and for every part, we’ll:

  • Discover all paragraphs and append them to the paragraphs listing;
  • Discover all part titles and append them to the part titles listing;
  • Use these two lists to calculate the variety of paragraphs and the variety of sections within the article as this might be some helpful knowledge to have.
  • for part in sections:
        paragraphs = part.find_all('p')
        for paragraph in paragraphs:
            story_paragraphs.append(paragraph.textual content)
    
        subs = part.find_all('h1')
        for sub in subs:
            section_titles.append(sub.textual content)
    
    number_sections = len(section_titles)
    number_paragraphs = len(story_paragraphs)
    

    This may considerably improve the time to scrape all the pieces, however may even make the ultimate dataset far more worthwhile.

    Storing and exporting the knowledge

    The scraping is now completed. Every little thing will now be appended to the each_story listing, which will probably be appended to the stories_data listing.

    each_story.append(date)
    each_story.append(title)
    each_story.append(subtitle)
    each_story.append(claps)
    each_story.append(responses)
    each_story.append(author_url)
    each_story.append(story_url)
    each_story.append(reading_time)
    each_story.append(number_sections)
    each_story.append(section_titles)
    each_story.append(number_paragraphs)
    each_story.append(story_paragraphs)
    
    stories_data.append(each_story)
    

    As stories_data is now an inventory of lists, we are able to simply remodel it right into a DataFrame after which export the DataFrame to a .csv file. For this final step, as we now have quite a lot of textual content knowledge, it’s beneficial to set the separator as ‘t’.

    columns = ['date', 'title', 'subtitle', 'claps', 'responses', 
               'author_url', 'story_url', 'reading_time (mins)', 
               'number_sections', 'section_titles', 'number_paragraphs', 'paragraphs']
    
    df = pd.DataFrame(stories_data, columns=columns)
    df.to_csv('1.csv', sep='t', index=False)
    

    The Knowledge

    That is how the information seems to be like:

    As you possibly can see, we now have scraped knowledge from 21,616 Medium articles. That’s rather a lot! That really means our scraper accessed nearly 22 thousand Medium pages. Actually, contemplating one archive web page for every day of the 12 months, we simply accessed (21,616 + 365 =) 21,981 pages!

    This big quantity of requests we made could be a drawback, although. The web site we’re scraping can notice the interactions aren’t being made by a human and we simply have our IP blocked. There are some workarounds to repair this. One resolution is to insert small pauses in your code, to make the interactions with the server extra human. We will use the randint perform from NumPy and the sleep perform to realize this:

    # Import this
    import numpy as np
    from time import sleep
    
    # Put a number of of this line somewhere else across the code
    sleep(np.random.randint(1, 15))
    

    This code will randomly select a variety of seconds from 1 to 15 for the scraper to pause. 

    In case you’re scraping an excessive amount of knowledge, nonetheless, even these pauses will not be sufficient. On this case, you possibly can both develop your personal infrastructure of IP addresses or perhaps, if you wish to preserve it easy, you will get in contact with a proxy supplier, equivalent to Infatica or others and they’re going to cope with this drawback for you by always altering your IP handle so you don’t get blocked.

    However what to do with this knowledge?

    That’s one thing you is perhaps asking your self. Properly, there’s all the time rather a lot to study from knowledge. We will carry out some evaluation to reply some easy query equivalent to:

  • Is the variety of publications in The Startup growing over time?
  • What’s the common dimension of a narrative in The Startup?
  • The charts under may also help with these. Discover how the variety of tales printed per 30 days skyrocketed within the second half of 2019. Additionally, the tales turned round 5 paragraphs shorter, on common, all year long. And I’m speaking paragraphs, however one might search for the common variety of phrases and even characters per story.

    And naturally, there’s Pure Language Processing - NLP. Sure, we now have quite a lot of textual content knowledge that we are able to use for NLP. It’s attainable to investigate the type of tales which can be normally printed in The Startup, to research what makes a narrative obtain extra or fewer claps and even to foretell the variety of claps and responses a brand new publication might obtain.

    However sure, there’s a lot to do with the information, however don’t miss the purpose right here. This text is in regards to the scraper, not the scraped knowledge. The principle aim right here is to share how highly effective of a instrument it may be. 

    Additionally, this similar idea of net scraping can be utilized to carry out quite a lot of completely different actions. For instance, you possibly can scrape Amazon a preserve monitor of costs, or you possibly can construct a dataset of job alternatives by scraping a job search web site in case you are searching for a job. The chances are countless, so it’s as much as you!

    In case you preferred this in suppose it might be helpful to you, you’ll find the whole code here. In case you have any questions, recommendations, or simply wish to be in contact, be happy to contact by Twitter or Linkedin.



    [ad_2]

    Source link

    Write a comment