Introducing AutoScraper: A Smart, Fast and Lightweight Web Scraper For Python

[ad_1]

In the previous couple of years, net scraping has been certainly one of my everyday and incessantly wanted duties. I used to be questioning if I could make it sensible and automated to avoid wasting plenty of time. So I made AutoScraper!

The challenge code is on the market right here. It grew to become the primary trending challenge on Github.

1*yD2rFqCzI8JlYnAANDexyQ.jpeg

This challenge is made for automated net scraping to make scraping simple.
It will get a url or the html content material of an internet web page and an inventory of pattern knowledge which we wish to scrape from that web page. This knowledge might be textual content, url or any html tag worth of that web page. It learns the scraping guidelines and returns the same components. Then you should utilize this discovered object with new urls to get related content material or the very same ingredient of these new pages.

Installation

It’s appropriate with python 3.

  • Install newest model from git repository utilizing pip:
$ pip set up git+https://github.com/alirezamika/autoscraper.git
$ pip set up autoscraper
$ python setup.py set up

How to make use of

Getting related outcomes

Say we wish to fetch all associated put up titles in a stackoverflow web page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or a number of candidates right here.
# You may put urls right here to retrieve urls.
wanted_list = ["How to call an external command?"]

scraper = AutoScraper()
outcome = scraper.construct(url, wanted_list)
print(outcome)

Here’s the output:

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 
    'How to call an external command?', 
    'What are metaclasses in Python?', 
    'Does Python have a ternary conditional operator?', 
    'How do you remove duplicates from a list whilst preserving order?', 
    'Convert bytes to a string', 
    'How to get line count of a large file cheaply in Python?', 
    "Does Python have a string 'contains' substring method?", 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]

Now you should utilize the scraper object to get associated subjects of any stackoverflow web page:

scraper.get_result_similar('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')

Getting precise outcome

Say we wish to scrape stay inventory costs from yahoo finance:

from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = ["124.81"]

scraper = AutoScraper()

# Here we are able to additionally move html content material through the html parameter as a substitute of the url (html=html_content)
outcome = scraper.construct(url, wanted_list)
print(outcome)

Note that it is best to replace the wanted_list if you wish to copy this code, because the content material of the web page dynamically modifications.

You may move any customized requests module parameter. for instance you could wish to use proxies or customized headers:

proxies = {
    "http": 'http://127.0.0.1:8001',
    "https": 'https://127.0.0.1:8001',
}

outcome = scraper.construct(url, wanted_list, request_args=dict(proxies=proxies))

Now we are able to get the value of any image:

scraper.get_result_exact('https://finance.yahoo.com/quote/MSFT/')

You might wish to get different information as effectively. For instance if you wish to get market cap too, you may simply append it to the wished listing. By utilizing the get_result_exact methodology, it should retrieve the information as the identical precise order within the wished listing.

Another instance: Say we wish to scrape the about textual content, variety of stars and the hyperlink to problems with Github repo pages:

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '2.5k', 'https://github.com/alirezamika/autoscraper/issues']

scraper = AutoScraper()
scraper.construct(url, wanted_list)

Simple, proper?

Saving the mannequin

We can now save the constructed mannequin to make use of it later. To save:

# Give it a file path
scraper.save('yahoo-finance')

And to load:

scraper.load('yahoo-finance')

Tutorials

  • See this gist for extra superior usages.
  • To reveal the facility of AutoScraper, I wrote a tutorial for creating an API from any web site in lower than 5 minutes and fewer than 20 traces of code. You can learn it right here.

Thanks

Feel free to contact when you’ve got any query.

Happy Coding ♥️

[ad_2]

Source hyperlink

Write a comment