Web scraping Python BeautifulSoup
In this post, you will learn about the Python web scraping tool beautifulsoap.
The web is a totally huge source of data that we can get using the web scraping and Python.
Web Scraping is a process of data extracting from web sites. The extracted data can be content, urls, contact information, etc., which we can store in a local file or database. This process can be done manually by code called scrapper or by automated software implemented using a bot or a web crawler. The web scraping is not always legal. Some sites have dis-allow the scraping in the 'robots.txt' file. Some popular sites provide APIs to access their data in a structured way. But not all websites. So, we need a web scraper for data extraction, data mining and storing in a structured way.
Python is the most popular programming language for web scraping. It provides many libraries that can handle web crawler related process smoothly. BeautifulSoup is the most widely used library among them.
Beautifulsoup module
The beautifulsoup library makes it easy to scrape the information from the HTML or XML files. The Beautiful Soup4 or bs4 works on Python 3. It is much faster and supports third party parsers like html5lib and lxml. The following command installs the BeautifulSoup module using pip tool.
pip install bs4
On successful installation, it returns the following -
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.8.2 bs4-0.0.1 soupsieve-2.0
For web scraping, first we need to import this module and pass the fetched url content to create a soup object. This library provides a find_all method to filter data from the web content.
Python Beautifulsoup: Scrap the smartwatches from Amazon
Suppose we want to get all the names of smart watches which are in 'span' tags from the request url, the code will be -
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.in/s?k=smartwatch&ref=nb_sb_noss'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.find_all('span', class_='a-size-medium a-color-base a-text-normal')
for job_title in x:
print(job_title.text.strip())
Output of the above code -
Related Articles
Python parse XML with lxml library
Convert speech to text in Python
Python project ideas for beginners
Python gmplot to add google map on a web page
Python web scraping using urllib
zip function in Python
Remove last element from list Python
Check if list is empty Python
Python convert XML to JSON
Python split multiple delimiters
Python loop through list
Python iterate list with index
Python Weather API Script
Python random choice
Python dict inside list
Python capitalize first letter of sentence
Python raise keyword