Manually collecting information from a specific website online, most likely for commercial purposes or analysis, will keep you awake all night.
Because you will have to collect data from several sources, then convert the unstructured content into structured data. But with the overwhelming data available on the internet, you can get discouraged even before you start the process.
You can avoid spending time collecting data manually by using robust web scrapers to automate the extraction of content from websites. Read on to find out more about Web scraping and how to create bots with Selenium and BeautifulSoup in Python for this purpose.
Table of Contents
- What is Web Scraping?
- What Kind of Information is Available for Web Scraping?
- Web Scraping Laws You Must Know
- Getting Started With Web Scraping Google Search Results with Selenium and BeautifulSoup Using Python
- Tools You Need to Scrape the Web
- Sample Codes for Web Scraping With Selenium and BeautifulSoup in Python
- Conclusion
What is Web Scraping?
You can automate web or data scraping to obtain information from the internet. In particular, this relates to developing or utilizing software that gathers data from a single website or a small collection of online pages. After that, a computer or algorithm processes and cleans the scraped data.
Web scraping facilitates the capture of unstructured data and its subsequent archival in a structured format. This data is typically saved in a local file to be modified and examined as required. Web scraping speeds up the process and gives very accurate sales data.
Web scraping is a much smaller-scale version of copying and pasting content from a webpage into an Excel spreadsheet. But when people use the phrase “web scraper,” they often mean computer applications that can gather information from numerous websites with high lead activity.
Web scraping software (sometimes known as “bots”) is designed to browse websites, scrape the pertinent pages, and extract meaningful data. These bots can quickly retrieve enormous volumes of data by automating this procedure.
For instance, search engine bots will crawl a website, examine its content, and then assign it a ranking.
Additionally, price comparison websites use bots to automatically retrieve product prices and descriptions from affiliated seller websites. In the digital age, where big data—which is continually updating and changing—plays such a significant role, this has obvious advantages.
Web scraping is a technique that may be used to scrape a lot of data from websites. But what kind of data can you scrape, and why do you need to get such vast amounts of data from websites?
=> Join the Waitlist for Early Access.
What Kind of Information is Available for Web Scraping?
To succeed, though, you may need to find pertinent information about your niche, depending on the kind of business you run. There are many applications for web scraping, notably in the field of data analytics.
Market research firms employ scrapers to get data from social media and internet forums for things like customer sentiment analysis. Some people scrape information from product websites like Amazon or eBay to enhance competitive analysis.
You can scrape:
- Prices
- Tweets from a specific Twitter account or hashtag.
- Contact details such as phone numbers and email addresses.
- Google Maps company listings to produce leads for marketing
- Images and text
- Videos
- Product information, customer sentiments, and reviews
Theoretically, anything that is data is scrambled, especially if it is on a website!
The question of whether online scraping is even legal, though, maybe on your mind. Regarding the information you can scrape from specific websites, there might be legal restrictions.
Web Scraping Laws You Must Know
Theft of trade secrets, fraud, contract breaches, copyright infringement, and other issues are among the concerns of owners of shuttered enterprises. Therefore, you should ensure you adhere to the law if you intend to automate a bot for this purpose.
Before 2015, you could easily get away with crawling and obtaining sensitive information from your competitors’ websites.
The Irish airline Ryanair, however, later brought a lawsuit over alleged “screen-scraping” of its website. The case made it to the Court of Justice of the European Union, which is Europe’s top court (CJEU).
The 9th Circuit Court of Appeals in the United States ruled on September 9, 2020, that the country’s Computer Fraud and Abuse Act does not apply to the scraping of publicly accessible websites (CFAA).
The terms and conditions of a corporation may set some restrictions on scraping activity.
The American court essentially decided that it is not “theft” for a company to scrape data from product lots, public user profiles, ticket prices, etc. Scraping data from a website is acceptable as long as the data is publicly accessible (that is, visible while visiting the site).
Even though it is entirely legal to scrape publicly accessible data, there are two sorts of information you should be wary about.
As follows:
1. Copyrighted Data
Data protected by copyright can not be used without the owner’s consent. This suggests that even while it’s not always unlawful to scrape and obtain copyrighted content, it might be if you use that information in a specific way.
Remember that national laws might not completely resolve this issue. For example, in some regions, you might only be allowed to utilize a small portion of the copyrighted content you scraped, but not at all in others.
2. Personal Information or Personally Identifiable Information (PII)
Many data privacy and protection legislation, such as Europe’s General Data Protection Regulation (GDPR) in Europe and those of many states in America, now go into great detail about this subject.
Without the owner’s express permission, it is generally against the law to acquire, utilize, or store PII. Scraping that data is effectively unlawful because you lack the legal authority to acquire PII without the owner’s consent. Such information includes:
- Name
- Contact information
- Medical information and
- Financial information
- Address
- Date of birth
- Sexual preference
- Ethnicity
Now that you are aware of the essential instructions for conducting Web scraping, how do you begin?
Getting Started With Web Scraping Google Search Results with Selenium and BeautifulSoup Using Python
Numerous libraries in Python offer functions and methods for a wide range of applications appropriate for data transformation and web crawling. Python is simple to program with because it has dynamic typing. Python allows you to use shortcodes.
As a result, you save time to engage in other rigorous activities.
Even though the precise procedure varies based on the software or tools you’re using, online scraping bots all adhere to three fundamental principles.
- Sending a server an HTTP request
- Downloading and analyzing (or dissecting) the website’s code
- Localizing the necessary data
Additionally, while the bots adhere to the preceding guidelines, you, as the programmer, must do the following as well:
- Choose the website or websites that you want to scrape.
- Look over the page to see what has to be discarded. Right-clicking anywhere on a front-end website will allow you to “inspect elements” or “view page source.”
- Decide what information you want to extract. Your goal is to locate the distinctive tags that enclose (or “nest”) the pertinent content, such as the “div” tags.
- Python packages, which handle much of the labor-intensive work, can be used to write the necessary code that instructs the bot where to look and what to extract.
- The next step after writing the code is to run it.
- You must keep the pertinent data after extracting, processing, and gathering it. You can tell your algorithm to perform this by introducing additional lines of code. You can choose any format. However, Excel formats are by far the most popular.
Tools You Need to Scrape the Web
Web scraping using the DOM (document object model) is implemented with the help of the powerful package called Beautiful Soup, Selenium, Request, and so on.
BeautifulSoup
BeautifulSoup is a Python module that makes it easy to extract data from XML and HTML texts.
To swiftly extract DOM elements, BeautifulSoup parses HTML into an accessible tree representation for machines. It enables the extraction of certain paragraph and table elements with specific HTML IDs, classes, and XPATH.
Large amounts of data are much easier to browse and search through, thanks to BeautifulSoup. For many data analysts, it is their go-to tool.
Selenium
Several open-source browser automation projects are collectively referred to as “Selenium.” Chrome, Firefox, and Safari are just a few of the online browsers that the Selenium API can manage. Websites with dynamic content can be scraped using the Selenium framework.
Selenium-written Python programs automate web browser interaction. Therefore, by automating button clicks with Selenium, the data displayed by JavaScript links may be made available and subsequently collected with BeautifulSoup.
Selenium-stealth
This Python module helps selenium to pass all Google account logins and prevents detection. Additionally, it aids in keeping the reCAPTCHA v3 score consistent.
Note: It is strongly recommended that you practice website inspection before creating a web scraping bot. This is because knowing a website’s structure is essential to understanding what you need to get started. This is crucial since, in the majority of cases, you’ll need the HTML code from the website you want to scrape to create a web scraping application.
While your Chrome is open, right-click and select inspect from the options. Alternatively, hitting F12 on a PC or Command + Option + I on a Mac may produce a similar result.
Sample Codes for Web Scraping With Selenium and BeautifulSoup in Python
Install all the necessary dependencies before we begin creating these programs. We’ll use various techniques to build web scraping bots. This is to ensure you understand various approaches to scrap the website for things like:
- News
- Images
- Video
- Maps
To commence, open your terminal and install the following:
pip install selenium
pip install BeautifulSoup or pip install bs4
pip install selenium-stealth
pip install requests
pip install webdriver-manager
pip install lxml
Note: Keep in mind that the information in the examples below may not apply to your system. Therefore, you must utilize your own private details in some places, such as your browser’s version and path.
Example One
In this example, we will scrape information from Instagram, Twitter, and YouTube.
from bs4 import BeautifulSoup
from selenium import webdriver
#create a function to fetch the people following you and those you follow
def get_twitter_stats():
#you might have to use your own Twitter account address
twitter_url = 'https://twitter.com/nitrotutorials'
driver = webdriver.Chrome()
driver.get(twitter_url)
content = driver.page_source.encode('utf-8').strip()
twitter_soup = BeautifulSoup(content, "lxml")
stats = twitter_soup.findAll("span", class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")
following = stats[0].text.strip()
followers = stats[1].text.strip()
print("Twitter stats:{} followers and {} folloing".format(followers, following))
# for stat in stats:
# print(stat.text.strip())
# create a function to scrap posts and followers Information
def get_instagram_stats():
#you might have to use your own Instagram account address
instagram_url = 'https://www.instagram.com/nitrotutorials'
driver = webdriver.Chrome()
driver.get(instagram_url)
content = driver.page_source.encode('utf-8').strip()
instagram_soup = BeautifulSoup(content, "lxml")
stats = instagram_soup.findAll("span", class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")
posts = stats[1].text.strip()
followers = stats[2].text.strip()
following = stats[3].text.strip()
print("instagram stats:{} followers and {} folloing".format(posts, followers, following))
#create a function to scrap about information from a YouTube channel
def get_youtube_stats():
#you might have to use your own YouTube channel
youtube_url = 'https://www.youtube.com/channel/UCLMdmCCRFGWt7rkx5tMErq/about?view_as=subscriber'
driver = webdriver.Chrome()
driver.get(youtube_url)
content = driver.page_source.encode('utf-8').strip()
youtube_soup = BeautifulSoup(content, "lxml")
views = youtube_soup.findAll(
"ÿt-formatted-string", class_="style-scope ytd-channel-about-me"
)
for view in views:
if "views" in view.text:
views = view.text.strip()
subscribers = youtube_soup.findAll(id="subscriber-count").text.strip()
print("youtube stats:{} and {} ".format(subscribers, views))
get_twitter_stats()
get_instagram_stats()
get_youtube_stats()
Example Two
from selenium import webdriver
from selenium_stealth import stealth
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
#options.add_argument("start-maximized")
# using headless here means you are running the browser in the background
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomatedExtension', False)
#ensure your are using the right path and the chrome version is the same with your current chrome browser
driver = webdriver.Chrome(options=options, executable_path="C:\Program Files\Google\Chrome\Application.exe")
stealth(driver,
languages=["en-US", "en"],
vendor= "Google Inc.",
platform= "Win32",
webgl_vendor= "Intel Inc.",
renderer= "Intel Iris OpenGL Engine",
fix_hairline=True
)
#define a search query
query = 'python tutorial for beginners'
links=[]
titles=[]
#define the number of pages you want to scrap
n_pages=15
for page in range(1, n_pages):
url = "http://www.google.com/search?q=" + \
query + "&start=" + str((page - 1) * 10)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# It is time to go in search of the links and titles of the pages
search = soup.find_all('div', class_="yuRUbf")
for lt in search:
links.append(lt.a.get('href'))
titles.append(lt.a.h3.text)
#you can append this in an excel sheet or CVS file as shown in example three
for link in links:
print(link)
for title in titles:
print(title)
Example Three
In this example, we try to use BeautifulSoup and request to show how to web-scrape a movie website.
from bs4 import BeautifulSoup
import requests, openpyxl
# define an excel sheet variable
excel = openpyxl.Workbook()
sheet = excel.active
sheet.append(['Movie rank','Movie name','Year of Release','IMDB Rating'])
try:
#use a request library to get the Url you want to webscrape
source = requests.get('https://www.imdb.com/chart/top/')
source.raise_for_status()
soup = BeautifulSoup(source.text,'html.parser')
#print(soup)
movies= soup.find('tbody', class_="lister-list").find_all('tr')
print(len(movies))
for movie in movies:
name = movie.find('td', class_="titleColumn").a.text
rank = movie.find('td', class_="titleColumn").get_text(strip=True).split('.')[0]
year = movie.find('td', class_="titleColumn").span.text.strip('()')
rating = movie.find('td', class_="ratingColumn imdbRating").strong.text
print(rank, name, year, rating)
sheet.append([rank, name, year, rating])
break
except Exception as e:
print(e)
excel.save('Top IMDB Ratings.xlxs')
Conclusion
You can use the above examples to get data on shopping websites and gain insight into Google maps, images, videos, and other helpful information. Remember, you must understand how to inspect website content and understand HTML classes, IDs, divs, and tags.
However, if you can create a connection with your browser using Selenium, it is a straightforward process to fetch the information you need with BeautifulSoup.
Tanner Abraham
Data Scientist and Software Engineer with a focus on experimental projects in new budding technologies that incorporate machine learning and quantum computing into web applications.=> Join the Waitlist for Early Access.