Modern Web Scraping With Beautiful Soup and Selenium

HTML is almost intuitive. CSS is a great advancement that cleanly separates the structure of a page from its look and feel. JavaScript adds some pizazz. That's the theory. The real world is a little different.

In this tutorial, you'll learn how the content you see in the browser actually gets rendered and how to go about scraping it when necessary. In particular, you'll learn how to scrape dynamic comments. Our tools will be Python and awesome packages like requests, Beautiful Soup, and Selenium.

When Should You Use Web Scraping?

Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages). It is sometimes necessary if there is no other way to extract the necessary information. Ideally, the application provides a dedicated API for accessing its data programmatically. There are several reasons web scraping should be your last resort:

It is fragile (the web pages you're scraping might change frequently).
It might be forbidden (some web apps have policies against scraping).
It might be slow and expansive (if you need to fetch and wade through a lot of noise).

Understanding Real-World Web Pages

Let's understand what we are up against, by looking at the output of some common web application code. We'll use the article Introduction to Vagrant as an example.

To scrape any content on this page, we need to find the HTML elements containing the content on the page first.

View Page Source

Every browser since the dawn of time (the 1990s) has supported the ability to view the HTML of the current page. Here is a snippet from the view source of Introduction to Vagrant that starts with a huge chunk of minified and uglified JavaScript unrelated to the article itself. Here is a small portion of it:

Here is some actual HTML from the page:

Static Scraping vs. Dynamic Scraping

Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in "view page source", and then you slice and dice it. If the content you're looking for is available, you need to go no further. However, if the content is in a src URL, you need dynamic scraping.

Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it's looking for. Sometimes you need to automate the browser by simulating a user to get the content you need.

Static Scraping With Requests and Beautiful Soup

Let's see how static scraping works using two awesome Python packages: requests for fetching web pages and Beautiful Soup for parsing HTML pages.

Installing Requests and Beautiful Soup

Install pipenv first, and then: pipenv install requests beautifulsoup4

This will create a virtual environment for you too. If you're using the code from gitlab, you can just pipenv install.

Fetching Pages

Fetching a page with requests is a one-liner: r = requests.get(url)

The response object has a lot of attributes. The most important ones are ok and content. If the request fails then r.ok will be False and r.content will contain the error. The content is a stream of bytes. It is usually better to decode it to utf-8 when dealing with text:

>>> r = requests.get('https://www.c2.com/no-such-page')
>>> r.ok
False
>>> print(r.content.decode('utf-8'))
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /ggg was not found on this server.</p>
<hr>
<address>
Apache/2.0.52 (CentOS) Server at www.c2.com Port 80
</address>
</body></html>

If everything is OK then r.content will contain the requested web page (same as view page source).

Finding Elements With Beautiful Soup

The get_page() function below fetches a web page by URL, decodes it to UTF-8, and parses it into a Beautiful Soup object using the HTML parser.

1	def get_page(url):
2	r = requests.get(url)
3	content = r.content.decode('utf-8')
4	return BeautifulSoup(content, 'html.parser')

Once we have a Beautiful Soup object, we can start extracting information from the page. Beautiful Soup provides many find functions to locate elements inside the page and drill down deep nested elements.

Envato Tuts+ author pages contain multiple tutorials. Here is my author page. On each page, there are up to 12 tutorials. If you have more than 12 tutorials then you can navigate to the next page. The HTML for each article is enclosed in an <article> tag. The following function finds all the article elements on the page, drills down to their links, and extracts the href attribute to get the URL of the tutorial:

1	def get_page_articles(page):
2	elements = page.findAll('article')
3	articles = [e.a.attrs['href'] for e in elements]
4	return articles

The following code gets all the articles from my page and prints them (without the common prefix):

page = get_page('https://tutsplus.com/authors/gigi-sayfan')
articles = get_page_articles(page)
prefix = 'https://code.tutsplus.com/tutorials'
for a in articles:
    print(a[len(prefix):])
    
Output:

docker-from-the-ground-up-building-images--cms-28166
how-to-implement-your-own-data-structure-in-python--cms-28723
professional-error-handling-with-python--cms-25950
write-professional-unit-tests-in-python--cms-25835
how-to-write-package-and-distribute-a-library-in-python--cms-28693
serialization-and-deserialization-of-python-objects-part-1--cms-26183
understand-how-much-memory-your-python-objects-use--cms-25609
fetching-data-in-your-react-application--cms-30670
rest-vs-grpc-battle-of-the-apis--cms-30711
regular-expressions-with-go-part-2--cms-30406
regular-expressions-with-go-part-1--cms-30403
8-things-that-make-jest-the-best-react-testing-framework--cms-30534

Dynamic Scraping With Selenium

Static scraping was good enough to get the list of articles, but if we need to automate the browser and interact with the DOM interactively, one of the best tools for the job is Selenium.

Selenium is primarily geared towards automated testing of web applications, but it is great as a general-purpose browser automation tool.

For those looking to enhance their web scraping capabilities by incorporating proxies, understanding the use of a proxy in Selenium is paramount. This involves configuring your Selenium setup to route requests through different network proxies, which can be crucial for accessing geo-restricted content or simulating requests from various locations around the globe. Learn how to effectively utilize proxy configurations within your Selenium WebDriver to boost scraping efficiency and overcome common web scraping obstacles.

Installing Selenium

Type this command to install Selenium: pipenv install selenium

Choose Your Web Driver

Selenium needs a web driver (the browser it automates). For web scraping, it usually doesn't matter which driver you choose. I prefer the Chrome driver. Follow the instructions in this Selenium guide.

Chrome vs. PhantomJS

In some cases you may prefer to use a headless browser, which means no UI is displayed. Theoretically, PhantomJS is just another web driver. But, in practice, people reported incompatibility issues where Selenium works properly with Chrome or Firefox and sometimes fails with PhantomJS. I prefer to remove this variable from the equation and use an actual browser web driver.

Scraping Dynamic Comments

Let's do some dynamic scraping and use Selenium to scrap the comments from a CodeCanyon JavaScript plugin. Here are the necessary imports.

1	from selenium import webdriver

The next step is to create a webdriver instance and get the comments URL.

1	driver = webdriver.Chrome()
2	driver.get('https://codecanyon.net/item/whatshelp-whatsapp-help-and-support-plugin-for-javascript/42202303/comments')

Once we get the URL, we have to find the element where comments are located. To do that, right-click on the web page and click on Inspect, as shown below.

The comments are enclosed in the commentList id. Right-click on the element and copy the XPath as shown below.

Next, use Selenium to extract the contents of the element containing the comments.

driver = webdriver.Chrome()
driver.get('https://codecanyon.net/item/whatshelp-whatsapp-help-and-support-plugin-for-javascript/42202303/comments')
modules = driver.find_elements('xpath','//*[@id="content"]/div/div[1]/div[4]')
for comment in comments:
    print(comment.text)

Here is the output:

sunsuk
20 days ago
How to integrate to CI scripts?
3 other replies
ThemeAtelier AUTHOR
20 days ago
Yes. You can add 5 staffs. We have proper documentation how can you add it in your website. Please follow the instruction. You have to call usual js and CSS files then just need to put actual markup in the footer of your site.
sunsuk
20 days ago
How to add wa number? Any admin dashboard ?
ThemeAtelier AUTHOR
20 days ago
No admin. It just a script. So everything is code. You just need to change demo values to your own value.
HTML knowledge is require for installing it.
Deezia12
29 days ago
Hi Author,
Does it work for a php website?
ThemeAtelier AUTHOR
24 days ago
Yes. As It will work in php websites.
gigantium
about 1 month ago
This is support for multiuser or for SAAS function?
ThemeAtelier AUTHOR
about 1 month ago
It’s not a SAAS product to use for multiple users. You can use this in your single website with for giving chat support to your customers with single and multi user agents.
gigantium
about 1 month ago
Thanks for your kind response. Can you provide a demo to access the admin dashboard?
ThemeAtelier AUTHOR
about 1 month ago

There are some scenarios where the browser has to load some dynamic content; this can make the scraping process a bit more complicated. Luckily, the Selenium driver provides wait() objects that wait for a certain amount of time before finding the element programmatically.

To use the wait functionality, you first need the following imports:

1	from selenium import webdriver
2	from selenium.webdriver.common.by import By
3	from selenium.webdriver.support.wait import WebDriverWait
4	from selenium.webdriver.support import expected_conditions as EC

Then get the URL and apply the wait() object to the element.

driver = webdriver.Chrome()
url = 'http://www.c2.com/loading-page'
driver.get(url)

element = WebDriverWait(driver, 5).until(
    EC.presence_of_element_located((By.ID, "loaded_element"))
)

Conclusion

Web scraping is a useful practice when the information you need is accessible through a web application that doesn't provide an appropriate API. It takes some work to extract data from modern web applications, but mature and well-designed tools like requests, Beautiful Soup, and Selenium make it worthwhile.

This post has been updated with contributions from Esther Vaati. Esther is a software developer and writer for Envato Tuts+.