Scraping Webpages in Python With Beautiful Soup: The Basics

In a previous tutorial, I showed you how to use the Requests module to access webpages using Python. The tutorial covered a lot of topics, like making GET/POST requests and downloading things like images or PDFs programmatically. The one thing missing from that tutorial was a guide to scraping webpages you accessed using Requests to extract the information that you need.

In this tutorial, you will learn about Beautiful Soup, which is a Python library to extract data from HTML files. The focus in this tutorial will be on learning the basics of the library, and more advanced topics will be covered in the next tutorial. Please note that this tutorial uses Beautiful Soup 4 for all the examples.

Installation

You can install Beautiful Soup 4 using pip. The package name is beautifulsoup4. It should work on both Python 2 and Python 3.

1	$ pip install beautifulsoup4

If you don’t have pip installed on your system, you can directly download the Beautiful Soup 4 source tarball and install it using setup.py.

1	$ python setup.py install

Beautiful Soup is originally packaged as Python 2 code. When you install it for use with Python 3, it is automatically updated to Python 3 code. The code won’t be converted unless you install the package. Here are a few common errors that you might notice:

The “No module named HTMLParser” ImportError occurs when you are running the Python 2 version of the code under Python 3.
The “No module named html.parser” ImportError occurs when you are running the Python 3 version of the code under Python 2.

Both the errors above can be corrected by uninstalling and reinstalling Beautiful Soup.

Installing a Parser

Before discussing the differences between different parsers that you can use with Beautiful Soup, let’s write the code to create a soup.

1	from bs4 import BeautifulSoup
2
3	soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser")

The BeautifulSoup object can accept two arguments. The first argument is the actual markup, and the second argument is the parser that you want to use. The different parsers are html.parser, lxml, and html5lib. The lxml parser has two versions: an HTML parser and an XML parser.

The html.parser is a built-in parser, and it does not work so well in older versions of Python. You can install the other parsers using the following commands:

1	$ pip install lxml
2	$ pip install html5lib

The lxml parser is very fast and can be used to quickly parse given HTML. On the other hand, the html5lib parser is very slow, but it is also extremely lenient. Here is an example of using each of these parsers:

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser")
print(soup)
# <html><p>This is <b>invalid HTML</b></p></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "lxml")
print(soup)
# <html><body><p>This is <b>invalid HTML</b></p></body></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "xml")
print(soup)
# <?xml version="1.0" encoding="utf-8"?>
# <html><p>This is <b>invalid HTML</b></p></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html5lib")
print(soup)
# <html><head></head><body><p>This is <b>invalid HTML</b></p></body></html>

The differences outlined by the above example only matter when you are parsing invalid HTML. However, most of the HTML on the web is malformed, and knowing these differences will help you in debugging some parsing errors and deciding which parser you want to use in a project. Generally, the lxml parser is a very good choice.

Objects in Beautiful Soup

Beautiful Soup parses the given HTML document into a tree of Python objects. There are four main Python objects that you need to know about: Tag, NavigableString, BeautifulSoup, and Comment.

The Tag object refers to an actual XML or HTML tag in the document. You can access the name of a tag using tag.name. You can also set a tag’s name to something else. The name change will be visible in the markup generated by Beautiful Soup.

You can access different attributes like the class and id of a tag using tag['class'] and tag['id'] respectively. You can also access the whole dictionary of attributes using tag.attrs. You can also add, remove, or modify a tag’s attributes. Attributes like an element’s class, which can take multiple values, are stored as a list.

The text within a tag is stored as a NavigableString in Beautiful Soup. It has a few useful methods like replace_with("string") to replace the text within a tag. You can also convert a NavigableString to unicode string using unicode().

Beautiful Soup also allows you to access the comments in a webpage. These comments are stored as a Comment object, which is also basically a NavigableString.

You have already learned about the BeautifulSoup object in the previous section. It is used to represent the document as a whole. Since it is not an actual object, it does not have any name or attributes.

Getting the Title, Headings, and Links

You can extract the page title and other such data very easily using Beautiful Soup. Let’s scrape the Wikipedia page about Python. First, you will have to get the markup of the page using the following code based on the Requests module tutorial to access webpages.

import requests
from bs4 import BeautifulSoup

req = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
soup = BeautifulSoup(req.text, "lxml")

Now that you have created the soup, you can get the title of the webpage using the following code:

soup.title
# <title>Python (programming language) - Wikipedia</title>

soup.title.name
# 'title'

soup.title.string
# 'Python (programming language) - Wikipedia'

You can also scrape the webpage for other information like the main heading or the first paragraph, their classes, or the id attribute.

soup.h1
# <h1 class="firstHeading" id="firstHeading" lang="en">Python (programming language)</h1>

soup.h1.string
# 'Python (programming language)'

soup.h1['class']
# ['firstHeading']

soup.h1['id']
# 'firstHeading'

soup.h1.attrs
# {'class': ['firstHeading'], 'id': 'firstHeading', 'lang': 'en'}

soup.h1['class'] = 'firstHeading, mainHeading'
soup.h1.string.replace_with("Python - Programming Language")
del soup.h1['lang']
del soup.h1['id']

soup.h1
# <h1 class="firstHeading, mainHeading">Python - Programming Language</h1>

Similarly, you can iterate through all the links or subheadings in a document using the following code:

1	for sub_heading in soup.find_all('h2'):
2	print(sub_heading.text)
3
4	# all the sub-headings like Contents, History[edit]...

Handling Multi-Valued and Duplicate Attributes

Different elements in an HTML document use a variety of attributes for different purposes. For example, you can add class or id attributes to style, group, or identify elements. Similarly, you can use data attributes to store any additional information. Not all attributes can accept multiple values, but a few can. The HTML specification has a clear set of rules for these situations, and Beautiful Soup tries to follow them all. However, it also allows you to specify how you want to handle the data returned by multi-valued attributes. This feature was added in version 4.8, so make sure that you have installed the right version before using it.

By default, attributes like class which can have multiple values will return a list, but ones like id will return a single string value. You can pass an argument called multi_valued_attributes in the BeautifulSoup constructor with its value set to None. This will make sure that the value returned by all the attributes is a string.

Here is an example:

from bs4 import BeautifulSoup

markup = '''
<a class="notice light" id="recent-posts" data-links="1 5 20" href="/recent-posts/">Recent Posts</a>
'''

soup = BeautifulSoup(markup, 'html.parser')
print(soup.a['class'])
print(soup.a['id'])
print(soup.a['data-links'] + "\n")
''' 
Output:
['notice', 'light']
recent-posts
1 5 20
'''


soup = BeautifulSoup(markup, 'html.parser', multi_valued_attributes=None)

print(soup.a['class'])
print(soup.a['id'])
print(soup.a['data-links'] + "\n")
'''
Output:
notice light
recent-posts
1 5 20
'''

There is no guarantee that the HTML you get from different websites will always be completely valid. It could have many different issues, like duplicated attributes. Starting from version 4.9.1, Beautiful Soup allows you to specify what should be done in such situations by setting a value for the on_duplicate_attribute argument. Different parsers handle this issue differently, and you will need to use the built-in html.parser to force a specific behavior.

from bs4 import BeautifulSoup

markup = '''
<a class="notice light" href="/recent-posts/" class="important dark">Recent Posts</a>
'''

soup = BeautifulSoup(markup, 'lxml')
print(soup.a['class'])
# ['notice', 'light']

soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore')
print(soup.a['class'])
# ['notice', 'light']

soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace')
print(soup.a['class'])
# ['important', 'dark']

Navigating the DOM

You can navigate through the DOM tree using regular tag names. Chaining those tag names can help you navigate the tree more deeply. For example, you can get the first link in the first paragraph of a given Wikipedia page by using soup.p.a. All the links in the first paragraph can be accessed by using soup.p.find_all('a').

You can also access all the children of a tag as a list by using tag.contents. To get the children at a specific index, you can use tag.contents[index]. You can also iterate over a tag's children by using the .children attribute.

Both .children and .contents are useful only when you want to access the direct or first-level descendants of a tag. To get all the descendants, you can use the .descendants attribute.

print(soup.p.contents)
# [<b>Python</b>, ' is a widely used ',.....the full list]

print(soup.p.contents[10])
# <a href="/wiki/Readability" title="Readability">readability</a>

for child in soup.p.children:
    print(child.name)
# b
# None
# a
# None
# a
# None
# ... and so on.

You can also access the parent of an element using the .parent attribute. Similarly, you can access all the ancestors of an element using the .parents attribute. The parent of the top-level <html> tag is the BeautifulSoup Object itself, and its parent is None.

print(soup.p.parent.name)
# div

for parent in soup.p.parents:
    print(parent.name)
# div
# div
# div
# body
# html
# [document]

You can access the previous and next sibling of an element using the .previous_sibling and .next_sibling attributes.

For two elements to be siblings, they should have the same parent. This means that the first child of an element will not have a previous sibling. Similarly, the last child of the element will not have a next sibling. In actual webpages, the previous and next siblings of an element will most probably be a new line character.

You can also iterate over all the siblings of an element using .previous_siblings and .next_siblings.

soup.head.next_sibling
# '\n'

soup.p.a.next_sibling
# ' for '

soup.p.a.previous_sibling
# ' is a widely used '

print(soup.p.b.previous_sibling)
# None

You can go to the element that comes immediately after the current element using the .next_element attribute. To access the element that comes immediately before the current element, use the .previous_element attribute.

Similarly, you can iterate over all the elements that come before and after the current element using .previous_elements and .next_elements respectively.

Parsing Only Part of a Document

Let's say that you need to process a large amount of data when looking for something specific, and it's important for you to save some processing time or memory. In that case, you can take advantage of the SoupStrainer class in Beautiful Soup. This class allows you to only focus on specific elements, while ignoring the rest of the document. For example, you can use it to ignore everything else on the webpage besides images by passing appropriate selectors in the SoupStrainer constructor.

Keep in mind that the Soup Strainer will not work with the html5lib parser. However, you can use it with both lxml and the built-in parser. Here's an example where we parse the Wikipedia page for the United States and get all the images with the class thumbimage.

import requests
from bs4 import BeautifulSoup, SoupStrainer

req = requests.get('https://en.wikipedia.org/wiki/United_States')

thumb_images = SoupStrainer(class_="thumbimage")

soup = BeautifulSoup(req.text, "lxml", parse_only=thumb_images)

for image in soup.find_all("img"):
    print(image['src'])
'''
Output:
//upload.wikimedia.org/wikipedia/commons/thumb/7/7b/Mesa_Verde_National_Park_-_Cliff_Palace.jpg/220px-Mesa_Verde_National_Park_-_Cliff_Palace.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Map_of_territorial_growth_1775.svg/260px-Map_of_territorial_growth_1775.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Declaration_of_Independence_%281819%29%2C_by_John_Trumbull.jpg/220px-Declaration_of_Independence_%281819%29%2C_by_John_Trumbull.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/9/94/U.S._Territorial_Acquisitions.png/310px-U.S._Territorial_Acquisitions.png
...and many more images
'''

You should note that I used class_ instead of class to get these elements because class is a reserved keyword in Python.

Final Thoughts

After completing this tutorial, you should now have a good understanding of the main differences between different HTML parsers. You should now also be able to navigate through a webpage and extract important data. This can be helpful when you want to analyze all the headings or links on a given website.

In the next part of the series, you will learn how to use the Beautiful Soup library to search and modify the DOM.