BeautifulSoup vs lxml: A Practical Performance Comparison
BeautifulSoup vs lxml: A Practical Performance Comparison
Introduction
When it comes to HTML parsing in Python, two libraries dominate the ecosystem: BeautifulSoup
(commonly referred to as bs4
) and lxml
. Both are powerful and easy to use—but performance-wise, they're worlds apart.
So why does bs4
, known to be significantly slower, remain so popular?
From my experience working on large-scale web scraping projects, I often start with bs4
because of its gentle learning curve and forgiving syntax. But once performance becomes critical—say, scraping tens of thousands of pages—I switch to lxml
. In one real-world project, this change alone cut down my total scrape time from nearly 3 hours to just 20 minutes. That’s a 2-hour improvement, simply by swapping the parser.
Let’s walk through a simple benchmark that demonstrates this difference.
Benchmark Setup
We'll extract rows from a Wikipedia table using both bs4
and lxml
, and measure how long each method takes when repeated 100 times.
Step 1: Get HTML from a Wikipedia page
import requests
def get_html():
url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
r = requests.get(url)
return r.text
Step 2: Define two parsing functions
Using BeautifulSoup:
from bs4 import BeautifulSoup as BSoup
def bs_scraping(page_source, parser):
bs_obj = BSoup(page_source, parser)
rows = bs_obj.find_all('table')[0].find_all('tr')
return rows
Using lxml:
from lxml import html
def lxml_scraping(page_source):
tree = html.fromstring(page_source)
table = tree.xpath('//*[@id="mw-content-text"]/div/table[1]')[0]
rows = table.findall('tr')
return rows
Step 3: Benchmark each parser
from datetime import datetime
# Number of times to parse the HTML
repeats = 100
# Load HTML only once
page_source = get_html()
# List of parsers to test with BeautifulSoup
bs_parsers = ['lxml', 'html.parser', 'html5lib']
# Run BeautifulSoup benchmarks
for parser in bs_parsers:
bs_start = datetime.now()
for _ in range(repeats):
bs_result = bs_scraping(page_source, parser)
bs_duration = datetime.now() - bs_start
print(f'BeautifulSoup ({parser}) time: {bs_duration}')
# Run lxml benchmark
lxml_start = datetime.now()
for _ in range(repeats):
lxml_result = lxml_scraping(page_source)
lxml_duration = datetime.now() - lxml_start
print(f'lxml time: {lxml_duration}')
Results
Here's what we get from the benchmark:
BeautifulSoup (lxml) time: 0:00:07.552425
BeautifulSoup (html.parser) time: 0:00:11.793648
BeautifulSoup (html5lib) time: 0:00:22.352544
lxml time: 0:00:00.658722
Conclusion
While BeautifulSoup
shines in its simplicity and tolerance for poorly-formed HTML, it's not the best choice for high-volume scraping tasks. Its abstraction comes at a steep performance cost.
For quick scripts, learning, or dealing with broken HTML, bs4
is great. But if you're processing tens of thousands of pages? Switch to lxml
. You'll save hours—literally.
Let your tools work for you, not against you.
Full Code
from datetime import datetime
import requests
from bs4 import BeautifulSoup as BSoup
from lxml import html
def get_html():
url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
r = requests.get(url)
return r.text
def bs_scraping(page_source, parser):
bs_obj = BSoup(page_source, parser)
rows = bs_obj.find_all('table')[0].find_all('tr')
return rows
def lxml_scraping(page_source):
tree = html.fromstring(page_source)
table = tree.xpath('//*[@id="mw-content-text"]/div/table[1]')[0]
rows = table.findall('tr')
return rows
if __name__ == '__main__':
repeats = 100
page_source = get_html()
bs_parsers = ['lxml', 'html.parser', 'html5lib']
for parser in bs_parsers:
bs_start = datetime.now()
for _ in range(repeats):
bs_result = bs_scraping(page_source, parser)
bs_finish = datetime.now() - bs_start
print('BeautifulSoup {} time: {}'.format(parser, bs_finish))
lxml_start = datetime.now()
for _ in range(repeats):
lxml_result = lxml_scraping(page_source)
lxml_finish = datetime.now() - lxml_start
print('lxml time:', lxml_finish)