Blog About
Published on 13 views

BeautifulSoup vs lxml: A Practical Performance Comparison

BeautifulSoup vs lxml: A Practical Performance Comparison

Introduction

When it comes to HTML parsing in Python, two libraries dominate the ecosystem: BeautifulSoup (commonly referred to as bs4) and lxml. Both are powerful and easy to use—but performance-wise, they're worlds apart.

So why does bs4, known to be significantly slower, remain so popular?

From my experience working on large-scale web scraping projects, I often start with bs4 because of its gentle learning curve and forgiving syntax. But once performance becomes critical—say, scraping tens of thousands of pages—I switch to lxml. In one real-world project, this change alone cut down my total scrape time from nearly 3 hours to just 20 minutes. That’s a 2-hour improvement, simply by swapping the parser.

Let’s walk through a simple benchmark that demonstrates this difference.

Benchmark Setup

We'll extract rows from a Wikipedia table using both bs4 and lxml, and measure how long each method takes when repeated 100 times.

Step 1: Get HTML from a Wikipedia page

import requests

def get_html():
    url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
    r = requests.get(url)
    return r.text

Step 2: Define two parsing functions

Using BeautifulSoup:

from bs4 import BeautifulSoup as BSoup

def bs_scraping(page_source, parser):
    bs_obj = BSoup(page_source, parser)
    rows = bs_obj.find_all('table')[0].find_all('tr')
    return rows

Using lxml:

from lxml import html

def lxml_scraping(page_source):
    tree = html.fromstring(page_source)
    table = tree.xpath('//*[@id="mw-content-text"]/div/table[1]')[0]
    rows = table.findall('tr')
    return rows

Step 3: Benchmark each parser

from datetime import datetime

# Number of times to parse the HTML
repeats = 100

# Load HTML only once
page_source = get_html()

# List of parsers to test with BeautifulSoup
bs_parsers = ['lxml', 'html.parser', 'html5lib']

# Run BeautifulSoup benchmarks
for parser in bs_parsers:
    bs_start = datetime.now()
    for _ in range(repeats):
        bs_result = bs_scraping(page_source, parser)
    bs_duration = datetime.now() - bs_start
    print(f'BeautifulSoup ({parser}) time: {bs_duration}')

# Run lxml benchmark
lxml_start = datetime.now()
for _ in range(repeats):
    lxml_result = lxml_scraping(page_source)
lxml_duration = datetime.now() - lxml_start
print(f'lxml time: {lxml_duration}')

Results

Here's what we get from the benchmark:

BeautifulSoup (lxml) time:       0:00:07.552425
BeautifulSoup (html.parser) time: 0:00:11.793648
BeautifulSoup (html5lib) time:    0:00:22.352544
lxml time:                        0:00:00.658722

Conclusion

While BeautifulSoup shines in its simplicity and tolerance for poorly-formed HTML, it's not the best choice for high-volume scraping tasks. Its abstraction comes at a steep performance cost.

For quick scripts, learning, or dealing with broken HTML, bs4 is great. But if you're processing tens of thousands of pages? Switch to lxml. You'll save hours—literally.

Let your tools work for you, not against you.

Full Code

from datetime import datetime

import requests
from bs4 import BeautifulSoup as BSoup
from lxml import html


def get_html():
    url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
    r = requests.get(url)
    return r.text


def bs_scraping(page_source, parser):
    bs_obj = BSoup(page_source, parser)
    rows = bs_obj.find_all('table')[0].find_all('tr')
    return rows


def lxml_scraping(page_source):
    tree = html.fromstring(page_source)
    table = tree.xpath('//*[@id="mw-content-text"]/div/table[1]')[0]
    rows = table.findall('tr')
    return rows

if __name__ == '__main__':
    repeats = 100
    page_source = get_html()
    bs_parsers = ['lxml', 'html.parser', 'html5lib']
    for parser in bs_parsers:
        bs_start = datetime.now()
        for _ in range(repeats):
            bs_result = bs_scraping(page_source, parser)
        bs_finish = datetime.now() - bs_start
        print('BeautifulSoup {} time: {}'.format(parser, bs_finish))

    lxml_start = datetime.now()
    for _ in range(repeats):
        lxml_result = lxml_scraping(page_source)
    lxml_finish = datetime.now() - lxml_start
    print('lxml time:', lxml_finish)