beautifulsoup 模块

BeautifulSoup4 详解

BeautifulSoup4（简称 BS4）是一个 Python 库，用于从 HTML 和 XML 文件中提取数据。它提供了简单易用的 API，可以轻松地导航、搜索和修改解析树。BeautifulSoup4 是进行网页抓取和数据解析的常用工具。

1. 安装 BeautifulSoup4

要使用 BeautifulSoup4，首先需要安装它和一个解析器库 lxml 或 html.parser：

pip install beautifulsoup4 lxml

2. 创建 BeautifulSoup 对象

你可以通过传入 HTML 或 XML 文本来创建 BeautifulSoup 对象：

from bs4 import BeautifulSoup

html_doc = """
<html>
  <head><title>Test Page</title></head>
  <body>
    <h1>This is a heading</h1>
    <p class="description">This is a paragraph.</p>
    <p class="description">Another paragraph.</p>
    <a href="http://example.com">Link</a>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

在上面的代码中，我们使用 lxml 解析器来解析 HTML 文本。如果你没有安装 lxml，也可以使用内置的 html.parser 解析器：

soup = BeautifulSoup(html_doc, 'html.parser')

3. 常用操作

BeautifulSoup 提供了多种方式来查找和操作 HTML 文档中的元素。

3.1 查找元素

3.1.1 `find` 和 `find_all`

find 方法用于查找第一个匹配的元素，find_all 方法用于查找所有匹配的元素。

# 查找第一个 <h1> 元素
h1 = soup.find('h1')
print(h1.text)  # 输出: This is a heading

# 查找所有 class 为 description 的 <p> 元素
descriptions = soup.find_all('p', class_='description')
for desc in descriptions:
    print(desc.text)

3.1.2 通过 CSS 选择器查找

你可以使用 select 方法通过 CSS 选择器查找元素：

# 查找所有 <p> 元素
paragraphs = soup.select('p')
for p in paragraphs:
    print(p.text)

# 查找带有 class 的元素
descriptions = soup.select('.description')
for desc in descriptions:
    print(desc.text)

# 查找特定的 <a> 标签
link = soup.select_one('a')
print(link['href'])  # 输出: http://example.com

3.2 操作元素

3.2.1 获取属性

可以通过字典风格的访问获取 HTML 标签的属性值：

link = soup.find('a')
print(link['href'])  # 输出: http://example.com

3.2.2 修改属性

同样可以使用字典风格的访问来修改属性值：

link['href'] = 'http://newlink.com'
print(link)  # 输出: <a href="http://newlink.com">Link</a>

3.2.3 获取和修改文本内容

text 属性可以用来获取元素的文本内容，string 属性可以用来修改文本内容：

# 获取文本内容
print(h1.text)  # 输出: This is a heading

# 修改文本内容
h1.string = "New Heading"
print(h1)  # 输出: <h1>New Heading</h1>

3.3 遍历文档树

BeautifulSoup 允许你遍历文档树，获取元素的父节点、子节点、兄弟节点等。

# 获取父节点
parent = h1.parent
print(parent.name)  # 输出: body

# 获取所有子节点
children = parent.children
for child in children:
    print(child.name)  # 输出: h1, p, p, a

# 获取下一个兄弟节点
next_sibling = h1.find_next_sibling()
print(next_sibling.name)  # 输出: p

4. 处理不完整的 HTML

BeautifulSoup 的一个强大功能是它可以处理不完整或格式错误的 HTML。例如，如果 HTML 文档中缺少闭合标签，BeautifulSoup 仍然可以解析并生成完整的树结构：

broken_html = "<html><head><title>Test"
soup = BeautifulSoup(broken_html, 'html.parser')
print(soup.prettify())

5. 搜索文档树

除了基本的 find 和 find_all，BeautifulSoup 还提供了许多高级搜索功能，如通过函数、正则表达式、列表、布尔值等搜索元素。

import re

# 通过正则表达式查找以 't' 开头的标签
tags = soup.find_all(re.compile('^t'))
for tag in tags:
    print(tag.name)

# 通过函数查找特定条件的元素
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

tags = soup.find_all(has_class_but_no_id)
for tag in tags:
    print(tag.name)

6. 修改文档树

你可以使用 BeautifulSoup 轻松地修改文档树，例如添加或删除标签、插入文本、包装元素等。

# 插入新标签
new_tag = soup.new_tag('span')
new_tag.string = "This is a span"
h1.insert_after(new_tag)

# 删除标签
h1.decompose()

# 替换标签
new_h1 = soup.new_tag('h1')
new_h1.string = "Replaced Heading"
h1.replace_with(new_h1)

7. 输出文档

可以使用 prettify 方法将解析的文档以美观的格式输出，或使用 str() 方法将其转换为字符串。

# 美观格式化输出
print(soup.prettify())

# 直接转换为字符串
html_string = str(soup)

8. 总结

BeautifulSoup4 是一个强大且易于使用的 HTML/XML 解析工具，适用于网络爬虫和数据提取等任务。它的灵活性和强大的搜索、遍历、修改功能使其成为处理网页数据的理想选择。通过掌握 BeautifulSoup4，你可以轻松地从各种复杂的网页中提取所需的信息，并进行进一步的分析和处理。