[爬虫笔记] 初探BeautifulSoup库 | Rabbit House = カフェラテ、カフェモカ、カプチーノ~ = QwQ

Soup 是汤啊.. 咱还以为是肥皂 qwq

# 基本使用

beautifulsoup 类， html 文档，标签树 (字符串) 认为等价。

bs4 库将所有读入的 html 文件和字符串都转化为 UTF-8 编码

	# pip install beautifulsoup4
	from bs4 import BeautifulSoup # 注意大小写
	soup = BeautifulSoup("<html>data</html>", 'html.parser') # 需要解析的 html 信息，解析器
	soup = BeautifulSoup(open("D://demo.html"), 'html.parser')

实例：

	>>> from bs4 import BeautifulSoup
	>>> soup = BeautifulSoup(demo, "html.parser") #demo 是一个 html 文档 (字符串) 通过 demo = get ("https://python123.io/ws/demo.html").text
	>>> tag = soup.a
	>>> tag.name
	'a'
	>>> tag.parent.name
	'p'
	>>> tag.attrs
	{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
	>>> tag.attrs['href']
	'http://www.icourse163.org/course/BIT-268001'
	>>> type(tag.attrs)
	<class 'dict'>
	>>> type(tag)
	<class 'bs4.element.Tag'>
	>>> tag.string
	'Basic Python'

才知道 html 竟然可以看成标签树形结构 qwq

遍历方式：上行遍历 (fa -> son)，下行遍历 (son -> fa)，平行遍历 (同父亲的 son 之间，不是同深度的所有节点)

下行遍历：

	>>> soup = BeautifulSoup(demo, "html.parser")
	>>> soup.head
	<head><title>This is a python demo page</title></head>
	>>> soup.head.contents
	[<title>This is a python demo page</title>]
	>>> soup.body.contents[1]
	<p class="title"><b>The demo python introduces several python courses.</b></p>
	# 遍历儿子
	for child in soup.body.children:
	print(child)

上行遍历：（接上段）

	>>> soup.title.parent
	<head><title>This is a python demo page</title></head>
	>>> soup.html.parent
	# 打印了整个 html 文档
	>>> soup.parent
	# 空
	# 遍历父亲
	for parent in soup.a.parents:
	if parent is None:
	print(parent)
	else:
	print(parent.name)

平行遍历：（接上段）

	>>> soup.a.next_sibling
	' and '
	>>> soup.a.next_sibling.next_sibling
	<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
	>>> soup.a.previous_sibling
	'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
	>>> soup.a.previous_sibling.previous_sibling
	# 空
	# 遍历同上 qwq

如何让 html 页面更加 "友好" 地显示

使用方法： soup.prettify() # 添加空格和换行符用于打印

	>>> import requests as rq
	>>> from bs4 import BeautifulSoup
	>>> url ="https://python123.io/ws/demo.html"
	>>> r = rq.get(url)
	>>> demo = r.text
	>>> soup = BeautifulSoup(demo, "html.parser")
	>>> for link in soup.find_all('a'):
	print(link.get('href'))
	http://www.icourse163.org/course/BIT-268001
	http://www.icourse163.org/course/BIT-1001870001