Làm cách nào để gói BeautifulSoup được sử dụng để phân tích cú pháp dữ liệu từ một trang web bằng Python?

BeautifulSoup là thư viện Python của bên thứ ba được sử dụng để phân tích cú pháp dữ liệu từ các trang web. Nó giúp ích trong việc tìm kiếm web, là một quá trình trích xuất, sử dụng và thao tác dữ liệu từ các tài nguyên khác nhau.

Việc thu thập dữ liệu trên web cũng có thể được sử dụng để trích xuất dữ liệu cho mục đích nghiên cứu, hiểu / so sánh xu hướng thị trường, thực hiện giám sát SEO, v.v.

Có thể chạy dòng dưới đây để cài đặt BeautifulSoup trên Windows -

pip install beautifulsoup4

Hãy để chúng tôi xem một ví dụ -

Ví dụ

import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import urllib
url = 'https://en.wikipedia.org/wiki/Algorithm'
html = urlopen(url).read()
print("Reading the webpage...")
soup = BeautifulSoup(html, features="html.parser")
print("Parsing the webpage...")
for script in soup(["script", "style"]):
   script.extract() # rip it out
print("Extracting text from the webpage...")
text = soup.get_text()
print("Data cleaning...")
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
text = str(text)
print(text)

Đầu ra

Reading the webpage...
Parsing the webpage...
Extracting text from the webpage...
Data cleaning...
Recursive C implementation of Euclid's algorithm from the above flowchart
Recursion
A recursive algorithm is one that invokes (makes reference to) itself repeatedly until a certain condition (also known as termination condition) matches, which is a method common to functional programming….
…..
Developers
Statistics
Cookie statement

Giải thích

Các gói bắt buộc được nhập và được đặt bí danh.
Trang web được xác định.
Url được mở và thẻ 'script' và các thẻ HTML không liên quan khác sẽ bị xóa.
Hàm 'get_text' được sử dụng để trích xuất văn bản từ dữ liệu trang web.
Các khoảng trắng thừa và các từ không hợp lệ sẽ bị loại bỏ.
Văn bản được in trên bảng điều khiển.