Triển khai Web Scraping bằng Python với BeautifulSoup?

BeautifulSoup là một lớp trong mô-đun bs4 của python. Mục đích cơ bản của việc xây dựng beautifulsoup là để phân tích cú pháp các tài liệu HTML hoặc XML.

Cài đặt bs4 (ngắn gọn là beautifulsoup)

Có thể dễ dàng cài đặt beautifulsoup bằng cách sử dụng mô-đun pip. Chỉ cần chạy lệnh dưới đây trên trình bao lệnh của bạn.

pip install bs4

Chạy lệnh trên trên thiết bị đầu cuối của bạn, bạn sẽ thấy màn hình của bạn giống như -

C:\Users\rajesh>pip install bs4
Collecting bs4
Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in c:\python\python361\lib\site-packages (from bs4) (4.6.0)
Building wheels for collected packages: bs4
Building wheel for bs4 (setup.py) ... done
Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1

Để xác minh xem BeautifulSoup có được cài đặt thành công trong máy của bạn hay không, chỉ cần chạy lệnh dưới đây trong cùng một thiết bị đầu cuối−

C:\Users\rajesh>python
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>>

Thành công, tuyệt vời !.

Ví dụ 1

Tìm tất cả các liên kết từ một tài liệu html Bây giờ, giả sử chúng ta có một tài liệu HTML và chúng tôi muốn thu thập tất cả các liên kết tham chiếu trong tài liệu. Vì vậy, trước tiên, chúng tôi sẽ lưu trữ tài liệu dưới dạng một chuỗi như bên dưới -

html_doc='''<a href='wwww.Tutorialspoint.com.com'/a>
<a href='wwww.nseindia.com.com'/a>
<a href='wwww.codesdope.com'/a>
<a href='wwww.google.com'/a>
<a href='wwww.facebook.com'/a>
<a href='wwww.wikipedia.org'/a>
<a href='wwww.twitter.com'/a>
<a href='wwww.microsoft.com'/a>
<a href='wwww.github.com'/a>
<a href='wwww.nytimes.com'/a>
<a href='wwww.youtube.com'/a>
<a href='wwww.reddit.com'/a>
<a href='wwww.python.org'/a>
<a href='wwww.stackoverflow.com'/a>
<a href='wwww.amazon.com'/a>
<a href=‘wwww.linkedin.com'/a>
<a href='wwww.finace.google.com'/a>'''

Bây giờ chúng ta sẽ tạo một đối tượng súp bằng cách chuyển biến html_doc ở trên vào hàm khởi tạo của beautifulSoup.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Bây giờ chúng ta có đối tượng súp, chúng ta có thể áp dụng các phương thức của lớp BeautifulSoup trên đó. Bây giờ chúng ta có thể tìm thấy tất cả các thuộc tính của thẻ và các giá trị trong các thuộc tính được cung cấp trong html_doc.

for tag in soup.find_all('a'):
print(tag.get('href'))

Từ đoạn mã trên, chúng tôi đang cố gắng lấy tất cả các liên kết trong chuỗi html_doc thông qua một vòng lặp để lấy mọi trong tài liệu và lấy thuộc tính href.

Dưới đây là mã hoàn chỉnh của chúng tôi để lấy tất cả các liên kết từ chuỗi html_doc.

from bs4 import BeautifulSoup

html_doc='''<a href='www.Tutorialspoint.com'/a>
<a href='www.nseindia.com.com'/a>
<a href='www.codesdope.com'/a>
<a href='www.google.com'/a>
<a href='www.facebook.com'/a>
<a href='www.wikipedia.org'/a>
<a href='www.twitter.com'/a>
<a href='www.microsoft.com'/a>
<a href='www.github.com'/a>
<a href='www.nytimes.com'/a>
<a href='www.youtube.com'/a>
<a href='www.reddit.com'/a>
<a href='www.python.org'/a>
<a href='www.stackoverflow.com'/a>
<a href='www.amazon.com'/a>
<a href='www.rediff.com'/a>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all('a'):
print(tag.get('href'))

Kết quả

www.Tutorialspoint.com
www.nseindia.com.com
www.codesdope.com
www.google.com
www.facebook.com
www.wikipedia.org
www.twitter.com
www.microsoft.com
www.github.com
www.nytimes.com
www.youtube.com
www.reddit.com
www.python.org
www.stackoverflow.com
www.amazon.com
www.rediff.com

Ví dụ 2

In tất cả các liên kết từ một trang web có phần tử cụ thể (ví dụ:python) được đề cập trong liên kết.

Chương trình bên dưới sẽ in tất cả các URL từ một trang web cụ thể có chứa “python” trong liên kết đó.

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html = urlopen("https://www.python.org")
content = html.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
if re.findall('python', a['href']):
print("Python URL:", a['href'])

Kết quả

Python URL: https://docs.python.org
Python URL: https://pypi.python.org/
Python URL: https://www.facebook.com/pythonlang?fref=ts
Python URL: https://brochure.getpython.info/
Python URL: https://docs.python.org/3/license.html
Python URL: https://wiki.python.org/moin/BeginnersGuide
Python URL: https://devguide.python.org/
Python URL: https://docs.python.org/faq/
Python URL: https://wiki.python.org/moin/Languages
Python URL: https://python.org/dev/peps/
Python URL: https://wiki.python.org/moin/PythonBooks
Python URL: https://wiki.python.org/moin/
Python URL: https://www.python.org/psf/codeofconduct/
Python URL: https://planetpython.org/
Python URL: /events/python-events
Python URL: /events/python-user-group/
Python URL: /events/python-events/past/
Python URL: /events/python-user-group/past/
Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
Python URL: //docs.python.org/3/tutorial/controlflow.html#defining-functions
Python URL: //docs.python.org/3/tutorial/introduction.html#lists
Python URL: https://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
Python URL: //docs.python.org/3/tutorial/
Python URL: //docs.python.org/3/tutorial/controlflow.html
Python URL: /downloads/release/python-373/
Python URL: https://docs.python.org
Python URL: //jobs.python.org
Python URL: https://blog.python.org
Python URL: https://feedproxy.google.com/~r/PythonInsider/~3/Joo0vg55HKo/python-373-is-now-available.html
Python URL: https://feedproxy.google.com/~r/PythonInsider/~3/N5tvkDIQ47g/python-3410-is-now-available.html
Python URL: https://feedproxy.google.com/~r/PythonInsider/~3/n0mOibtx6_A/python-3.html
Python URL: /events/python-events/805/
Python URL: /events/python-events/817/
Python URL: /events/python-user-group/814/
Python URL: /events/python-events/789/
Python URL: /events/python-events/831/
Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/
Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/
Python URL: https://wiki.python.org/moin/TkInter
Python URL: https://www.wxpython.org/
Python URL: https://ipython.org
Python URL: #python-network
Python URL: https://brochure.getpython.info/
Python URL: https://docs.python.org/3/license.html
Python URL: https://wiki.python.org/moin/BeginnersGuide
Python URL: https://devguide.python.org/
Python URL: https://docs.python.org/faq/
Python URL: https://wiki.python.org/moin/Languages
Python URL: https://python.org/dev/peps/
Python URL: https://wiki.python.org/moin/PythonBooks
Python URL: https://wiki.python.org/moin/
Python URL: https://www.python.org/psf/codeofconduct/
Python URL: https://planetpython.org/
Python URL: /events/python-events
Python URL: /events/python-user-group/
Python URL: /events/python-events/past/
Python URL: /events/python-user-group/past/
Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
Python URL: https://devguide.python.org/
Python URL: https://bugs.python.org/
Python URL: https://mail.python.org/mailman/listinfo/python-dev
Python URL: #python-network
Python URL: https://github.com/python/pythondotorg/issues
Python URL: https://status.python.org/