Làm cách nào để trích xuất dữ liệu được yêu cầu từ các chuỗi có cấu trúc trong Python?

Giới thiệu ...

Tôi sẽ chỉ cho bạn một số phương pháp để trích xuất dữ liệu / trường yêu cầu từ các chuỗi có cấu trúc. Những cách tiếp cận này sẽ hữu ích, trong đó định dạng của cấu trúc đầu vào ở một định dạng đã biết.

Cách thực hiện ..

1. Hãy để chúng tôi tạo một định dạng giả để hiểu cách tiếp cận.

Report: <> - Time: <> - Player: <> - Titles: - Country: <>

Báo cáo:Daily_Report - Thời gian:2020-10-16T01:01:01.000001 - Tay vợt:Federer - Danh hiệu:20 - Quốc gia:Thụy Sĩ

report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'

2. Điều đầu tiên tôi nhận thấy từ báo cáo là bộ tách là "-". Chúng tôi sẽ tiếp tục và phân tích cú pháp báo cáo bằng "-"

fields = report.split(' - ')
name, time, player , titles, _ = fields

print(f"Output \n *** The report name {name} generated on {time} has {titles} titles for {player}. ")

Đầu ra

*** The report name Report: Daily_Report generated on Time: 2020-10-10T12:30:59.000000 has Titles: 20 titles for Player: Federer.

3. Bây giờ đầu ra không như mong đợi vì chúng ta vẫn có thể thấy một số nhãn như Báo cáo:, Thời gian:, Trình phát:không bắt buộc.

# extract only report name
formatted_name = name.split(':')[1]

# extract only player
formatted_player = player.split(':')[1]

# extract only titles
formatted_titles = int(titles.split(':')[1])

# extract only titles
new_time = time.split(': ')[1]

print(f"Output \n {formatted_name} , {new_time}, {formatted_player} , {formatted_titles}")

Đầu ra

Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20

4. Bây giờ dấu thời gian ở định dạng ISO, có thể được chia nhỏ nếu bạn muốn hoặc để nguyên. Hãy để tôi chỉ cho bạn cách bạn có thể tách trường dấu thời gian.

from datetime import datetime
formatted_date = datetime.fromisoformat(new_time)

print(f"Output \n{formatted_date}")

Đầu ra

2020-10-10 12:30:59

Bây giờ chúng ta sẽ kết hợp tất cả các bước này thành một chức năng duy nhất.

def parse_function(log):
"""
Function : Parse the given log in the format
Report: <> - Time: <> - Player: <> - Titles: - Country: <>
Args : log
Return : required data
"""
fields = log.split(' - ')
name, time, player , titles, _ = fields

# extract only report name
formatted_name = name.split(':')[1]

# extract only player
formatted_player = player.split(':')[1]

# extract only titles
formatted_titles = int(titles.split(':')[1])

# extract only titles
new_time = time.split(': ')[1]

return f"{formatted_name} , {new_time}, {formatted_player} , {formatted_titles}"

if __name__ == '__main__':
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'
data = parse_function(report)
print(f"Output \n{data}")

Đầu ra

Daily_Report , 2020-10-10T12:30:59.000000, Federer , 20

6. Chúng ta có thể sử dụng mô-đun phân tích cú pháp để làm cho nó đơn giản hơn một chút. Như bạn thấy định dạng, hãy tạo một mẫu. Chúng tôi có thể sử dụng mô-đun phân tích cú pháp để thực hiện việc này dễ dàng hơn một chút.

Đầu tiên hãy cài đặt mô-đun phân tích cú pháp bằng cách - pip cài đặt phân tích cú pháp

from parse import parse
report = 'Report: Daily_Report - Time: 2020-10-10T12:30:59.000000 - Player: Federer - Titles: 20 - Country: Switzerland'

# Looking at the report, create a template
template = 'Report: {name} - Time: {time} - Player: {player} - Titles: {titles} - Country: {country}'

# Run parse and check the results
data = parse(template, report)
print(f"Output \n{data}")

Đầu ra

<Result () {'name': 'Daily_Report', 'time': '2020-10-10T12:30:59.000000', 'player': 'Federer', 'titles': '20', 'country': 'Switzerland'}>

7. Với một lớp lót đơn giản, chúng tôi có thể trích xuất dữ liệu từ nhật ký bằng cách xác định mẫu. Bây giờ, hãy để chúng tôi trích xuất các giá trị riêng lẻ.

print(f"Output \n {data['name']} - {data['time']} - {data['player']} - {data['titles']} - {data['country']}")

Đầu ra

Daily_Report - 2020-10-10T12:30:59.000000 - Federer - 20 - Switzerland

Kết luận:

Bạn đã thấy một số phương pháp để phân tích cú pháp dữ liệu cần thiết từ tệp nhật ký. Thích xác định mẫu và sử dụng mô-đun phân tích cú pháp để trích xuất dữ liệu cần thiết.