Làm thế nào để xử lý dữ liệu tệp excel theo khối với Python?

Giới thiệu

Có vẻ như thế giới được cai trị bởi Excel. Tôi đã rất ngạc nhiên trong công việc kỹ thuật dữ liệu của mình khi thấy có bao nhiêu đồng nghiệp của tôi đang sử dụng Excel như một công cụ quan trọng để đưa ra quyết định. Mặc dù tôi không phải là một fan hâm mộ lớn của MS Office và các bảng tính excel của họ, tôi vẫn sẽ chỉ cho bạn một mẹo nhỏ để xử lý các bảng tính lớn excel một cách hiệu quả.

Cách thực hiện ..

Trước khi tham gia trực tiếp vào chương trình, chúng ta hãy hiểu một số điều cơ bản về cách xử lý bảng tính excel với Pandas.

1. Cài đặt. Tiếp tục và cài đặt openpyxl và xlwt. Nếu bạn không chắc liệu nó có được cài đặt hay không, chỉ có các gói có sẵn bằng cách sử dụng pip đóng băng hoặc danh sách pip từ thiết bị đầu cuối python.

Đầu tiên chúng ta sẽ tạo một bảng tính excel bằng cách chuyển nhiều dữ liệu, sau đó chúng ta sẽ tải dữ liệu vào khung dữ liệu gấu trúc. Cuối cùng, chúng tôi sẽ ghi dữ liệu khung dữ liệu vào một sổ làm việc mới.

import xlsxwriter
import pandas as pd

2.Tạo một bảng tính Excel với dữ liệu nhỏ. chúng ta sẽ có một hàm nhỏ để ghi dữ liệu từ điển vào bảng tính excel. Tất cả logic mã được xác định ở mỗi bước.

# Function : write_data_to_files
def write_data_to_files(inp_data, inp_file_name):
"""
function : create a csv file with the data passed to this code
args : inp_data : tuple data to be written to the target file
file_name : target file name to store the data
return : none
assumption : File to be created and this code are in same directory.
"""
print(f" *** Writing the data to - {inp_file_name}")

# Create a Workbook.
workbook = xlsxwriter.Workbook(inp_file_name)

# add a worksheet.
worksheet = workbook.add_worksheet()

# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# read the input data and write them in rows and columns
for player, titles in inp_data:
worksheet.write(row, col, player)
worksheet.write(row, col + 1, titles)
row += 1

# close the workbook.
workbook.close()
print(f" *** Completed writing the data to - {inp_file_name}")

# Function : excel_functions_with_pandas
def excel_functions_with_pandas(inp_file_name):
"""
function : Quick overview of functions you can apply on excel with pandas
args : inp_file_name : input excel spread sheet.
return : none
assumption : Input excel spreadsheet and this code are in same directory.
"""
data = pd.read_excel(inp_file_name)

# print top 2 rows
print(f" *** Displaying top 2 rows of - {inp_file_name} \n {data.head()} ")

# look at the data types
print(f" *** Displaying info about {inp_file_name} - {data.info()}")

# Create a new spreadsheet "Sheet2" and write data into it.
new_players_info = pd.DataFrame(data=[
{"players": "new Roger Federer", "titles": 20},
{"players": "new Rafael Nadal", "titles": 20},
{"players": "new Novak Djokovic", "titles": 17},
{"players": "new Andy Murray", "titles": 3}], columns=["players", "titles"])

new_data = pd.ExcelWriter(inp_file_name)
new_players_info.to_excel(new_data, sheet_name="Sheet2")
if __name__ == '__main__':
# Define your file name and data
file_name = "temporary_file.xlsx"

# tuple data for storage
file_data = (['player', 'titles'], ['Federer', 20], ['Nadal', 20], ['Djokovic', 17], ['Murray', 3])

# write the file_data to file_name
# write_data_to_files(file_data, file_name)

# # Read excel file into pandas and apply functions.
# excel_functions_with_pandas(file_name)

if __name__ == '__main__':
# Define your file name and data
file_name = "temporary_file.xlsx"

# tuple data for storage
file_data = (['player', 'titles'], ['Federer', 20], ['Nadal', 20], ['Djokovic', 17], ['Murray', 3])

# write the file_data to file_name
# write_data_to_files(file_data, file_name)

# # Read excel file into pandas and apply functions.
# excel_functions_with_pandas(file_name)

Đầu ra

*** Writing the data to - temporary_file.xlsx
*** Completed writing the data to - temporary_file.xlsx
*** Displaying top 2 rows of - temporary_file.xlsx
player titles
0 Federer 20
1 Nadal 20
2 Djokovic 17
3 Murray 3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 player 4 non-null object
1 titles 4 non-null int64
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes
*** Displaying info about temporary_file.xlsx - None

Giờ đây, khi xử lý các tệp csv lớn, chúng tôi có khá nhiều tùy chọn bao gồm các phần để xử lý chúng theo phần, tuy nhiên đối với bảng tính excel, Pandas không cung cấp tùy chọn theo phần theo mặc định.

Vì vậy, chương trình dưới đây khá hữu ích nếu bạn muốn xử lý bảng tính excel theo từng phần.

Ví dụ

Hàm

def global_excel_to_db_chunks(file_name, nrows):
"""
function : handle excel spreadsheets in chunks
args : inp_file_name : input excel spread sheet.
return : none
assumption : Input excel spreadsheet and this code are in same directory.
"""
chunks = []
i_chunk = 0

# The first row is the header. We have already read it, so we skip it.
skiprows = 1
df_header = pd.read_excel(file_name, nrows=1)

while True:
df_chunk = pd.read_excel(
file_name, nrows=nrows, skiprows=skiprows, header=None)
skiprows += nrows

# When there is no data, we know we can break out of the loop.
if not df_chunk.shape[0]:
break
else:
print(
f" ** Reading chunk number {i_chunk} with {df_chunk.shape[0]} Rows")
# print(f" *** Reading chunk {i_chunk} ({df_chunk.shape[0]} rows)")
chunks.append(df_chunk)
i_chunk += 1

df_chunks = pd.concat(chunks)

# Rename the columns to concatenate the chunks with the header.
columns = {i: col for i, col in enumerate(df_header.columns.tolist())}
df_chunks.rename(columns=columns, inplace=True)
df = pd.concat([df_header, df_chunks])

print(f' *** Reading is Completed in chunks...')

if __name__ == '__main__':
print(f" *** Gathering & Displaying Stats on the excel spreadsheet ***")
file_name = 'Sample-sales-data-excel.xls'
stats = pd.read_excel(file_name)
print(f" ** Total rows in the spreadsheet are - {len(stats.index)} Rows")

# process the excel file in chunks of 1000 rows at a time.
global_excel_to_db_chunks(file_name, 1000)

*** Gathering & Displaying Stats on the excel spreadsheet ***
** Total rows in the spreadsheet are - 9994 Rows
** Reading chunk number 0 with 1000 Rows
** Reading chunk number 1 with 1000 Rows
** Reading chunk number 2 with 1000 Rows
** Reading chunk number 3 with 1000 Rows
** Reading chunk number 4 with 1000 Rows
** Reading chunk number 5 with 1000 Rows
** Reading chunk number 6 with 1000 Rows
** Reading chunk number 7 with 1000 Rows
** Reading chunk number 8 with 1000 Rows
** Reading chunk number 9 with 994 Rows
*** Reading is Completed in chunks...

Đầu ra

*** Gathering & Displaying Stats on the excel spreadsheet ***
** Total rows in the spreadsheet are - 9994 Rows
** Reading chunk number 0 with 1000 Rows
** Reading chunk number 1 with 1000 Rows
** Reading chunk number 2 with 1000 Rows
** Reading chunk number 3 with 1000 Rows
** Reading chunk number 4 with 1000 Rows
** Reading chunk number 5 with 1000 Rows
** Reading chunk number 6 with 1000 Rows
** Reading chunk number 7 with 1000 Rows
** Reading chunk number 8 with 1000 Rows
** Reading chunk number 9 with 994 Rows
*** Reading is Completed in chunks...