Cách sử dụng Boto3 để phân trang thông qua tất cả các bảng có trong AWS Glue

Tuyên bố sự cố:Sử dụng boto3 thư viện bằng Python để phân trang thông qua tất cả các bảng từ Danh mục dữ liệu AWS Glue được tạo trong tài khoản của bạn

Phương pháp tiếp cận / Thuật toán để giải quyết vấn đề này

Bước 1: Nhập boto3 và botocore các ngoại lệ để xử lý các ngoại lệ.
Bước 2: max_items , page_size và started_token là các tham số tùy chọn cho hàm này, trong khi database_name là bắt buộc.
- max_items biểu thị tổng số bản ghi cần trả về. Nếu số lượng bản ghi có sẵn> max_items , sau đó là NextToken sẽ được cung cấp trong câu trả lời để tiếp tục phân trang.
- page_size biểu thị kích thước của mỗi trang.
- started_token giúp phân trang và nó sử dụng NextToken từ một phản hồi trước đó.
Bước 3: Tạo phiên AWS bằng boto3 lib . Đảm bảo rằng region_name được đề cập trong hồ sơ mặc định. Nếu nó không được đề cập, thì hãy chuyển region_name một cách rõ ràng trong khi tạo phiên.
Bước 4: Tạo ứng dụng AWS cho keo dán.
Bước 5: Tạo bộ phân trang đối tượng chứa thông tin chi tiết của tất cả các bảng bằng get_tables
Bước 5: Gọi phân trang và chuyển database_name dưới dạng DatabaseName, max_items , page_size và started_token dưới dạng PaginationConfig
Bước 6: Nó trả về số lượng bản ghi dựa trên max_size và page_size .
Bước 7: Xử lý ngoại lệ chung nếu có sự cố trong khi phân trang.

Mã mẫu

Sử dụng mã sau để phân trang thông qua tất cả các bảng được tạo trong tài khoản người dùng -

import boto3
from botocore.exceptions import ClientError

def paginate_through_tables(database_name, max_items=None:int,page_size=None:int, starting_token=None:string):
   session = boto3.session.Session()
   glue_client = session.client('glue')
   try:
   paginator = glue_client.get_paginator('get_tables')
      response = paginator.paginate(DatabaseName=database_name,       PaginationConfig={
         'MaxItems':max_items,
         'PageSize':page_size,
         'StartingToken':starting_token}
       )
   return response
   except ClientError as e:
      raise Exception("boto3 client error in paginate_through_tables: " + e.__str__())
   except Exception as e:
      raise Exception("Unexpected error in paginate_through_tables: " + e.__str__())
a = paginate_through_tables("test_db",2,5)
print(*a)

Đầu ra

{'TableList': [
{'Name': 'temp_table', 'DatabaseName': 'test_db', 'Owner': 'abc', 'CreateTime': datetime.datetime(2020, 9, 10, 20, 44, 29, tzinfo=tzlocal()), 'UpdateTime': datetime.datetime(2020, 9, 10, 20, 44, 29, tzinfo=tzlocal()), 'LastAccessTime': datetime.datetime(1970, 1, 1, 5, 30, tzinfo=tzlocal()), 'Retention': 0, 'StorageDescriptor':
{'Columns': [{'Name': 'keyname', 'Type': 'string', 'Comment': ''}, {'Name': 'amount', 'Type': 'string', 'Comment': ''}, {'Name': 'effectivedate', 'Type': 'string', 'Comment': ''}, {'Name': 'clientname', 'Type': 'string', 'Comment': ''}, {'Name': 'accoutname', 'Type': 'varchar(5)', 'Comment': ''}, {'Name': 'clientid', 'Type': 'varchar(6)', 'Comment': ''}], 'Location': 's3://test/', 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'Compressed': False, 'NumberOfBuckets': 0, 'SerdeInfo': {'Name': 'test', 'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe', 'Parameters': {}}, 'BucketColumns': [], 'SortColumns': [], 'Parameters': {}, 'StoredAsSubDirectories': False}, 'PartitionKeys': [], 'ViewOriginalText': '', 'ViewExpandedText': '', 'TableType': 'EXTERNAL_TABLE', 'Parameters': {'EXTERNAL': 'TRUE', 'has_encrypted_data': 'false', 'parquet.compression': 'SNAPPY'}, 'CreatedBy': 'arn:aws:sts::782258485841:assumed-role/IVZ-ADFS-NorthBayLead/Hari.Porandla@invesco.com'},
{'Name': 'test_3', 'DatabaseName': 'test_db', 'Owner': 'abc', 'CreateTime': datetime.datetime(2020, 9, 10, 21, 54, 39, tzinfo=tzlocal()), 'UpdateTime': datetime.datetime(2020, 9, 10, 21, 54, 39, tzinfo=tzlocal()), 'LastAccessTime': datetime.datetime(1970, 1, 1, 5, 30, tzinfo=tzlocal()), 'Retention': 0, 'StorageDescriptor': {'Columns': [{'Name': 'keyname', 'Type': 'string', 'Comment': ''}, {'Name': 'amount', 'Type': 'string', 'Comment': ''}, {'Name': 'effectivedate', 'Type': 'string', 'Comment': ''}, {'Name': 'clientname', 'Type': 'string', 'Comment': ''}, {'Name': 'accoutname', 'Type': 'varchar(5)', 'Comment': ''}, {'Name': 'clientid', 'Type': 'varchar(6)', 'Comment': ''}], 'Location': 's3://test3/', 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'Compressed': False, 'NumberOfBuckets': 0, 'SerdeInfo': {'Name': test_3', 'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe', 'Parameters': {}}, 'BucketColumns': [], 'SortColumns': [], 'Parameters': {}, 'StoredAsSubDirectories': False}, 'PartitionKeys': [], 'ViewOriginalText': '', 'ViewExpandedText': '', 'TableType': 'EXTERNAL_TABLE', 'CreatedBy': 'arn:aws:sts::***********:assumed-role/abc'}], 'ResponseMetadata': {'RequestId': 'dd35e6c5-*********************1', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 02 Apr 2021 13:42:48 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '10301', 'connection': 'keep-alive', 'x-amzn-requestid': *******************}, 'RetryAttempts': 0}}