Data products, also known as datasets, lie at the core of the KeyCore Enterprise Data Lake (KEDL). In this article, we will delve into the definition and attributes of data products, how they are managed centrally, and the standard and customer-specified configurations that enhance their functionality and accessibility.
1. Definition of Data Products In the context of KEDL, a data product or dataset refers to a unique entity that contains data and metadata. These datasets are managed centrally within the Data Lake, offering a comprehensive and unified view of data assets across the entire Data Mesh. The datasets can be created on any account that is part of the mesh, by choosing AWS accoint and region from the dropdowns.
2. Dataset Attributes and Standard Options Each dataset in KEDL contains essential information and standard options that define its characteristics and accessibility:
Region and Account: Specifies the region and AWS account where the dataset is to be created.
Owner: Indicates the owner of the dataset, who has administrative rights over it.
Unique Name: Assigns a distinct name to the dataset for easy identification and reference.
Crawler and Crawler Lineage: Refers to the AWS Glue crawler responsible for discovering the dataset and its lineage.
Optional Crawler Schedule and Classifier: Specifies additional settings for the Glue crawler, such as a schedule for regular updates and a classifier for data classification.
3. Customer-Specified Configurations KEDL allows customers to define two crucial configurations for their datasets:
Compliance: Compliance is a map of boolean values representing various compliance checks, such as GDPR (General Data Protection Regulation), GXP (Good Practice), PII (Personally Identifiable Information), etc. These compliance checks play a pivotal role in determining whether a dataset qualifies as self-service or requires steward approval for access.
Metadata: The metadata map enables customers to specify additional fields for datasets, which can be mandatory or optional. These metadata fields can include details like cost center, department, or any other relevant information that aids in data classification and organization.
4. Tags for Enhanced Dataset Search All datasets in KEDL are associated with tags, which are searchable terms relevant to the dataset. Tags serve as a powerful tool for users to quickly locate datasets based on specific attributes. For example, tags could include keywords related to the dataset's subject, data source, or any other relevant information that facilitates dataset discovery and utilization.
5. Accessibility of Datasets through Athena and Sagemaker A key advantage of KEDL is the seamless accessibility of datasets through AWS Athena. By assuming the created personal role, users can query and analyze data stored in the Data Lake using standard SQL queries. Additionally, users can also gain access to AWS Sagemaker if specified, empowering them to perform advanced analytics and machine learning tasks.
Conclusion Data products, as the building blocks of KEDL, play a pivotal role in streamlining data management, analysis, and collaboration within organizations. With their centrally managed attributes and customer-specified configurations, datasets in KEDL become powerful assets that can be efficiently accessed and utilized across the Data Mesh. In the following articles, we will explore user management, dataset creation, data querying, and other essential aspects of using KEDL in practice.