Usage Instructions for KEDL
In this article, we will guide users on how to effectively utilize the KeyCore Enterprise Data Lake (KEDL) in their daily data operations. From accessing the KEDL web application to querying data and creating ETL jobs, we will cover the essential steps to leverage KEDL's capabilities efficiently.
1. Accessing the KEDL Web Application
To begin using KEDL, users need to access the web application. Follow these steps to get started:
Open a web browser and navigate to the KEDL web application URL provided by your organization (e.g., https://dsc.marsdenmark.dk).
Log in using your credentials, which would be the same as your organization's identity provider (Cognito User Pool or Azure AD).
Once logged in, you will have access to the KEDL dashboard, where you can start working with data products.
2. Viewing and Searching Datasets
After logging in, you can explore and search for datasets within KEDL. Here's how:
Navigate to the "Datasets" page on the KEDL dashboard.
You can view a list of available datasets along with their essential details such as name, owner, and tags.
To search for specific datasets, use the search functionality and filter datasets based on tags or names.
3. Creating a New Dataset
If you need to create a new dataset in KEDL, follow these steps:
From the "Datasets" page, click on the "New Dataset" button.
Provide the required information for the dataset, including region, account, owner, and a unique name.
Optionally, specify crawler details, schedule, and classifier if needed.
Enter customer-specified configurations like compliance and metadata fields.
Click "Create" to create the dataset.
The system will process the dataset creation, and a confirmation message will appear, reporting success or failure.
4. Verifying Dataset Access and Subscribing to Datasets
As a member, you may need access to specific datasets. Here's how you can verify access and subscribe to datasets:
Visit the "Datasets" page.
Click on the dataset you want to access.
Under "Members," verify that you have access to the dataset.
If you need access to a dataset, click on the "Subscribe" button.
In the popup, choose the profile you need for this dataset (e.g., Consumer or Developer).
Click "Ok" to subscribe to the dataset.
For non-self-service datasets, the request will go to the dataset owner or stewards for approval.
5. Managing Dataset Access as an Owner/Steward
As an owner or steward, you have the authority to grant or revoke access to datasets. Here's how to manage dataset access:
Go to the "Datasets" page.
Click on the dataset you wish to administer.
From the list of users, choose a member you want to promote to dataset member.
Click the arrow pointing right and choose their profile from the popup.
Accept the changes in the popup to grant access to the user.
To revoke access, choose the member from the list and click the arrow pointing left.
Accept the changes in the popup to remove their access.
6. Uploading Data to S3 and Creating ETL Jobs
KEDL allows you to upload data to Amazon S3 and create ETL (Extract, Transform, Load) jobs to process and analyze the data. Follow these steps to upload data and create ETL jobs:
Go to the "Credentials" page and either copy your credentials into your terminal or access the AWS Console.
If using AWS CLI, use the "aws s3 cp" command to upload data from your local machine to the designated S3 bucket and dataset.
In the AWS Console, go to the AWS Glue service.
Choose "ETL jobs" and create the desired type of ETL job (e.g., Visual job).
Design the transformation, specify the source and target datasets, and choose the format and compression settings.
Fill out the job details, ensuring to use the correct S3 bucket names generated by KEDL.
Turn off ‘Use Glue data catalog as the Hive metastore’
Choose the appropriate role (e.g., KEDLDatarole-<your email>) for the job to run.
Save the job and run it. Be prepared for potential errors and refine the transformation as needed.
Consider adding a schedule and enabling bookmark to optimize the job performance.
7. Using JDBC and ODBC for Data Access
KEDL provides support for accessing data through JDBC and ODBC drivers. Here's how to use them:
Install the necessary driver for JDBC or ODBC based on the provided documentation.
Go to the "Credentials" page on the KEDL web application.
Choose the region you want to access data in.
Copy the JDBC or ODBC connect string from the page and insert it into your client configuration.
Additionally, copy the associated profile and insert it into your AWS credentials file.
Be cautious of any hidden characters that may cause issues with the connection.
You can now access data using standard SQL queries in your client.
NB: Depending on your installation of ODBC on your local machine, you might need the connect string to contain a DRIVER field. To obtain this create an SSM Parameter named /kedl/odbc/driver and set the value to DRIVER=Simba Athena ODBC Driver; changing the name between = and ; to match your installation. Logout and login again and this string should now be in your connect string.
8. Querying Data with Standard SQL
Whether using Athena, JDBC, or ODBC, querying data in KEDL is done using standard SQL. Follow these steps:
Open your preferred SQL client or the Athena console, depending on your method of access.
If using Athena, ensure you have chosen the correct workgroup, typically "KEDL-workgroup."
Prefix the table name with the database name to access datasets (e.g., select * from viking_curated.viking limit 10;).
Execute your SQL query to retrieve the required data.
Querying data in KEDL with SQL provides you with a powerful tool to explore and analyze datasets effectively.