This content originally appeared on DEV Community and was authored by GargeeBhatnagar
“ I have checked the documents of AWS to filter and normalize data using aws glue databrew. AWS Glue DataBrew makes it easy and secure to create a recipe and publish it in versions. In terms of cost, the solution is cheaper and secure.”
AWS Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code. Using DataBrew helps reduce the time it takes to prepare data for analytics and machine learning by up to 80 percent, compared to custom developed data preparation. You can choose from over 250 ready-made transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats and correcting invalid values.
In this post, you will experience the filter and normalize data using aws glue databrew. Here I have created a project and recipe with publish its versions. Also created a profile job with s3 bucket job output location.
Architecture Overview
The architecture diagram shows the overall deployment architecture with data flow, s3 bucket and aws glue databrew.
Solution overview
The blog post consists of the following phases:
- Creation of Project and Recipe Using AWS Glue DataBrew
- Creation of Profile Job with Required Inputs and S3 Location Job Output Set
- Output of Dataset Preview, Data Profile Overview, Column Statistics and Data Lineage
Phase 1: Creation of Project and Recipe Using AWS Glue DataBrew
- Open the aws glue databrew console, create a project with mention project and recipe details: project name and select a dataset, dataset name and role name. Once the project is created, set the filter by condition. Choose the source column with condition. Once the recipe is ready, publish the recipe(version) with the required description. Also we can use Group to add a list in one.
- Showcase of dataset, s3 location for input file, project, recipe and data lineage.
Phase 2: Creation of Profile Job with Required Inputs and S3 Location Job Output Set
Phase 3: Output of Dataset Preview, Data Profile Overview, Column Statistics and Data Lineage
Clean-up
AWS Glue DataBrew: Project and Recipe, S3 Bucket.
Pricing
I review the pricing and estimated cost of this example.
Cost of AWS Glue DataBrew = AWS Glue Job($0.48 per node hour for AWS Glue DataBrew jobs x 0.218 node-hour) = $0.10
AWS Glue Request($0 for AWS Glue Data Catalog requests x 9 Request) = $0.00
AWS Glue Session($0 for AWS Glue DataBrew interactive sessions x 1 Sessions) = $0.00
Cost of Amazon Simple Storage Service = $0.03
Total Cost = $0.10 + $0.03 = $0.13
Summary
In this post, I showed “filter and normalize data using aws glue databrew”.
For more details on AWS Glue DataBrew, Checkout Get started AWS Glue DataBrew, open the AWS Glue DataBrew console. To learn more, read the AWS Glue DataBrew documentation.
Thanks for reading!
Connect with me: Linkedin
This content originally appeared on DEV Community and was authored by GargeeBhatnagar