This content originally appeared on DEV Community and was authored by soul-o mutwiri
- Building and managing Data Infrastructure and platforms:
- databases
data warehouses on cloud – s3, aws Glue, Amazon Redshift etc.
Ingest data from various sources:
Use tools like AWS glue Jobs or aws Lambda functions to ingest data
from databases, applications, files, streaming devices into a centralized data platforms.Prepare ingested data for analytics
use AWS glue, Apache spark, Amazon EMR to prepare data for cleaning, transforming and enriching it.
Catalog and document Curated datasets
-use AWS Glue crawlers to determine format and schema, group data into tables. write metadata to aws Glue data Catalog. Use metadata tagging in Data catalog for data governance, compliance and discoverability.Automate regular data workflows and pipelines
simplify and accelerate data processing using services like AWS Glue Workflows, AWS lambda or AWS step functions.
The data engineer builds the system that delivers usable data to the data analyst, who querys and analyzes the data to gain business insights/reports/visualizations.
Before a data engineer begins these questions must be answered:
- Which data should be analyzed? What is its value to the business or organization?
- Who owns the data? Where is it located?
- Is the data usable in its current state? What transformations are required?
- Who needs to see the data?
- After the data is curated and ready for consumption, how should it be presented?
This content originally appeared on DEV Community and was authored by soul-o mutwiri