AWS Glue

Serverless data prep and pipelines on AWS with visual and code-first workflows
4.3 
Rating
7 votes
Your vote:
Screenshots
1 / 1
Visit Website
aws.amazon.com
Loading

Open your first project by wiring up sources and a destination, then let Glue do the legwork. Point a crawler at S3 buckets, JDBC databases, or streaming inputs to infer columns, partition keys, and formats. Organize discovered tables in the central catalog, add tags for ownership and sensitivity, and set permissions through IAM/Lake Formation so teams see only what they should. Schedule the crawler to refresh schemas as data lands, and you’ve built a reliable inventory you can query from Athena, Redshift, or notebooks within minutes.

Next, turn cataloged data into usable datasets. In Glue Studio, sketch the flow: read nodes for inputs, transforms for joins, filters, and type casts, and a target node for S3, Redshift, or Snowflake via connectors. Prefer writing code? Author Spark jobs in Python or Scala, import libraries, and parameterize paths and dates for reuse. Enable incremental loads with bookmarks, partition outputs by date or keys, compress with Parquet/ORC, and set worker type and autoscaling to balance speed and cost. Chain steps with triggers, handle retries on failure, and emit metrics and logs to CloudWatch so you can trace bottlenecks and fix them fast.

For analyst-driven cleanup, launch DataBrew. Sample a dataset, profile it for missing values and outliers, and build a point-and-click recipe: split columns, standardize formats, deduplicate, fuzzy-match names, and validate with rules (e.g., email regex, value ranges). Preview on a sample, then run the recipe at scale and publish results to S3 or push into Redshift. Store recipes in versioned projects, share them across teams, and attach jobs to a schedule so refreshed, trustworthy data is always waiting for dashboards and ML features.

Finally, run it like production. Group jobs into a workflow that ingests, transforms, checks quality, and loads to analytics stores. Use Data Quality checks to block bad data, notify on drift, and write diagnostics to a quarantine path. Secure connections in a VPC, secrets in Secrets Manager, and audit access through the catalog. Trigger pipelines from events (new file in S3) or CI/CD (CodePipeline) and manage infrastructure as code with CloudFormation or Terraform. Keep costs predictable by right-sizing DPUs, using idle timeouts, and pruning unnecessary repartitions. With these practices, you can ship dependable pipelines quickly—without managing servers or bespoke schedulers.

Review Summary

Features

  • No-server execution for Spark-based jobs
  • Visual pipeline builder and code authoring
  • Automated schema discovery and table cataloging
  • Central metadata catalog with access controls
  • Incremental processing with job bookmarks
  • Built-in data quality rules and profiling
  • Workflow orchestration with event triggers
  • Connectors for S3, JDBC, Redshift, Snowflake, and more
  • Parameterization and reusable job templates
  • Monitoring via metrics, logs, and alerts

How It’s Used

  • Batch ingest from S3 and relational sources into partitioned analytics tables
  • Data cleansing and standardization for BI dashboards using DataBrew
  • Feature engineering pipelines feeding SageMaker training jobs
  • CDC-style incremental merges from OLTP databases into a data lake
  • Join and reconcile customer records from multiple systems for a 360 view
  • Quality gates that quarantine bad data and alert data owners
  • Event-driven pipelines that start when new files arrive in S3
  • SQL access to curated datasets via Athena and Redshift Spectrum

Plans & Pricing

Etl Jobs and Development Endpoints

$0.44

$0.44 per DPU-Hour, billed per second, with a 1-minute minimum (Glue version 2.0) or 10-minute minimum (Glue version 0.9/1.0) for each ETL job of type Apache Spark
$0.44 per DPU-Hour, billed per second, with a 1-minute minimum for each ETL job of type Python shell
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum for each provisioned development endpoint

Data Catalog Storage and Requests

$1.00 per month

Storage:
<ul>
<li>Free for the first million objects stored
$1.00 per 100,000 objects stored above 1M, per month
Requests:
<ul>
<li>Free for the first million requests per month
$1.00 per million requests above 1M in a month

Crawlers and Databrew Interactive Sessions

$0.44

$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run

Databrew Jobs

$0.48

1-minute billing duration

Comments

4.3
Rating
7 votes
5 stars
0
4 stars
0
3 stars
0
2 stars
0
1 stars
0
User

Your vote: