AWS Glue

Serverless data prep and pipelines on AWS with visual and code-first workflows

4.3

7 votes

Your vote:

1 / 1

Edit program info

Info updated on: Apr 03, 2026

Visit Website

aws.amazon.com

Software Informer

Download popular programs, drivers and latest updates easily

Open your first project by wiring up sources and a destination, then let Glue do the legwork. Point a crawler at S3 buckets, JDBC databases, or streaming inputs to infer columns, partition keys, and formats. Organize discovered tables in the central catalog, add tags for ownership and sensitivity, and set permissions through IAM/Lake Formation so teams see only what they should. Schedule the crawler to refresh schemas as data lands, and you’ve built a reliable inventory you can query from Athena, Redshift, or notebooks within minutes.

Next, turn cataloged data into usable datasets. In Glue Studio, sketch the flow: read nodes for inputs, transforms for joins, filters, and type casts, and a target node for S3, Redshift, or Snowflake via connectors. Prefer writing code? Author Spark jobs in Python or Scala, import libraries, and parameterize paths and dates for reuse. Enable incremental loads with bookmarks, partition outputs by date or keys, compress with Parquet/ORC, and set worker type and autoscaling to balance speed and cost. Chain steps with triggers, handle retries on failure, and emit metrics and logs to CloudWatch so you can trace bottlenecks and fix them fast.

For analyst-driven cleanup, launch DataBrew. Sample a dataset, profile it for missing values and outliers, and build a point-and-click recipe: split columns, standardize formats, deduplicate, fuzzy-match names, and validate with rules (e.g., email regex, value ranges). Preview on a sample, then run the recipe at scale and publish results to S3 or push into Redshift. Store recipes in versioned projects, share them across teams, and attach jobs to a schedule so refreshed, trustworthy data is always waiting for dashboards and ML features.

Finally, run it like production. Group jobs into a workflow that ingests, transforms, checks quality, and loads to analytics stores. Use Data Quality checks to block bad data, notify on drift, and write diagnostics to a quarantine path. Secure connections in a VPC, secrets in Secrets Manager, and audit access through the catalog. Trigger pipelines from events (new file in S3) or CI/CD (CodePipeline) and manage infrastructure as code with CloudFormation or Terraform. Keep costs predictable by right-sizing DPUs, using idle timeouts, and pruning unnecessary repartitions. With these practices, you can ship dependable pipelines quickly—without managing servers or bespoke schedulers.

Review Summary

Features

No-server execution for Spark-based jobs
Visual pipeline builder and code authoring
Automated schema discovery and table cataloging
Central metadata catalog with access controls
Incremental processing with job bookmarks
Built-in data quality rules and profiling
Workflow orchestration with event triggers
Connectors for S3, JDBC, Redshift, Snowflake, and more
Parameterization and reusable job templates
Monitoring via metrics, logs, and alerts

How It’s Used

Batch ingest from S3 and relational sources into partitioned analytics tables
Data cleansing and standardization for BI dashboards using DataBrew
Feature engineering pipelines feeding SageMaker training jobs
CDC-style incremental merges from OLTP databases into a data lake
Join and reconcile customer records from multiple systems for a 360 view
Quality gates that quarantine bad data and alert data owners
Event-driven pipelines that start when new files arrive in S3
SQL access to curated datasets via Athena and Redshift Spectrum

Plans & Pricing

Etl Jobs and Development Endpoints

$0.44

$0.44 per DPU-Hour, billed per second, with a 1-minute minimum (Glue version 2.0) or 10-minute minimum (Glue version 0.9/1.0) for each ETL job of type Apache Spark
$0.44 per DPU-Hour, billed per second, with a 1-minute minimum for each ETL job of type Python shell
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum for each provisioned development endpoint

Data Catalog Storage and Requests

$1.00 per month

Storage:
<ul>
<li>Free for the first million objects stored
$1.00 per 100,000 objects stored above 1M, per month
Requests:
<ul>
<li>Free for the first million requests per month
$1.00 per million requests above 1M in a month

Crawlers and Databrew Interactive Sessions

$0.44

$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run

Databrew Jobs

$0.48

1-minute billing duration