🥉 Bronze layer — ingestion & landing¶
Deep-dive companion to Medallion architecture with ELTMaestro. The Bronze layer captures source data as-is and lands it for downstream processing. Goal: faithful, append-only, replayable raw history — minimal transformation.
Ingestion patterns¶
Pick the pattern that matches the source and the change rate:
| Pattern | Steps | When |
|---|---|---|
| Full / snapshot load | Jdbcsource / Datasource → land |
Small/reference tables, or sources with no reliable change marker. |
| Incremental (watermark) | Watermark + Jdbcsource/Datasource |
Source has a monotonic column (id / updated_at). The Watermark step persists the last position between runs and pulls only newer rows. |
| CDC / upsert | isCDC + $QUERY_CDC_DELETE / $QUERY_CDC_INSERT |
Source emits inserts/updates/deletes you want to apply as a delete-then-insert merge. |
| File drop | Local_file, Sftp, Pullfile |
Files delivered to a folder / SFTP / S3. |
| Event-driven | FileWatch, FileScanner, RemoteFileScanner, JdbcWatch, ScriptWatch |
Trigger a landing run when a file appears or a source condition is met. |
| Connector extract | Mongo / Salesforce extract paths | NoSQL / SaaS sources. |
Incremental is the default for Bronze. A
Watermarkkeeps each run cheap and the lake append-only. Reserve full loads for small/reference data.
Landing targets¶
Where raw data comes to rest:
| Target | Step | Notes |
|---|---|---|
| Object storage (S3 / Azure Blob) | staged via $OBJECT_STORAGE (S3_CONNECTION / BLOB_CONNECTION) |
The lake. See Administration â–¸ cloud connections. |
| HDFS | Jdbctargethdfs |
On-cluster lake storage. |
| Hive external tables | Onstagegroup |
Registers landed files as Spark/Hive external tables ($HIVE_CONNECTION/$HIVE_DATABASE/$HIVE_SCHEMA) so Silver can query them as tables. |
| Onstage staging tables | Onstagegroup / file loaders |
Staging tables in the MPP for the next merge. |
Append-only discipline¶
Bronze must stay immutable and replayable:
- Insert-only targets — never update/delete in Bronze; corrections happen in Silver.
- Watermark every incremental source so reruns don't double-land and only new rows arrive.
- Keep raw fidelity — land columns as-received (string/raw types are fine); defer type-casting and cleaning to Silver. Base64-encode awkward payloads if a staging format needs it (decoded later via
$B64EXPRESSION). - Partition by load time / batch (
$GUIDscopes a run's staged objects) so you can replay or audit a specific landing.
Gotchas¶
- Late-arriving data — a high-water mark on
updated_atcan skip rows committed out of order; prefer an id-based or overlap-window watermark when the source allows. - Idempotency — a rerun of the same window must not duplicate rows; rely on the
Watermarkposition, not wall-clock. - Schema drift — new/renamed source columns should land without breaking; keep Bronze permissive and resolve shape in Silver.
Next¶
Promote to 🥈 Silver — clean, conform, validate. Back to the Medallion overview.