In this section, we dive deep into Change Data Capture (CDC) with Delta Live Tables, Spark memory management, Databricks Workflows routing, and secure data sharing.
Let’s keep crushing it!
🔄 Part 16: Delta Live Tables (DLT) & CDC
Question 31: Handling Change Data Capture (CDC)
A data engineer is building a Delta Live Tables (DLT) pipeline to ingest database changes (inserts, updates, deletes) from a Kafka stream. Which DLT SQL command must they use to process these CDC events and automatically update the target table based on a primary key?
- A.
MERGE INTO - B.
APPLY CHANGES INTO - C.
UPDATE SET - D.
INSERT OVERWRITE
✅ Correct Answer: B
💡 Why it’s correct: APPLY CHANGES INTO is the native DLT declarative command for handling CDC. It automatically manages out-of-sequence records, deduplicates updates, and handles row-level inserts, updates, and deletes based on the sequence/timestamp column and primary key you provide.
❌ Why the others are wrong:
- A: While
MERGE INTOis the standard Delta Lake SQL command for upserts, DLT uses its own declarative syntax (APPLY CHANGES INTO) to simplify CDC pipelines. - C & D:
UPDATE SETandINSERT OVERWRITEare static SQL operations that cannot automatically handle a mixed stream of CDC events (like processing a delete followed by an insert).
🧠 Part 17: Spark Architecture & Memory Management
Question 32: Diagnosing Out of Memory (OOM) Errors
A data engineer runs a PySpark notebook on a cluster with 1 Driver node (16GB RAM) and 10 Worker nodes (32GB RAM each). They execute the command df.collect() on a DataFrame containing 50GB of data. The job immediately fails with an Out of Memory (OOM) error. Which node crashed, and why?
- A. A Worker node crashed because a single partition was larger than 32GB (Data Skew).
- B. The Driver node crashed because
collect()forces all data to be sent to the driver, exceeding its 16GB memory limit. - C. All nodes crashed because Spark does not support DataFrames larger than the total cluster RAM.
- D. The Driver node crashed because the broadcast hash join failed.
✅ Correct Answer: B
💡 Why it’s correct: The collect() action in Spark gathers all the distributed data from the worker nodes and pulls it into the memory of the single Driver node. Since the dataset is 50GB and the driver only has 16GB of RAM, it will instantly run out of memory and crash.
❌ Why the others are wrong:
- A: While data skew can cause worker OOMs, the specific trigger here is
collect(), which impacts the driver, not the workers. - C: Spark is specifically designed to process datasets larger than cluster memory by spilling to disk.
- D: There is no join operation mentioned in this scenario.
🔐 Part 18: Unity Catalog & Secure Sharing
Question 33: Sharing Data Externally
A company uses Databricks and Unity Catalog. They need to securely share a massive daily_sales Delta table with an external vendor. The external vendor does not use Databricks; they use pandas on their local machines and a Snowflake warehouse. What is the most secure and efficient way to share this data without duplicating it?
- A. Export the table to CSV files and host them on an FTP server.
- B. Grant the vendor a Databricks Personal Access Token (PAT) to query the workspace.
- C. Use Delta Sharing to create a Share and a Recipient, providing the vendor with an activation link.
- D. Create an External Location in Unity Catalog and give the vendor IAM access to the cloud bucket.
✅ Correct Answer: C
💡 Why it’s correct: Delta Sharing is an open standard built into Unity Catalog that allows you to securely share live data with any computing platform (Pandas, Snowflake, PowerBI, etc.) without the recipient needing a Databricks account. It shares a secure, short-lived, read-only token to access the underlying Parquet files directly, without copying the data.
❌ Why the others are wrong:
- A: Exporting to CSV creates stale, duplicate data and lacks modern security/governance.
- B: Giving external vendors active tokens to your internal workspace violates zero-trust security policies and consumes your compute resources.
- D: Granting direct IAM access to the storage bucket bypasses Unity Catalog’s fine-grained governance and auditing.
⚙️ Part 19: Cost Control & Cluster Policies
Question 34: Enforcing Compute Constraints
A Databricks workspace administrator wants to prevent data scientists from spinning up massively expensive clusters. They want to enforce a rule where no user can create a cluster with more than 4 worker nodes or use GPU instance types. Which Databricks feature should they configure?
- A. Instance Pools
- B. Unity Catalog Resource Quotas
- C. Cluster Policies
- D. Workspace Access Controls (ACLs)
✅ Correct Answer: C
💡 Why it’s correct: Cluster Policies allow administrators to define strict JSON rules that limit how clusters can be configured. You can use policies to enforce a maximum number of workers, restrict specific EC2/VM instance types, mandate auto-termination times, and enforce mandatory tagging for cost tracking.
❌ Why the others are wrong:
- A: Instance Pools are used to maintain a ready supply of warm VMs to speed up cluster start times; they do not enforce user restrictions.
- B: Unity Catalog manages data governance (tables, files, models), not compute resources.
- D: ACLs determine who can attach to or restart an existing cluster, but do not control the configuration limits of new clusters.
⚡ Part 20: Delta Lake Advanced Operations
Question 35: Forcing an Aggressive VACUUM
A data engineer realizes that sensitive PII was accidentally written to a Delta table and immediately runs a DELETE statement. To comply with GDPR, they must physically remove the old data files from cloud storage immediately. However, running VACUUM users RETAIN 0 HOURS throws an error. How can they force this operation to run?
- A. They must run
OPTIMIZEbefore runningVACUUM. - B. They must set
spark.databricks.delta.retentionDurationCheck.enabled = falsefor the session. - C. They must convert the table to a Managed Unity Catalog table first.
- D. They cannot do this; Delta Lake strictly forbids vacuuming data younger than 7 days.
✅ Correct Answer: B
💡 Why it’s correct: By default, Delta Lake prevents you from running a VACUUM with a retention period of less than 168 hours (7 days) to prevent accidentally corrupting active read streams or concurrent transactions. To bypass this safety mechanism for emergency compliance reasons, you must explicitly disable the Spark session property retentionDurationCheck.enabled.
❌ Why the others are wrong:
- A:
OPTIMIZEcompacts files but does not delete old, unreferenced files. - C: Unity Catalog managed tables still enforce the 7-day retention safety check by default.
- D: It is entirely possible to vacuum data younger than 7 days, provided the safety check is disabled.
🛤️ Part 21: Orchestration & Dependencies
Question 36: Workflows Condition Logic
In a Databricks Workflow, Task C depends on the completion of both Task A and Task B. During a scheduled run, Task A succeeds, but Task B fails due to a data quality error. By default, what will happen to Task C?
- A. Task C will run partially using only the outputs from Task A.
- B. Task C will wait indefinitely for Task B to be repaired.
- C. Task C will be skipped because its upstream dependency failed.
- D. Task C will execute normally, but any references to Task B will return NULL.
✅ Correct Answer: C
💡 Why it’s correct: By default, a task in Databricks Workflows uses the all_success trigger rule. This means the task will only execute if all of its upstream dependencies finish successfully. Because Task B failed, Task C is automatically skipped (it becomes an “Upstream Failed” status).
❌ Why the others are wrong:
- A & D: Databricks tasks are atomic; they either run entirely or do not run at all. They do not run “partially” or inject NULLs for failed dependencies.
- B: Workflows do not hang indefinitely waiting for manual intervention. The DAG will fail and conclude.
🔎 Part 22: Databricks SQL & Profiling
Question 37: Troubleshooting Slow Queries
A data analyst complains that a specific dashboard query running on a Databricks Serverless SQL Warehouse is taking 5 minutes to complete. The data engineer needs to see exactly how much time the query spent performing I/O versus CPU processing, and identify if there is a bottleneck in a specific JOIN. Where should they look?
- A. The Spark UI on the driver node
- B. The Databricks SQL Query Profile
- C. The Cluster Metrics (Ganglia) tab
- D. The Delta Live Tables Event Log
✅ Correct Answer: B
💡 Why it’s correct: In Databricks SQL, the Query Profile is the dedicated tool for deeply analyzing query performance. It provides a visual, node-by-node execution plan showing exactly how many rows were processed, where time was spent (I/O, CPU, network), and which specific operators (like a SortMergeJoin or HashAggregate) caused bottlenecks.
❌ Why the others are wrong:
- A: Serverless SQL Warehouses abstract away the underlying Spark UI. The Query Profile is the modern, Databricks-native replacement for SQL optimization.
- C: Ganglia shows hardware-level metrics (CPU/RAM spikes over time), not query-level physical execution plans.
- D: This query is running on a SQL Warehouse for a dashboard, not within a DLT pipeline.
🌊 Part 23: Structured Streaming Resilience
Question 38: Checkpointing & Exactly-Once Semantics
A Structured Streaming job ingesting JSON files from cloud storage crashes due to a temporary network outage. When the cluster restarts and the job resumes, how does Spark know exactly which files it has already processed to guarantee exactly-once processing and avoid duplicates?
- A. By reading the Watermark state
- B. By checking the Delta Lake transaction log of the target table
- C. By reading the Streaming Checkpoint directory
- D. By inferring the state from the
_rescued_datacolumn
✅ Correct Answer: C
💡 Why it’s correct: When you define a streaming query, you must provide a checkpointLocation. Spark writes Write-Ahead Logs (WAL) and state information to this highly durable cloud storage directory. Upon restart, Spark reads this directory to determine exactly which micro-batch it was processing when it crashed, ensuring no data is dropped or processed twice.
❌ Why the others are wrong:
- A: Watermarks handle late-arriving data in memory; they do not persist the state of file ingestion offsets.
- B: The target Delta table’s log tracks what was written, but the stream itself needs its checkpoint to know where to read from the source.
- D:
_rescued_datais an Auto Loader feature for schema mismatch, not a state-tracking mechanism for stream resilience.
📂 Part 24: Unity Catalog Unstructured Data
Question 39: Securing Raw Files
A data engineering team receives daily raw .mp4 video files and .csv extracts from an external ad agency. These files are dropped into an Azure ADLS Gen2 bucket. The team needs to govern access to these raw files using Unity Catalog permissions (GRANT READ) before any ingestion pipelines run. Which UC object should they create?
- A. An External Table
- B. A Managed Table
- C. An External Volume
- D. A Delta Live Table
✅ Correct Answer: C
💡 Why it’s correct: Volumes are Unity Catalog’s solution for governing non-tabular, unstructured, or semi-structured data. An External Volume maps a Unity Catalog namespace (catalog.schema.volume) directly to a path in your cloud storage, allowing you to use SQL GRANT statements to secure access to the raw files (like videos or CSVs) without converting them to a table first.
❌ Why the others are wrong:
- A, B, & D: All of these require the data to be structured (rows and columns) and processed by Spark/Delta. You cannot create a table out of raw
.mp4video files.
🔄 Part 25: Delta Architecture Fundamentals
Question 40: The Medallion Architecture
In the Databricks Medallion Architecture, which layer is strictly responsible for storing data in its original, immutable format, maintaining an exact historical archive of the source system?
- A. The Bronze Layer
- B. The Silver Layer
- C. The Gold Layer
- D. The Platinum Layer
✅ Correct Answer: A
💡 Why it’s correct: The Bronze layer (raw data) is the landing zone. Its primary purpose is to capture data from source systems exactly as it arrives, in an append-only, immutable fashion. This ensures that if a transformation logic error occurs later, you can always replay the pipeline from the untampered source truth.
❌ Why the others are wrong:
- B: The Silver layer contains filtered, cleaned, and deduplicated data (often applying CDC).
- C: The Gold layer contains highly refined, aggregated data tailored for specific business use cases or BI dashboards.
- D: “Platinum” is not a standard tier in the Medallion architecture.