Ace Your Databricks Data Engineer Exam: 10 Essential Practice Questions Explained Part 5

In this section, we are focusing on advanced Delta Lake optimization, streaming-static joins, Unity Catalog security features, and Databricks SQL alerting.

🛡️ Part 26: Unity Catalog Row & Column Security

Question 41: Dynamic Data Masking

A company policy dictates that only members of the hr_team group are allowed to see the actual values in the salary column of the employees table. All other users should see the word "REDACTED". What is the most efficient, native Unity Catalog way to enforce this?

  • A. Create a View that uses an IF statement and grant access to the view instead of the table.
  • B. Encrypt the column in the Parquet files using a customer-managed key.
  • C. Apply a Unity Catalog Column Masking function to the salary column.
  • D. Use Workspace Access Control Lists (ACLs) to hide the column.

Correct Answer: C

💡 Why it’s correct: Unity Catalog natively supports Column Masking functions. You can create a SQL UDF (User Defined Function) that checks the user’s group via is_account_group_member(). If they are in hr_team, it returns the salary; otherwise, it returns "REDACTED". You then apply this mask directly to the table column, ensuring the rule applies globally across all workspaces and endpoints.

❌ Why the others are wrong:

  • A: While creating views was the legacy way to handle this, it causes “view sprawl” (managing hundreds of duplicate views for different permissions). Masking functions eliminate this need.
  • B: File-level encryption protects data at rest but does not dynamically mask data for different users running SQL queries in the same workspace.
  • D: Workspace ACLs control access to workspace objects (like notebooks), not column-level data in tables.

⚡ Part 27: Delta Lake Optimization

Question 42: Speeding Up Multi-Column Queries

A 2TB Delta table named web_events is currently partitioned by date. Analysts frequently run queries that filter on both date and customer_id. However, queries filtering for a specific customer_id within a 30-day date range are still scanning too many files. How can you optimize this table?

  • A. Repartition the table by both date and customer_id.
  • B. Run OPTIMIZE web_events ZORDER BY (customer_id).
  • C. Run VACUUM web_events RETAIN 0 HOURS.
  • D. Change the partition column to customer_id.

Correct Answer: B

💡 Why it’s correct: Z-Ordering is a technique that physically sorts and collocates related data within the same data files. By Z-Ordering on customer_id, Delta Lake clusters identical customer IDs into the same Parquet files within each date partition. When an analyst queries for a specific customer_id, Spark uses file statistics to skip 99% of the files in that date partition, drastically speeding up the query.

❌ Why the others are wrong:

  • A & D: customer_id is a high-cardinality column (millions of unique values). Partitioning by it will create millions of tiny directories, destroying storage metadata performance (the small file problem).
  • C: VACUUM deletes old file versions. It does not reorganize or sort active data to improve query read performance.

🌊 Part 28: Advanced Structured Streaming

Question 43: Stream-Static Joins

A data engineer is processing a real-time Kafka stream of transactions. They need to enrich this stream by joining it with a Delta table named customer_profiles, which receives occasional updates (like address changes). How does Spark handle joining a continuous stream to a Delta table?

  • A. Spark caches the Delta table in memory on day one and never updates it.
  • B. Spark cannot join a streaming DataFrame to a static Delta table.
  • C. Spark automatically reads the latest version of the Delta table at the start of every micro-batch.
  • D. The engineer must write a custom Python loop to pause the stream and refresh the static table.

Correct Answer: C

💡 Why it’s correct: Databricks and Delta Lake seamlessly support Stream-Static joins. When a streaming DataFrame joins against a static Delta table, the Spark engine inherently checks the Delta transaction log at the beginning of each micro-batch. If customer_profiles has been updated, Spark uses the newest version of the table for that specific micro-batch’s join operations.

❌ Why the others are wrong:

  • A: Spark does not permanently freeze the static table state; it polls for updates dynamically.
  • B: Stream-static joins are fully supported and are a core pattern in the Medallion architecture.
  • D: No manual loops or pausing are required; the engine handles the state evaluation automatically.

🏗️ Part 29: DLT Architecture & Publishing

Question 44: Querying DLT Tables Externally

A team built a robust Delta Live Tables (DLT) pipeline that cleanses raw data into a set of Gold-level tables. A separate business intelligence team wants to query these Gold tables using Databricks SQL and Tableau. How can you make DLT tables available to other tools?

  • A. DLT tables are strictly internal and cannot be queried outside the pipeline.
  • B. Specify a Target schema (or Target catalog and schema in Unity Catalog) in the DLT pipeline settings.
  • C. Use a Databricks Workflow to copy the data from DLT to a separate SQL warehouse.
  • D. Grant the BI team access to the underlying DBFS storage path.

Correct Answer: B

💡 Why it’s correct: By default, DLT creates temporary/internal tables. However, if you define a Target Schema (e.g., catalog_name.gold_sales_schema) in the pipeline settings, DLT automatically publishes all tables and views defined in your pipeline to the Databricks Metastore / Unity Catalog. Once published, they can be queried like any standard Delta table by Databricks SQL, Tableau, PowerBI, or other notebooks.

❌ Why the others are wrong:

  • A: DLT tables are fully queryable once published.
  • C: Copying the data creates unnecessary duplication. DLT manages the tables natively.
  • D: Granting direct path access bypasses Unity Catalog governance and is an anti-pattern.

🔔 Part 30: Observability & Alerting

Question 45: Targeted Task Notifications

A Databricks Workflow job consists of 10 tasks. The final task, Update_Dashboards, occasionally fails due to external API timeouts. The data engineer wants to send a Slack message only when this specific task fails, without receiving alerts for the rest of the job. How can they achieve this?

  • A. Configure a Job-level email alert for failures.
  • B. Add a DLT expectation with expect_or_fail.
  • C. Configure a webhook notification on the specific Update_Dashboards task for the “Task fails” event.
  • D. Use a Databricks SQL Alert.

Correct Answer: C

💡 Why it’s correct: Databricks Workflows allows for highly granular notifications. You can attach email or Webhook notifications (which connect easily to Slack, Teams, or PagerDuty) at both the Job level and the individual Task level. Selecting “Task fails” on the specific task ensures alerts are only triggered for that exact point of failure.

❌ Why the others are wrong:

  • A: A Job-level alert triggers if any part of the job fails, which violates the requirement to only alert on the specific task.
  • B: DLT expectations evaluate data quality inside a pipeline, not external API timeouts in a Workflow task.
  • D: Databricks SQL Alerts trigger based on the results of a SQL query (e.g., “count > 10”), not task execution states.

💰 Part 31: Compute Economics

Question 46: Strategic Use of Spot Instances

A data engineering team is trying to reduce cloud compute costs. They have four different workloads. Which workload is the MOST appropriate candidate to run entirely on Spot instances?

  • A. An ad-hoc data exploration cluster used by analysts to test queries.
  • B. A 24/7 low-latency Structured Streaming production job.
  • C. A critical financial reporting pipeline with a strict 8:00 AM SLA.
  • D. A Databricks SQL Serverless Warehouse serving live BI dashboards.

Correct Answer: A

💡 Why it’s correct: Spot instances utilize spare cloud capacity at a massive discount (up to 80%), but the cloud provider can reclaim (terminate) them at any time with a 2-minute warning. Ad-hoc exploration clusters are perfect for this because if a node is lost, the analyst simply re-runs the query—there is no critical business outage.

❌ Why the others are wrong:

  • B & C: SLA-critical and 24/7 streaming pipelines should use On-Demand instances (at least for the driver node) to guarantee uptime and meet deadlines without interruption.
  • D: Serverless compute is managed entirely by Databricks; users do not configure underlying Spot/On-Demand ratios for it.

🔄 Part 32: Schema Evolution in Delta Lake

Question 47: Merging New Columns

A DataFrame df_new contains a new column named campaign_id that does not currently exist in the target Delta table marketing_events. The engineer wants to append this DataFrame to the table and automatically add the new column to the table’s schema. Which PySpark write command should they use?

  • A. df_new.write.format("delta").mode("append").saveAsTable("marketing_events")
  • B. df_new.write.format("delta").option("mergeSchema", "true").mode("append").saveAsTable("marketing_events")
  • C. df_new.write.format("delta").option("inferSchema", "true").mode("overwrite").saveAsTable("marketing_events")
  • D. df_new.write.format("delta").option("overwriteSchema", "true").mode("append").saveAsTable("marketing_events")

Correct Answer: B

💡 Why it’s correct: By default, Delta Lake strictly enforces schemas to prevent accidental data corruption. To safely append a DataFrame that contains new, valid columns, you must explicitly declare .option("mergeSchema", "true"). Delta will then update the table’s metadata to include campaign_id and insert the new data.

❌ Why the others are wrong:

  • A: Without mergeSchema, Delta Lake will throw a AnalysisException stating the schemas do not match.
  • C: inferSchema is an option used when reading raw files (like CSV or JSON), not when writing to a Delta table.
  • D: overwriteSchema is used in combination with mode("overwrite") to completely replace a table’s schema and data, not for appending.

📊 Part 33: Databricks SQL & Business Logic

Question 48: Automated Data Monitoring

A business stakeholder wants to receive an automated email immediately if the total daily sales in the gold_sales table drops below $10,000. What is the most straightforward, no-code way to build this in Databricks?

  • A. Write a Python script in a Databricks Notebook that uses the smtplib library.
  • B. Configure a Unity Catalog Data Quality rule.
  • C. Write a query in Databricks SQL and configure a Databricks SQL Alert based on a threshold.
  • D. Use Delta Live Tables expect_or_drop function.

Correct Answer: C

💡 Why it’s correct: Databricks SQL Alerts are built exactly for this scenario. You write a standard SQL query (e.g., SELECT SUM(amount) FROM gold_sales WHERE date = current_date()), set a trigger condition (e.g., < 10000), and configure an email or webhook destination. Databricks handles the scheduling and notification routing automatically.

❌ Why the others are wrong:

  • A: Requires custom infrastructure, handling SMTP credentials, and managing a separate scheduled job.
  • B: Unity Catalog does not have a native “Data Quality rule” alerting engine (it relies on DLT for pipeline expectations).
  • D: DLT expectations handle row-level filtering during ingestion/processing, not post-processing business metric alerts.

🔗 Part 34: Lineage & Metastore Limitations

Question 49: Unity Catalog Lineage Boundaries

Unity Catalog automatically captures data lineage across tables, columns, and dashboards. However, which of the following scenarios will Unity Catalog NOT be able to track?

  • A. A Spark SQL INSERT INTO query executed in a Databricks Notebook.
  • B. A PySpark DataFrame transformation using .withColumn().
  • C. A Databricks SQL query executing on an external table backed by AWS S3.
  • D. A transformation that occurs inside an on-premise PostgreSQL database before the data is ingested into Databricks via JDBC.

Correct Answer: D

💡 Why it’s correct: Unity Catalog can only track lineage for executions that happen within the Databricks compute plane. It parses the execution plans of Databricks clusters and SQL warehouses. If a transformation happens entirely inside an external system (like a stored procedure running inside PostgreSQL), Databricks has no visibility into it.

❌ Why the others are wrong:

  • A, B, & C: Unity Catalog natively tracks lineage for all Databricks-native workloads, whether they are SQL, Python, or querying External Tables on cloud storage.

🥇 Part 35: The Medallion Architecture

Question 50: The Role of the Gold Layer

In the Databricks Medallion Architecture, what is the primary defining characteristic of the Gold layer?

  • A. It stores data in its original, immutable, raw JSON format.
  • B. It contains highly denormalized, aggregated data tailored for BI dashboards and business-level reporting.
  • C. It acts as an enterprise repository of cleansed, filtered, and deduplicated historical data.
  • D. It serves exclusively as the landing zone for real-time Kafka streams.

Correct Answer: B

💡 Why it’s correct: The Gold layer represents the final, business-ready state of the data. Instead of raw events or normalized tables, Gold tables are typically highly denormalized (e.g., star schemas) and aggregated (e.g., daily sales per region) so that BI tools like PowerBI, Tableau, and Databricks SQL Dashboards can query them at lightning speed.

❌ Why the others are wrong:

  • A & D: These describe the Bronze layer (raw, immutable landing zone).
  • C: This describes the Silver layer (cleansed, deduplicated, enriched enterprise source of truth).

Leave a Reply

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading