🚀 Ace Your Databricks Data Engineer Exam: 10 Essential Practice Questions Explained

Preparing for a Databricks Data Engineering certification? Mastering the concepts is only half the battle; you also need to know how to navigate tricky multiple-choice scenarios.

In this post, we’ll break down 10 real-world exam-style questions covering PySpark, Unity Catalog, Databricks Workflows, and Performance Optimization. More importantly, we won’t just give you the answers—we’ll explain why the correct choice is right and why the others fall short.

Let’s dive in!

🛠️ Part 1: PySpark & Databricks SQL

Question 1: Handling Null Values in PySpark

Which PySpark code fills nulls in the email column with unknown@example.com in a DataFrame df?

  • A. df.na.replace("", "unknown@example.com", subset=["email"])
  • B. df.na.replace(None, "unknown@example.com", subset=["email"])
  • C. df.na.drop(subset=["email"])
  • D. df.na.fill({"email": "unknown@example.com"})

Correct Answer: D

💡 Why it’s correct: In PySpark, na.fill() is the native, null-safe method specifically designed to replace NULL values. Passing a dictionary {"column_name": "replacement_value"} allows you to target specific columns efficiently.

❌ Why the others are wrong:

  • A: replace targets specific literal values (like empty strings ""), not standard SQL NULLs.
  • B: Passing Python’s None to .replace() is not the reliable or idiomatic way to handle nulls in Spark DataFrames.
  • C: na.drop() removes the entire row if a null is found, which results in data loss rather than filling the value.

Question 2: Safe Type Casting in SQL

Which Databricks Lakehouse SQL function should be used to convert a STRING column to INTEGER, where invalid values must return NULL instead of raising an error?

  • A. TRY_CAST(column_name AS INT)
  • B. CAST(column_name AS INT)
  • C. INT(column_name)
  • D. COALESCE(CAST(column_name AS INT), 0)

Correct Answer: A

💡 Why it’s correct: TRY_CAST safely attempts the conversion. If the string contains non-numeric characters (e.g., 'abc'), it gracefully returns NULL instead of failing the entire query. This is essential for robust data pipelines handling messy source data.

❌ Why the others are wrong:

  • B & C: Both standard CAST and the INT() function will throw a hard error and fail the pipeline if they encounter an invalid string.
  • D: COALESCE handles existing nulls by replacing them with 0, but the inner CAST will still fail and crash the query on invalid strings before COALESCE can even evaluate.

🔐 Part 2: Unity Catalog & Governance

Question 3: Upgrading to Unity Catalog

A data engineer needs to support a Unity Catalog Delta table catalog.schema.sales that is currently registered externally. Requirements state: automatic file cleanup after a table drop, no downtime during migration, and enforced name-based access. Which action should they take?

  • A. Run ALTER TABLE catalog.schema.sales SET MANAGED
  • B. Run DROP TABLE catalog.schema.sales PURGE
  • C. Run ALTER TABLE catalog.schema.sales SET LOCATION
  • D. Create a new managed table with CTAS and migrate workloads

Correct Answer: A

💡 Why it’s correct: Running ALTER TABLE ... SET MANAGED performs an in-place conversion from an external table to a Unity Catalog managed table. This preserves the table’s identity, permissions, and history, drastically reducing migration downtime. Once managed, the platform automatically handles the storage lifecycle (including file cleanup on drop).

❌ Why the others are wrong:

  • B: DROP TABLE ... PURGE deletes the table and its underlying files entirely, violating the requirement to migrate workloads seamlessly.
  • C: SET LOCATION is used to change where external data points to, but it does not convert the table to a managed lifecycle state.
  • D: Using Create Table As Select (CTAS) requires moving data, recreating permissions, and updating downstream jobs, causing downtime and breaking table history.

Question 4: Unity Catalog Permissions

An engineer must let the growth-analysts group read email_stats from corp_marketing.campaigns, but not create, alter, or delete any objects in the schema. Which action sequence meets the requirement?

  • A. Grant USE SCHEMA on schema, grant SELECT and MODIFY on table
  • B. Grant USE SCHEMA on schema, grant SELECT on table
  • C. Grant USE CATALOG and CREATE SCHEMA on catalog, USE SCHEMA on schema, SELECT on table
  • D. Grant USE CATALOG on catalog, USE SCHEMA on schema, and SELECT on table

Correct Answer: D

💡 Why it’s correct: Unity Catalog uses a strict hierarchical permission model. To read a table, a user must have USE CATALOG on the parent catalog, USE SCHEMA on the parent schema, and SELECT on the table itself.

❌ Why the others are wrong:

  • A: Includes MODIFY, which violates the read-only requirement.
  • B: Missing the required USE CATALOG permission at the top level.
  • C: Grants CREATE SCHEMA, which violates the rule against creating objects.

⚙️ Part 3: Architecture & Orchestration

Question 5: Automating Table Maintenance

A data engineer currently manually schedules OPTIMIZE and VACUUM jobs for Unity Catalog Delta tables. They want the platform to automatically determine when and how to run these operations. What should they do?

  • A. Configure Auto Loader with schema evolution enabled
  • B. Set delta.autooptimize.autoCompact and delta.autooptimize.optimizeWrite
  • C. Enable predictive optimization at the account, catalog, or schema level
  • D. Enable Liquid Clustering with CLUSTER BY on each table

Correct Answer: C

💡 Why it’s correct: Predictive Optimization is a Unity Catalog feature that uses AI to analyze your tables’ read/write patterns. It automatically schedules and executes OPTIMIZE (compaction) and VACUUM (cleanup) only when it is cost-effective and necessary, completely removing the need for manual orchestration.

❌ Why the others are wrong:

  • A: Auto Loader is an ingestion tool for raw files, not a maintenance utility.
  • B: Auto-optimize properties help manage file sizes during write operations, but they do not automate VACUUM (stale file cleanup).
  • D: Liquid Clustering improves data layout and query performance, but does not automate maintenance scheduling.

Question 6: Incremental Ingestion & Time Travel

Lakeflow Connect ingests customer transactions nightly. Users need 30-day historical data access, but the current REPLACE mode prevents time-travel. How can the engineer retain historical data and prevent duplicates?

  • A. Use APPEND mode
  • B. Implement COPY INTO
  • C. Enable MERGE with a unique key
  • D. Use DROP and REPLACE

Correct Answer: C

💡 Why it’s correct: MERGE acts as an “upsert” (update + insert). By using a unique key, it inserts new records and updates existing ones without duplicating data. Crucially, because Delta Lake natively tracks MERGE transactions, it preserves the table history, allowing users to run time-travel queries.

❌ Why the others are wrong:

  • A: APPEND will blindly insert records. If a transaction is re-processed, it creates duplicates.
  • B: COPY INTO tracks which files have been loaded to prevent double-loading files, but it doesn’t deduplicate at the record level if source data contains updates.
  • D: DROP and REPLACE destroys the table history entirely, breaking time travel.

Question 7: Job Control Flow

A notebook task calculates daily data volume and writes it to a task value record_count. If record_count > 1,000,000, the job should run high_volume_pipeline; otherwise, it runs normal_pipeline. Which feature should be used?

  • A. Configure additional retries until the count exceeds the threshold
  • B. Configure “Run If” conditions based on task status
  • C. Add an If/else condition task that reads the task value and routes downstream
  • D. Use a For each task over the record count

Correct Answer: C

💡 Why it’s correct: Databricks Workflows supports an If/else condition task. This task type can explicitly read dynamic values (like record_count) generated by previous tasks and use boolean logic to branch the DAG (Directed Acyclic Graph) to different downstream tasks.

❌ Why the others are wrong:

  • A: Retries are for handling failures, not logical routing.
  • B: “Run If” conditions (e.g., All Success, At Least One Failed) evaluate the success/failure state of previous tasks, not their output values.
  • D: “For each” is used to iterate over arrays/lists in parallel, not for boolean routing.

Question 8: Databricks Asset Bundles (DABs)

A deployed Databricks Asset Bundle job in prod uses catalog: $(var.catalog_name). The engineer runs: databricks bundle run -t prod aggregate_job --var="catalog_name=temp_catalog". However, the job still uses prod_catalog. Why?

  • A. --var is only supported for bundle validate
  • B. Variables are resolved at deployment time; run-time overrides don’t affect deployed jobs
  • C. Variable substitution only works in targets, not resources
  • D. --var updates the job but requires a cluster restart

Correct Answer: B

💡 Why it’s correct: In Databricks Asset Bundles, variables are statically resolved and baked into the job definition during the deployment phase (bundle deploy). When you use bundle run, you are triggering the already-deployed asset on the workspace. You cannot hot-swap pipeline configuration variables at runtime via the CLI.

❌ Why the others are wrong:

  • A, C, & D: Are factually incorrect regarding how DABs compile and validate YAML configurations.

⚡ Part 4: Performance & Troubleshooting

Question 9: Cost-Efficient Job Reruns

A 15-task Databricks job fails at Task #10. Tasks 1–9 succeeded. How can you finish the job with minimal compute cost?

  • A. Run the entire job again
  • B. Delete successful tasks and rerun
  • C. Use Repair Run
  • D. Manually run notebooks on interactive clusters

Correct Answer: C

💡 Why it’s correct: Repair Run is a specific Databricks Job feature that allows you to resume a failed job. It intelligently skips all successfully completed upstream tasks and only executes the task that failed along with its downstream dependencies, saving significant time and compute costs.

❌ Why the others are wrong:

  • A: Wastes compute by re-processing data in Tasks 1-9.
  • B & D: Both are manual, error-prone interventions that break the automated DAG lineage.

Question 10: Diagnosing Data Skew

In a Spark stage with 200 tasks, 195 finish in seconds but 5 take 15+ minutes. What is the most likely issue?

  • A. Network latency
  • B. Driver memory issue
  • C. Small file problem
  • D. Data skew

Correct Answer: D

💡 Why it’s correct: This is the classic symptom of Data Skew. Spark partitions data across tasks based on a key (e.g., grouping by country). If 95% of your users are in one country, the specific tasks handling that partition will receive a massive amount of data compared to the others, resulting in a few “straggler” tasks that take exponentially longer to finish.

❌ Why the others are wrong:

  • A: Network latency would likely affect all tasks unpredictably, not consistently isolate 5 tasks.
  • B: Driver memory issues result in Out Of Memory (OOM) errors crashing the whole application, not slow worker tasks.
  • C: The small file problem causes overhead during the planning/reading phase and usually results in thousands of tiny tasks, not a few massive ones.

📝 Final Thoughts

Preparing for Databricks exams requires shifting your mindset from “how do I write the code?” to “what is the most scalable, cost-effective, and governed way to build this on the Lakehouse?”

Bookmark this page for your revision, and good luck on your exam!

Leave a Reply

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading