Databricks Community

self-employed · ‎11-05-2022

It is the practice exam for data engineer associate

The question is:

A data engineering team has created a series of tables using Parquet data stored in an external system. The team is noticing that after appending new rows to the data in the external system, their queries within Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this issue. Which of the following approaches will ensure that the data returned by queries is always up-to-date?

The options are

A. The tables should be converted to the Delta format

B. The tables should be stored in a cloud-based external system

C. The tables should be refreshed in the writing cluster before the next query is run D. The tables should be altered to include metadata to not cache

E. The tables should be updated before the next query is run

The correct answer is set to A while I choose D.

My understanding is that external data source cannot guarantee ACID and the data is first fetched from Cache. So the options is either we disable the cache, or move data. But just convert table format will not help.

Can anyone help to explain why converting format will solve the problem?

Hubert-Dudek · ‎11-07-2022

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

View solution in original post

Hubert-Dudek · ‎11-07-2022

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

suny · ‎02-14-2023

Not an answer, just asking the databricks folks to clarify:

I would also like to understand this. If there is no event emitted from the external parquet table (push) , and no active pulling or refreshing from the delta table side (pull), how is the unfortunate delta table supposed to know that new rows arrived? So this practice exam question is confusing me seriously, because refresh is not mentioned and it sounds as if some magic happens.

Databricks Community

Can anyone help me to understand one question in PracticeExam-DataEngineerAssociate?

Join Us as a Local Community Builder!

Free Edition Hackathon

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐