cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Can anyone help me to understand one question in PracticeExam-DataEngineerAssociate?

self-employed
Contributor

It is the practice exam for data engineer associate

The question is:

A data engineering team has created a series of tables using Parquet data stored in an external system. The team is noticing that after appending new rows to the data in the external system, their queries within Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this issue. Which of the following approaches will ensure that the data returned by queries is always up-to-date?

The options are

A. The tables should be converted to the Delta format

B. The tables should be stored in a cloud-based external system

C. The tables should be refreshed in the writing cluster before the next query is run D. The tables should be altered to include metadata to not cache

E. The tables should be updated before the next query is run

The correct answer is set to A while I choose D.

My understanding is that external data source cannot guarantee ACID and the data is first fetched from Cache. So the options is either we disable the cache, or move data. But just convert table format will not help.

Can anyone help to explain why converting format will solve the problem?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

View solution in original post

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

suny
New Contributor II

Not an answer, just asking the databricks folks to clarify:

I would also like to understand this. If there is no event emitted from the external parquet table (push) , and no active pulling or refreshing from the delta table side (pull), how is the unfortunate delta table supposed to know that new rows arrived? So this practice exam question is confusing me seriously, because refresh is not mentioned and it sounds as if some magic happens.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group