cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Can anyone help me to understand one question in PracticeExam-DataEngineerAssociate?

self-employed
Contributor

It is the practice exam for data engineer associate

The question is:

A data engineering team has created a series of tables using Parquet data stored in an external system. The team is noticing that after appending new rows to the data in the external system, their queries within Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this issue. Which of the following approaches will ensure that the data returned by queries is always up-to-date?

The options are

A. The tables should be converted to the Delta format

B. The tables should be stored in a cloud-based external system

C. The tables should be refreshed in the writing cluster before the next query is run D. The tables should be altered to include metadata to not cache

E. The tables should be updated before the next query is run

The correct answer is set to A while I choose D.

My understanding is that external data source cannot guarantee ACID and the data is first fetched from Cache. So the options is either we disable the cache, or move data. But just convert table format will not help.

Can anyone help to explain why converting format will solve the problem?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

Kaniz
Community Manager
Community Manager

Hi @lawrance Zhang​ , We haven’t heard from you since the last response from @Hubert Dudek​  , and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

suny
New Contributor II

Not an answer, just asking the databricks folks to clarify:

I would also like to understand this. If there is no event emitted from the external parquet table (push) , and no active pulling or refreshing from the delta table side (pull), how is the unfortunate delta table supposed to know that new rows arrived? So this practice exam question is confusing me seriously, because refresh is not mentioned and it sounds as if some magic happens.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.