cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Can anyone help me to understand one question in PracticeExam-DataEngineerAssociate?

self-employed
Contributor

It is the practice exam for data engineer associate

The question is:

A data engineering team has created a series of tables using Parquet data stored in an external system. The team is noticing that after appending new rows to the data in the external system, their queries within Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this issue. Which of the following approaches will ensure that the data returned by queries is always up-to-date?

The options are

A. The tables should be converted to the Delta format

B. The tables should be stored in a cloud-based external system

C. The tables should be refreshed in the writing cluster before the next query is run D. The tables should be altered to include metadata to not cache

E. The tables should be updated before the next query is run

The correct answer is set to A while I choose D.

My understanding is that external data source cannot guarantee ACID and the data is first fetched from Cache. So the options is either we disable the cache, or move data. But just convert table format will not help.

Can anyone help to explain why converting format will solve the problem?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I think all is included in that part of the question "will ensure that the data returned by queries is always up-to-date" as the only solution for Parquet external table to fix that issue I used was REFRESH TABLE and is not mentioned here. Still, even that is not guaranteeing that it is always up-to-date as you can forget to refresh data sources for the table.

Kaniz_Fatma
Community Manager
Community Manager

Hi @lawrance Zhang​ , We haven’t heard from you since the last response from @Hubert Dudek​  , and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

suny
New Contributor II

Not an answer, just asking the databricks folks to clarify:

I would also like to understand this. If there is no event emitted from the external parquet table (push) , and no active pulling or refreshing from the delta table side (pull), how is the unfortunate delta table supposed to know that new rows arrived? So this practice exam question is confusing me seriously, because refresh is not mentioned and it sounds as if some magic happens.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!