Data Engineering

Forum Posts

Sorted by:

by thedatacrew • Databricks Partner

2 weeks ago

248 Views
3 replies
3 kudos

Adhoc Table Refresh in Lakeflow Spark Declarative Pipelines (SDP)

Hi,It is currently not possible to specify a list of tables to refresh and their refresh policies (full/normal) in a Lakeflow Job.It can be done via the REST API, but it's messy.For example, if you need some tables or views refreshed more regularly, ...

Data Engineering

248 Views
3 replies
3 kudos

2 weeks ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

Tuesday

3 kudos

This is a real limitation in the current Lakeflow / DLT job model.Today, a pipeline is treated as the unit of refresh, not individual tables inside it. That means:You can run or fully refresh a pipelineBut you cannot define different refresh policies...

3 kudos

Tuesday

2 More Replies

by Databrickissue • New Contributor

Monday

131 Views
1 replies
0 kudos

DLT Issue

I have one DLT pipeline in Databricks. When I schedule the pipeline, the data is not showing. However, when I run the pipeline manually, the data is displayed properly

Data Engineering

131 Views
1 replies
0 kudos

Monday

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

Tuesday

0 kudos

A few details would help narrow this down.When the scheduled run executes:Does the pipeline update show Succeeded or Failed?In the pipeline Event Log, do you see rows being processed/written?Is your manual run a normal update or a Full Refresh?Is the...

0 kudos

Tuesday

by Ericsson • New Contributor II

12-01-2021 8:45:17 AM

7061 Views
4 replies
1 kudos

SQL week format issue its not showing result as 01(ww)

Hi Folks,I've requirement to show the week number as ww format. Please see the below codeselect weekofyear(date_add(to_date(current_date, 'yyyyMMdd'), +35)). also plz refre the screen shot for result.

Data Engineering

7061 Views
4 replies
1 kudos

12-01-2021 8:45:17 AM

View Replies

Latest Reply

Aidutchinso
New Contributor

Tuesday

1 kudos

"I've been exploring different communities lately, and honestly, connecting with people who share your interests makes all the difference. Whether it's diving deep into data engineering discussions or just having random conversations on platforms lik...

1 kudos

Tuesday

3 More Replies

by samgon • New Contributor III

06-17-2025 2:21:01 AM

7536 Views
5 replies
6 kudos

Resolved! study materials for Certified Data Engineer Professional Certification?

Can anyone recommend high-quality study materials or resources (courses, documentation, practice exams, etc.) that helped you prepare for the Professional-level exam?

Data Engineering

dataengineering

7536 Views
5 replies
6 kudos

06-17-2025 2:21:01 AM

View Replies

Latest Reply

williamandrew
New Contributor II

Tuesday

6 kudos

Recently achieved this certification and it feels great to see all the hard work pay off. Consistent practice, hands-on learning, and quality study resources made a huge difference. For anyone preparing, I found this resource helpful: https://linkly....

6 kudos

Tuesday

4 More Replies

by deepak05 • Contributor

01-22-2024 11:02:11 PM

43655 Views
12 replies
13 kudos

Resolved! I Got 70.00% on Databricks Certified Data Engineer Professional Exam but Failed....

Hi Everyone,Today I gave databricks exam for and I got 64 questions and my result was exactly 70.00%(As per databricks the pass percentage is 70 or above). but still the status was showing Failed and I couldn't get certified.Can you anyone help me on...

Data Engineering

43655 Views
12 replies
13 kudos

01-22-2024 11:02:11 PM

View Replies

Latest Reply

halliekohler
New Contributor

Tuesday

13 kudos

Congratulations on this achievement! Reaching this milestone feels incredibly rewarding. I had a similar experience, and quality practice resources from https://linkly.link/2l2Hb were very helpful throughout my preparation journey.

13 kudos

Tuesday

11 More Replies

by Navinkumar_K • New Contributor

Monday

52 Views
0 replies
0 kudos

DeltaFileStatistics on a nested column (`created date.shipment`) cause filtering issues

Environment- Databricks Runtime version: 17.3 LTS- Cloud: Azure- Catalog: Unity Catalog- Table format: DeltaSummaryWe have a Delta table named `shipment` with a column `created date.shipment` (a column whose name contains a dot). Delta collects delta...

Data Engineering

52 Views
0 replies
0 kudos

Monday

by genie • New Contributor

a week ago

142 Views
1 replies
0 kudos

Genie Code hallucinates CLI commands

I want to run some SQL commands programmatically against and decided to use Genie Code to help me, it came up with unsupported and non-existent commands.

Data Engineering

142 Views
1 replies
0 kudos

a week ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

a week ago

0 kudos

The command shown in the screenshot appears to be hallucinated.databricks sql-statements execute is not a valid Databricks CLI command. It looks like Genie combined concepts from the SQL Statement Execution API with CLI syntax that doesn't actually e...

0 kudos

a week ago

by Maxrb • New Contributor III

a week ago

350 Views
4 replies
3 kudos

Resolved! Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]

Hi,I am using autoloader to load parquet files into my unity catalog with the following settings:.option("cloudFiles.format", "parquet") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaEvolutionMode", "addNewColumnsWithTypeWi...

Data Engineering

350 Views
4 replies
3 kudos

a week ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

a week ago

3 kudos

What you're seeing comes down to where the type mismatch is detected.For Parquet, some mismatches can be handled at the Auto Loader layer and end up in _rescued_data, while others fail earlier inside the Parquet reader itself.In your example, the exi...

3 kudos

a week ago

3 More Replies

by shan-databricks • Databricks Partner

a week ago

195 Views
3 replies
1 kudos

How to store credentials in Databricks and assign them to job parameters

I am using SQL Server, Postgres, and MongoDB as data sources, connecting through Spark and JDBC connector. I would like to store the credentials and connection details in Databricks, pass them as job parameters, and need guidance on possible approach...

Data Engineering

195 Views
3 replies
1 kudos

a week ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

a week ago

1 kudos

I'd think about this as a separation of concerns:Secrets are for sensitive values (usernames, passwords, tokens, connection URIs).Job parameters are for runtime values (connection name, database, schema, table, query, collection, source system).In mo...

1 kudos

a week ago

2 More Replies

by Nick_Hughes • New Contributor III

05-16-2023 3:43:03 AM

17532 Views
5 replies
1 kudos

Best way to generate fake data using underlying schema

HiWe are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships ...

Data Engineering

17532 Views
5 replies
1 kudos

05-16-2023 3:43:03 AM

View Replies

Latest Reply

savlahanish27
Databricks Partner

a week ago

1 kudos

The core problem you're facing is that Delta Lake doesn't enforce foreign key constraints, so most datagen tools generate each table independently and your joins produce no meaningful overlap.The solution is to generate a shared key pool first - a si...

1 kudos

a week ago

4 More Replies

by ConnorK • Databricks Partner

a week ago

246 Views
3 replies
2 kudos

Databricks Standard SharePoint Connector Performance Issues

I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.My notebook does a straightforward read using the following:lakeflow_connection_name = 'sharepoint_dev'...

Data Engineering

246 Views
3 replies
2 kudos

a week ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

a week ago

2 kudos

I think your diagnosis is likely correct.One thing that stands out is that you’re only reading A1:Z2 from each workbook. Given that the operation is still taking 40+ minutes, the bottleneck is unlikely to be the Excel parsing itself and more likely t...

2 kudos

a week ago

2 More Replies

by AlexSantiago • New Contributor II

09-10-2022 11:40:01 PM

19048 Views
26 replies
4 kudos

spotify API get token - raw_input was called, but this frontend does not support input requests.

hello everyone, I'm trying use spotify's api to analyse my music data, but i'm receiving a error during authentication, specifically when I try get the token, above my code.Is it a databricks bug?pip install spotipyfrom spotipy.oauth2 import SpotifyO...

Data Engineering

19048 Views
26 replies
4 kudos

09-10-2022 11:40:01 PM

View Replies

Latest Reply

abdullahbinali
New Contributor II

a week ago

4 kudos

To get a Spotify API token, create an app in the Spotify Developer Dashboard and get your Client ID and Client Secret. Send a POST request to Spotify Accounts API using the Client Credentials Flow to receive an access token.For local services in Jed...

4 kudos

a week ago

25 More Replies

by Nidhig631 • Databricks MVP

a week ago

617 Views
10 replies
0 kudos

DISTINCT is the major bottleneck because of the heavy shuffle.

Need some advice from the community.I am processing around 100 million records using:df.select(required_cols).distinct().write.saveAsTable(...)The source has 1000+ columns, but I'm selecting only 20 columns before applying DISTINCT.I have already ena...

Data Engineering

617 Views
10 replies
0 kudos

a week ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

a week ago

0 kudos

I think you’re looking at this the right way. For an exact dedup, there’s no way around a global shuffle somewhere. A separate hash column doesn’t fundamentally change that, since Spark is already hashing the grouping keys internally for distinct(). ...

0 kudos

a week ago

9 More Replies

by AmitDECopilot • New Contributor III

a week ago

248 Views
1 replies
1 kudos

Resolved! Legacy Modernization Isn’t a Technology Problem

After working on multiple modernization initiatives, I’ve noticed a pattern:Organizations spend months discussing:Databricks vs SnowflakeSpark vs SQLBatch vs StreamingAirflow vs Managed OrchestrationBut the biggest challenge is usually somewhere els...

Data Engineering

248 Views
1 replies
1 kudos

a week ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

a week ago

1 kudos

I completely agree that teams often underestimate the metadata challenge during modernization.One thing I’ve seen repeatedly, though, is that the hardest part isn’t always the metadata itself—it’s the business intent behind it. We can extract mapping...

1 kudos

a week ago

by Sam500 • New Contributor III

a week ago

1527 Views
4 replies
1 kudos

Resolved! Databricks Serverless Costs

Our power BI reports consume real-time data , and for that the only option remains is Databricks serverless,but serverrless is expensive option, how to control the costs for serverless , and any other alternatives. Thank you.

Data Engineering

1527 Views
4 replies
1 kudos

a week ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor II

a week ago

1 kudos

Serverless is often the preferred option for Power BI DirectQuery workloads because it starts in seconds and scales automatically. However, it’s not always the only option, and there are several ways to reduce costs.A few high-impact optimizations:Se...

1 kudos

a week ago

3 More Replies

Databricks Community

Forum Posts

Adhoc Table Refresh in Lakeflow Spark Declarative Pipelines (SDP)

DLT Issue

SQL week format issue its not showing result as 01(ww)

Resolved! study materials for Certified Data Engineer Professional Certification?

Resolved! I Got 70.00% on Databricks Certified Data Engineer Professional Exam but Failed....

DeltaFileStatistics on a nested column (`created date.shipment`) cause filtering issues

Genie Code hallucinates CLI commands

Resolved! Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]

How to store credentials in Databricks and assign them to job parameters

Best way to generate fake data using underlying schema

Databricks Standard SharePoint Connector Performance Issues

spotify API get token - raw_input was called, but this frontend does not support input requests.

DISTINCT is the major bottleneck because of the heavy shuffle.

Resolved! Legacy Modernization Isn’t a Technology Problem

Resolved! Databricks Serverless Costs

Legacy Modernization Isn’t a Technology Problem

PySpark AnalysisException: Ambiguous reference to ...

Managing IPYNB cell timestamps in source control

How to change a field when instancing cluster defi...

Auto CDC Delete Propagation Issue: Streaming CDF R...