Data Engineering

Forum Posts

Sorted by:

by genie • Visitor

8 hours ago

45 Views
1 replies
0 kudos

Genie Code hallucinates CLI commands

I want to run some SQL commands programmatically against and decided to use Genie Code to help me, it came up with unsupported and non-existent commands.

Data Engineering

45 Views
1 replies
0 kudos

8 hours ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor

7 hours ago

0 kudos

The command shown in the screenshot appears to be hallucinated.databricks sql-statements execute is not a valid Databricks CLI command. It looks like Genie combined concepts from the SQL Statement Execution API with CLI syntax that doesn't actually e...

0 kudos

7 hours ago

by Maxrb • New Contributor III

10 hours ago

97 Views
4 replies
3 kudos

Resolved! Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]

Hi,I am using autoloader to load parquet files into my unity catalog with the following settings:.option("cloudFiles.format", "parquet") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaEvolutionMode", "addNewColumnsWithTypeWi...

Data Engineering

97 Views
4 replies
3 kudos

10 hours ago

View Replies

Latest Reply

Yogasathyandrun
New Contributor

9 hours ago

3 kudos

What you're seeing comes down to where the type mismatch is detected.For Parquet, some mismatches can be handled at the Auto Loader layer and end up in _rescued_data, while others fail earlier inside the Parquet reader itself.In your example, the exi...

3 kudos

9 hours ago

3 More Replies

by shan-databricks • Databricks Partner

yesterday

67 Views
3 replies
0 kudos

How to store credentials in Databricks and assign them to job parameters

I am using SQL Server, Postgres, and MongoDB as data sources, connecting through Spark and JDBC connector. I would like to store the credentials and connection details in Databricks, pass them as job parameters, and need guidance on possible approach...

Data Engineering

67 Views
3 replies
0 kudos

yesterday

View Replies

Latest Reply

Yogasathyandrun
New Contributor

9 hours ago

0 kudos

I'd think about this as a separation of concerns:Secrets are for sensitive values (usernames, passwords, tokens, connection URIs).Job parameters are for runtime values (connection name, database, schema, table, query, collection, source system).In mo...

0 kudos

9 hours ago

2 More Replies

by Nick_Hughes • New Contributor III

05-16-2023 3:43:03 AM

17323 Views
5 replies
1 kudos

Best way to generate fake data using underlying schema

HiWe are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships ...

Data Engineering

17323 Views
5 replies
1 kudos

05-16-2023 3:43:03 AM

View Replies

Latest Reply

savlahanish27
Databricks Partner

9 hours ago

1 kudos

The core problem you're facing is that Delta Lake doesn't enforce foreign key constraints, so most datagen tools generate each table independently and your joins produce no meaningful overlap.The solution is to generate a shared key pool first - a si...

1 kudos

9 hours ago

4 More Replies

by ConnorK • Databricks Partner

yesterday

121 Views
3 replies
2 kudos

Databricks Standard SharePoint Connector Performance Issues

I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.My notebook does a straightforward read using the following:lakeflow_connection_name = 'sharepoint_dev'...

Data Engineering

121 Views
3 replies
2 kudos

yesterday

View Replies

Latest Reply

Yogasathyandrun
New Contributor

12 hours ago

2 kudos

I think your diagnosis is likely correct.One thing that stands out is that you’re only reading A1:Z2 from each workbook. Given that the operation is still taking 40+ minutes, the bottleneck is unlikely to be the Excel parsing itself and more likely t...

2 kudos

12 hours ago

2 More Replies

by AlexSantiago • New Contributor II

09-10-2022 11:40:01 PM

18935 Views
26 replies
4 kudos

spotify API get token - raw_input was called, but this frontend does not support input requests.

hello everyone, I'm trying use spotify's api to analyse my music data, but i'm receiving a error during authentication, specifically when I try get the token, above my code.Is it a databricks bug?pip install spotipyfrom spotipy.oauth2 import SpotifyO...

Data Engineering

18935 Views
26 replies
4 kudos

09-10-2022 11:40:01 PM

View Replies

Latest Reply

abdullahbinali
New Contributor

yesterday

4 kudos

To get a Spotify API token, create an app in the Spotify Developer Dashboard and get your Client ID and Client Secret. Send a POST request to Spotify Accounts API using the Client Credentials Flow to receive an access token.For local services in Jed...

4 kudos

yesterday

25 More Replies

by Nidhig631 • Databricks MVP

Saturday

336 Views
11 replies
0 kudos

DISTINCT is the major bottleneck because of the heavy shuffle.

Need some advice from the community.I am processing around 100 million records using:df.select(required_cols).distinct().write.saveAsTable(...)The source has 1000+ columns, but I'm selecting only 20 columns before applying DISTINCT.I have already ena...

Data Engineering

336 Views
11 replies
0 kudos

Saturday

View Replies

Latest Reply

kim533
New Contributor

yesterday

0 kudos

For 100M+ records, `DISTINCT` will almost always be shuffle-heavy because Spark must compare records across partitions. If you truly need exact deduplication on 20 columns, consider using drop Duplicates(required cols instead of distinct the executio...

0 kudos

yesterday

10 More Replies

by A0s01gy • New Contributor II

yesterday

135 Views
1 replies
1 kudos

Legacy Modernization Isn’t a Technology Problem

After working on multiple modernization initiatives, I’ve noticed a pattern:Organizations spend months discussing:Databricks vs SnowflakeSpark vs SQLBatch vs StreamingAirflow vs Managed OrchestrationBut the biggest challenge is usually somewhere els...

Data Engineering

135 Views
1 replies
1 kudos

yesterday

View Replies

Latest Reply

Yogasathyandrun
New Contributor

yesterday

1 kudos

I completely agree that teams often underestimate the metadata challenge during modernization.One thing I’ve seen repeatedly, though, is that the hardest part isn’t always the metadata itself—it’s the business intent behind it. We can extract mapping...

1 kudos

yesterday

by Sam500 • New Contributor III

Saturday

903 Views
4 replies
0 kudos

Resolved! Databricks Serverless Costs

Our power BI reports consume real-time data , and for that the only option remains is Databricks serverless,but serverrless is expensive option, how to control the costs for serverless , and any other alternatives. Thank you.

Data Engineering

903 Views
4 replies
0 kudos

Saturday

View Replies

Latest Reply

Yogasathyandrun
New Contributor

yesterday

0 kudos

Serverless is often the preferred option for Power BI DirectQuery workloads because it starts in seconds and scales automatically. However, it’s not always the only option, and there are several ways to reduce costs.A few high-impact optimizations:Se...

0 kudos

yesterday

3 More Replies

by RGSLCA • New Contributor II

Friday

148 Views
1 replies
0 kudos

Selective overwrite on Partition and Liquid clustered tables

Hi,I have created 2 identical tables but one is partitioned and the one is a Liquid Clustered with Auto Clustering.I inserted 30M rows x 2 (60M) for two dates , date 1 = 2026-06-01 and date = 2026-06-02 , then I overwrite the date 2026-06-02 with a s...

Data Engineering

148 Views
1 replies
0 kudos

Friday

View Replies

Latest Reply

balajij8
Contributor III

Saturday

0 kudos

Hi, the current way is not optimal. You can follow belowINSERT query ran with mostly 43 tasks, creating 43 output files. Since the Liquid clustered table has no organization (clusterBy "[]") - dates are randomly scattered across files.Partition table...

0 kudos

Saturday

by Yogasathyandrun • New Contributor

Saturday

132 Views
0 replies
0 kudos

Detecting Photon fallback in-cluster + safe right-sizing from system tables

I'm prototyping a cluster cost / right-sizing advisor and wanted to get a reality-check from people running Databricks at real scale before I sink more time into it.The main thing I'm chasing is Photon fallback. Photon quietly drops to the JVM on uns...

Data Engineering

132 Views
0 replies
0 kudos

Saturday

by Ramana • Valued Contributor II

09-10-2025 7:20:59 AM

2810 Views
6 replies
4 kudos

Resolved! Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Hello Community,We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.When we read CSV or JSON files with multiLine=true, the load becomes sing...

Data Engineering

2810 Views
6 replies
4 kudos

09-10-2025 7:20:59 AM

View Replies

Latest Reply

Ramana
Valued Contributor II

Friday

4 kudos

spark_partition_id is the closest and most performant function available as an alternative, and I migrated to use this function. So far, no issues.https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.spark_p...

4 kudos

Friday

5 More Replies

by Ramana • Valued Contributor II

09-10-2025 7:36:16 AM

1402 Views
3 replies
0 kudos

Resolved! Serverless Compute - Python - Custom Emails via SMTP (smtplib.SMTP(host_name)) - Any alternative?

Hello Community,We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.We have several scenarios where we need to send an inline email via Pytho...

Data Engineering

1402 Views
3 replies
0 kudos

09-10-2025 7:36:16 AM

View Replies

Latest Reply

Ramana
Valued Contributor II

Friday

0 kudos

The solution we implemented as an alternative for email sending from Serverless is via the Microsoft Graph API.https://learn.microsoft.com/en-us/graph/api/user-sendmail?view=graph-rest-1.0&tabs=python

0 kudos

Friday

2 More Replies

by RGSLCA • New Contributor II

2 weeks ago

470 Views
7 replies
1 kudos

Sizing Tables and delt logs/CDF

Hi,I need to compare the sizes of my delta tables , what's the correct approach ?Table size reported by analyze command ? , but how do I check the delta log size , if I enable CDF .. how do I know the CDF log size(the overhead it adds) ? , kind of l...

Data Engineering

470 Views
7 replies
1 kudos

2 weeks ago

View Replies

Latest Reply

Vikram10
New Contributor II

a week ago

1 kudos

Hi @RGSLCA DESCRIBE DETAIL is the best starting point if you're comparing Delta table sizes, but it's important to understand what it reports. The sizeInBytes value represents only the latest active snapshot of the table, not the total storage consum...

1 kudos

a week ago

6 More Replies

by nidhin • New Contributor III

Thursday

143 Views
2 replies
1 kudos

Lakeflow SDP (DLT) produce external tables, or only UC-managed

As I understand it, streaming tables and materialized views produced by Lakeflow Spark Declarative Pipelines (DLT) are always Unity Catalog managed tables , there's no LOCATION/path option on create_streaming_table or apply_changes.Is that correct? A...

Data Engineering

143 Views
2 replies
1 kudos

Thursday

View Replies

Latest Reply

Ashwin_DSA
Databricks Employee

Thursday

1 kudos

Hi @nidhin, What you’re saying is basically correct for a Unity Catalog-enabled Lakeflow Spark Declarative Pipelines setup. In that model, pipelines publish streaming tables and materialized views into the target catalog and schema, the data is store...

1 kudos

Thursday

1 More Replies

Databricks Community

Forum Posts

Genie Code hallucinates CLI commands

Resolved! Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]

How to store credentials in Databricks and assign them to job parameters

Best way to generate fake data using underlying schema

Databricks Standard SharePoint Connector Performance Issues

spotify API get token - raw_input was called, but this frontend does not support input requests.

DISTINCT is the major bottleneck because of the heavy shuffle.

Legacy Modernization Isn’t a Technology Problem

Resolved! Databricks Serverless Costs

Selective overwrite on Partition and Liquid clustered tables

Detecting Photon fallback in-cluster + safe right-sizing from system tables

Resolved! Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Resolved! Serverless Compute - Python - Custom Emails via SMTP (smtplib.SMTP(host_name)) - Any alternative?

Sizing Tables and delt logs/CDF

Lakeflow SDP (DLT) produce external tables, or only UC-managed

Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_T...

Databricks Serverless Costs

Serverless Compute - pySpark - Any alternative for...

Serverless Compute - Python - Custom Emails via SM...

From STTM to Databricks Pipelines: Can Metadata Be...