cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

lachu
by Visitor
  • 77 Views
  • 2 replies
  • 0 kudos

SDP continuous mode

Hi,I was doing a POC and hence used open source spark and kafka in docket container and got it working. The sample code is ingesting data from kafka but it is running only in batch mode. Not able to continuously ingest the kafka streamQuestion: Can w...

  • 77 Views
  • 2 replies
  • 0 kudos
Latest Reply
bala_sai
New Contributor
  • 0 kudos

Yes, we can build a continuous streaming pipeline using open source Spark. The main thing is to use Spark Structured Streaming, not a normal batch read. For Kafka streaming, we need to use spark.readStream, then write using writeStream, and keep the ...

  • 0 kudos
1 More Replies
ConnorK
by Databricks Partner
  • 158 Views
  • 3 replies
  • 2 kudos

Databricks Standard SharePoint Connector Performance Issues

I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.My notebook does a straightforward read using the following:lakeflow_connection_name = 'sharepoint_dev'...

  • 158 Views
  • 3 replies
  • 2 kudos
Latest Reply
Yogasathyandrun
New Contributor
  • 2 kudos

I think your diagnosis is likely correct.One thing that stands out is that you’re only reading A1:Z2 from each workbook. Given that the operation is still taking 40+ minutes, the bottleneck is unlikely to be the Excel parsing itself and more likely t...

  • 2 kudos
2 More Replies
nidhin
by New Contributor III
  • 147 Views
  • 2 replies
  • 1 kudos

Lakeflow SDP (DLT) produce external tables, or only UC-managed

As I understand it, streaming tables and materialized views produced by Lakeflow Spark Declarative Pipelines (DLT) are always Unity Catalog managed tables , there's no LOCATION/path option on create_streaming_table or apply_changes.Is that correct? A...

  • 147 Views
  • 2 replies
  • 1 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 1 kudos

Hi @nidhin, What you’re saying is basically correct for a Unity Catalog-enabled Lakeflow Spark Declarative Pipelines setup. In that model, pipelines publish streaming tables and materialized views into the target catalog and schema, the data is store...

  • 1 kudos
1 More Replies
amirabedhiafi
by Contributor
  • 670 Views
  • 4 replies
  • 5 kudos

Resolved! json file existing in volume but not showing in UI

I have some json files existing in a specific volume when I try to search for them they don't appear but when I query the the volume using python I am able to get them and read their content.Any help ?

  • 670 Views
  • 4 replies
  • 5 kudos
Latest Reply
Vikram10
New Contributor II
  • 5 kudos

Hi,The global workspace search won't return results for files stored in Unity Catalog Volumes. Its indexing is focused on workspace assets and catalog-managed objects, rather than the underlying files within a Volume.To locate files in a Volume, navi...

  • 5 kudos
3 More Replies
CG29
by New Contributor II
  • 582 Views
  • 5 replies
  • 2 kudos

Resolved! Databricks unable to list ADLS folder and files

Hi Databricks Community,I am able to list the container from my databricks workspace but unable to list the folder and files further.If I try to access the same files and folder from the Databricks UI, external location path, I am able to see all fil...

  • 582 Views
  • 5 replies
  • 2 kudos
Latest Reply
ashukasma
New Contributor II
  • 2 kudos

Following are may be the Causes1. Different authentication methods- The UI's external location uses Unity Catalog credentials- Your dbutils.fs.ls() command uses the compute's Spark configurations- These may be using different credentials with differe...

  • 2 kudos
4 More Replies
Sainath368
by Contributor
  • 330 Views
  • 1 replies
  • 2 kudos

Resolved! DESCRIBE HISTORY Performance Issue for Large Scale Tables (22K Tables)

Hi everyone, I’m working with around 22,000 Unity Catalog external Delta tables, and my requirement is to execute DESCRIBE HISTORY table_name LIMIT 1 for each table and append the latest record into a single consolidated table. I’ve already tried mul...

  • 330 Views
  • 1 replies
  • 2 kudos
Latest Reply
ShamenParis
New Contributor III
  • 2 kudos

Hi,The reason your performance degrades so badly (4 mins for 2k tables, but 50 mins for 12k) is because of the Spark Driver. When you run spark.sql("DESCRIBE HISTORY...") inside a ThreadPoolExecutor, every single one of those 22,000 queries has to be...

  • 2 kudos
naveenayalla
by New Contributor II
  • 349 Views
  • 1 replies
  • 3 kudos

Why We Moved Our Operational Database Into Databricks — And Stopped Managing Two Stacks

Lakebase just went GA. Here's what a production migration actually looks like.For most of the last decade, our data infrastructure lived in two separate worlds.On one side: a transactional database handling operational workloads — the writes, the loo...

Data Engineering
Architecture
Community articles
Database
DIAS2026
lakebase
  • 349 Views
  • 1 replies
  • 3 kudos
Latest Reply
Mailendiran
New Contributor III
  • 3 kudos

Great write up and felt useful. Thanks for sharing the real experience.!

  • 3 kudos
Albertino
by New Contributor
  • 227 Views
  • 1 replies
  • 1 kudos

databricks-connect library for python and pandas 3

Hello,databricks-connect is pinning pandas during the installation. Since we're moving towards pandas 3 can you please add the support for the newest version as well?

Data Engineering
databricks-connect
Pandas
python
  • 227 Views
  • 1 replies
  • 1 kudos
Latest Reply
Brahmareddy
Esteemed Contributor II
  • 1 kudos

Hi @Albertino , how are you doing today?Thanks for calling this out. as per my understanding, at the moment, the pandas pin is there intentionally. Databricks Connect release notes say supported pandas versions are currently limited to 1.0.5<=pandas<...

  • 1 kudos
naveenayalla
by New Contributor II
  • 311 Views
  • 1 replies
  • 0 kudos

From RAG Demo to Production on Databricks: 7 Things Teams Should Validate First

From RAG Demo to Production on Databricks: 7 Things Teams Should Validate FirstBy Naveen AyallaMany teams can build a RAG demo quickly.Upload documents, create embeddings, connect a model, ask a question, and show an answer.But production is differen...

naveen0808_0-1780880239856.png
  • 311 Views
  • 1 replies
  • 0 kudos
Latest Reply
naveenayalla
New Contributor II
  • 0 kudos

Thanks for reading. I’m especially interested in hearing from people who have worked on real RAG or GenAI workflows.Which one has been the biggest challenge for your team?1. Choosing the right source data2. Access control and governance3. Improving r...

  • 0 kudos
ccsalt
by New Contributor II
  • 461 Views
  • 4 replies
  • 1 kudos

Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log)

Saving logs from an all-purpose cluster to Volume or S3 is not consistent, because stderr, stdout, and log4j-active.log get overwritten when the cluster is restarted between minutes 01 and 59.Tested case:A job is configured to start every 20 minutes,...

  • 461 Views
  • 4 replies
  • 1 kudos
Latest Reply
aleksandra_ch
Databricks Employee
  • 1 kudos

Hi @ccsalt , This is a known limitation. Log rotation (renaming to log4j-YYYY-MM-DD-HH.log.gz) only happens on the hour boundary. The active log file log4j-active.log has always the same name and is overwritten if a cluster restart happens within one...

  • 1 kudos
3 More Replies
Alessio_F
by New Contributor
  • 234 Views
  • 1 replies
  • 0 kudos

Extract SQL function in SQL Server federated database

Hi everyone,I'm using Azure Databricks with a customer who has a SQL Server database federated on the Unity Catalog.It seems that, while converting some date functions to the SQL Server dialect, Databricks uses the function "extract", which is not re...

  • 234 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @Alessio_F ,This happens because in Databricks SQL both year and month functions are just aliases over following patterns:- extract (YEAR FROM expr)- extract(MONTH FROM expr) When Databricks pushes a predicate or expression down to the remote SQL ...

  • 0 kudos
Raj_DB
by Contributor
  • 293 Views
  • 1 replies
  • 1 kudos

Resolved! Automating Job Permission Updates in Databricks Using a Notebook

Hi everyone,I am looking to create a notebook that, when executed by a user, performs the following actions:Retrieves all Databricks jobs created by the current userChecks whether a specific role already has permissions on those jobsAutomatically add...

  • 293 Views
  • 1 replies
  • 1 kudos
Latest Reply
ziafazal
Databricks Partner
  • 1 kudos

Hi @Raj_DB You can use databricks SDK to retrieve all jobs filter them by selecting only those where owner is current usersomething like thisfrom databricks.sdk import WorkspaceClient w = WorkspaceClient() # Specify the user email/username you want...

  • 1 kudos
vedanth
by New Contributor
  • 228 Views
  • 1 replies
  • 0 kudos

Salesforce Connector - Lakeflow Connect 400 Error

HI All,I have been trying to setup Salesforce using Lakeflow Connect and followed instructions on the docshttps://docs.databricks.com/aws/en/connect/managed-ingestion#sfdcHowever I face into invalid_grant error  However login history on salesforce sh...

vedanth_0-1779009668052.png
  • 228 Views
  • 1 replies
  • 0 kudos
Latest Reply
GaneshI
New Contributor III
  • 0 kudos

Hi Vedanth,The invalid_grant error usually occurs due to authentication or OAuth configuration issues between Salesforce and Databricks Lakeflow Connect.Could you please verify the following points:Ensure the Salesforce user account is not locked and...

  • 0 kudos
Labels