cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Abdul-Mannan
by New Contributor III
  • 151 Views
  • 1 replies
  • 0 kudos

Notifications have file information but dataframe is empty using autoloader file notification mode

Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using foreachbatch sink to process the data ...

  • 151 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Here are some potential steps and considerations to troubleshoot and resolve the issue: Permissions and Configuration: Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles ...

  • 0 kudos
thecodecache
by New Contributor II
  • 1745 Views
  • 2 replies
  • 0 kudos

Transpile a SQL Script into PySpark DataFrame API equivalent code

Input SQL Script (assume any dialect) : SELECT b.se10, b.se3, b.se_aggrtr_indctr, b.key_swipe_ind FROM (SELECT se10, se3, se_aggrtr_indctr, ROW_NUMBER() OVER (PARTITION BY SE10 ...

  • 1745 Views
  • 2 replies
  • 0 kudos
Latest Reply
MathieuDB
Databricks Employee
  • 0 kudos

Hello @thecodecache , Have a look the SQLGlot project: https://github.com/tobymao/sqlglot?tab=readme-ov-file#faq It can easily transpile SQL to Spark SQL, like that: import sqlglot from pyspark.sql import SparkSession # Initialize Spark session spar...

  • 0 kudos
1 More Replies
William_Scardua
by Valued Contributor
  • 8134 Views
  • 2 replies
  • 1 kudos

Pyspark or Scala ?

Hi guys,Many people use pyspark to develop their pipelines, in your opinion in which cases is it better to use one or the other? Or is it better to choose a single language?Thanks

  • 8134 Views
  • 2 replies
  • 1 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 1 kudos

Hi @William_Scardua,It is advisable to consider using Python (or PySpark) due to Spark's comprehensive API support for Python. Furthermore, Databricks currently supports Delta Live Tables (DLT) with Python, but does not support Scala at this time. Ad...

  • 1 kudos
1 More Replies
JrV
by New Contributor
  • 55 Views
  • 1 replies
  • 0 kudos

Sparql and RDF data

Hello Databricks Community,Does anyone have experience with running SPARQL (https://en.wikipedia.org/wiki/SPARQL) queries in Databricks?Make a connection to the Community SolidServer https://github.com/CommunitySolidServer/CommunitySolidServerand que...

  • 55 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16502773013
Databricks Employee
  • 0 kudos

Hello @JrV , For this use case, Databricks currently support Bellman SPARQL engine which can run on Databricks as a Scala library operating on a dataframe of Triples (S, P, O) Also integration is available for Stardog through Databricks Partner Conne...

  • 0 kudos
Gajju
by New Contributor
  • 57 Views
  • 1 replies
  • 0 kudos

[Deprecation Marker Required] : MERGE INTO Clause

Dear Friends:Considering MERGE INTO may generate wrong results(The APPLY CHANGES APIs: Simplify change data capture with Delta Live Tables | Databricks on AWS), may I ask that why it's API is still floating in technical documentation, without "Deprec...

  • 57 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16502773013
Databricks Employee
  • 0 kudos

Hello @Gajju , MERGE INTO is not being deprecated, APPLY CHANGES should be seen as an enhanced merge process in Delta Live Table that handles out of sequence records automatically as shown in  the example in the documentation shared. The notion of wr...

  • 0 kudos
milind2000
by New Contributor
  • 60 Views
  • 1 replies
  • 0 kudos

Question about Data Management for Supply-Demand Allocation

I have a scenario where I am trying to parallelize supply - demand allotment between sellers and buyers with many to many links. I am unsure of whether I can parallelize the calculation using PySpark operations. I have two columns to keep track of in...

  • 60 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Parallelizing supply-demand allotment in PySpark can be challenging due to the need for sequential updates to supply and demand values across rows. However, it is possible to achieve this using PySpark operations, though it may require a different ap...

  • 0 kudos
glevine
by New Contributor
  • 141 Views
  • 1 replies
  • 0 kudos

Resolved! DNSResolve Error while establishing JDBC connection to Azure Databricks

I am using the Databricks JDBC driver (https://databricks.com/spark/jdbc-drivers-download) to connect to Azure Databricks through a VPN.I am connecting through a SAAS low-code platform, Appian, so I don't have access to any more logs. We have set up ...

glevine_0-1737110151744.png
  • 141 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

It seems that the DNS is not being able to resolve the domain name of your workspace, from the browser with the VPN connection are you able to access to it? 

  • 0 kudos
eballinger
by New Contributor III
  • 231 Views
  • 6 replies
  • 0 kudos

Resolved! DLT Pipeline Event Logs

There seems to be a issue now with our DLT pipeline event logs. I am not sure if this is a recent bug or not (but they were ok in Dec). But the issue is in dev, qc and prod and we only have a couple days of history logs now visible in the UI.From wha...

  • 231 Views
  • 6 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Great to hear your issue got resolved.

  • 0 kudos
5 More Replies
Costas96
by New Contributor III
  • 119 Views
  • 1 replies
  • 1 kudos

Resolved! Delta Live Tables: Add sequential column

Hello everyone, I have a DLT table (examp_table) and I want to add a sequential column that its values will be incremented every time a record gets ingested. I tried to do that with monotonically_increasing_id and Window.orderBy("a column") functions...

  • 119 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @Costas96, Thanks for your question. You can use identity column feature. https://www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html

  • 1 kudos
BenceCzako
by New Contributor II
  • 275 Views
  • 5 replies
  • 0 kudos

Databricks mount bug

Hello,I have a weird problem in databricks for which I hope you can suggest some solutions.I have an azureml blob storage mounted to databricks with some folder structure that can be accessed from a notebook as/dbfs/mnt/azuremount/foo/bar/something.t...

  • 275 Views
  • 5 replies
  • 0 kudos
Latest Reply
BenceCzako
New Contributor II
  • 0 kudos

Hello,Can you figure out the issue?

  • 0 kudos
4 More Replies
drag7ter
by Contributor
  • 104 Views
  • 3 replies
  • 0 kudos

Disable caching in Serverless SQL Warehouse

I have Serverless SQL Warehouse claster, and I run my sql code in sql editor. When I run query for the first time I see it take 30 secs total time, but all next time I see in query profiling that it gets result set from cache and takes 1-2 secs total...

  • 104 Views
  • 3 replies
  • 0 kudos
Latest Reply
drag7ter
Contributor
  • 0 kudos

As I mentioned above this setting doesn't work for sql warehouse clusterSET use_cached_result = false

  • 0 kudos
2 More Replies
dbx-user7354
by New Contributor III
  • 3238 Views
  • 5 replies
  • 1 kudos

Pyspark Dataframes orderby only orders within partition when having multiple worker

I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't.  from pyspark.sql import functions as F import matplotlib.pyplot...

dbxuser7354_0-1711014288660.png dbxuser7354_1-1711014300462.png
  • 3238 Views
  • 5 replies
  • 1 kudos
Latest Reply
NemesisMF
New Contributor II
  • 1 kudos

@NandiniN Did you try with a multiple worker cluster? Which Runtime with which spark version did you use?Maybe it would be good to test with Runtime 13.3, then we would know that it was fixed in the meantime.I found this on StackOverflow. Seems someo...

  • 1 kudos
4 More Replies
Costas96
by New Contributor III
  • 426 Views
  • 7 replies
  • 0 kudos

Resolved! Delta Live Tables: Creating table with spark.sql and everything gets ingested at the first column

Hello everyone. I am new to DLT and I am trying to practice with it by doing some basic ingestions. I have a query like the following where I am getting data from two tables using UNION. I have noticed that everything gets ingested at the first colum...

  • 426 Views
  • 7 replies
  • 0 kudos
Latest Reply
Costas96
New Contributor III
  • 0 kudos

Actually I found the solution by using spark.readStream to read the external tables a and b into two dataframes and then I just did  combined_df = df_a.union(df_b) to create my DLT table. Thank you! 

  • 0 kudos
6 More Replies
udara_zure
by New Contributor II
  • 176 Views
  • 3 replies
  • 0 kudos

Resolved! what is the best way to deploy workflows with different notebooks to execute in different workspaces

I have a workflow in QA workspace that attached one notebook. I need to deploy the same workflow to PRD workspace , with all the notebooks in the azure devops repo and attche and run a different notebook in the PRD workflow.

  • 176 Views
  • 3 replies
  • 0 kudos
Latest Reply
ashraf1395
Valued Contributor
  • 0 kudos

Databricks asset bundles can be a great solution for this. Clear and straightforward. https://docs.databricks.com/en/dev-tools/bundles/index.html

  • 0 kudos
2 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels