cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

franc_bomb
by New Contributor II
  • 2758 Views
  • 7 replies
  • 0 kudos

Cluster creation issue

Hello,I just started using Databricks community version for learning purposes.I have been trying to create a cluster but the first time it failed asking me to retry or contact the support, and now it's just running forever.What could be the problem? 

  • 2758 Views
  • 7 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Can you please perform one test, check on the cloud provider if you are able to start a node?

  • 0 kudos
6 More Replies
leymariv
by New Contributor
  • 840 Views
  • 1 replies
  • 0 kudos

Performance issue writing an extract of a huge unpartitionned single column dataframe

I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be a...

leymariv_2-1737155764713.png leymariv_0-1737155486874.png
  • 840 Views
  • 1 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @leymariv,You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use from_json function to flatten or normalize the data to respective columns.Additionally, you can understand how dat...

  • 0 kudos
Maksym
by New Contributor III
  • 13866 Views
  • 5 replies
  • 7 kudos

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark .readStream .format("cloudFiles") .option('cloudFil...

  • 13866 Views
  • 5 replies
  • 7 kudos
Latest Reply
lassebe
New Contributor II
  • 7 kudos

I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!

  • 7 kudos
4 More Replies
kasiviss42
by New Contributor III
  • 3161 Views
  • 3 replies
  • 0 kudos

Predicate pushdown query

Does predicate pushdown works when we provide a filter on a dataframe reading a delta table with 2 lakh values i.efilter condition:column is in(list)list contains 2lakh elements  i need to get n number of columns from a table i am currently using joi...

  • 3161 Views
  • 3 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @kasiviss42,This might sound like a rhetorical question, but let’s delve into the complexity of joins and filters and examine how generating a list of 2lakh values affects it. Let's assume we have fact table with 1 billion record and dimension tab...

  • 0 kudos
2 More Replies
NavyaSinghvi
by Databricks Partner
  • 5771 Views
  • 6 replies
  • 2 kudos

Resolved! File_arrival trigger in Workflow

I am using  "job.trigger.file_arrival.location" in job parameters to get triggered file location . But I am getting error "job.trigger.file_arrival.location is not allowed". How can I get triggered file location in workflow ? 

  • 5771 Views
  • 6 replies
  • 2 kudos
Latest Reply
raghu2
Databricks Partner
  • 2 kudos

The parameters are passed as widgets to the job. After defining the parameters in the job definition, With following code I was able to access the data associated with the parameter:widget_names = ["loc1", "loc2", "loc3"]  # Add all expected paramete...

  • 2 kudos
5 More Replies
Harish2122
by Contributor
  • 30105 Views
  • 10 replies
  • 13 kudos

Databricks SQL string_agg

Migrating some on-premise SQL views to Databricks and struggling to find conversions for some functions. the main one is the string_agg function.string_agg(field_name, ', ')​Anyone know how to convert that to Databricks SQL?​Thanks in advance.

  • 30105 Views
  • 10 replies
  • 13 kudos
Latest Reply
smueller
New Contributor II
  • 13 kudos

If not grouping by something else: SELECT array_join(collect_set(field_name), ',') field_list    FROM table

  • 13 kudos
9 More Replies
Abdul-Mannan
by New Contributor III
  • 1194 Views
  • 1 replies
  • 0 kudos

Notifications have file information but dataframe is empty using autoloader file notification mode

Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using foreachbatch sink to process the data ...

  • 1194 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Here are some potential steps and considerations to troubleshoot and resolve the issue: Permissions and Configuration: Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles ...

  • 0 kudos
thecodecache
by New Contributor II
  • 6460 Views
  • 2 replies
  • 0 kudos

Transpile a SQL Script into PySpark DataFrame API equivalent code

Input SQL Script (assume any dialect) : SELECT b.se10, b.se3, b.se_aggrtr_indctr, b.key_swipe_ind FROM (SELECT se10, se3, se_aggrtr_indctr, ROW_NUMBER() OVER (PARTITION BY SE10 ...

  • 6460 Views
  • 2 replies
  • 0 kudos
Latest Reply
MathieuDB
Databricks Employee
  • 0 kudos

Hello @thecodecache , Have a look the SQLGlot project: https://github.com/tobymao/sqlglot?tab=readme-ov-file#faq It can easily transpile SQL to Spark SQL, like that: import sqlglot from pyspark.sql import SparkSession # Initialize Spark session spar...

  • 0 kudos
1 More Replies
William_Scardua
by Valued Contributor
  • 13292 Views
  • 2 replies
  • 1 kudos

Pyspark or Scala ?

Hi guys,Many people use pyspark to develop their pipelines, in your opinion in which cases is it better to use one or the other? Or is it better to choose a single language?Thanks

  • 13292 Views
  • 2 replies
  • 1 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 1 kudos

Hi @William_Scardua,It is advisable to consider using Python (or PySpark) due to Spark's comprehensive API support for Python. Furthermore, Databricks currently supports Delta Live Tables (DLT) with Python, but does not support Scala at this time. Ad...

  • 1 kudos
1 More Replies
Gajju
by Databricks Partner
  • 753 Views
  • 1 replies
  • 0 kudos

[Deprecation Marker Required] : MERGE INTO Clause

Dear Friends:Considering MERGE INTO may generate wrong results(The APPLY CHANGES APIs: Simplify change data capture with Delta Live Tables | Databricks on AWS), may I ask that why it's API is still floating in technical documentation, without "Deprec...

  • 753 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16502773013
Databricks Employee
  • 0 kudos

Hello @Gajju , MERGE INTO is not being deprecated, APPLY CHANGES should be seen as an enhanced merge process in Delta Live Table that handles out of sequence records automatically as shown in  the example in the documentation shared. The notion of wr...

  • 0 kudos
milind2000
by New Contributor
  • 584 Views
  • 1 replies
  • 0 kudos

Question about Data Management for Supply-Demand Allocation

I have a scenario where I am trying to parallelize supply - demand allotment between sellers and buyers with many to many links. I am unsure of whether I can parallelize the calculation using PySpark operations. I have two columns to keep track of in...

  • 584 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Parallelizing supply-demand allotment in PySpark can be challenging due to the need for sequential updates to supply and demand values across rows. However, it is possible to achieve this using PySpark operations, though it may require a different ap...

  • 0 kudos
glevine
by New Contributor II
  • 1339 Views
  • 1 replies
  • 0 kudos

Resolved! DNSResolve Error while establishing JDBC connection to Azure Databricks

I am using the Databricks JDBC driver (https://databricks.com/spark/jdbc-drivers-download) to connect to Azure Databricks through a VPN.I am connecting through a SAAS low-code platform, Appian, so I don't have access to any more logs. We have set up ...

glevine_0-1737110151744.png
  • 1339 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

It seems that the DNS is not being able to resolve the domain name of your workspace, from the browser with the VPN connection are you able to access to it? 

  • 0 kudos
eballinger
by Contributor
  • 3075 Views
  • 6 replies
  • 2 kudos

Resolved! DLT Pipeline Event Logs

There seems to be a issue now with our DLT pipeline event logs. I am not sure if this is a recent bug or not (but they were ok in Dec). But the issue is in dev, qc and prod and we only have a couple days of history logs now visible in the UI.From wha...

  • 3075 Views
  • 6 replies
  • 2 kudos
Latest Reply
Walter_C
Databricks Employee
  • 2 kudos

Great to hear your issue got resolved.

  • 2 kudos
5 More Replies
Costas96
by New Contributor III
  • 1585 Views
  • 1 replies
  • 1 kudos

Resolved! Delta Live Tables: Add sequential column

Hello everyone, I have a DLT table (examp_table) and I want to add a sequential column that its values will be incremented every time a record gets ingested. I tried to do that with monotonically_increasing_id and Window.orderBy("a column") functions...

  • 1585 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @Costas96, Thanks for your question. You can use identity column feature. https://www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html

  • 1 kudos
BenceCzako
by New Contributor II
  • 2967 Views
  • 5 replies
  • 0 kudos

Databricks mount bug

Hello,I have a weird problem in databricks for which I hope you can suggest some solutions.I have an azureml blob storage mounted to databricks with some folder structure that can be accessed from a notebook as/dbfs/mnt/azuremount/foo/bar/something.t...

  • 2967 Views
  • 5 replies
  • 0 kudos
Latest Reply
BenceCzako
New Contributor II
  • 0 kudos

Hello,Can you figure out the issue?

  • 0 kudos
4 More Replies
Labels