cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

garciargs
by New Contributor
  • 29 Views
  • 1 replies
  • 0 kudos

Incremental load from two tables

Hi, I am looking to build a ETL process for a incremental load silver table.This silver table, lets say "contracts_silver", is built by joining two bronze tables, "contracts_raw" and "customer".contracts_silverCONTRACT_IDSTATUSCUSTOMER_NAME1SIGNEDPet...

  • 29 Views
  • 1 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @garciargs ,Yes, in databricks you can do it using DLT (Delta Live Table) and Spark Structured Streaming, where you have to enable CDF (Change Data Feed) on both contracts_raw and customer_raw which would track all DML changes over raw tables.-- N...

  • 0 kudos
kyrrewk
by New Contributor II
  • 21 Views
  • 2 replies
  • 0 kudos

Monitor progress when using databricks-connect

When using databricks-connect how can you monitor the progress? Ideally, we want something similar to what you get in the Databricks notebook, i.e., information about the jobs/stages. We are using Python.

  • 21 Views
  • 2 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

When you refer to progress, you mean that during the notebook execution you can see the Spark jobs processing for each cell?

  • 0 kudos
1 More Replies
leymariv
by Visitor
  • 70 Views
  • 1 replies
  • 0 kudos

Performance issue writing an extract of a huge unpartitionned single column dataframe

I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be a...

leymariv_2-1737155764713.png leymariv_0-1737155486874.png
  • 70 Views
  • 1 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @leymariv,You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use from_json function to flatten or normalize the data to respective columns.Additionally, you can understand how dat...

  • 0 kudos
Maksym
by New Contributor III
  • 8547 Views
  • 5 replies
  • 7 kudos

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark .readStream .format("cloudFiles") .option('cloudFil...

  • 8547 Views
  • 5 replies
  • 7 kudos
Latest Reply
lassebe
New Contributor II
  • 7 kudos

I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!

  • 7 kudos
4 More Replies
kasiviss42
by New Contributor II
  • 46 Views
  • 3 replies
  • 0 kudos

Predicate pushdown query

Does predicate pushdown works when we provide a filter on a dataframe reading a delta table with 2 lakh values i.efilter condition:column is in(list)list contains 2lakh elements  i need to get n number of columns from a table i am currently using joi...

  • 46 Views
  • 3 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @kasiviss42,This might sound like a rhetorical question, but let’s delve into the complexity of joins and filters and examine how generating a list of 2lakh values affects it. Let's assume we have fact table with 1 billion record and dimension tab...

  • 0 kudos
2 More Replies
AlexSantiago
by New Contributor II
  • 3136 Views
  • 16 replies
  • 4 kudos

spotify API get token - raw_input was called, but this frontend does not support input requests.

hello everyone, I'm trying use spotify's api to analyse my music data, but i'm receiving a error during authentication, specifically when I try get the token, above my code.Is it a databricks bug?pip install spotipyfrom spotipy.oauth2 import SpotifyO...

  • 3136 Views
  • 16 replies
  • 4 kudos
Latest Reply
avamax44
Visitor
  • 4 kudos

How do top-followed accounts balance personal and professional content?Authenticity & Relatability: Personal content, such as behind-the-scenes moments, personal stories, and daily life updates, helps influencers connect with their audience on a huma...

  • 4 kudos
15 More Replies
franc_bomb
by New Contributor
  • 49 Views
  • 4 replies
  • 0 kudos

Cluster creation issue

Hello,I just started using Databricks community version for learning purposes.I have been trying to create a cluster but the first time it failed asking me to retry or contact the support, and now it's just running forever.What could be the problem? 

  • 49 Views
  • 4 replies
  • 0 kudos
Latest Reply
franc_bomb
New Contributor
  • 0 kudos

I've been trying again but I still face the same problem.

  • 0 kudos
3 More Replies
NavyaSinghvi
by New Contributor III
  • 2131 Views
  • 6 replies
  • 2 kudos

Resolved! File_arrival trigger in Workflow

I am using  "job.trigger.file_arrival.location" in job parameters to get triggered file location . But I am getting error "job.trigger.file_arrival.location is not allowed". How can I get triggered file location in workflow ? 

  • 2131 Views
  • 6 replies
  • 2 kudos
Latest Reply
raghu2
New Contributor III
  • 2 kudos

The parameters are passed as widgets to the job. After defining the parameters in the job definition, With following code I was able to access the data associated with the parameter:widget_names = ["loc1", "loc2", "loc3"]  # Add all expected paramete...

  • 2 kudos
5 More Replies
adam_mich
by New Contributor II
  • 440 Views
  • 10 replies
  • 0 kudos

How to Pass Data to a Databricks App?

I am developing a Databricks application using the Streamlit package. I was able to get a "hello world" app deployed successfully, but now I am trying to pass data that exists in the dbfs on the same instance. I try to read a csv saved to the dbfs bu...

  • 440 Views
  • 10 replies
  • 0 kudos
Latest Reply
txti
New Contributor III
  • 0 kudos

I have the identical problem in Databricks Apps.  I have tried...Reading from DBFS path using mount version `/dbfs/myfolder/myfile` and protocol `dbfs:/myfolder/myfile`Reading from Unity Volumes `/Volumes/mycatalog/mydatabase/myfolder/myfile`Also mad...

  • 0 kudos
9 More Replies
TX-Aggie-00
by New Contributor III
  • 841 Views
  • 6 replies
  • 2 kudos

Installing linux packages on cluster

Hey everyone!  We have a need to utilize libreoffice in one of our automated tasks via a notebook.  I have tried to install via a init script that I attach to the cluster, but sometimes the program gets installed and sometimes it doesn't.  For obviou...

  • 841 Views
  • 6 replies
  • 2 kudos
Latest Reply
TX-Aggie-00
New Contributor III
  • 2 kudos

Thanks Alberto!  There were 42 deb files, so I just changed my script to:sudo dpkg -i /dbfs/Volumes/your_catalog/your_schema/your_volume/*.debThe init_script log shows that it unpacks everything, sets them up and the processes triggers, but the packa...

  • 2 kudos
5 More Replies
Harish2122
by Contributor
  • 14981 Views
  • 10 replies
  • 13 kudos

Databricks SQL string_agg

Migrating some on-premise SQL views to Databricks and struggling to find conversions for some functions. the main one is the string_agg function.string_agg(field_name, ', ')​Anyone know how to convert that to Databricks SQL?​Thanks in advance.

  • 14981 Views
  • 10 replies
  • 13 kudos
Latest Reply
smueller
New Contributor II
  • 13 kudos

If not grouping by something else: SELECT array_join(collect_set(field_name), ',') field_list    FROM table

  • 13 kudos
9 More Replies
Abdul-Mannan
by New Contributor III
  • 39 Views
  • 1 replies
  • 0 kudos

Notifications have file information but dataframe is empty using autoloader file notification mode

Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using foreachbatch sink to process the data ...

  • 39 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Here are some potential steps and considerations to troubleshoot and resolve the issue: Permissions and Configuration: Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles ...

  • 0 kudos
thecodecache
by New Contributor II
  • 1698 Views
  • 2 replies
  • 0 kudos

Transpile a SQL Script into PySpark DataFrame API equivalent code

Input SQL Script (assume any dialect) : SELECT b.se10, b.se3, b.se_aggrtr_indctr, b.key_swipe_ind FROM (SELECT se10, se3, se_aggrtr_indctr, ROW_NUMBER() OVER (PARTITION BY SE10 ...

  • 1698 Views
  • 2 replies
  • 0 kudos
Latest Reply
MathieuDB
Databricks Employee
  • 0 kudos

Hello @thecodecache , Have a look the SQLGlot project: https://github.com/tobymao/sqlglot?tab=readme-ov-file#faq It can easily transpile SQL to Spark SQL, like that: import sqlglot from pyspark.sql import SparkSession # Initialize Spark session spar...

  • 0 kudos
1 More Replies
William_Scardua
by Valued Contributor
  • 6884 Views
  • 2 replies
  • 0 kudos

Pyspark or Scala ?

Hi guys,Many people use pyspark to develop their pipelines, in your opinion in which cases is it better to use one or the other? Or is it better to choose a single language?Thanks

  • 6884 Views
  • 2 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @William_Scardua,It is advisable to consider using Python (or PySpark) due to Spark's comprehensive API support for Python. Furthermore, Databricks currently supports Delta Live Tables (DLT) with Python, but does not support Scala at this time. Ad...

  • 0 kudos
1 More Replies
JrV
by New Contributor
  • 31 Views
  • 1 replies
  • 0 kudos

Sparql and RDF data

Hello Databricks Community,Does anyone have experience with running SPARQL (https://en.wikipedia.org/wiki/SPARQL) queries in Databricks?Make a connection to the Community SolidServer https://github.com/CommunitySolidServer/CommunitySolidServerand que...

  • 31 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16502773013
Databricks Employee
  • 0 kudos

Hello @JrV , For this use case, Databricks currently support Bellman SPARQL engine which can run on Databricks as a Scala library operating on a dataframe of Triples (S, P, O) Also integration is available for Stardog through Databricks Partner Conne...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels