Data Engineering

Forum Posts

Sorted by:

by noorbasha534 • Valued Contributor II

02-01-2025 9:39:03 AM

1410 Views
1 replies
0 kudos

Error handling - SQL states

Dear all,Few questions please - 1. Has anyone successfully used the below way of dealing with error handling in PySpark (example: that contains data frames) as well as SQL code based notebooks - from pyspark.errors import PySparkException try: spa...

Data Engineering

1410 Views
1 replies
0 kudos

02-01-2025 9:39:03 AM

View Replies

Latest Reply

Alberto_Umana
Databricks Employee

02-01-2025 6:23:10 PM

0 kudos

Hi @noorbasha534, The approach you mentioned for error handling in PySpark using PySparkException is a valid method. It allows you to catch specific exceptions related to PySpark operations and handle them accordingly. Logging errors into tables ...

0 kudos

02-01-2025 6:23:10 PM

by subhas_hati • New Contributor

02-01-2025 5:54:51 PM

2313 Views
1 replies
0 kudos

Distinguishing stream workload from batch work load

Is it possible the same data source of batch data as well as stream data. Please find the following code that I have got from internet. The following code handles both stream and batch workload. Please find attached the corresponding pdf file. I am f...

Data Engineering

2313 Views
1 replies
0 kudos

02-01-2025 5:54:51 PM

View Replies

Latest Reply

Alberto_Umana
Databricks Employee

02-01-2025 6:11:42 PM

0 kudos

Hi @subhas_hati, Thanks for your question: Batch Workload: The availableNow trigger is used for batch processing. When you set the trigger to availableNow, it processes all available data as a single batch and then stops. This is useful for scenarios...

0 kudos

02-01-2025 6:11:42 PM

by ijaza0489 • New Contributor

02-01-2025 4:36:43 PM

1673 Views
1 replies
0 kudos

Best Strategy for Ingesting PostgreSQL Data into Bronze Layer in Databricks

I am designing a data ingestion strategy for ingesting 10 tables from a PostgreSQL 10 database into the Bronze layer using Databricks only (without ADF or other external tools).Full Load: 7 tables will be fully loaded in each run.Incremental Load: 3 ...

Data Engineering

1673 Views
1 replies
0 kudos

02-01-2025 4:36:43 PM

View Replies

Latest Reply

Alberto_Umana
Databricks Employee

02-01-2025 6:09:15 PM

0 kudos

Hello @ijaza0489, Here are key points to keep in mind: Tracking and Implementing Incremental Loads:Delta Lake: Utilize Delta Lake for managing incremental loads. Delta Lake supports ACID transactions and allows you to perform upserts and merges eff...

0 kudos

02-01-2025 6:09:15 PM

by mgallagher • New Contributor

01-31-2025 10:10:56 AM

1190 Views
1 replies
0 kudos

Limit access to certain pages of a dashboard

Hello,I would like to know if it is possible to restrict / limit access to certain pages of a multipage dashboard based on the user's group membership. In other words, the dashboard itself is able to be accessed by all, with some pages visible to all...

Data Engineering

access

dashboard

filter

group

possible

1190 Views
1 replies
0 kudos

01-31-2025 10:10:56 AM

View Replies

Latest Reply

Alberto_Umana
Databricks Employee

02-01-2025 8:07:41 AM

0 kudos

Hi @mgallagher, Databricks does not natively support page-level access control within a single dashboard, you can create separate dashboards for different user groups and control access at the dashboard level. This means creating a main dashboard acc...

0 kudos

02-01-2025 8:07:41 AM

by John_Rotenstein • New Contributor II

09-14-2023 1:44:25 AM

28599 Views
10 replies
5 kudos

Retrieve job-level parameters in Python

Parameters can be passed to Tasks and the values can be retrieved with:dbutils.widgets.get("parameter_name")More recently, we have been given the ability to add parameters to Jobs.However, the parameters cannot be retrieved like Task parameters.Quest...

Data Engineering

28599 Views
10 replies
5 kudos

09-14-2023 1:44:25 AM

View Replies

Latest Reply

lprevost
Contributor III

08-06-2024 2:01:36 PM

5 kudos

The only thing that has worked for me consistently in python is params = dbutils.widgets.getAll() where an empty dictionary is returned if I'm in interactive mode and the job/task params are returned if they are present.

5 kudos

08-06-2024 2:01:36 PM

9 More Replies

by msgrac • New Contributor II

03-11-2024 12:09:27 PM

1851 Views
2 replies
0 kudos

Cant remove file on ADLS using dbutils.fs.rm because url contains illeagal character

The URL contains a "[" within, and I've tried to encode the path from "[" to "%5B%27", but it didn't work: from urllib.parse import quotepath = ""encoded_path = quote(path)

Data Engineering

1851 Views
2 replies
0 kudos

03-11-2024 12:09:27 PM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:57:42 PM

0 kudos

Try https://community.databricks.com/t5/data-engineering/how-can-i-delete-a-file-in-dbfs-with-illegal-character/td-p/9755 https://community.databricks.com/t5/data-engineering/using-dbutils-fs-ls-on-uri-with-square-brackets-results-in-error/td-p/6928

0 kudos

01-31-2025 11:57:42 PM

1 More Replies

by JrV • New Contributor

01-16-2025 7:24:39 AM

1999 Views
2 replies
1 kudos

Sparql and RDF data

Hello Databricks Community,Does anyone have experience with running SPARQL (https://en.wikipedia.org/wiki/SPARQL) queries in Databricks?Make a connection to the Community SolidServer https://github.com/CommunitySolidServer/CommunitySolidServerand que...

Data Engineering

1999 Views
2 replies
1 kudos

01-16-2025 7:24:39 AM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:45:14 PM

1 kudos

You can use the rdflib library to connect to the Community SolidServer and execute SPARQL queries. from rdflib import Graph

1 kudos

01-31-2025 11:45:14 PM

1 More Replies

by databrick3 • New Contributor

01-16-2025 6:30:40 AM

708 Views
2 replies
0 kudos

R model deployment

unable to serve R model on databricks even tried with pyfunc

Data Engineering

708 Views
2 replies
0 kudos

01-16-2025 6:30:40 AM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:42:22 PM

0 kudos

You can try Posit Connect. https://posit.co/blog/databricks-udfs/ Blogs - https://www.databricks.com/blog/databricks-and-posit-announce-new-integrations

0 kudos

01-31-2025 11:42:22 PM

1 More Replies

by shreya_20202 • New Contributor II

05-07-2024 4:31:16 AM

9388 Views
1 replies
1 kudos

copy file structure including files from one storage to another incrementally using pyspark

I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:results search 03 Module19111.json Module19126.json 04 Module11291...

Data Engineering

9388 Views
1 replies
1 kudos

05-07-2024 4:31:16 AM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:37:29 PM

1 kudos

Is this directory structure a partitioned table?

1 kudos

01-31-2025 11:37:29 PM

by AtanuC • New Contributor

09-19-2023 10:30:48 PM

13064 Views
1 replies
0 kudos

OOP programming in Pyspark on Databricks platform

Hello Expert,I have a doubt so I need your advice and opinion on below query. Does OOP is a good chioce of programming for distributed data processing ? like Pysaprk in Databricks platform ? If not then what it is and what kinfd of challenges could b...

Data Engineering

13064 Views
1 replies
0 kudos

09-19-2023 10:30:48 PM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:31:13 PM

0 kudos

Functional programming is generally better suited for distributed data processing with PySpark on Databricks due to its emphasis on immutability, stateless operations, and higher-order functions. These features align well with Spark's execution model...

0 kudos

01-31-2025 11:31:13 PM

by kazinahian • New Contributor III

09-20-2023 2:21:08 PM

5494 Views
2 replies
1 kudos

How can I create a new calculated field in databricks by using pyspark.

Hello:Great people. I am new to Databricks and pyspark learning. How can I create a new column called "sub_total"? Where I want to group by "category" "subcategory" and "monthly" sales value. Appreciate your empathic solution.

Data Engineering

calculation

5494 Views
2 replies
1 kudos

09-20-2023 2:21:08 PM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:29:37 PM

1 kudos

I want to group by "category" "subcategory" and "monthly" sales value. sub_total_df = df.groupBy("category", "subcategory", "monthly").agg(sum("sales_value").alias("sub_total")) You could always type in your query in the Databricks notebook, by clic...

1 kudos

01-31-2025 11:29:37 PM

1 More Replies

by Divyanshu • New Contributor

09-28-2023 9:39:11 PM

7361 Views
1 replies
0 kudos

java.lang.ArithmeticException: long overflow Exception while writing to table | pyspark

Hey ,I am trying to fetch data from mongo and write to databricks table.I have read data from mongo using pymongo library, then flattened nested struct objects along with renaming columns(since there were few duplicates) and then writing to databrick...

Data Engineering

7361 Views
1 replies
0 kudos

09-28-2023 9:39:11 PM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:26:51 PM

0 kudos

Sorry, I am not going through the entire schema and code, but in general the error "java.lang.ArithmeticException: long overflow" typically occurs when a calculation exceeds the range that can be represented by a long data type in Java. This issue ca...

0 kudos

01-31-2025 11:26:51 PM

by sanjay • Valued Contributor II

02-11-2024 9:26:16 PM

18386 Views
2 replies
0 kudos

pyspark dropDuplicates performance issue

Hi,I am trying to delete duplicate records found by key but its very slow. Its continuous running pipeline so data is not that huge but still it takes time to execute this command.df = df.dropDuplicates(["fileName"])Is there any better approach to d...

Data Engineering

18386 Views
2 replies
0 kudos

02-11-2024 9:26:16 PM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:19:04 PM

0 kudos

Before dropDuplicates eensure that your DataFrame operations are optimized by caching intermediate results if they are reused multiple times. This can help reduce the overall execution time. We could use some aggregates and grouping like df_deduped ...

0 kudos

01-31-2025 11:19:04 PM

1 More Replies

by Phani1 • Databricks MVP

04-29-2024 9:36:50 AM

5906 Views
2 replies
2 kudos

Execute Pyspark cells concurrently

Hi Team,Hi Team,Is it feasible to run pyspark cells concurrently in databricks notebooks? If so, kindly provide instructions on how to accomplish this. We aim to execute the intermediate steps simultaneously.The given scenario entails the simultaneou...

Data Engineering

5906 Views
2 replies
2 kudos

04-29-2024 9:36:50 AM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 11:08:07 PM

2 kudos

Databricks also supports executing SQL cells in parallel. While a command is running and your notebook is attached to an interactive cluster, you can run a SQL cell simultaneously with the current command. The SQL cell is executed in a new, parallel ...

2 kudos

01-31-2025 11:08:07 PM

1 More Replies

by CM2 • New Contributor

01-28-2025 4:24:27 AM

4013 Views
1 replies
0 kudos

Data transfer from AWS/Databricks to GEO repository via FTP

Does anyone have a Python script that runs in Databricks to transfer RNAseq data stored in AWS bucket to a public repository (GEO)? All my attempts failed, it looks like the connection between dbx and geo isn't working as expected.

Data Engineering

4013 Views
1 replies
0 kudos

01-28-2025 4:24:27 AM

View Replies

Latest Reply

NandiniN
Databricks Employee

01-31-2025 9:42:48 PM

0 kudos

What is the exact error that you face? We can debug from there. I see there are some steps shared on GEO submissions https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html

0 kudos

01-31-2025 9:42:48 PM

Databricks Community

Forum Posts

Error handling - SQL states

Distinguishing stream workload from batch work load

Best Strategy for Ingesting PostgreSQL Data into Bronze Layer in Databricks

Limit access to certain pages of a dashboard

Retrieve job-level parameters in Python

Cant remove file on ADLS using dbutils.fs.rm because url contains illeagal character

Sparql and RDF data

R model deployment

copy file structure including files from one storage to another incrementally using pyspark

OOP programming in Pyspark on Databricks platform

How can I create a new calculated field in databricks by using pyspark.

java.lang.ArithmeticException: long overflow Exception while writing to table | pyspark

pyspark dropDuplicates performance issue

Execute Pyspark cells concurrently

Data transfer from AWS/Databricks to GEO repository via FTP

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template