cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

noorbasha534
by Valued Contributor II
  • 1410 Views
  • 1 replies
  • 0 kudos

Error handling - SQL states

Dear all,Few questions please - 1. Has anyone successfully used the below way of dealing with error handling in PySpark (example: that contains data frames) as well as SQL code based notebooks - from pyspark.errors import PySparkException try: spa...

  • 1410 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @noorbasha534,   The approach you mentioned for error handling in PySpark using PySparkException is a valid method. It allows you to catch specific exceptions related to PySpark operations and handle them accordingly. Logging errors into tables ...

  • 0 kudos
subhas_hati
by New Contributor
  • 2313 Views
  • 1 replies
  • 0 kudos

Distinguishing stream workload from batch work load

Is it possible the same data source of batch data as well as stream data. Please find the following code that I have got from internet. The following code handles both stream and batch workload. Please find attached the corresponding pdf file. I am f...

  • 2313 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @subhas_hati, Thanks for your question: Batch Workload: The availableNow trigger is used for batch processing. When you set the trigger to availableNow, it processes all available data as a single batch and then stops. This is useful for scenarios...

  • 0 kudos
ijaza0489
by New Contributor
  • 1673 Views
  • 1 replies
  • 0 kudos

Best Strategy for Ingesting PostgreSQL Data into Bronze Layer in Databricks

I am designing a data ingestion strategy for ingesting 10 tables from a PostgreSQL 10 database into the Bronze layer using Databricks only (without ADF or other external tools).Full Load: 7 tables will be fully loaded in each run.Incremental Load: 3 ...

  • 1673 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @ijaza0489, Here are key points to keep in mind:   Tracking and Implementing Incremental Loads:Delta Lake: Utilize Delta Lake for managing incremental loads. Delta Lake supports ACID transactions and allows you to perform upserts and merges eff...

  • 0 kudos
mgallagher
by New Contributor
  • 1190 Views
  • 1 replies
  • 0 kudos

Limit access to certain pages of a dashboard

Hello,I would like to know if it is possible to restrict / limit access to certain pages of a multipage dashboard based on the user's group membership. In other words, the dashboard itself is able to be accessed by all, with some pages visible to all...

Data Engineering
access
dashboard
filter
group
possible
  • 1190 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @mgallagher, Databricks does not natively support page-level access control within a single dashboard, you can create separate dashboards for different user groups and control access at the dashboard level. This means creating a main dashboard acc...

  • 0 kudos
John_Rotenstein
by New Contributor II
  • 28599 Views
  • 10 replies
  • 5 kudos

Retrieve job-level parameters in Python

Parameters can be passed to Tasks and the values can be retrieved with:dbutils.widgets.get("parameter_name")More recently, we have been given the ability to add parameters to Jobs.However, the parameters cannot be retrieved like Task parameters.Quest...

  • 28599 Views
  • 10 replies
  • 5 kudos
Latest Reply
lprevost
Contributor III
  • 5 kudos

The only thing that has worked for me consistently in python is params = dbutils.widgets.getAll() where an empty dictionary is returned if I'm in interactive mode and the job/task params are returned if they are present.

  • 5 kudos
9 More Replies
msgrac
by New Contributor II
  • 1851 Views
  • 2 replies
  • 0 kudos

Cant remove file on ADLS using dbutils.fs.rm because url contains illeagal character

The URL contains a "[" within, and I've tried to encode the path from "[" to "%5B%27", but it didn't work:  from urllib.parse import quotepath = ""encoded_path = quote(path)

  • 1851 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Try https://community.databricks.com/t5/data-engineering/how-can-i-delete-a-file-in-dbfs-with-illegal-character/td-p/9755 https://community.databricks.com/t5/data-engineering/using-dbutils-fs-ls-on-uri-with-square-brackets-results-in-error/td-p/6928

  • 0 kudos
1 More Replies
JrV
by New Contributor
  • 1999 Views
  • 2 replies
  • 1 kudos

Sparql and RDF data

Hello Databricks Community,Does anyone have experience with running SPARQL (https://en.wikipedia.org/wiki/SPARQL) queries in Databricks?Make a connection to the Community SolidServer https://github.com/CommunitySolidServer/CommunitySolidServerand que...

  • 1999 Views
  • 2 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

You can use the rdflib library to connect to the Community SolidServer and execute SPARQL queries. from rdflib import Graph  

  • 1 kudos
1 More Replies
databrick3
by New Contributor
  • 708 Views
  • 2 replies
  • 0 kudos

R model deployment

unable to serve R model on databricks even tried with pyfunc

  • 708 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

You can try Posit Connect. https://posit.co/blog/databricks-udfs/ Blogs - https://www.databricks.com/blog/databricks-and-posit-announce-new-integrations

  • 0 kudos
1 More Replies
shreya_20202
by New Contributor II
  • 9388 Views
  • 1 replies
  • 1 kudos

copy file structure including files from one storage to another incrementally using pyspark

I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:results search 03 Module19111.json Module19126.json 04 Module11291...

  • 9388 Views
  • 1 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

Is this directory structure a partitioned table? 

  • 1 kudos
AtanuC
by New Contributor
  • 13064 Views
  • 1 replies
  • 0 kudos

OOP programming in Pyspark on Databricks platform

Hello Expert,I have a doubt so I need your advice and opinion on below query. Does OOP is a good chioce of programming for distributed data processing ? like Pysaprk in Databricks platform ? If not then what it is and what kinfd of challenges could b...

  • 13064 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Functional programming is generally better suited for distributed data processing with PySpark on Databricks due to its emphasis on immutability, stateless operations, and higher-order functions. These features align well with Spark's execution model...

  • 0 kudos
kazinahian
by New Contributor III
  • 5494 Views
  • 2 replies
  • 1 kudos

How can I create a new calculated field in databricks by using pyspark.

Hello:Great people. I am new to Databricks and pyspark learning. How can I create a new column called "sub_total"? Where I want to group by "category" "subcategory" and "monthly" sales value. Appreciate your empathic solution. 

Data Engineering
calculation
  • 5494 Views
  • 2 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

I want to group by "category" "subcategory" and "monthly" sales value.  sub_total_df = df.groupBy("category", "subcategory", "monthly").agg(sum("sales_value").alias("sub_total")) You could always type in your query in the Databricks notebook, by clic...

  • 1 kudos
1 More Replies
Divyanshu
by New Contributor
  • 7361 Views
  • 1 replies
  • 0 kudos

java.lang.ArithmeticException: long overflow Exception while writing to table | pyspark

Hey ,I am trying to fetch data from mongo and write to databricks table.I have read data from mongo using pymongo library, then flattened nested struct objects along with renaming columns(since there were few duplicates) and then writing to databrick...

  • 7361 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Sorry, I am not going through the entire schema and code, but in general the error "java.lang.ArithmeticException: long overflow" typically occurs when a calculation exceeds the range that can be represented by a long data type in Java. This issue ca...

  • 0 kudos
sanjay
by Valued Contributor II
  • 18386 Views
  • 2 replies
  • 0 kudos

pyspark dropDuplicates performance issue

Hi,I am trying to delete duplicate records found by key but its very slow.  Its continuous running pipeline so data is not that huge but still it takes time to execute this command.df = df.dropDuplicates(["fileName"])Is there any better approach to d...

  • 18386 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Before dropDuplicates eensure that your DataFrame operations are optimized by caching intermediate results if they are reused multiple times. This can help reduce the overall execution time. We could use some aggregates and grouping like  df_deduped ...

  • 0 kudos
1 More Replies
Phani1
by Databricks MVP
  • 5906 Views
  • 2 replies
  • 2 kudos

Execute Pyspark cells concurrently

Hi Team,Hi Team,Is it feasible to run pyspark cells concurrently in databricks notebooks? If so, kindly provide instructions on how to accomplish this. We aim to execute the intermediate steps simultaneously.The given scenario entails the simultaneou...

  • 5906 Views
  • 2 replies
  • 2 kudos
Latest Reply
NandiniN
Databricks Employee
  • 2 kudos

Databricks also supports executing SQL cells in parallel. While a command is running and your notebook is attached to an interactive cluster, you can run a SQL cell simultaneously with the current command. The SQL cell is executed in a new, parallel ...

  • 2 kudos
1 More Replies
CM2
by New Contributor
  • 4013 Views
  • 1 replies
  • 0 kudos

Data transfer from AWS/Databricks to GEO repository via FTP

Does anyone have a Python script that runs in Databricks to transfer RNAseq data stored in AWS bucket to a public repository (GEO)? All my attempts failed, it looks like the connection between dbx and geo isn't working as expected.

  • 4013 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

What is the exact error that you face? We can debug from there. I see there are some steps shared on GEO submissions https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html   

  • 0 kudos
Labels