cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ijaza0489
by New Contributor
  • 603 Views
  • 1 replies
  • 0 kudos

Best Strategy for Ingesting PostgreSQL Data into Bronze Layer in Databricks

I am designing a data ingestion strategy for ingesting 10 tables from a PostgreSQL 10 database into the Bronze layer using Databricks only (without ADF or other external tools).Full Load: 7 tables will be fully loaded in each run.Incremental Load: 3 ...

  • 603 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @ijaza0489, Here are key points to keep in mind:   Tracking and Implementing Incremental Loads:Delta Lake: Utilize Delta Lake for managing incremental loads. Delta Lake supports ACID transactions and allows you to perform upserts and merges eff...

  • 0 kudos
mgallagher
by New Contributor
  • 483 Views
  • 1 replies
  • 0 kudos

Limit access to certain pages of a dashboard

Hello,I would like to know if it is possible to restrict / limit access to certain pages of a multipage dashboard based on the user's group membership. In other words, the dashboard itself is able to be accessed by all, with some pages visible to all...

Data Engineering
access
dashboard
filter
group
possible
  • 483 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @mgallagher, Databricks does not natively support page-level access control within a single dashboard, you can create separate dashboards for different user groups and control access at the dashboard level. This means creating a main dashboard acc...

  • 0 kudos
John_Rotenstein
by New Contributor II
  • 22560 Views
  • 10 replies
  • 5 kudos

Retrieve job-level parameters in Python

Parameters can be passed to Tasks and the values can be retrieved with:dbutils.widgets.get("parameter_name")More recently, we have been given the ability to add parameters to Jobs.However, the parameters cannot be retrieved like Task parameters.Quest...

  • 22560 Views
  • 10 replies
  • 5 kudos
Latest Reply
lprevost
Contributor II
  • 5 kudos

The only thing that has worked for me consistently in python is params = dbutils.widgets.getAll() where an empty dictionary is returned if I'm in interactive mode and the job/task params are returned if they are present.

  • 5 kudos
9 More Replies
msgrac
by New Contributor II
  • 1365 Views
  • 2 replies
  • 0 kudos

Cant remove file on ADLS using dbutils.fs.rm because url contains illeagal character

The URL contains a "[" within, and I've tried to encode the path from "[" to "%5B%27", but it didn't work:  from urllib.parse import quotepath = ""encoded_path = quote(path)

  • 1365 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Try https://community.databricks.com/t5/data-engineering/how-can-i-delete-a-file-in-dbfs-with-illegal-character/td-p/9755 https://community.databricks.com/t5/data-engineering/using-dbutils-fs-ls-on-uri-with-square-brackets-results-in-error/td-p/6928

  • 0 kudos
1 More Replies
JrV
by New Contributor
  • 1000 Views
  • 2 replies
  • 0 kudos

Sparql and RDF data

Hello Databricks Community,Does anyone have experience with running SPARQL (https://en.wikipedia.org/wiki/SPARQL) queries in Databricks?Make a connection to the Community SolidServer https://github.com/CommunitySolidServer/CommunitySolidServerand que...

  • 1000 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

You can use the rdflib library to connect to the Community SolidServer and execute SPARQL queries. from rdflib import Graph  

  • 0 kudos
1 More Replies
databrick3
by New Contributor
  • 368 Views
  • 2 replies
  • 0 kudos

R model deployment

unable to serve R model on databricks even tried with pyfunc

  • 368 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

You can try Posit Connect. https://posit.co/blog/databricks-udfs/ Blogs - https://www.databricks.com/blog/databricks-and-posit-announce-new-integrations

  • 0 kudos
1 More Replies
shreya_20202
by New Contributor II
  • 4620 Views
  • 1 replies
  • 1 kudos

copy file structure including files from one storage to another incrementally using pyspark

I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:results search 03 Module19111.json Module19126.json 04 Module11291...

  • 4620 Views
  • 1 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

Is this directory structure a partitioned table? 

  • 1 kudos
AtanuC
by New Contributor
  • 11782 Views
  • 1 replies
  • 0 kudos

OOP programming in Pyspark on Databricks platform

Hello Expert,I have a doubt so I need your advice and opinion on below query. Does OOP is a good chioce of programming for distributed data processing ? like Pysaprk in Databricks platform ? If not then what it is and what kinfd of challenges could b...

  • 11782 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Functional programming is generally better suited for distributed data processing with PySpark on Databricks due to its emphasis on immutability, stateless operations, and higher-order functions. These features align well with Spark's execution model...

  • 0 kudos
kazinahian
by New Contributor III
  • 4167 Views
  • 2 replies
  • 1 kudos

How can I create a new calculated field in databricks by using pyspark.

Hello:Great people. I am new to Databricks and pyspark learning. How can I create a new column called "sub_total"? Where I want to group by "category" "subcategory" and "monthly" sales value. Appreciate your empathic solution. 

Data Engineering
calculation
  • 4167 Views
  • 2 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

I want to group by "category" "subcategory" and "monthly" sales value.  sub_total_df = df.groupBy("category", "subcategory", "monthly").agg(sum("sales_value").alias("sub_total")) You could always type in your query in the Databricks notebook, by clic...

  • 1 kudos
1 More Replies
Divyanshu
by New Contributor
  • 5186 Views
  • 1 replies
  • 0 kudos

java.lang.ArithmeticException: long overflow Exception while writing to table | pyspark

Hey ,I am trying to fetch data from mongo and write to databricks table.I have read data from mongo using pymongo library, then flattened nested struct objects along with renaming columns(since there were few duplicates) and then writing to databrick...

  • 5186 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Sorry, I am not going through the entire schema and code, but in general the error "java.lang.ArithmeticException: long overflow" typically occurs when a calculation exceeds the range that can be represented by a long data type in Java. This issue ca...

  • 0 kudos
sanjay
by Valued Contributor II
  • 15340 Views
  • 2 replies
  • 0 kudos

pyspark dropDuplicates performance issue

Hi,I am trying to delete duplicate records found by key but its very slow.  Its continuous running pipeline so data is not that huge but still it takes time to execute this command.df = df.dropDuplicates(["fileName"])Is there any better approach to d...

  • 15340 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Before dropDuplicates eensure that your DataFrame operations are optimized by caching intermediate results if they are reused multiple times. This can help reduce the overall execution time. We could use some aggregates and grouping like  df_deduped ...

  • 0 kudos
1 More Replies
Phani1
by Valued Contributor II
  • 3229 Views
  • 2 replies
  • 2 kudos

Execute Pyspark cells concurrently

Hi Team,Hi Team,Is it feasible to run pyspark cells concurrently in databricks notebooks? If so, kindly provide instructions on how to accomplish this. We aim to execute the intermediate steps simultaneously.The given scenario entails the simultaneou...

  • 3229 Views
  • 2 replies
  • 2 kudos
Latest Reply
NandiniN
Databricks Employee
  • 2 kudos

Databricks also supports executing SQL cells in parallel. While a command is running and your notebook is attached to an interactive cluster, you can run a SQL cell simultaneously with the current command. The SQL cell is executed in a new, parallel ...

  • 2 kudos
1 More Replies
CM2
by New Contributor
  • 1902 Views
  • 1 replies
  • 0 kudos

Data transfer from AWS/Databricks to GEO repository via FTP

Does anyone have a Python script that runs in Databricks to transfer RNAseq data stored in AWS bucket to a public repository (GEO)? All my attempts failed, it looks like the connection between dbx and geo isn't working as expected.

  • 1902 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

What is the exact error that you face? We can debug from there. I see there are some steps shared on GEO submissions https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html   

  • 0 kudos
drag7ter
by Contributor
  • 2279 Views
  • 1 replies
  • 0 kudos

Bootstrap cluster timeout for job pipeline - databricks bug?

From time to time we have these erors in scheduled PROD runs. It happens when job starts and tries to create one time cluster. It happens 1 time from 10-20 runs and we are not able to identify the root cause, as all network connectivity is fine, some...

  • 2279 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

The error message "BOOTSTRAP_TIMEOUT (SERVICE_FAULT)" indicates that the cluster was terminated because it took too long to initialize. This can happen due to various reasons, including network connectivity issues between the data plane and the contr...

  • 0 kudos
Mike_Szklarczyk
by Contributor
  • 387 Views
  • 1 replies
  • 2 kudos

Resolved! Retrive inforrmation about table clustering from information_schema

Hi guys,I wonder if and when it will be possible to extract from information_schema how the table is clustered. I know that analogous information can be obtained when a table is partitioned using this query: SELECT * FROM cdl_dev.information_schema.c...

  • 387 Views
  • 1 replies
  • 2 kudos
Latest Reply
NandiniN
Databricks Employee
  • 2 kudos

Iit is not possible to extract clustering information of a table directly from the information_schema . The information_schema.columns table can provide details about partitioning, but similar information for clustering is not available through the i...

  • 2 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels