cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

semsim
by Contributor
  • 5468 Views
  • 9 replies
  • 1 kudos

Resolved! Init Script Failing

I am getting an error when I try to run the cluster scoped init script. The script itself is as follows:#!/bin/bashsudo apt update && sudo apt upgrade -ysudo apt install libreoffice-common libreoffice-java-common libreoffice-writer openjdk-8-jre-head...

  • 5468 Views
  • 9 replies
  • 1 kudos
Latest Reply
zmsoft
New Contributor III
  • 1 kudos

Hi @semsim , @jacovangelder ,I added the code you mentioned at the beginning of the script, but I still got errors. #!/bin/bash sudo rm -r /var/lib/apt/lists/* sudo apt clean && sudo apt update --fix-missing -y if ! [[ "18.04 20.04 22.04 23.04 24.04...

  • 1 kudos
8 More Replies
pragarwal
by New Contributor II
  • 3474 Views
  • 6 replies
  • 1 kudos

Adding Member to group using account databricks rest api

Hi All,I want to add a member to a group in databricks account level using rest api (https://docs.databricks.com/api/azure/account/accountgroups/patch) as mentioned in this link I could able to authenticate but not able to add member while using belo...

  • 3474 Views
  • 6 replies
  • 1 kudos
Latest Reply
Nikos
New Contributor II
  • 1 kudos

Does the above work? I still can't quite figure it out. Any help would be much appreciated.I know authentication is not an issue as I can use a lot of the other endpoints. I just can't figure out the correct body syntax to add a member to a group.url...

  • 1 kudos
5 More Replies
sashikanth
by New Contributor II
  • 623 Views
  • 2 replies
  • 0 kudos

Streaming or Batch Processing

How to decide whether to go for Streaming or Batch processing when the upstream is DELTA table?Please share suggestions to optimize the load timings.

  • 623 Views
  • 2 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

Structured Streaming is one of the options, spark.readStream.format("delta")

  • 0 kudos
1 More Replies
priyanananthram
by New Contributor II
  • 7824 Views
  • 4 replies
  • 1 kudos

Delta live tables for large number of tables

Hi There I am hoping for some guidance I have some 850 tables that I need to ingest using  a DLT Pipeline. When I do this my event log shows that driver node dies becomes unresponsive likely due to GC.Can DLT be used to ingest large number of tablesI...

  • 7824 Views
  • 4 replies
  • 1 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 1 kudos

Delta Live Tables (DLT) can indeed be used to ingest a large number of tables. However, if you're experiencing issues with the driver node becoming unresponsive due to garbage collection (GC), it might be a sign that the resources allocated to the dr...

  • 1 kudos
3 More Replies
badari_narayan
by New Contributor II
  • 1221 Views
  • 6 replies
  • 1 kudos

How to create SQL Functions using Pysparkin local machine

I am trying to create spark SQL function in particular schema (i.e) spark.sql(" CREATE OR REPLACE FUNCTION <spark_catalog>.<schema_name>.<function_name()> RETURNS STRING RETURN <value>")This works perfectly fine on Databricks using notebooks.But, I n...

  • 1221 Views
  • 6 replies
  • 1 kudos
Latest Reply
filipniziol
Contributor III
  • 1 kudos

Hi @badari_narayan ,In general you may run pyspark project locally, but with limitations.Create virtual environmentInstall pyspark in your virtual environment (the same version you have on your cluster)Since spark version 2.x you even do not need to ...

  • 1 kudos
5 More Replies
meret
by New Contributor II
  • 999 Views
  • 2 replies
  • 0 kudos

Trouble Accessing Trust Store for Oracle JDBC Connection on Shared Compute Cluster

HiI am trying to read data from an Oracle DB using the Oracle JDBC Driver:df = (spark.read.format("jdbc").option("url", "jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCPS)(PORT=xxx)(HOST=xxx))(CONNECT_DATA=(SID=xxx)))").option("dbTable", "schema...

  • 999 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

The trust store file needs to be accessible from all nodes in the shared compute cluster. You can achieve this by storing the trust store file in a location that is accessible to all nodes, such as a mounted volume or a distributed file system. Here'...

  • 0 kudos
1 More Replies
sms101
by New Contributor
  • 874 Views
  • 1 replies
  • 0 kudos

Table lineage visibility in Databricks

I’ve observed differences in table lineage visibility in Databricks based on how data is referenced, and I would like to confirm if this is the expected behavior.1. When referencing a Delta table as the source in a query (e.g., df = spark.table("cata...

  • 874 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Honored Contributor
  • 0 kudos

Hi @sms101,How are you doing today?As per my understanding, It is correct that lineage tracking in Databricks works primarily at the table level, meaning when you reference a Delta table directly, the lineage is properly captured. However, when you u...

  • 0 kudos
Bilel
by New Contributor
  • 989 Views
  • 1 replies
  • 1 kudos

Python library not installed when compute is resized

 Hi,I have a python notebook workflow that uses a job cluster. The cluster lost at least a node (due to Spot Instance Termination) and did an upsize. After that I got an error in my job "Module not found", but the python module was being used before ...

  • 989 Views
  • 1 replies
  • 1 kudos
Latest Reply
Brahmareddy
Honored Contributor
  • 1 kudos

Hi @Bilel,How are you doing today?As per my understanding, Consider installing the library at the cluster level to ensure it's automatically applied across all nodes when a new one is added. You could also try using init scripts to guarantee the requ...

  • 1 kudos
fperry
by New Contributor II
  • 524 Views
  • 1 replies
  • 0 kudos

Question about stateful processing

I'm experiencing an issue that I don't understand. I am using Python's arbitrary stateful processing with structured streaming to calculate metrics for each item/ID. A timeout is set, after which I clear the state for that item/ID and display each ID...

  • 524 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Honored Contributor
  • 0 kudos

Hi @fperry,How are you doing today?As per my understanding, Consider checking for any differences in how the stateful streaming function is writing and persisting data. It's possible that while the state is cleared after the timeout, some state might...

  • 0 kudos
gabrieleladd
by New Contributor II
  • 2313 Views
  • 3 replies
  • 1 kudos

Clearing data stored by pipelines

Hi everyone! I'm new to Databricks and moving my first steps with Delta Live Tables, so please forgive my inexperience. I'm building my first DLT pipeline and there's something that I can't really grasp: how to clear all the objects generated or upda...

Data Engineering
Data Pipelines
Delta Live Tables
  • 2313 Views
  • 3 replies
  • 1 kudos
Latest Reply
ChKing
New Contributor II
  • 1 kudos

To clear all objects generated or updated by the DLT pipeline, you can drop the tables manually using the DROP command as you've mentioned. However, to get a completely clean slate, including metadata like the tracking of already processed files in t...

  • 1 kudos
2 More Replies
aniruth1000
by New Contributor II
  • 1362 Views
  • 3 replies
  • 2 kudos

Resolved! Delta Live Tables - CDC - Batching - Delta Tables

Hey Folks, I'm trying to implement CDC - Apply changes from one delta table to another. Source is  a delta table named table_latest and target is another delta table named table_old. Both are delta tables in databricks. Im trying to cascade the incre...

  • 1362 Views
  • 3 replies
  • 2 kudos
Latest Reply
filipniziol
Contributor III
  • 2 kudos

Hi @aniruth1000 ,When using delta live table pipelines, only the source table can be the delta table.The target table must be fully managed by the DLT pipeline, including its creation and lifecycle.Let's say that you modified the code as suggested by...

  • 2 kudos
2 More Replies
vishwanath_1
by New Contributor III
  • 2962 Views
  • 4 replies
  • 1 kudos

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read

reading 130gb file  without  multi line true it is 6 minutes my file has data in multi liner .How to speed up the reading time here .. i am using below commandInputDF=spark.read.option("delimiter","^").option("header",false).option("encoding","UTF-8"...

  • 2962 Views
  • 4 replies
  • 1 kudos
Latest Reply
Lakshay
Databricks Employee
  • 1 kudos

Hi @vishwanath_1 , Can you try setting the below config if this resolves the issue? set spark.databricks.sql.csv.edgeParserSplittable=true;

  • 1 kudos
3 More Replies
vishu4rall
by New Contributor II
  • 606 Views
  • 4 replies
  • 0 kudos

copy files from azure file share to s3 bucket

kindly help us with code to upload a text/csv file from Azure file share to s3 bucket

  • 606 Views
  • 4 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

Did you try using azcopy?  https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10?tabs=dnf

  • 0 kudos
3 More Replies
LeoGaller
by New Contributor II
  • 4069 Views
  • 3 replies
  • 1 kudos

What are the options for "spark_conf.spark.databricks.cluster.profile"?

Hey guys, I'm trying to find what are the options we can pass to spark_conf.spark.databricks.cluster.profileI know looking around that some of the available configs are singleNode and serverless, but there are others?Where is the documentation of it?...

  • 4069 Views
  • 3 replies
  • 1 kudos
lprevost
by Contributor
  • 1081 Views
  • 5 replies
  • 0 kudos

Large/complex Incremental Autoloader Job -- Seeking Experience on approach

I'm experimenting with several approaches to implement an incremental autoloader query either in DLT or in a pipeline job.   The complexities:- Moving approximately 30B records from a nasty set of nested folders on S3 in several thousand csv files.  ...

  • 1081 Views
  • 5 replies
  • 0 kudos
Latest Reply
lprevost
Contributor
  • 0 kudos

Crickets....

  • 0 kudos
4 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels