cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

KacperG
by New Contributor III
  • 1467 Views
  • 1 replies
  • 0 kudos

%sh fails to install mdbtools locally

HiI have a notebook that uses mdbtools:%sh sudo apt-get -y -S install mdbtoolsHowever, when I want to run it locally it returns error:sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure...

  • 1467 Views
  • 1 replies
  • 0 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 0 kudos

Hi @KacperG ,Most likely the job is using different privileges than you.In your command -S needs to be specified before apt-get. Have you tried running: %sh sudo -S apt-get -y install mdbtools If it is just a typo, here are the options:1.Try to run t...

  • 0 kudos
semsim
by Contributor
  • 16142 Views
  • 9 replies
  • 1 kudos

Resolved! Init Script Failing

I am getting an error when I try to run the cluster scoped init script. The script itself is as follows:#!/bin/bashsudo apt update && sudo apt upgrade -ysudo apt install libreoffice-common libreoffice-java-common libreoffice-writer openjdk-8-jre-head...

  • 16142 Views
  • 9 replies
  • 1 kudos
Latest Reply
zmsoft
Contributor
  • 1 kudos

Hi @semsim , @jacovangelder ,I added the code you mentioned at the beginning of the script, but I still got errors. #!/bin/bash sudo rm -r /var/lib/apt/lists/* sudo apt clean && sudo apt update --fix-missing -y if ! [[ "18.04 20.04 22.04 23.04 24.04...

  • 1 kudos
8 More Replies
pragarwal
by Databricks Partner
  • 9662 Views
  • 6 replies
  • 1 kudos

Adding Member to group using account databricks rest api

Hi All,I want to add a member to a group in databricks account level using rest api (https://docs.databricks.com/api/azure/account/accountgroups/patch) as mentioned in this link I could able to authenticate but not able to add member while using belo...

  • 9662 Views
  • 6 replies
  • 1 kudos
Latest Reply
Nikos
New Contributor II
  • 1 kudos

Does the above work? I still can't quite figure it out. Any help would be much appreciated.I know authentication is not an issue as I can use a lot of the other endpoints. I just can't figure out the correct body syntax to add a member to a group.url...

  • 1 kudos
5 More Replies
sashikanth
by Databricks Partner
  • 1504 Views
  • 2 replies
  • 0 kudos

Streaming or Batch Processing

How to decide whether to go for Streaming or Batch processing when the upstream is DELTA table?Please share suggestions to optimize the load timings.

  • 1504 Views
  • 2 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

Structured Streaming is one of the options, spark.readStream.format("delta")

  • 0 kudos
1 More Replies
priyanananthram
by New Contributor II
  • 10757 Views
  • 4 replies
  • 1 kudos

Delta live tables for large number of tables

Hi There I am hoping for some guidance I have some 850 tables that I need to ingest using  a DLT Pipeline. When I do this my event log shows that driver node dies becomes unresponsive likely due to GC.Can DLT be used to ingest large number of tablesI...

  • 10757 Views
  • 4 replies
  • 1 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 1 kudos

Delta Live Tables (DLT) can indeed be used to ingest a large number of tables. However, if you're experiencing issues with the driver node becoming unresponsive due to garbage collection (GC), it might be a sign that the resources allocated to the dr...

  • 1 kudos
3 More Replies
badari_narayan
by Databricks Partner
  • 3858 Views
  • 6 replies
  • 1 kudos

How to create SQL Functions using Pysparkin local machine

I am trying to create spark SQL function in particular schema (i.e) spark.sql(" CREATE OR REPLACE FUNCTION <spark_catalog>.<schema_name>.<function_name()> RETURNS STRING RETURN <value>")This works perfectly fine on Databricks using notebooks.But, I n...

  • 3858 Views
  • 6 replies
  • 1 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 1 kudos

Hi @badari_narayan ,In general you may run pyspark project locally, but with limitations.Create virtual environmentInstall pyspark in your virtual environment (the same version you have on your cluster)Since spark version 2.x you even do not need to ...

  • 1 kudos
5 More Replies
sms101
by New Contributor
  • 2127 Views
  • 1 replies
  • 0 kudos

Table lineage visibility in Databricks

I’ve observed differences in table lineage visibility in Databricks based on how data is referenced, and I would like to confirm if this is the expected behavior.1. When referencing a Delta table as the source in a query (e.g., df = spark.table("cata...

  • 2127 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi @sms101,How are you doing today?As per my understanding, It is correct that lineage tracking in Databricks works primarily at the table level, meaning when you reference a Delta table directly, the lineage is properly captured. However, when you u...

  • 0 kudos
Bilel
by New Contributor II
  • 1919 Views
  • 1 replies
  • 2 kudos

Python library not installed when compute is resized

 Hi,I have a python notebook workflow that uses a job cluster. The cluster lost at least a node (due to Spot Instance Termination) and did an upsize. After that I got an error in my job "Module not found", but the python module was being used before ...

  • 1919 Views
  • 1 replies
  • 2 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 2 kudos

Hi @Bilel,How are you doing today?As per my understanding, Consider installing the library at the cluster level to ensure it's automatically applied across all nodes when a new one is added. You could also try using init scripts to guarantee the requ...

  • 2 kudos
fperry
by New Contributor III
  • 1188 Views
  • 1 replies
  • 0 kudos

Question about stateful processing

I'm experiencing an issue that I don't understand. I am using Python's arbitrary stateful processing with structured streaming to calculate metrics for each item/ID. A timeout is set, after which I clear the state for that item/ID and display each ID...

  • 1188 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi @fperry,How are you doing today?As per my understanding, Consider checking for any differences in how the stateful streaming function is writing and persisting data. It's possible that while the state is cleared after the timeout, some state might...

  • 0 kudos
gabrieleladd
by New Contributor II
  • 4377 Views
  • 3 replies
  • 1 kudos

Clearing data stored by pipelines

Hi everyone! I'm new to Databricks and moving my first steps with Delta Live Tables, so please forgive my inexperience. I'm building my first DLT pipeline and there's something that I can't really grasp: how to clear all the objects generated or upda...

Data Engineering
Data Pipelines
Delta Live Tables
  • 4377 Views
  • 3 replies
  • 1 kudos
Latest Reply
ChKing
New Contributor II
  • 1 kudos

To clear all objects generated or updated by the DLT pipeline, you can drop the tables manually using the DROP command as you've mentioned. However, to get a completely clean slate, including metadata like the tracking of already processed files in t...

  • 1 kudos
2 More Replies
aniruth1000
by New Contributor II
  • 5674 Views
  • 3 replies
  • 2 kudos

Resolved! Delta Live Tables - CDC - Batching - Delta Tables

Hey Folks, I'm trying to implement CDC - Apply changes from one delta table to another. Source is  a delta table named table_latest and target is another delta table named table_old. Both are delta tables in databricks. Im trying to cascade the incre...

  • 5674 Views
  • 3 replies
  • 2 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 2 kudos

Hi @aniruth1000 ,When using delta live table pipelines, only the source table can be the delta table.The target table must be fully managed by the DLT pipeline, including its creation and lifecycle.Let's say that you modified the code as suggested by...

  • 2 kudos
2 More Replies
vishwanath_1
by New Contributor III
  • 5393 Views
  • 4 replies
  • 1 kudos

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read

reading 130gb file  without  multi line true it is 6 minutes my file has data in multi liner .How to speed up the reading time here .. i am using below commandInputDF=spark.read.option("delimiter","^").option("header",false).option("encoding","UTF-8"...

  • 5393 Views
  • 4 replies
  • 1 kudos
Latest Reply
Lakshay
Databricks Employee
  • 1 kudos

Hi @vishwanath_1 , Can you try setting the below config if this resolves the issue? set spark.databricks.sql.csv.edgeParserSplittable=true;

  • 1 kudos
3 More Replies
vishu4rall
by New Contributor II
  • 1959 Views
  • 4 replies
  • 0 kudos

copy files from azure file share to s3 bucket

kindly help us with code to upload a text/csv file from Azure file share to s3 bucket

  • 1959 Views
  • 4 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

Did you try using azcopy?  https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10?tabs=dnf

  • 0 kudos
3 More Replies
lprevost
by Contributor III
  • 2924 Views
  • 5 replies
  • 0 kudos

Large/complex Incremental Autoloader Job -- Seeking Experience on approach

I'm experimenting with several approaches to implement an incremental autoloader query either in DLT or in a pipeline job.   The complexities:- Moving approximately 30B records from a nasty set of nested folders on S3 in several thousand csv files.  ...

  • 2924 Views
  • 5 replies
  • 0 kudos
Latest Reply
lprevost
Contributor III
  • 0 kudos

Crickets....

  • 0 kudos
4 More Replies
lprevost
by Contributor III
  • 851 Views
  • 1 replies
  • 0 kudos

Using GraphFrames on DLT job

I am trying to run a DLT job that uses GraphFrames, which is in the ML standard image.   I am using it successfully in my job compute instances.  Here are my overrides for the standard job compute policy: {"spark_version": {"type": "unlimited","defau...

  • 851 Views
  • 1 replies
  • 0 kudos
Latest Reply
lprevost
Contributor III
  • 0 kudos

Crickets ....

  • 0 kudos
Labels