cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

vishwanath_1
by New Contributor III
  • 3282 Views
  • 4 replies
  • 1 kudos

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read

reading 130gb file  without  multi line true it is 6 minutes my file has data in multi liner .How to speed up the reading time here .. i am using below commandInputDF=spark.read.option("delimiter","^").option("header",false).option("encoding","UTF-8"...

  • 3282 Views
  • 4 replies
  • 1 kudos
Latest Reply
Lakshay
Databricks Employee
  • 1 kudos

Hi @vishwanath_1 , Can you try setting the below config if this resolves the issue? set spark.databricks.sql.csv.edgeParserSplittable=true;

  • 1 kudos
3 More Replies
vishu4rall
by New Contributor II
  • 721 Views
  • 4 replies
  • 0 kudos

copy files from azure file share to s3 bucket

kindly help us with code to upload a text/csv file from Azure file share to s3 bucket

  • 721 Views
  • 4 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

Did you try using azcopy?  https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10?tabs=dnf

  • 0 kudos
3 More Replies
LeoGaller
by New Contributor II
  • 5078 Views
  • 3 replies
  • 1 kudos

What are the options for "spark_conf.spark.databricks.cluster.profile"?

Hey guys, I'm trying to find what are the options we can pass to spark_conf.spark.databricks.cluster.profileI know looking around that some of the available configs are singleNode and serverless, but there are others?Where is the documentation of it?...

  • 5078 Views
  • 3 replies
  • 1 kudos
lprevost
by Contributor II
  • 1290 Views
  • 5 replies
  • 0 kudos

Large/complex Incremental Autoloader Job -- Seeking Experience on approach

I'm experimenting with several approaches to implement an incremental autoloader query either in DLT or in a pipeline job.   The complexities:- Moving approximately 30B records from a nasty set of nested folders on S3 in several thousand csv files.  ...

  • 1290 Views
  • 5 replies
  • 0 kudos
Latest Reply
lprevost
Contributor II
  • 0 kudos

Crickets....

  • 0 kudos
4 More Replies
lprevost
by Contributor II
  • 390 Views
  • 1 replies
  • 0 kudos

Using GraphFrames on DLT job

I am trying to run a DLT job that uses GraphFrames, which is in the ML standard image.   I am using it successfully in my job compute instances.  Here are my overrides for the standard job compute policy: {"spark_version": {"type": "unlimited","defau...

  • 390 Views
  • 1 replies
  • 0 kudos
Latest Reply
lprevost
Contributor II
  • 0 kudos

Crickets ....

  • 0 kudos
lprevost
by Contributor II
  • 620 Views
  • 2 replies
  • 0 kudos

GraphFrames and DLT

I am trying to run a DLT job that uses GraphFrames, which is in the ML standard image.   I am using it successfully in my job compute instances but I'm running into problems trying to use it in a DLT job.  Here are my overrides for the standard job c...

  • 620 Views
  • 2 replies
  • 0 kudos
Latest Reply
lprevost
Contributor II
  • 0 kudos

Crickets .....

  • 0 kudos
1 More Replies
Valentin1
by New Contributor III
  • 8436 Views
  • 6 replies
  • 3 kudos

Delta Live Tables Incremental Batch Loads & Failure Recovery

Hello Databricks community,I'm working on a pipeline and would like to implement a common use case using Delta Live Tables. The pipeline should include the following steps:Incrementally load data from Table A as a batch.If the pipeline has previously...

  • 8436 Views
  • 6 replies
  • 3 kudos
Latest Reply
lprevost
Contributor II
  • 3 kudos

I totally agree that this is a gap in the Databricks solution.  This gap exists between a static read and real time streaming.   My problem (and suspect there are many use cases) is that I have slowly changing data coming into structured folders via ...

  • 3 kudos
5 More Replies
Octavian1
by Contributor
  • 1187 Views
  • 2 replies
  • 1 kudos

Path of artifacts not found error in pyfunc.load_model using pyfunc wrapper

Hi,For a PySpark model, which involves also a pipeline, and that I want to register with mlflow, I am using a pyfunc wrapper.Steps I followed:1. Pipeline and model serialization and logging (using Volume locally, the logging will be performed in dbfs...

  • 1187 Views
  • 2 replies
  • 1 kudos
Latest Reply
pikapika
New Contributor II
  • 1 kudos

Stuck with the same issue however I managed to load it ( was looking to serve it using model serving as well ),One thing I noticed is that we can use mlflow.create_experiment() in the beginning and specify the default artifact location parameter as D...

  • 1 kudos
1 More Replies
KristiLogos
by Contributor
  • 1966 Views
  • 9 replies
  • 4 kudos

Resolved! Load parent columns and not unnest using pyspark? Found invalid character(s) ' ,;{}()\n' in schema

I'm not sure I'm working this correctly but I'm having some issues with the column names when I try to load to a table in our databricks catalog. I have multiple .json.gz files in our blob container that I want to load to a table:df = spark.read.opti...

  • 1966 Views
  • 9 replies
  • 4 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 4 kudos

Hi @KristiLogos ,Check if your JSON doesn't have characters contained in error message in it's key values. 

  • 4 kudos
8 More Replies
wendyl
by New Contributor II
  • 1000 Views
  • 3 replies
  • 0 kudos

Connection Refused: [Databricks][JDBC](11640) Required Connection Key(s): PWD;

Hey I'm trying to connect to Databricks using client id and secrets. I'm using JDBC 2.6.38.I'm using the following connection url: jdbc:databricks://<server-hostname>:443;httpPath=<http-path>;AuthMech=11;Auth_Flow=1;OAuth2ClientId=<service-principal-...

  • 1000 Views
  • 3 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @wendyl ,Could you give as an answer for the following questions? - does your workspace have private link ?- do you use  Microsoft Entra ID managed service principal ?- if you used Entra ID managed SP, did you use secret from Entra ID, or Azure Da...

  • 0 kudos
2 More Replies
Himanshu4
by New Contributor II
  • 2238 Views
  • 5 replies
  • 2 kudos

Inquiry Regarding Enabling Unity Catalog in Databricks Cluster Configuration via API

Dear Databricks Community,I hope this message finds you well. I am currently working on automating cluster configuration updates in Databricks using the API. As part of this automation, I am looking to ensure that the Unity Catalog is enabled within ...

  • 2238 Views
  • 5 replies
  • 2 kudos
Latest Reply
Himanshu4
New Contributor II
  • 2 kudos

Hi RaphaelCan we fetch job details from one workspace and create new job in new workspace with the same "job id" and configuration?

  • 2 kudos
4 More Replies
mayur_05
by New Contributor II
  • 838 Views
  • 3 replies
  • 0 kudos

access cluster executor logs

Hi Team,I want to get realtime log for cluster executor and driver stderr/stdout log while performing data operations and save those log in catalog's volume

  • 838 Views
  • 3 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

you can create it for Job Clusters compute too. The specific cluster log folder will be under /dbfs/cluster-logs (or whatever you change it to)    

  • 0 kudos
2 More Replies
TheManOfSteele
by New Contributor III
  • 1375 Views
  • 2 replies
  • 0 kudos

Resolved! Databricks-connect Configure a connection to serverless compute Not working

Following these instructions, at https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html#configure-a-connection-to-serverless-compute There seems to be an issue with the example code.from databricks.connect import DatabricksSe...

  • 1375 Views
  • 2 replies
  • 0 kudos
Latest Reply
TheManOfSteele
New Contributor III
  • 0 kudos

Worked! Thank you!

  • 0 kudos
1 More Replies
Dave_Nithio
by Contributor
  • 787 Views
  • 1 replies
  • 0 kudos

Delta Table Log History not Updating

I am running into an issue related to my Delta Log and an old version. I currently have default delta settings for delta.checkpointInterval (10 commits as this table was created prior to DBR 11.1), delta.deletedFileRetentionDuration (7 days), and del...

Dave_Nithio_4-1726759906146.png Dave_Nithio_2-1726759822867.png Dave_Nithio_1-1726759722776.png Dave_Nithio_5-1726760080078.png
  • 787 Views
  • 1 replies
  • 0 kudos
Latest Reply
jennie258fitz
New Contributor III
  • 0 kudos

@Dave_Nithio wrote:I am running into an issue related to my Delta Log and an old version. I currently have default delta settings for delta.checkpointInterval (10 commits as this table was created prior to DBR 11.1), delta.deletedFileRetentionDuratio...

  • 0 kudos
hpant
by New Contributor III
  • 716 Views
  • 1 replies
  • 0 kudos

" ResourceNotFound" error is coming on connecting devops repo to databricks workflow(job).

I have a .py file in a repo in azure devops,I want to add it in a workflow in databricks and these are the values I have provided. And the source is this:I have provided all the values correctly but getting this error: " ResourceNotFound". Can someon...

hpant_0-1725539147316.png hpant_2-1725539295054.png hpant_3-1725539358879.png
  • 716 Views
  • 1 replies
  • 0 kudos
Latest Reply
nicole_lu_PM
Databricks Employee
  • 0 kudos

Can you try cloning the DevOps repo as a Git folder? The git folder clone interface should ask you to set up a Git credential if it's not already there.

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels