Data Engineering

Forum Posts

Sorted by:

by swetha • New Contributor III

08-30-2022 10:24:49 AM

3352 Views
4 replies
1 kudos

Error: no streaming listener attached to the spark app is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...

Data Engineering

3352 Views
4 replies
1 kudos

08-30-2022 10:24:49 AM

View Replies

Latest Reply

INJUSTIC
New Contributor II

11-20-2024 7:07:25 AM

1 kudos

Have you found the solution? Thanks

1 kudos

11-20-2024 7:07:25 AM

3 More Replies

by swetha • New Contributor III

08-30-2022 4:42:29 AM

2969 Views
3 replies
1 kudos

I am unable to attach a streaming listener to a spark streaming job. Error: no streaming listener attached to the spark application is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...

Data Engineering

2969 Views
3 replies
1 kudos

08-30-2022 4:42:29 AM

View Replies

Latest Reply

INJUSTIC
New Contributor II

11-20-2024 7:05:18 AM

1 kudos

Have you found the solution? Thanks

1 kudos

11-20-2024 7:05:18 AM

2 More Replies

by sanjay • Valued Contributor II

03-30-2023 12:42:50 AM

13057 Views
13 replies
10 kudos

Spark tasks too slow and not doing parellel processing

Hi,I have spark job which is processing large data set, its taking too long to process the data. In Spark UI, I can see its running 1 tasks out of 9 tasks. Not sure how to run this in parellel. I have already mentioned auto scaling and providing upto...

Data Engineering

13057 Views
13 replies
10 kudos

03-30-2023 12:42:50 AM

View Replies

Latest Reply

plondon
New Contributor II

07-24-2024 4:07:00 AM

10 kudos

Will it be any different if using Spark but within Azure, i.e. faster?

10 kudos

07-24-2024 4:07:00 AM

12 More Replies

by Paul_Seattle • New Contributor

05-11-2023 10:37:33 AM

7229 Views
1 replies
0 kudos

A Quick Question on Running a job from CLI

Could anyone tell me what could be wrong with my command to submit a spark job with params( If I don’t have --spark-submit-params, it’s fine). Please see the attached snapshot.

Data Engineering

7229 Views
1 replies
0 kudos

05-11-2023 10:37:33 AM

View Replies

Latest Reply

User16539034020
Databricks Employee

06-16-2023 9:47:09 AM

0 kudos

yes, there is no need for spark-submit-params. databricks jobs run-now --job-id ***reference: https://docs.databricks.com/dev-tools/cli/jobs-cli.html

0 kudos

06-16-2023 9:47:09 AM

by siddharthk • New Contributor II

06-08-2023 12:45:31 PM

1921 Views
2 replies
2 kudos

Resolved! Reduce downtime of Postgres table - JDBC overwrite job

I want to overwrite a Postgresql table transactionStats which is used by the customer facing dashboards.This table needs to be updated every 30 mins. I am writing a AWS Glue Spark job via JDBC connection to perform this operation.Spark dataframe writ...

Data Engineering

1921 Views
2 replies
2 kudos

06-08-2023 12:45:31 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-15-2023 12:09:32 AM

2 kudos

Hi @Siddharth Kanojiya We haven't heard from you since the last response from @werners (Customer) . Kindly share the information with us, and in return, we will provide you with the necessary solution.Thanks and Regards

2 kudos

06-15-2023 12:09:32 AM

1 More Replies

by Prannu • New Contributor II

04-25-2023 1:08:29 AM

2059 Views
2 replies
1 kudos

Location of files previously uploaded on DBFS

I have uploaded a csv data file and used it in a spark job three months back. I am now running the same spark job with a new cluster created. Program is running properly. I want to know where I can see the previously uploaded csv data file.

Data Engineering

2059 Views
2 replies
1 kudos

04-25-2023 1:08:29 AM

View Replies

Latest Reply

karthik_p
Esteemed Contributor

04-25-2023 6:37:08 AM

1 kudos

@Pranay Gupta you can see that in dbfs root directory, based on path you provided in job. please check .please go to data explorer and select below option that i shown in screen shot

1 kudos

04-25-2023 6:37:08 AM

1 More Replies

by Nandini • New Contributor II

12-05-2022 12:19:47 AM

13011 Views
10 replies
7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

Data Engineering

13011 Views
10 replies
7 kudos

12-05-2022 12:19:47 AM

View Replies

Latest Reply

Etyr
Contributor

01-11-2023 2:33:17 AM

7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

7 kudos

01-11-2023 2:33:17 AM

9 More Replies

by Paras • New Contributor II

03-01-2023 11:32:21 PM

2863 Views
4 replies
7 kudos

can you tell me what could be problem to my spark job?

Data Engineering

2863 Views
4 replies
7 kudos

03-01-2023 11:32:21 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-16-2023 10:18:47 PM

7 kudos

Hi @Paras Gadhiya Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Than...

7 kudos

03-16-2023 10:18:47 PM

3 More Replies

by FRG96 • New Contributor III

02-16-2022 8:34:26 PM

23796 Views
4 replies
7 kudos

Resolved! How to programmatically get the Spark Job ID of a running Spark Task?

In Spark we can get the Spark Application ID inside the Task programmatically using:SparkEnv.get.blockManager.conf.getAppIdand we can get the Stage ID and Task Attempt ID of the running Task using:TaskContext.get.stageId TaskContext.get.taskAttemptId...

Data Engineering

23796 Views
4 replies
7 kudos

02-16-2022 8:34:26 PM

View Replies

Latest Reply

FRG96
New Contributor III

03-08-2022 8:21:22 PM

7 kudos

Hi @Gaurav Rupnar , I have Spark SQL UDFs (implemented as Scala methods) in which I want to get the details of the Spark SQL query that called the UDF, especially a unique query ID, which in SparkSQL is the Spark Job ID. That's why I wanted a way to...

7 kudos

03-08-2022 8:21:22 PM

3 More Replies

by Lincoln_Bergeso • New Contributor II

02-15-2022 3:24:33 PM

7887 Views
8 replies
4 kudos

Resolved! How do I read the contents of a hidden file in a Spark job?

I'm trying to read a file from a Google Cloud Storage bucket. The filename starts with a period, so Spark assumes the file is hidden and won't let me read it.My code is similar to this:from pyspark.sql import SparkSession spark = SparkSession.build...

Data Engineering

7887 Views
8 replies
4 kudos

02-15-2022 3:24:33 PM

View Replies

Latest Reply

Dan_Z
Databricks Employee

05-04-2022 9:19:30 AM

4 kudos

I don't think there is an easy way to do this. You will also break very basic functionality (like being able to read Delta tables) if you were able to get around these constraints. I suggest you employ a rename job and then read.

4 kudos

05-04-2022 9:19:30 AM

7 More Replies

by dataguy73 • New Contributor

03-25-2022 11:04:54 AM

3101 Views
1 replies
1 kudos

Resolved! spark properties files

I am trying to migrate a spark job from an on-premises Hadoop cluster to data bricks on azure. Currently, we are keeping many values in the properties file. When executing spark-submit we pass the parameter --properties /prop.file.txt. and inside t...

Data Engineering

3101 Views
1 replies
1 kudos

03-25-2022 11:04:54 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

03-28-2022 1:46:22 AM

1 kudos

I use JSON files and .conf files which reside on the data lake or in the filestore of dbfs.Then read those files using python/scala

1 kudos

03-28-2022 1:46:22 AM

by User16826994223 • Honored Contributor III

06-17-2021 1:02:03 AM

1209 Views
1 replies
0 kudos

spark is reading data from source even I am persisting the data

hI allI am reading data and I am caching the data and then I am performing Action Count to get the data in memory, but still, in dag I found out that every time it reads data from SOURCE.

Data Engineering

1209 Views
1 replies
0 kudos

06-17-2021 1:02:03 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-26-2021 10:52:35 AM

0 kudos

It looks like the the spark memory is not sufficient to cache all the data so it read always from source

0 kudos

06-26-2021 10:52:35 AM

by Srikanth_Gupta_ • Databricks Employee

06-16-2021 6:11:28 AM

928 Views
0 replies
0 kudos

Best practices for GC techniques to improve performance of spark job

Data Engineering

928 Views
0 replies
0 kudos

06-16-2021 6:11:28 AM

by User16826992666 • Valued Contributor

06-15-2021 8:34:28 PM

1968 Views
1 replies
0 kudos

Resolved! What should I be looking for when evaluating the performance of a Spark job?

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

Data Engineering

1968 Views
1 replies
0 kudos

06-15-2021 8:34:28 PM

View Replies

Latest Reply

Srikanth_Gupta_
Databricks Employee

06-16-2021 5:35:48 AM

0 kudos

Few things on top of my mind.1) Check Spark UI and check which stage is taking more time.2) Check for data skewing3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join h...

0 kudos

06-16-2021 5:35:48 AM