cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

swetha
by New Contributor III
  • 2987 Views
  • 4 replies
  • 1 kudos

Error: no streaming listener attached to the spark app is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...

  • 2987 Views
  • 4 replies
  • 1 kudos
Latest Reply
INJUSTIC
New Contributor II
  • 1 kudos

Have you found the solution? Thanks

  • 1 kudos
3 More Replies
swetha
by New Contributor III
  • 2596 Views
  • 3 replies
  • 1 kudos

I am unable to attach a streaming listener to a spark streaming job. Error: no streaming listener attached to the spark application is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...

  • 2596 Views
  • 3 replies
  • 1 kudos
Latest Reply
INJUSTIC
New Contributor II
  • 1 kudos

Have you found the solution? Thanks

  • 1 kudos
2 More Replies
sanjay
by Valued Contributor II
  • 11812 Views
  • 13 replies
  • 10 kudos

Spark tasks too slow and not doing parellel processing

Hi,I have spark job which is processing large data set, its taking too long to process the data. In Spark UI, I can see its running 1 tasks out of 9 tasks. Not sure how to run this in parellel. I have already mentioned auto scaling and providing upto...

  • 11812 Views
  • 13 replies
  • 10 kudos
Latest Reply
plondon
New Contributor II
  • 10 kudos

Will it be any different if using Spark but within Azure, i.e. faster? 

  • 10 kudos
12 More Replies
Paul_Seattle
by New Contributor
  • 7107 Views
  • 1 replies
  • 0 kudos

A Quick Question on Running a job from CLI

Could anyone tell me what could be wrong with my command to submit a spark job with params( If I don’t have --spark-submit-params, it’s fine). Please see the attached snapshot.

image
  • 7107 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16539034020
Databricks Employee
  • 0 kudos

yes, there is no need for spark-submit-params. databricks jobs run-now --job-id ***reference: https://docs.databricks.com/dev-tools/cli/jobs-cli.html

  • 0 kudos
siddharthk
by New Contributor II
  • 1612 Views
  • 2 replies
  • 2 kudos

Resolved! Reduce downtime of Postgres table - JDBC overwrite job

I want to overwrite a Postgresql table transactionStats which is used by the customer facing dashboards.This table needs to be updated every 30 mins. I am writing a AWS Glue Spark job via JDBC connection to perform this operation.Spark dataframe writ...

  • 1612 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Siddharth Kanojiya​ We haven't heard from you since the last response from @werners (Customer)​ . Kindly share the information with us, and in return, we will provide you with the necessary solution.Thanks and Regards

  • 2 kudos
1 More Replies
Prannu
by New Contributor II
  • 1817 Views
  • 2 replies
  • 1 kudos

Location of files previously uploaded on DBFS

I have uploaded a csv data file and used it in a spark job three months back. I am now running the same spark job with a new cluster created. Program is running properly. I want to know where I can see the previously uploaded csv data file.

  • 1817 Views
  • 2 replies
  • 1 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 1 kudos

@Pranay Gupta​ you can see that in dbfs root directory, based on path you provided in job. please check .please go to data explorer and select below option that i shown in screen shot

  • 1 kudos
1 More Replies
Nandini
by New Contributor II
  • 11849 Views
  • 10 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 11849 Views
  • 10 replies
  • 7 kudos
Latest Reply
Etyr
Contributor
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
9 More Replies
Paras
by New Contributor II
  • 2620 Views
  • 4 replies
  • 7 kudos
  • 2620 Views
  • 4 replies
  • 7 kudos
Latest Reply
Anonymous
Not applicable
  • 7 kudos

Hi @Paras Gadhiya​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Than...

  • 7 kudos
3 More Replies
FRG96
by New Contributor III
  • 22067 Views
  • 4 replies
  • 7 kudos

Resolved! How to programmatically get the Spark Job ID of a running Spark Task?

In Spark we can get the Spark Application ID inside the Task programmatically using:SparkEnv.get.blockManager.conf.getAppIdand we can get the Stage ID and Task Attempt ID of the running Task using:TaskContext.get.stageId TaskContext.get.taskAttemptId...

  • 22067 Views
  • 4 replies
  • 7 kudos
Latest Reply
FRG96
New Contributor III
  • 7 kudos

Hi @Gaurav Rupnar​ , I have Spark SQL UDFs (implemented as Scala methods) in which I want to get the details of the Spark SQL query that called the UDF, especially a unique query ID, which in SparkSQL is the Spark Job ID. That's why I wanted a way to...

  • 7 kudos
3 More Replies
Lincoln_Bergeso
by New Contributor II
  • 7186 Views
  • 8 replies
  • 4 kudos

Resolved! How do I read the contents of a hidden file in a Spark job?

I'm trying to read a file from a Google Cloud Storage bucket. The filename starts with a period, so Spark assumes the file is hidden and won't let me read it.My code is similar to this:from pyspark.sql import SparkSession   spark = SparkSession.build...

  • 7186 Views
  • 8 replies
  • 4 kudos
Latest Reply
Dan_Z
Databricks Employee
  • 4 kudos

I don't think there is an easy way to do this. You will also break very basic functionality (like being able to read Delta tables) if you were able to get around these constraints. I suggest you employ a rename job and then read.

  • 4 kudos
7 More Replies
dataguy73
by New Contributor
  • 2840 Views
  • 1 replies
  • 1 kudos

Resolved! spark properties files

I am trying to migrate a spark job from an on-premises Hadoop cluster to data bricks on azure. Currently, we are keeping many values in the properties file. When executing spark-submit we pass the parameter --properties /prop.file.txt. and inside t...

  • 2840 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

I use JSON files and .conf files which reside on the data lake or in the filestore of dbfs.Then read those files using python/scala

  • 1 kudos
User16826994223
by Honored Contributor III
  • 1107 Views
  • 1 replies
  • 0 kudos

spark is reading data from source even I am persisting the data

hI allI am reading data and I am caching the data and then I am performing Action Count to get the data in memory, but still, in dag I found out that every time it reads data from SOURCE.

  • 1107 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

It looks like the the spark memory is not sufficient to cache all the data so it read always from source

  • 0 kudos
User16826992666
by Valued Contributor
  • 1752 Views
  • 1 replies
  • 0 kudos

Resolved! What should I be looking for when evaluating the performance of a Spark job?

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

  • 1752 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Valued Contributor
  • 0 kudos

Few things on top of my mind.1) Check Spark UI and check which stage is taking more time.2) Check for data skewing3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join h...

  • 0 kudos
Labels