cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Paul_Seattle
by New Contributor
  • 1676 Views
  • 1 replies
  • 0 kudos

A Quick Question on Running a job from CLI

Could anyone tell me what could be wrong with my command to submit a spark job with params( If I don’t have --spark-submit-params, it’s fine). Please see the attached snapshot.

image
  • 1676 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16539034020
Contributor II
  • 0 kudos

yes, there is no need for spark-submit-params. databricks jobs run-now --job-id ***reference: https://docs.databricks.com/dev-tools/cli/jobs-cli.html

  • 0 kudos
siddharthk
by New Contributor II
  • 848 Views
  • 2 replies
  • 2 kudos

Resolved! Reduce downtime of Postgres table - JDBC overwrite job

I want to overwrite a Postgresql table transactionStats which is used by the customer facing dashboards.This table needs to be updated every 30 mins. I am writing a AWS Glue Spark job via JDBC connection to perform this operation.Spark dataframe writ...

  • 848 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Siddharth Kanojiya​ We haven't heard from you since the last response from @werners (Customer)​ . Kindly share the information with us, and in return, we will provide you with the necessary solution.Thanks and Regards

  • 2 kudos
1 More Replies
Prannu
by New Contributor II
  • 1096 Views
  • 2 replies
  • 1 kudos

Location of files previously uploaded on DBFS

I have uploaded a csv data file and used it in a spark job three months back. I am now running the same spark job with a new cluster created. Program is running properly. I want to know where I can see the previously uploaded csv data file.

  • 1096 Views
  • 2 replies
  • 1 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 1 kudos

@Pranay Gupta​ you can see that in dbfs root directory, based on path you provided in job. please check .please go to data explorer and select below option that i shown in screen shot

  • 1 kudos
1 More Replies
Nandini
by New Contributor II
  • 7738 Views
  • 10 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 7738 Views
  • 10 replies
  • 7 kudos
Latest Reply
Etyr
Contributor
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
9 More Replies
sanjay
by Valued Contributor II
  • 5067 Views
  • 12 replies
  • 10 kudos

Spark tasks too slow and not doing parellel processing

Hi,I have spark job which is processing large data set, its taking too long to process the data. In Spark UI, I can see its running 1 tasks out of 9 tasks. Not sure how to run this in parellel. I have already mentioned auto scaling and providing upto...

  • 5067 Views
  • 12 replies
  • 10 kudos
Latest Reply
Anonymous
Not applicable
  • 10 kudos

Hi @Sanjay Jain​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so w...

  • 10 kudos
11 More Replies
Paras
by New Contributor II
  • 1474 Views
  • 4 replies
  • 6 kudos
  • 1474 Views
  • 4 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hi @Paras Gadhiya​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Than...

  • 6 kudos
3 More Replies
swetha
by New Contributor III
  • 1760 Views
  • 4 replies
  • 1 kudos

Error: no streaming listener attached to the spark app is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...

  • 1760 Views
  • 4 replies
  • 1 kudos
Latest Reply
Vidula
Honored Contributor
  • 1 kudos

Hi @swetha kadiyala​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Th...

  • 1 kudos
3 More Replies
swetha
by New Contributor III
  • 1559 Views
  • 2 replies
  • 1 kudos

I am unable to attach a streaming listener to a spark streaming job. Error: no streaming listener attached to the spark application is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...

  • 1559 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vidula
Honored Contributor
  • 1 kudos

Hi @swetha kadiyala​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Th...

  • 1 kudos
1 More Replies
FRG96
by New Contributor III
  • 8463 Views
  • 6 replies
  • 7 kudos

Resolved! How to programmatically get the Spark Job ID of a running Spark Task?

In Spark we can get the Spark Application ID inside the Task programmatically using:SparkEnv.get.blockManager.conf.getAppIdand we can get the Stage ID and Task Attempt ID of the running Task using:TaskContext.get.stageId TaskContext.get.taskAttemptId...

  • 8463 Views
  • 6 replies
  • 7 kudos
Latest Reply
FRG96
New Contributor III
  • 7 kudos

Hi @Gaurav Rupnar​ , I have Spark SQL UDFs (implemented as Scala methods) in which I want to get the details of the Spark SQL query that called the UDF, especially a unique query ID, which in SparkSQL is the Spark Job ID. That's why I wanted a way to...

  • 7 kudos
5 More Replies
Lincoln_Bergeso
by New Contributor II
  • 4438 Views
  • 10 replies
  • 5 kudos

Resolved! How do I read the contents of a hidden file in a Spark job?

I'm trying to read a file from a Google Cloud Storage bucket. The filename starts with a period, so Spark assumes the file is hidden and won't let me read it.My code is similar to this:from pyspark.sql import SparkSession   spark = SparkSession.build...

  • 4438 Views
  • 10 replies
  • 5 kudos
Latest Reply
Kaniz
Community Manager
  • 5 kudos

Hi @Lincoln Bergeson​ , Did @Dan Zafar​ 's response help you solve your problem?

  • 5 kudos
9 More Replies
dataguy73
by New Contributor
  • 1861 Views
  • 2 replies
  • 2 kudos

Resolved! spark properties files

I am trying to migrate a spark job from an on-premises Hadoop cluster to data bricks on azure. Currently, we are keeping many values in the properties file. When executing spark-submit we pass the parameter --properties /prop.file.txt. and inside t...

  • 1861 Views
  • 2 replies
  • 2 kudos
Latest Reply
Kaniz
Community Manager
  • 2 kudos

Hi @vishal dutt​ , Were you able to implement the properties file in the Databricks notebook?

  • 2 kudos
1 More Replies
Kaniz
by Community Manager
  • 1088 Views
  • 1 replies
  • 1 kudos
  • 1088 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

Basically all what is needed is to create api token in databricks and than use Jobs API as described here:https://docs.databricks.com/dev-tools/api/latest/jobs.htmlfollowing endpoints are available:POST https://<databricks-instance>/api/2.1/jobs/crea...

  • 1 kudos
User16826994223
by Honored Contributor III
  • 749 Views
  • 1 replies
  • 0 kudos

spark is reading data from source even I am persisting the data

hI allI am reading data and I am caching the data and then I am performing Action Count to get the data in memory, but still, in dag I found out that every time it reads data from SOURCE.

  • 749 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

It looks like the the spark memory is not sufficient to cache all the data so it read always from source

  • 0 kudos
User16826992666
by Valued Contributor
  • 1203 Views
  • 1 replies
  • 0 kudos

Resolved! What should I be looking for when evaluating the performance of a Spark job?

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

  • 1203 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Valued Contributor
  • 0 kudos

Few things on top of my mind.1) Check Spark UI and check which stage is taking more time.2) Check for data skewing3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join h...

  • 0 kudos
Labels