Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...
Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...
Hi,I have spark job which is processing large data set, its taking too long to process the data. In Spark UI, I can see its running 1 tasks out of 9 tasks. Not sure how to run this in parellel. I have already mentioned auto scaling and providing upto...
Could anyone tell me what could be wrong with my command to submit a spark job with params( If I don’t have --spark-submit-params, it’s fine). Please see the attached snapshot.
I want to overwrite a Postgresql table transactionStats which is used by the customer facing dashboards.This table needs to be updated every 30 mins. I am writing a AWS Glue Spark job via JDBC connection to perform this operation.Spark dataframe writ...
Hi @Siddharth Kanojiya We haven't heard from you since the last response from @werners (Customer) . Kindly share the information with us, and in return, we will provide you with the necessary solution.Thanks and Regards
I have uploaded a csv data file and used it in a spark job three months back. I am now running the same spark job with a new cluster created. Program is running properly. I want to know where I can see the previously uploaded csv data file.
@Pranay Gupta you can see that in dbfs root directory, based on path you provided in job. please check .please go to data explorer and select below option that i shown in screen shot
I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str):
files_in_path = db...
If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
# Get Path class to convert string path to FS path
path = spark._...
Hi @Paras Gadhiya Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Than...
In Spark we can get the Spark Application ID inside the Task programmatically using:SparkEnv.get.blockManager.conf.getAppIdand we can get the Stage ID and Task Attempt ID of the running Task using:TaskContext.get.stageId
TaskContext.get.taskAttemptId...
Hi @Gaurav Rupnar , I have Spark SQL UDFs (implemented as Scala methods) in which I want to get the details of the Spark SQL query that called the UDF, especially a unique query ID, which in SparkSQL is the Spark Job ID. That's why I wanted a way to...
I'm trying to read a file from a Google Cloud Storage bucket. The filename starts with a period, so Spark assumes the file is hidden and won't let me read it.My code is similar to this:from pyspark.sql import SparkSession
spark = SparkSession.build...
I don't think there is an easy way to do this. You will also break very basic functionality (like being able to read Delta tables) if you were able to get around these constraints. I suggest you employ a rename job and then read.
I am trying to migrate a spark job from an on-premises Hadoop cluster to data bricks on azure. Currently, we are keeping many values in the properties file. When executing spark-submit we pass the parameter --properties /prop.file.txt. and inside t...
hI allI am reading data and I am caching the data and then I am performing Action Count to get the data in memory, but still, in dag I found out that every time it reads data from SOURCE.
Few things on top of my mind.1) Check Spark UI and check which stage is taking more time.2) Check for data skewing3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join h...