Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Error: missing application resourceGetting this error while running job with spark submit. I have given following parameters while creating job:--conf spark.yarn.appMasterEnv.PYSAPRK_PYTHON=databricks/path/python3--py-files dbfs/path/to/.egg job_m...
Hi,We tried a simulate the question on our end and what we did was packaged a module inside a whl file.Now to access the wheel file we created another python file test_whl_locally.py. Inside test_whl_locally.py to access the content of the wheel file...
Nexus repo for the notebook you can use Notebook-scoped libraries with %pip with Use %pip install with the --index-url option. Secret management is available. See example.from UI it is not supported Cluster libraries
I am using spark structured streaming to read a protobuf encoded message from the event hub. We use a lot of Delta tables, but there isn't a simple way to integrate this. We are currently using K-SQL to transform into avro on the fly and then use Dat...
Hi team, I'm getting weird error in one of my jobs when connecting to Snowflake. All my other jobs (I've got plenty) work fine. The current one also works fine when I have only one coding step (except installing needed libraries in my very first step...
What Jose said.If you cannot use delta or do not want to:the use of coalesce and repartition/partitioning is the way to define the file size.There is no one ideal file size. It all depends on the use case, available cluster size, data flow downstrea...
We have a streaming data written into delta. We will not write all the partitions every day. Hence i am thinking of running compact spark job, to run only on partitions that has been modified yesterday. Is it possible to query the partitionsValues wr...
Hi @Gnanasoundari Soundarajan Based on the details you provided, you are not overwriting all the partitions every day which means you might be using append mode while writing the data on day 1. On day 2, you want to access those partition values and...
I got the below error when running a streaming workload from a source Delta table Caused by: java.lang.RuntimeException: Cannot reserve additional contiguous bytes in the vectorized reader (requested xxxxxxxxx bytes). As a workaround, you can reduce ...
This is happening because the delta/parquet source has one or more of the following:a huge number of columnshuge strings in one or more columnshuge arrays/map, possibly nested in each otherIn order to mitigate this issue, could you please reduce spar...
Hi,I am wondering what documentation exists on Query Pushdown in Snowflake.I noticed that a single function (monitonically_increasing_id()) prevented the entire query being pushed down to Snowflake during an ETL process. Is Pushdown coming from the S...
Hi Sam,The Spark Connector applies predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. When the data source is Snowflake, the operations are translated into a SQL query and then executed in Snowflake to...
In my pipeline I'm using Azure Data Factory to trigger Databricks notebooks as a linked serviceI want to use spot instances for my job clusters Is there a way to achieve this?I didn't find a way to do this in the GUI.Thanks for your help!Marco
Hi @Werner Stinckens ,Just a quick follow up question.Does it make sense to you that you can select the following options in Azure Data Factory?To my understanding, "cluster version", "Python Version" and the "Worker options" are defined when I crea...
Hi all,I currently have a cluster configured in databricks with spark-xml (version com.databricks:spark-xml_2.12:0.13.0) which was installed using Maven. The spark-xml library itself works fine with Pyspark when I am using it in a notebook within th...
@Sean Owen I do not believe I have. Do you have any documentation on how to install spark-xml locally? I have tried the following with no luck. IS this what you are referring to?PYSPARK_HOME/bin/pyspark --packages com.databricks:spark-xml_2.12:0.13....
How to deploy real-time model on databricks at scale? Right now, The model serving is very limited to 20 requests per second. Also, There are no model monitoring framework/graphs like the one's provided with AzureML or Sagemaker frameworks.
I believe the next update to serving will include 1, not 2 (this is still within a Databricks workspace in a region). I don't think multi-model endpoints are on the roadmap next.How does Airflow integration relate?
Embed Google Slides (PowerPoint) into Databricks Interactive NotebooksUse the following code to embed your slides:slide_id = '1CYEVsDqsdfg343fwg42MtXqGd68gffP-Y16CR59c'
slide_number = 'id.p9'
displayHTML(f'''
<iframe
src="https://docs.google.com/...
I'm new to Spark and trying to understand how some of its components work.I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).But I'm wondering whether the in...
@Narek Margaryan , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).The number of partitions in the file itself also matters.This l...