cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ItsMe
by New Contributor II
  • 3816 Views
  • 3 replies
  • 7 kudos

Resolved! Run Pyspark job of Python egg package using spark submit on databricks

Error: missing application resource​Getting this error while running job with spark submit.​ I have given following parameters while creating job:--conf spark.yarn.appMasterEnv.PYSAPRK_PYTHON=databricks/path/python3--py-files dbfs/path/to/.egg job_m...

  • 3816 Views
  • 3 replies
  • 7 kudos
Latest Reply
User16752246494
Contributor
  • 7 kudos

Hi,We tried a simulate the question on our end and what we did was packaged a module inside a whl file.Now to access the wheel file we created another python file test_whl_locally.py. Inside test_whl_locally.py to access the content of the wheel file...

  • 7 kudos
2 More Replies
BorislavBlagoev
by Valued Contributor III
  • 3864 Views
  • 1 replies
  • 5 kudos

Resolved! Get package from Nexus repo.

I want to receive a package from Nexus repo both in notebook and job. If anyone has experience with this, please answer me here!

  • 3864 Views
  • 1 replies
  • 5 kudos
Latest Reply
User16855813973
Databricks Employee
  • 5 kudos

Nexus repo for the notebook you can use Notebook-scoped libraries with %pip with Use %pip install with the --index-url option. Secret management is available. See example.from UI it is not supported Cluster libraries

  • 5 kudos
User16868770416
by Contributor
  • 4344 Views
  • 1 replies
  • 0 kudos

What is the best way to decode protobuf using pyspark?

I am using spark structured streaming to read a protobuf encoded message from the event hub. We use a lot of Delta tables, but there isn't a simple way to integrate this. We are currently using K-SQL to transform into avro on the fly and then use Dat...

  • 4344 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

hi @Will Block​ ,I think there is a related question being asked in the past. I think it was this one I found this library, I hope it helps.

  • 0 kudos
marchello
by New Contributor III
  • 6344 Views
  • 8 replies
  • 3 kudos

Resolved! error on connecting to Snowflake

Hi team, I'm getting weird error in one of my jobs when connecting to Snowflake. All my other jobs (I've got plenty) work fine. The current one also works fine when I have only one coding step (except installing needed libraries in my very first step...

  • 6344 Views
  • 8 replies
  • 3 kudos
Latest Reply
Dan_Z
Databricks Employee
  • 3 kudos

@marchello​ I suggest you contact Snowflake to move forward on this one.

  • 3 kudos
7 More Replies
William_Scardua
by Valued Contributor
  • 3799 Views
  • 4 replies
  • 4 kudos

Resolved! Small/big file problem, how do you fix it ?

How do you work to fixing the small/big file problem ? what you suggest ?

  • 3799 Views
  • 4 replies
  • 4 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 4 kudos

What Jose said.If you cannot use delta or do not want to:the use of coalesce and repartition/partitioning is the way to define the file size.There is no one ideal file size. It all depends on the use case, available cluster size, data flow downstrea...

  • 4 kudos
3 More Replies
soundari
by New Contributor
  • 2261 Views
  • 1 replies
  • 1 kudos

Resolved! Identify the partitionValues written yesterday from delta

We have a streaming data written into delta. We will not write all the partitions every day. Hence i am thinking of running compact spark job, to run only on partitions that has been modified yesterday. Is it possible to query the partitionsValues wr...

  • 2261 Views
  • 1 replies
  • 1 kudos
Latest Reply
Deepak_Bhutada
Contributor III
  • 1 kudos

Hi @Gnanasoundari Soundarajan​ Based on the details you provided, you are not overwriting all the partitions every day which means you might be using append mode while writing the data on day 1. On day 2, you want to access those partition values and...

  • 1 kudos
shan_chandra
by Databricks Employee
  • 8299 Views
  • 1 replies
  • 3 kudos

Resolved! Cannot reserve additional contiguous bytes in the vectorized reader (requested xxxxxxxxx bytes).

I got the below error when running a streaming workload from a source Delta table Caused by: java.lang.RuntimeException: Cannot reserve additional contiguous bytes in the vectorized reader (requested xxxxxxxxx bytes). As a workaround, you can reduce ...

  • 8299 Views
  • 1 replies
  • 3 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 3 kudos

This is happening because the delta/parquet source has one or more of the following:a huge number of columnshuge strings in one or more columnshuge arrays/map, possibly nested in each otherIn order to mitigate this issue, could you please reduce spar...

  • 3 kudos
jsaddam28
by New Contributor III
  • 49266 Views
  • 24 replies
  • 15 kudos

How to import local python file in notebook?

for example I have one.py and two.py in databricks and I want to use one of the module from one.py in two.py. Usually I do this in my local machine by import statement like below two.py__ from one import module1 . . . How to do this in databricks???...

  • 49266 Views
  • 24 replies
  • 15 kudos
Latest Reply
StephanieAlba
Databricks Employee
  • 15 kudos

USE REPOS! Repos is able to call a function that is in a file in the same Github repo as long as Files is enabled in the admin panel.So if I have utils.py with:import pandas as pd   def clean_data(): # Load wine data data = pd.read_csv("/dbfs/da...

  • 15 kudos
23 More Replies
Sam
by New Contributor III
  • 3700 Views
  • 1 replies
  • 1 kudos

Resolved! Query Pushdown in Snowflake

Hi,I am wondering what documentation exists on Query Pushdown in Snowflake.I noticed that a single function (monitonically_increasing_id()) prevented the entire query being pushed down to Snowflake during an ETL process. Is Pushdown coming from the S...

  • 3700 Views
  • 1 replies
  • 1 kudos
Latest Reply
siddhathPanchal
Databricks Employee
  • 1 kudos

Hi Sam,The Spark Connector applies predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. When the data source is Snowflake, the operations are translated into a SQL query and then executed in Snowflake to...

  • 1 kudos
MarcoCaviezel
by New Contributor III
  • 5666 Views
  • 5 replies
  • 3 kudos

Resolved! Use Spot Instances with Azure Data Factory Linked Service

In my pipeline I'm using Azure Data Factory to trigger Databricks notebooks as a linked serviceI want to use spot instances for my job clusters Is there a way to achieve this?I didn't find a way to do this in the GUI.Thanks for your help!Marco

  • 5666 Views
  • 5 replies
  • 3 kudos
Latest Reply
MarcoCaviezel
New Contributor III
  • 3 kudos

Hi @Werner Stinckens​ ,Just a quick follow up question.Does it make sense to you that you can select the following options in Azure Data Factory?To my understanding, "cluster version", "Python Version" and the "Worker options" are defined when I crea...

  • 3 kudos
4 More Replies
brendan-b
by New Contributor II
  • 10302 Views
  • 2 replies
  • 3 kudos

spark-xml not working with Databricks Connect and Pyspark

Hi all,I currently have a cluster configured in databricks with spark-xml (version com.databricks:spark-xml_2.12:0.13.0) which was installed using Maven. The spark-xml library itself works fine with Pyspark when I am using it in a notebook within th...

  • 10302 Views
  • 2 replies
  • 3 kudos
Latest Reply
brendan-b
New Contributor II
  • 3 kudos

@Sean Owen​ I do not believe I have. Do you have any documentation on how to install spark-xml locally? I have tried the following with no luck. IS this what you are referring to?PYSPARK_HOME/bin/pyspark --packages com.databricks:spark-xml_2.12:0.13....

  • 3 kudos
1 More Replies
Maverick1
by Valued Contributor II
  • 4865 Views
  • 8 replies
  • 14 kudos

Resolved! Real-time model serving and monitoring on Databricks at scale

How to deploy real-time model on databricks at scale? Right now, The model serving is very limited to 20 requests per second. Also, There are no model monitoring framework/graphs like the one's provided with AzureML or Sagemaker frameworks.

  • 4865 Views
  • 8 replies
  • 14 kudos
Latest Reply
sean_owen
Databricks Employee
  • 14 kudos

I believe the next update to serving will include 1, not 2 (this is still within a Databricks workspace in a region). I don't think multi-model endpoints are on the roadmap next.How does Airflow integration relate?

  • 14 kudos
7 More Replies
Artem_Y
by Databricks Employee
  • 1953 Views
  • 1 replies
  • 4 kudos

Embed Google Slides (PowerPoint) into Databricks Interactive Notebooks Use the following code to embed your slides:slide_id = '1CYEVsDqsdfg343fwg4...

Embed Google Slides (PowerPoint) into Databricks Interactive NotebooksUse the following code to embed your slides:slide_id = '1CYEVsDqsdfg343fwg42MtXqGd68gffP-Y16CR59c' slide_number = 'id.p9'   displayHTML(f''' <iframe src="https://docs.google.com/...

8F9761FE-B986-48EB-8461-0AAEA891DEDB_4_5005_c
  • 1953 Views
  • 1 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Artem Yevtushenko​ - Thank you for sharing this solution.

  • 4 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels