cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

soundari
by New Contributor
  • 2268 Views
  • 1 replies
  • 1 kudos

Resolved! Identify the partitionValues written yesterday from delta

We have a streaming data written into delta. We will not write all the partitions every day. Hence i am thinking of running compact spark job, to run only on partitions that has been modified yesterday. Is it possible to query the partitionsValues wr...

  • 2268 Views
  • 1 replies
  • 1 kudos
Latest Reply
Deepak_Bhutada
Contributor III
  • 1 kudos

Hi @Gnanasoundari Soundarajan​ Based on the details you provided, you are not overwriting all the partitions every day which means you might be using append mode while writing the data on day 1. On day 2, you want to access those partition values and...

  • 1 kudos
shan_chandra
by Databricks Employee
  • 8323 Views
  • 1 replies
  • 3 kudos

Resolved! Cannot reserve additional contiguous bytes in the vectorized reader (requested xxxxxxxxx bytes).

I got the below error when running a streaming workload from a source Delta table Caused by: java.lang.RuntimeException: Cannot reserve additional contiguous bytes in the vectorized reader (requested xxxxxxxxx bytes). As a workaround, you can reduce ...

  • 8323 Views
  • 1 replies
  • 3 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 3 kudos

This is happening because the delta/parquet source has one or more of the following:a huge number of columnshuge strings in one or more columnshuge arrays/map, possibly nested in each otherIn order to mitigate this issue, could you please reduce spar...

  • 3 kudos
jsaddam28
by New Contributor III
  • 49409 Views
  • 24 replies
  • 15 kudos

How to import local python file in notebook?

for example I have one.py and two.py in databricks and I want to use one of the module from one.py in two.py. Usually I do this in my local machine by import statement like below two.py__ from one import module1 . . . How to do this in databricks???...

  • 49409 Views
  • 24 replies
  • 15 kudos
Latest Reply
StephanieAlba
Databricks Employee
  • 15 kudos

USE REPOS! Repos is able to call a function that is in a file in the same Github repo as long as Files is enabled in the admin panel.So if I have utils.py with:import pandas as pd   def clean_data(): # Load wine data data = pd.read_csv("/dbfs/da...

  • 15 kudos
23 More Replies
Sam
by New Contributor III
  • 3717 Views
  • 1 replies
  • 1 kudos

Resolved! Query Pushdown in Snowflake

Hi,I am wondering what documentation exists on Query Pushdown in Snowflake.I noticed that a single function (monitonically_increasing_id()) prevented the entire query being pushed down to Snowflake during an ETL process. Is Pushdown coming from the S...

  • 3717 Views
  • 1 replies
  • 1 kudos
Latest Reply
siddhathPanchal
Databricks Employee
  • 1 kudos

Hi Sam,The Spark Connector applies predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. When the data source is Snowflake, the operations are translated into a SQL query and then executed in Snowflake to...

  • 1 kudos
MarcoCaviezel
by New Contributor III
  • 5680 Views
  • 5 replies
  • 3 kudos

Resolved! Use Spot Instances with Azure Data Factory Linked Service

In my pipeline I'm using Azure Data Factory to trigger Databricks notebooks as a linked serviceI want to use spot instances for my job clusters Is there a way to achieve this?I didn't find a way to do this in the GUI.Thanks for your help!Marco

  • 5680 Views
  • 5 replies
  • 3 kudos
Latest Reply
MarcoCaviezel
New Contributor III
  • 3 kudos

Hi @Werner Stinckens​ ,Just a quick follow up question.Does it make sense to you that you can select the following options in Azure Data Factory?To my understanding, "cluster version", "Python Version" and the "Worker options" are defined when I crea...

  • 3 kudos
4 More Replies
brendan-b
by New Contributor II
  • 10338 Views
  • 2 replies
  • 3 kudos

spark-xml not working with Databricks Connect and Pyspark

Hi all,I currently have a cluster configured in databricks with spark-xml (version com.databricks:spark-xml_2.12:0.13.0) which was installed using Maven. The spark-xml library itself works fine with Pyspark when I am using it in a notebook within th...

  • 10338 Views
  • 2 replies
  • 3 kudos
Latest Reply
brendan-b
New Contributor II
  • 3 kudos

@Sean Owen​ I do not believe I have. Do you have any documentation on how to install spark-xml locally? I have tried the following with no luck. IS this what you are referring to?PYSPARK_HOME/bin/pyspark --packages com.databricks:spark-xml_2.12:0.13....

  • 3 kudos
1 More Replies
Maverick1
by Valued Contributor II
  • 4880 Views
  • 8 replies
  • 14 kudos

Resolved! Real-time model serving and monitoring on Databricks at scale

How to deploy real-time model on databricks at scale? Right now, The model serving is very limited to 20 requests per second. Also, There are no model monitoring framework/graphs like the one's provided with AzureML or Sagemaker frameworks.

  • 4880 Views
  • 8 replies
  • 14 kudos
Latest Reply
sean_owen
Databricks Employee
  • 14 kudos

I believe the next update to serving will include 1, not 2 (this is still within a Databricks workspace in a region). I don't think multi-model endpoints are on the roadmap next.How does Airflow integration relate?

  • 14 kudos
7 More Replies
Artem_Y
by Databricks Employee
  • 1960 Views
  • 1 replies
  • 4 kudos

Embed Google Slides (PowerPoint) into Databricks Interactive Notebooks Use the following code to embed your slides:slide_id = '1CYEVsDqsdfg343fwg4...

Embed Google Slides (PowerPoint) into Databricks Interactive NotebooksUse the following code to embed your slides:slide_id = '1CYEVsDqsdfg343fwg42MtXqGd68gffP-Y16CR59c' slide_number = 'id.p9'   displayHTML(f''' <iframe src="https://docs.google.com/...

8F9761FE-B986-48EB-8461-0AAEA891DEDB_4_5005_c
  • 1960 Views
  • 1 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Artem Yevtushenko​ - Thank you for sharing this solution.

  • 4 kudos
narek_margaryan
by New Contributor II
  • 2825 Views
  • 1 replies
  • 3 kudos

Resolved! Do Spark nodes read data from storage in a sequence?

I'm new to Spark and trying to understand how some of its components work.I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).But I'm wondering whether the in...

  • 2825 Views
  • 1 replies
  • 3 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 3 kudos

@Narek Margaryan​ , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).The number of partitions in the file itself also matters.This l...

  • 3 kudos
Kotofosonline
by New Contributor III
  • 5354 Views
  • 2 replies
  • 3 kudos

Resolved! Query with distinct sort and alias produces error column not found

I’m trying to use sql query on azure-databricks with distinct sort and aliasesSELECT DISTINCT album.ArtistId AS my_alias FROM album ORDER BY album.ArtistIdThe problem is that if I add an alias then I can not use not aliased name in the order by ...

  • 5354 Views
  • 2 replies
  • 3 kudos
Latest Reply
Kotofosonline
New Contributor III
  • 3 kudos

The code from above is worked in both cases.

  • 3 kudos
1 More Replies
dataslicer
by Contributor
  • 2944 Views
  • 1 replies
  • 1 kudos

Resolved! upgraded R package rlang to 0.4.11 on DBR 8.3 SC, but sessionInfo() still shows rlang as 0.4.9

I am using Azure Databricks Runtime (DBR) 8.3 ML with Python notebook and R cells together.I want to use "tidyverse" and one of the dependency is rlang >= 0.4.10 and the base DBR 8.3 ML provides rlang @ 0.4.9. I successfully upgraded the R package t...

  • 2944 Views
  • 1 replies
  • 1 kudos
Latest Reply
Sivaprasad1
Valued Contributor II
  • 1 kudos

@Jim Huang​ : Could you please try to restart the session and try to run tidyverse. Looks like the older version of rlang loaded in session.Error : package or namespace load failed for ‘tidyverse’ in loadNamespace(i, c(lib.loc, .libPaths()), versionC...

  • 1 kudos
amichel
by New Contributor III
  • 3226 Views
  • 2 replies
  • 5 kudos

Resolved! Is there a stable, ideally official JMS/ActiveMQ connector for Spark?

We're delivering pipelines that are mostly based on Databricks Spark Streaming, Delta Lake and Azure Event Hubs, and there's a requirement to integrate with AMQ/JMS endpoints (Request and Response queues in ActiveMQ).Is there a proven way to integrat...

  • 3226 Views
  • 2 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

@amichel We have a feature request to add structured streaming support for Tibco EMS and JMS. Unfortunately, it's yet to be prioritized for the roadmap. I would request to file a feature request in our ideas portal https://ideas.databricks.com/ideas/...

  • 5 kudos
1 More Replies
vasanthvk
by New Contributor III
  • 9036 Views
  • 7 replies
  • 3 kudos

Resolved! Is there a way to automate Table creation in Databricks SQL based on a ADLS storage location which contains multiple Parquet files?

We have ADLS container location which contains several (100+) different data subjects folders which contain Parquet files with partition column and we want to expose each of the data subject folder as a table in Databricks SQL. Is there any way to au...

  • 9036 Views
  • 7 replies
  • 3 kudos
Latest Reply
User16857282152
Contributor
  • 3 kudos

Updating dazfuller suggestion, but including code for one level of partitioning, of course if you have deeper partitions then you will have to make a function and do a recursive call to get to the final directory containing parquet files. Parquet wil...

  • 3 kudos
6 More Replies
MartinB
by Contributor III
  • 10520 Views
  • 4 replies
  • 3 kudos

Resolved! Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future

Hi,I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.Example:Loading this into a Spark dataframe works fine...

Example_SCD2
  • 10520 Views
  • 4 replies
  • 3 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 3 kudos

Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue. https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&page=com.atlassian.jira.plugin.system.issuetabpanels%3...

  • 3 kudos
3 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels