cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

r0nald
by New Contributor II
  • 3327 Views
  • 3 replies
  • 1 kudos

UDF not working inside transform() & lambda (SQL)

Below is toy example of what I'm trying to achieve, but don't understand why it fails. Can anyone explain why, and suggest a fix or not overly bloated workaround?%sqlcreate or replace function status_map(status int)returns stringreturn map(10, "STATU...

  • 3327 Views
  • 3 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

the transform function in sql is not the same as the scala/pyspark counterpart.  It is in fact a map().Here is some interesting infoI agree that functions are essential for code modularity.  Hence I prefer not to use sql but scala/pyspark instead.

  • 1 kudos
2 More Replies
SaraCorralLou
by New Contributor III
  • 3407 Views
  • 7 replies
  • 2 kudos

Bad performance UDFs functions

Hello,I am contacting you because I am having a problem with the performance of my notebooks on databricks.My notebook is written in python (pypark) in it I read a delta table that I copy to a dataframe and do several transformations and create sever...

SaraCorralLou_0-1692357805407.png
  • 3407 Views
  • 7 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

looping over records is a performance killer.  To be avoided at all costs.beware the for-loop (databricks.com)

  • 2 kudos
6 More Replies
ricard98
by New Contributor II
  • 4814 Views
  • 6 replies
  • 5 kudos

How to integrate SAP ERP to databricks

is there a way to integrate SAP erp to a databricks Notebook through python???,

  • 4814 Views
  • 6 replies
  • 5 kudos
Latest Reply
Kong
New Contributor II
  • 5 kudos

I've connected Databricks directly to S4/HANA ABAP layers but will re-iterate that it is extremely challenging if you do not have a background in sys administration, networking, devops, programming, and SAP.

  • 5 kudos
5 More Replies
Chris_Shehu
by Valued Contributor III
  • 1949 Views
  • 2 replies
  • 1 kudos

Resolved! Custom Library's(Unity Catalog Enabled Clusters)

I'm trying to use a custom library that I created from a .whl file in the workspace/shared location. The library attaches to the cluster without any issues and I can it when I list the modules using pip. When I try to call the module I get an error t...

  • 1949 Views
  • 2 replies
  • 1 kudos
Latest Reply
Szpila
New Contributor II
  • 1 kudos

Hello Guys,I am working on the project where we need to use spark-excel library (Maven) in order to ingest data from excel files. As those 3rd party library are not allowed on shared cluster, do you have any workaround other then using pandas for exa...

  • 1 kudos
1 More Replies
User15986662700
by New Contributor III
  • 3378 Views
  • 4 replies
  • 1 kudos
  • 3378 Views
  • 4 replies
  • 1 kudos
Latest Reply
User15986662700
New Contributor III
  • 1 kudos

Yes, it is possible to connect databricks to a kerberized hbase cluster. The attached article explains the steps. It consists of setting up a kerberos client using a keytab in the cluster nodes, installing the hbase-spark integration library, and set...

  • 1 kudos
3 More Replies
naga_databricks
by Contributor
  • 1977 Views
  • 2 replies
  • 0 kudos

Reading bigquery data using a query

To read Bigquery data using spark.read, i'm using a query. This query executes and creates a table on the materializationDataset. df = spark.read.format("bigquery") \.option("query", query) \.option("materializationProject", materializationProject) \...

  • 1977 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @naga_databricks, The Databricks documentation does not explicitly state that spark.read BigQuery format will create a Materialized View. Instead, it mentions that it can read from a BigQuery table or the result of a BigQuery SQL query. When you ...

  • 0 kudos
1 More Replies
EDDatabricks
by Contributor
  • 828 Views
  • 2 replies
  • 2 kudos

Appropriate storage account type for reference data (Azure)

Hello,We are using a reference dataset for our Production applications. We would like to create a delta table for this dataset to be used from our applications. Currently, manual updates will occur on this dataset through a script on a weekly basis. ...

Data Engineering
Delta Live Table
Storage account
  • 828 Views
  • 2 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

+1 for ADLS.  Hierarchical storage, hot/cold/premium storage, things not possible in blob storage

  • 2 kudos
1 More Replies
irispan
by New Contributor II
  • 2576 Views
  • 4 replies
  • 1 kudos

Recommended Hive metastore pattern for Trino integration

Hi, i have several questions regarding Trino integration:Is it recommended to use an external Hive metastore or leverage on the databricks-maintained Hive metastore when it comes to enabling external query engines such as Trino?When I tried to use ex...

test - Databricks
  • 2576 Views
  • 4 replies
  • 1 kudos
Latest Reply
JunlinZeng
New Contributor II
  • 1 kudos

> Is it recommended to use an external Hive metastore or leverage on the databricks-maintained Hive metastore when it comes to enabling external query engines such as Trino?Databricks maintained hive metastore is not suggested to be used externally. ...

  • 1 kudos
3 More Replies
Agus1
by New Contributor III
  • 3321 Views
  • 3 replies
  • 3 kudos

Update destination table when using Spark Structured Streaming and Delta tables

I’m trying to implement a streaming pipeline that will run hourly using Spark Structured Streaming, Scala and Delta tables. The pipeline will process different items with their details.The source are delta tables that already exists, written hourly u...

  • 3321 Views
  • 3 replies
  • 3 kudos
Latest Reply
Tharun-Kumar
Honored Contributor II
  • 3 kudos

@Agus1 Could you try using CDC in delta. You could use readChangeFeed to read only the changes that got applied on the source table. This is also explained here.https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed

  • 3 kudos
2 More Replies
Eric_Kieft
by New Contributor III
  • 1886 Views
  • 2 replies
  • 1 kudos

Unity Catalog Table/View Column Data Type Changes

When changing a delta table column data type in Unity Catalog, we noticed a view that is referencing that table did not automatically update to reflect the new data type.Is there a way to update the delta table column data type so that it also update...

  • 1886 Views
  • 2 replies
  • 1 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 1 kudos

Can you try refreshing the view by running the command: REFRESH TABLE <viewname>

  • 1 kudos
1 More Replies
suresh1122
by New Contributor III
  • 10024 Views
  • 11 replies
  • 7 kudos

dataframe takes unusually long time to save as a delta table using sql for a very small dataset with 30k rows. It takes around 2hrs. Is there a solution for this problem?

I am trying to save a dataframe after a series of data manipulations using Udf functions to a delta table. I tried using this code( df .write .format('delta') .mode('overwrite') .option('overwriteSchema', 'true') .saveAsTable('output_table'))but this...

  • 10024 Views
  • 11 replies
  • 7 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 7 kudos

You should also look into the sql plan if the writing phase is indeed the part that is taking time. Since spark works on lazy evaluation, there might be some other phase that might be taking time

  • 7 kudos
10 More Replies
NDK
by New Contributor II
  • 1287 Views
  • 1 replies
  • 0 kudos

Soft stop a Streaming Job

I have an auto loader streaming job with the continuous run, I want to stop that job on weekend for some time and restart again. 

  • 1287 Views
  • 1 replies
  • 0 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 0 kudos

This widget could not be displayed.
I have an auto loader streaming job with the continuous run, I want to stop that job on weekend for some time and restart again. 

This widget could not be displayed.
  • 0 kudos
This widget could not be displayed.
Vibhor
by Contributor
  • 2740 Views
  • 5 replies
  • 4 kudos

Resolved! Cluster Performance

Facing an issue with cluster performance, in event log can see - cluster is not responsive likely due to GC. Number of pipeline (databricks notebooks) running and cluster configuration is same as it used to be before but started seeing this issue sin...

  • 2740 Views
  • 5 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Moderator
  • 4 kudos

Hi @Vibhor Sethi​ ,Do you see any other error messages? did you data volume increase? what kind of job are you running?

  • 4 kudos
4 More Replies
ajain80
by New Contributor III
  • 11894 Views
  • 6 replies
  • 10 kudos

Resolved! SFTP Connect

How I can connect sftp server from databricks. So I can write files into tables directly?

  • 11894 Views
  • 6 replies
  • 10 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 10 kudos

The classic solution is to copy data from FTP to ADLS storage using Azure Data Factory, and after the copy is done in the ADF pipeline, trigger the databricks notebook.

  • 10 kudos
5 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels