cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Brad
by Contributor II
  • 1053 Views
  • 3 replies
  • 0 kudos

Will MERGE incur a lot driver memory

Hi team,We have a job to run MERGE on a target table with around 220 million rows. We found it needs a lot driver memory (just for MERGE itself). From the job metrics we can see the MERGE needs at least 46GB memory. Is there some special thing to mak...

  • 1053 Views
  • 3 replies
  • 0 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 0 kudos

Hi @Brad ,Could you try to apply very standard optimization practices and check the outcome:1. If your runtime is greater equal 15.2, could you implement liquid clustering on the source and target tables using JOIN columns?ALTER TABLE <table_name> CL...

  • 0 kudos
2 More Replies
hcord
by New Contributor II
  • 1407 Views
  • 1 replies
  • 2 kudos

Resolved! Trigger a workflow from a different databricks environment

Hello everyone,In the company I work we have a lot of different databricks environments and now we're in need of deeper integration of processes from environment's X and Y. There's a workflow in Y that runs a process that when finished we would like ...

  • 1407 Views
  • 1 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @hcord ,You can use REST API in the last task to trigger a workflow in different workspace

  • 2 kudos
sshynkary
by New Contributor
  • 2547 Views
  • 1 replies
  • 0 kudos

Loading data from spark dataframe directly to Sharepoint

Hi guys!I am trying to load data directly from PySpark dataframe to Sharepoint folder and I cannot find a solution regarding it.I wanted to implement workaround using volumes and logic apps, but there are few issues. I need to partition df in a few f...

Data Engineering
SharePoint
spark
  • 2547 Views
  • 1 replies
  • 0 kudos
Latest Reply
ChKing
New Contributor II
  • 0 kudos

One approach could involve using Azure Data Lake as an intermediary. You can partition your PySpark DataFrames and load them into Azure Data Lake, which is optimized for large-scale data storage and integrates well with PySpark. Once the data is in A...

  • 0 kudos
dpc
by New Contributor III
  • 11316 Views
  • 4 replies
  • 2 kudos

Resolved! Remove Duplicate rows in tables

HelloI've seen posts that show how to remove duplicates, something like this:MERGE into [deltatable] as targetUSING ( select *, ROW_NUMBER() OVER (Partition By [primary keys] Order By [date] desc) as rn  from [deltatable] qualify rn> 1 ) as sourceON ...

  • 11316 Views
  • 4 replies
  • 2 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 2 kudos

Hi @dpc ,if you like using SQL:1. Test data:# Sample data data = [("1", "A"), ("1", "A"), ("2", "B"), ("2", "B"), ("3", "C")] # Create DataFrame df = spark.createDataFrame(data, ["id", "value"]) # Write to Delta table df.write.format("delta").mode(...

  • 2 kudos
3 More Replies
397973
by New Contributor III
  • 845 Views
  • 1 replies
  • 0 kudos

First time to see "Databricks is experiencing heavy load" message. What does it mean really?

Hi, I just went to run a Databricks pyspark notebook and saw this message:This is a notebook I've run before but never saw this. Is it referring to my cluster? The Databricks infrastructure? My notebook ran normally, just wondering though. Google sea...

397973_0-1727271218117.png
  • 845 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

never saw that message, but my guess it is not your cluster but the Databricks platform in your region.status.databricks.com perhaps has some info.

  • 0 kudos
MustangR
by New Contributor
  • 2195 Views
  • 2 replies
  • 0 kudos

Delta Table Upsert fails when source attributes are missing

Hi All,I am trying to merge a json to delta table. Since the Json is basically from MongoDB which does not have a schema, there are chances of having missing attributes expected by delta table schema validation. Schema Evolution is enabled as well. H...

  • 2195 Views
  • 2 replies
  • 0 kudos
Latest Reply
JohnM256
New Contributor II
  • 0 kudos

How do I set Existing Optional Columns?

  • 0 kudos
1 More Replies
Paul_Poco
by New Contributor II
  • 77238 Views
  • 5 replies
  • 6 kudos

Asynchronous API calls from Databricks

Hi, ​I have to send thousands of API calls from a Databricks notebook to an API to retrieve some data. Right now, I am using a sequential approach using the python request package. As the performance is not acceptable anymore, I need to send my API c...

  • 77238 Views
  • 5 replies
  • 6 kudos
Latest Reply
adarsh8304
New Contributor II
  • 6 kudos

Hey @Paul_Poco what about using the processpoolexecutor or threadypoolexecutor from the concurrent.futures module ? have u tried them or not . ?  

  • 6 kudos
4 More Replies
priyansh
by New Contributor III
  • 2293 Views
  • 3 replies
  • 0 kudos

How Photon Acceleration Actually works?

Hey folks!I would like to know that how photon acceleration actually works, I have tested it on a sample of 219MB, 513MB, 2.7 GB, 4.1 GB of Data and the difference in seconds between normal and photon accelerated compute was not so much, So my questi...

image (4).png
  • 2293 Views
  • 3 replies
  • 0 kudos
Latest Reply
arch_db
New Contributor III
  • 0 kudos

Try to check merge operation on tables over 200GB.

  • 0 kudos
2 More Replies
EricCournarie
by New Contributor III
  • 966 Views
  • 2 replies
  • 0 kudos

Metadata on a prepared statement return upper case column names

Hello,Using the JDBC Driver , when I check the metadata of a prepared statement, the column names names are all uppercase . This does not happen when running a DESCRIBE on the same select. Any properties to set , or it is a known issue ? or a workaro...

  • 966 Views
  • 2 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

Looks like a bug. Can you try using double quotes?  SELECT "ColumnName" instead of backticks?   

  • 0 kudos
1 More Replies
shsalami
by New Contributor III
  • 1317 Views
  • 2 replies
  • 0 kudos

Sample streaming table is failed

Running the following databricks sample code in the pipeline: CREATE OR REFRESH STREAMING TABLE customersAS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv") I got error:org.apache.spark.sql.catalyst.ExtendedAnalysisExcep...

  • 1317 Views
  • 2 replies
  • 0 kudos
Latest Reply
shsalami
New Contributor III
  • 0 kudos

There is no table with that name.Also, in that folder just the following file exists:dbfs:/databricks-datasets/retail-org/customers/customers.csv

  • 0 kudos
1 More Replies
shsalami
by New Contributor III
  • 2774 Views
  • 2 replies
  • 1 kudos

Resolved! Materialize view creation is failed

I have 'ALL_PRIVILEGES' and 'USE_SCHEMA' on lhdev.gld_sbx schema but the following command has been failed with the error:DriverException: Unable to process statement for Table 'customermvx' create materialized view customermvxasselect *from lhdev.gl...

  • 2774 Views
  • 2 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @shsalami ,According to below documentation snippet, you also need USE CATALOG privilege on the parent catalog. "The user who creates a materialized view (MV) is the MV owner and needs to have the following permissions:SELECT privilege over the ba...

  • 1 kudos
1 More Replies
shinaushin
by New Contributor II
  • 6299 Views
  • 15 replies
  • 3 kudos

Session expired, cannot log back in to Community Edition

Whenever I am logged into my Community Edition account and leave it idle for a bit, it says that my session has expired, which is understandable. However, when I try logging back in with the exact same credentials, I receive an error saying that a Co...

  • 6299 Views
  • 15 replies
  • 3 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 3 kudos

Hi All, Between 22:00 UTC on August 14, 2024, and 14:43 UTC on August 19, 2024, you may have been a user of Databricks’ Community Edition who experienced errors citing “Your session has expired, please authenticate again” within a few minutes of logg...

  • 3 kudos
14 More Replies
TCK
by New Contributor II
  • 1606 Views
  • 2 replies
  • 0 kudos

Embedding external content and videos via IFrame

Hello there,I'm currently creating a notebook which contains a training course for data engineering.For certain topics it would be nice to embed external resources like Youtube videos so participants do not have to leave the notebook to watch the vid...

  • 1606 Views
  • 2 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

Did you try using displayHTML() ?  https://docs.databricks.com/en/visualizations/html-d3-and-svg.html

  • 0 kudos
1 More Replies
johnb1
by Contributor
  • 2927 Views
  • 5 replies
  • 0 kudos

Resolved! SQL UDF vs. Python UDF, SQL UDF vs. Pandas UDF

I would like to understand how(1) SQL UDFs compare to Python UDFs(2) SQL UDFs compare to Pandas UDFsEspecially in terms of performance.I cannot find any documentation on the topics, also not in the official Databricks documentation (which unfortunate...

  • 2927 Views
  • 5 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

The first sublink has SQL UDFs where you can write your SQL UDF using SQL or Python. This Python implementation is different from the one mentioned above. https://docs.databricks.com/en/udf/unity-catalog.html

  • 0 kudos
4 More Replies
techie001
by New Contributor
  • 805 Views
  • 1 replies
  • 0 kudos

Delta Live tables vs Azure SQL DB for a read intensive application

Hi,I am looking for some advice to compare cost and performace between Delta Live tables vs Azure SQL DB files in Azure Blob for building the backend for a web application.There would be very frequent read operation(Multiple searches every second) an...

  • 805 Views
  • 1 replies
  • 0 kudos
Latest Reply
gchandra
Databricks Employee
  • 0 kudos

As far as I know, Azure SQL DB is RDBMS, whereas DLT is to build a data pipeline. 

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels