cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Karthik_2
by New Contributor
  • 715 Views
  • 1 replies
  • 0 kudos

Query on SQL Warehouse Concurrency in Azure Databricks

Hi,We are planning to migrate the backend of our web application, currently hosted on App Service with an Azure SQL Database, to Azure Databricks as the data source. For this, we intend to use the SQL Warehouse in Databricks to execute queries and in...

  • 715 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Hello Karthik, many thanks for your question. Databricks SQL Warehouses use dynamic concurrency to handle varying demands. Unlike static-capacity warehouses, Databricks SQL adjusts compute resources in real time to manage concurrent loads and maximiz...

  • 0 kudos
tseader
by New Contributor III
  • 2164 Views
  • 3 replies
  • 1 kudos

Resolved! Python SDK clusters.create_and_wait - Sourcing from cluster-create JSON

I am attempting to create a compute cluster using the Python SDK while sourcing a cluster-create configuration JSON file, which is how it's done for the databricks-cli and what databricks provides through the GUI.  Reading in the JSON as a Dict fails...

  • 2164 Views
  • 3 replies
  • 1 kudos
Latest Reply
tseader
New Contributor III
  • 1 kudos

@Retired_mod The structure of the `cluster-create.json` is perfectly fine.  The issue is as stated above related to the structure is that the SDK does not allow nested structures from the JSON file to be used, and instead they need to be cast to spec...

  • 1 kudos
2 More Replies
praful
by New Contributor II
  • 2443 Views
  • 5 replies
  • 1 kudos

Recover Lost Notebook

Hi Team, I was using Databricks community edition for learning purpose. I had an account https://community.cloud.databricks.com/?o=6822095545287159 where I stored all my learning notebooks. Unfortunately, this account suddenly stopped working, and I ...

  • 2443 Views
  • 5 replies
  • 1 kudos
Latest Reply
Walter_C
Databricks Employee
  • 1 kudos

The workspace id you have shared seems to be related to a workspace which is still in running state, if you missed the login access to this workspace then our team you have reached over email would be able to assist.I will add the following doc for s...

  • 1 kudos
4 More Replies
minhhung0507
by Contributor III
  • 875 Views
  • 7 replies
  • 4 kudos

Resolved! How to reduce cost of "Regional Standard Class A Operations"

Hi Databricks experts,We're experiencing unexpectedly high costs from Regional Standard Class A Operations in GCS while running a Databricks pipeline. The costs seem related to frequent metadata queries, possibly tied to Delta table operations.In las...

image.png
  • 875 Views
  • 7 replies
  • 4 kudos
Latest Reply
VZLA
Databricks Employee
  • 4 kudos

@minhhung0507 its hard to say without having more direct insight, but generally speaking many streaming jobs with very frequent intervals will likely contribute; 300 jobs triggered continously will also contribute depending on the use case of these j...

  • 4 kudos
6 More Replies
pacman
by New Contributor
  • 11201 Views
  • 6 replies
  • 0 kudos

How to run a saved query from a Notebook (PySpark)

Hi Team! Noob to Databricks, so apologies if I ask a dumb question.I have created a relatively large series of queries that fetch and organize the data I want.  I'm ready to drive all of these from a Notebook (likely PySpark).An example query is save...

  • 11201 Views
  • 6 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

You can also get the query ids by listing the queries through API call https://docs.databricks.com/api/workspace/queries/list 

  • 0 kudos
5 More Replies
143260
by New Contributor
  • 8675 Views
  • 2 replies
  • 1 kudos

Convert SQL Query to Dataframe

Hello,Being relatively new to the Databricks world, I'm hoping someone can show me how to take a SQL query and put the results into a dataframe. As part of data validation project, I'd like to cross join two dataframes.

  • 8675 Views
  • 2 replies
  • 1 kudos
Latest Reply
Antoine_B
Contributor
  • 1 kudos

From a pyspark notebook, you could do:df = spark.sql("SELECT * FROM my_table WHERE ...")Then you can use this df and crossjoin it to another DataFrameIf you are new to databricks, I suggest you should follow some of the self paced lessons in databric...

  • 1 kudos
1 More Replies
zed
by New Contributor III
  • 983 Views
  • 6 replies
  • 0 kudos

Resolved! ConcurrentAppendException in Feature Engineering write_table

I am using the Feature Engineering client when writing to a time series feature table. Then I have cried two data bricks jobs with the below code. I am running with different run_dates (e.g. '2016-01-07' and '2016-01-08'). When they run concurrently,...

  • 983 Views
  • 6 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@zed Clustering by your date column can indeed help avoid the ConcurrentAppendException without incurring the strict partitioning constraints that a “time series feature table” normally disallows. Unlike partitioning, CLUSTER BY does not create physi...

  • 0 kudos
5 More Replies
Einsatz
by New Contributor II
  • 895 Views
  • 4 replies
  • 2 kudos

Resolved! Photon enabled UC cluster has less executor memory(1/4th) compared to normal cluster.

I have a Unity Catalog Enabled cluster with Node type Standard_DS4_v2 (28 GB Memory, 8 Cores). When "Use Photon Acceleration" option is disabled spark.executor.memory is 18409m. But if I enable Photon Acceleration it shows spark.executor.memory as 46...

  • 895 Views
  • 4 replies
  • 2 kudos
Latest Reply
Walter_C
Databricks Employee
  • 2 kudos

The memory allocated to the Photon engine is not fixed; it is based on a percentage of the node’s total memory. To calculate the value of spark.executor.memory based on a specific node type, you can use the following formula: container_size = (vm_si...

  • 2 kudos
3 More Replies
TejeshS
by New Contributor III
  • 1159 Views
  • 1 replies
  • 1 kudos

How to identify which columns we need to consider for liquid clustering from a table of 200+ columns

In Databricks, when working with a table that has a large number of columns (e.g., 200), it can be challenging to determine which columns are most important for liquid clustering.Objective: The goal is to determine which columns to select based on th...

  • 1159 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @TejeshS, Thanks for your post! To determine which columns are most important for liquid clustering in a table with a large number of columns, you should focus on the columns that are most frequently used in query filters and those that can signif...

  • 1 kudos
guiferviz
by New Contributor III
  • 1770 Views
  • 7 replies
  • 3 kudos

Resolved! How to Determine if Materialized View is Performing Full or Incremental Refresh?

I'm currently testing materialized views and I need some help understanding the refresh behavior. Specifically, I want to know if my materialized view is querying the full table (performing a full refresh) or just doing an incremental refresh.From so...

  • 1770 Views
  • 7 replies
  • 3 kudos
Latest Reply
TejeshS
New Contributor III
  • 3 kudos

To validate the status of your materialized view (MV) refresh, run a DESCRIBE EXTENDED command and check the row corresponding to the "last refresh status type."RECOMPUTE indicates a full load execution was completed.NO_OPERATION means no operation w...

  • 3 kudos
6 More Replies
PiotrM
by New Contributor III
  • 518 Views
  • 2 replies
  • 0 kudos

Canceling long running on UC-enabled all-purpose clusters

Hey, as in the subject. Is it possible to set timeout for long running queries on all-purpose clusters that are UC enabled? I know there is such setting for SQL Warehouses and Workflows, but I was unable to find one for all-purpose clusters. The issu...

  • 518 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@PiotrM thanks for your question! Adding to @Alberto_Umana comment, could you please clarify what do you mean with: "I tried thing like spark.task.reaper.killTimeout, but it seems like UC clusters won't accept it." ? Is it throwing an error or is it ...

  • 0 kudos
1 More Replies
berserkersap
by Contributor
  • 6022 Views
  • 4 replies
  • 1 kudos

Speed Up JDBC Write from Databricks Notebook to MS SQL Server

Hello Everyone,I have a use case where I need to write a delta table from DataBricks to a SQL Server Table using Pyspark/ python/ spark SQL .The delta table I am writing contains around 3 million records and the SQL Server Table is neither partitione...

Data Engineering
JDBC
MS SQL Server
pyspark
Table Write
  • 6022 Views
  • 4 replies
  • 1 kudos
Latest Reply
VZLA
Databricks Employee
  • 1 kudos

@berserkersap have you had time to identify where's the bottleneck? e.g.: sequential writes, network latency/throughput, or maybe you have a connection pool in the target much lower than the number of connection threads in the source?

  • 1 kudos
3 More Replies
guangyi
by Contributor III
  • 653 Views
  • 2 replies
  • 0 kudos

How to identify the mandatory fields of the create clusters API

After several attempts I found some mandatory fields for cluster creation API: num_workers, spark_version, node_type_id. I’m not finding these fields directly against the API but via job cluster definition in the asset bundle yaml file.I ask the Chat...

  • 653 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@guangyi thanks for your question! I understand your concerns. Looking through the docs I could only find a few with the "required" metadata tag, while most seem to be implicitly assumed, e.g.: singleNode with num_workers 0, and similar requirements....

  • 0 kudos
1 More Replies
vivek_cloudde
by New Contributor III
  • 1690 Views
  • 8 replies
  • 2 kudos

Resolved! Issue while creating on-demand cluster in azure databricks using pyspark

Hello,I am trying to create an on demand cluster in azure databricks using below code and i am getting the error message{"error_code":"INVALID_PARAMETER_VALUE","message":"Exactly 1 of virtual_cluster_size, num_workers or autoscale must be specified."...

  • 1690 Views
  • 8 replies
  • 2 kudos
Latest Reply
VZLA
Databricks Employee
  • 2 kudos

@vivek_cloudde I still find it interesting to know that for all these different misconfigurations or wrong cluster definitions, you got the same error message, but anyways, happy to hear it worked ! If it helps, next time and to make things simpler, ...

  • 2 kudos
7 More Replies
nikhil_kumawat
by New Contributor II
  • 1103 Views
  • 8 replies
  • 2 kudos

Not able to retain precision while reading data from source file

Hi, I am trying to read a csv file located in S3 bucket folder. The csv file contains around 50 columns out of which one of the column is "litre_val" which contains values like "60211.952", "59164.608'. Upto 3 decimal points. Now to read this csv we ...

precision.png
  • 1103 Views
  • 8 replies
  • 2 kudos
Latest Reply
VZLA
Databricks Employee
  • 2 kudos

@nikhil_kumawat can you provide more details to reproduce this and better help you? e.g.: sample data set, dbr version, reproducer code, etc. I'm having this sample data: csv_content = """column1,column2,litre_val,another_decimal_column 1,TypeA,60211...

  • 2 kudos
7 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels