cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

pacman
by New Contributor
  • 10966 Views
  • 6 replies
  • 0 kudos

How to run a saved query from a Notebook (PySpark)

Hi Team! Noob to Databricks, so apologies if I ask a dumb question.I have created a relatively large series of queries that fetch and organize the data I want.  I'm ready to drive all of these from a Notebook (likely PySpark).An example query is save...

  • 10966 Views
  • 6 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

You can also get the query ids by listing the queries through API call https://docs.databricks.com/api/workspace/queries/list 

  • 0 kudos
5 More Replies
143260
by New Contributor
  • 8366 Views
  • 2 replies
  • 1 kudos

Convert SQL Query to Dataframe

Hello,Being relatively new to the Databricks world, I'm hoping someone can show me how to take a SQL query and put the results into a dataframe. As part of data validation project, I'd like to cross join two dataframes.

  • 8366 Views
  • 2 replies
  • 1 kudos
Latest Reply
Antoine_B
Contributor
  • 1 kudos

From a pyspark notebook, you could do:df = spark.sql("SELECT * FROM my_table WHERE ...")Then you can use this df and crossjoin it to another DataFrameIf you are new to databricks, I suggest you should follow some of the self paced lessons in databric...

  • 1 kudos
1 More Replies
zed
by New Contributor III
  • 948 Views
  • 6 replies
  • 0 kudos

Resolved! ConcurrentAppendException in Feature Engineering write_table

I am using the Feature Engineering client when writing to a time series feature table. Then I have cried two data bricks jobs with the below code. I am running with different run_dates (e.g. '2016-01-07' and '2016-01-08'). When they run concurrently,...

  • 948 Views
  • 6 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@zed Clustering by your date column can indeed help avoid the ConcurrentAppendException without incurring the strict partitioning constraints that a “time series feature table” normally disallows. Unlike partitioning, CLUSTER BY does not create physi...

  • 0 kudos
5 More Replies
Einsatz
by New Contributor II
  • 858 Views
  • 4 replies
  • 2 kudos

Resolved! Photon enabled UC cluster has less executor memory(1/4th) compared to normal cluster.

I have a Unity Catalog Enabled cluster with Node type Standard_DS4_v2 (28 GB Memory, 8 Cores). When "Use Photon Acceleration" option is disabled spark.executor.memory is 18409m. But if I enable Photon Acceleration it shows spark.executor.memory as 46...

  • 858 Views
  • 4 replies
  • 2 kudos
Latest Reply
Walter_C
Databricks Employee
  • 2 kudos

The memory allocated to the Photon engine is not fixed; it is based on a percentage of the node’s total memory. To calculate the value of spark.executor.memory based on a specific node type, you can use the following formula: container_size = (vm_si...

  • 2 kudos
3 More Replies
TejeshS
by New Contributor III
  • 1117 Views
  • 1 replies
  • 1 kudos

How to identify which columns we need to consider for liquid clustering from a table of 200+ columns

In Databricks, when working with a table that has a large number of columns (e.g., 200), it can be challenging to determine which columns are most important for liquid clustering.Objective: The goal is to determine which columns to select based on th...

  • 1117 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @TejeshS, Thanks for your post! To determine which columns are most important for liquid clustering in a table with a large number of columns, you should focus on the columns that are most frequently used in query filters and those that can signif...

  • 1 kudos
guiferviz
by New Contributor III
  • 1693 Views
  • 7 replies
  • 3 kudos

Resolved! How to Determine if Materialized View is Performing Full or Incremental Refresh?

I'm currently testing materialized views and I need some help understanding the refresh behavior. Specifically, I want to know if my materialized view is querying the full table (performing a full refresh) or just doing an incremental refresh.From so...

  • 1693 Views
  • 7 replies
  • 3 kudos
Latest Reply
TejeshS
New Contributor III
  • 3 kudos

To validate the status of your materialized view (MV) refresh, run a DESCRIBE EXTENDED command and check the row corresponding to the "last refresh status type."RECOMPUTE indicates a full load execution was completed.NO_OPERATION means no operation w...

  • 3 kudos
6 More Replies
PiotrM
by New Contributor III
  • 502 Views
  • 2 replies
  • 0 kudos

Canceling long running on UC-enabled all-purpose clusters

Hey, as in the subject. Is it possible to set timeout for long running queries on all-purpose clusters that are UC enabled? I know there is such setting for SQL Warehouses and Workflows, but I was unable to find one for all-purpose clusters. The issu...

  • 502 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@PiotrM thanks for your question! Adding to @Alberto_Umana comment, could you please clarify what do you mean with: "I tried thing like spark.task.reaper.killTimeout, but it seems like UC clusters won't accept it." ? Is it throwing an error or is it ...

  • 0 kudos
1 More Replies
berserkersap
by Contributor
  • 5966 Views
  • 4 replies
  • 1 kudos

Speed Up JDBC Write from Databricks Notebook to MS SQL Server

Hello Everyone,I have a use case where I need to write a delta table from DataBricks to a SQL Server Table using Pyspark/ python/ spark SQL .The delta table I am writing contains around 3 million records and the SQL Server Table is neither partitione...

Data Engineering
JDBC
MS SQL Server
pyspark
Table Write
  • 5966 Views
  • 4 replies
  • 1 kudos
Latest Reply
VZLA
Databricks Employee
  • 1 kudos

@berserkersap have you had time to identify where's the bottleneck? e.g.: sequential writes, network latency/throughput, or maybe you have a connection pool in the target much lower than the number of connection threads in the source?

  • 1 kudos
3 More Replies
guangyi
by Contributor III
  • 640 Views
  • 2 replies
  • 0 kudos

How to identify the mandatory fields of the create clusters API

After several attempts I found some mandatory fields for cluster creation API: num_workers, spark_version, node_type_id. I’m not finding these fields directly against the API but via job cluster definition in the asset bundle yaml file.I ask the Chat...

  • 640 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@guangyi thanks for your question! I understand your concerns. Looking through the docs I could only find a few with the "required" metadata tag, while most seem to be implicitly assumed, e.g.: singleNode with num_workers 0, and similar requirements....

  • 0 kudos
1 More Replies
vivek_cloudde
by New Contributor III
  • 1630 Views
  • 8 replies
  • 2 kudos

Resolved! Issue while creating on-demand cluster in azure databricks using pyspark

Hello,I am trying to create an on demand cluster in azure databricks using below code and i am getting the error message{"error_code":"INVALID_PARAMETER_VALUE","message":"Exactly 1 of virtual_cluster_size, num_workers or autoscale must be specified."...

  • 1630 Views
  • 8 replies
  • 2 kudos
Latest Reply
VZLA
Databricks Employee
  • 2 kudos

@vivek_cloudde I still find it interesting to know that for all these different misconfigurations or wrong cluster definitions, you got the same error message, but anyways, happy to hear it worked ! If it helps, next time and to make things simpler, ...

  • 2 kudos
7 More Replies
nikhil_kumawat
by New Contributor II
  • 1072 Views
  • 8 replies
  • 2 kudos

Not able to retain precision while reading data from source file

Hi, I am trying to read a csv file located in S3 bucket folder. The csv file contains around 50 columns out of which one of the column is "litre_val" which contains values like "60211.952", "59164.608'. Upto 3 decimal points. Now to read this csv we ...

precision.png
  • 1072 Views
  • 8 replies
  • 2 kudos
Latest Reply
VZLA
Databricks Employee
  • 2 kudos

@nikhil_kumawat can you provide more details to reproduce this and better help you? e.g.: sample data set, dbr version, reproducer code, etc. I'm having this sample data: csv_content = """column1,column2,litre_val,another_decimal_column 1,TypeA,60211...

  • 2 kudos
7 More Replies
AlbertWang
by Valued Contributor
  • 2494 Views
  • 6 replies
  • 3 kudos

Resolved! Azure Databricks Unity Catalog - cannot access managed volume in notebook

We have set up Azure Databricks with Unity Catalog (Metastore).Used Managed Identity (Databricks Access Connector) for connection from workspace(s) to ADLS Gen2ADLS Gen2 storage account has Storage Blob Data Contributor and Storage Queue Data Contrib...

  • 2494 Views
  • 6 replies
  • 3 kudos
Latest Reply
VAMSaha22
New Contributor II
  • 3 kudos

Hi @AlbertWang did you find a solution to this issue , I am facing the exact same issue

  • 3 kudos
5 More Replies
Algocrat
by New Contributor II
  • 4562 Views
  • 2 replies
  • 2 kudos

Resolved! Discover and redact pii

Hi! What is the best way to discover and redact pii. Does Databricks offer any frameworks, or set of methods, or processes that we may follow?  

  • 4562 Views
  • 2 replies
  • 2 kudos
Latest Reply
viswesh
New Contributor II
  • 2 kudos

Hey @Algocrat  @szymon_dybczak , just wanted to let you know that Databricks is currently working on a product to tackle PII / sensitive data classification. If you're a current customer, we recommend you reach out to your account representative to l...

  • 2 kudos
1 More Replies
semsim
by Contributor
  • 3200 Views
  • 6 replies
  • 0 kudos

Resolved! Installing LibreOffice on Databricks

Hi, I need to install libreoffice to do a document conversion from .docx to .pdf. The requirement is no use of containers. Any idea on how I should go about this? Environment: Databricks 13.3 LTSThanks,Sem

  • 3200 Views
  • 6 replies
  • 0 kudos
Latest Reply
furkan
New Contributor II
  • 0 kudos

Hi @semsim I'm attempting to install LibreOffice for converting DOCX files to PDF and tried running your shell commands from notebook. However, I encountered the 404 errors shown below. Do you have any suggestions on how to resolve this issue? I real...

  • 0 kudos
5 More Replies
soumiknow
by Contributor II
  • 2221 Views
  • 10 replies
  • 2 kudos

Resolved! How to resolved 'connection refused' error while using a google-cloud lib in Databricks Notebook?

I want to use google-cloud-bigquery library in my PySpark code though I know that spark-bigquery-connector is available. The reason I want to use is that the Databricks Cluster 15.4LTS comes with 0.22.2-SNAPSHOT version of spark-bigquery-connector wh...

  • 2221 Views
  • 10 replies
  • 2 kudos
Latest Reply
VZLA
Databricks Employee
  • 2 kudos

@soumiknow sounds good ! Please let me know if you need some internal assistance with the communication process.

  • 2 kudos
9 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels