cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

data-engineer-d
by Contributor
  • 1992 Views
  • 3 replies
  • 4 kudos

Parametrize the DLT pipeline for dynamic loading of many tables

I am trying to ingest hundreds of tables with CDC, where I want to create a generic/dynamic pipeline which can accept parameters (e.g table_name, schema, file path) and run the logic on it. However, I am not able to find a way to pass parameters to p...

Data Engineering
Delta Live Tables
  • 1992 Views
  • 3 replies
  • 4 kudos
Latest Reply
Gilg
Contributor II
  • 4 kudos

If you have different folders for each of your source tables, you can leverage python loops to naturally iterate over the folders.To do this, you need to create a create_pipeline function that has table_name, schema, path as your parameters. Inside t...

  • 4 kudos
2 More Replies
felix_counter
by New Contributor III
  • 11585 Views
  • 4 replies
  • 0 kudos

How to authenticate databricks provider in terraform using a system-managed identity?

Hello,I want to authenticate the databricks provider using a system-managed identity in Azure. The identity resides in a different subscription than the databricks workspace: According to the "authentication" section of the databricks provider docume...

managed identity.png
Data Engineering
authentication
databricks provider
managed identity
Terraform
  • 11585 Views
  • 4 replies
  • 0 kudos
Latest Reply
FarBo
New Contributor III
  • 0 kudos

@felix_counter I think I have your answer.To create a databricks provider to manage your workspace using an SPN, you need to create the provider like this:provider "databricks" { alias = "workspace" host = <your workspace URL> azure_...

  • 0 kudos
3 More Replies
jura
by New Contributor II
  • 1323 Views
  • 2 replies
  • 1 kudos

SQL Identifier clause

Hi, I was trying to prepare some dynamic SQLs to create table using the IDENTIFIER clause and WITH AS clause, but I'm stuck on some bug as it seems. could someone verify it or tell me that I am doing something wrong?code is running on SQL Warehouse T...

jura_2-1710922868633.png jura_3-1710923081107.png jura_4-1710923152252.png
Data Engineering
identifier
  • 1323 Views
  • 2 replies
  • 1 kudos
Latest Reply
raphaelblg
Honored Contributor
  • 1 kudos

Hello @jura ,I'm Raphael and i'll be helping you out. The approach below should work: USE CATALOG dev; CREATE OR REPLACE TABLE IDENTIFIER("bronze.jura_test") as SELECT ... Please let me know the outcome and feel free to ask any further questions. ...

  • 1 kudos
1 More Replies
FarBo
by New Contributor III
  • 4781 Views
  • 4 replies
  • 5 kudos

Spark issue handling data from json when the schema DataType mismatch occurs

Hi,I have encountered a problem using spark, when creating a dataframe from a raw json source.I have defined an schema for my data and the problem is that when there is a mismatch between one of the column values and its defined schema, spark not onl...

  • 4781 Views
  • 4 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

@Farzad Bonabi​ :Thank you for reporting this issue. It seems to be a known bug in Spark when dealing with malformed decimal values. When a decimal value in the input JSON data is not parseable by Spark, it sets not only that column to null but also ...

  • 5 kudos
3 More Replies
Tam
by New Contributor III
  • 10606 Views
  • 2 replies
  • 2 kudos

Delta Table on AWS Glue Catalog

I have set up Databricks cluster to work with AWS Glue Catalog by enabling the spark.databricks.hive.metastore.glueCatalog.enabled to true. However, when I create a Delta table on Glue Catalog, the schema reflected in the AWS Glue Catalog is incorrec...

Tam_0-1700157256870.png Tam_1-1700157262740.png
  • 10606 Views
  • 2 replies
  • 2 kudos
Latest Reply
monometa
New Contributor II
  • 2 kudos

Hi, could you please refer to something or explain in more detail your point about querying Delta Lake files directly instead of through the AWS Glue catalog and why it was highlighted as a best practice?

  • 2 kudos
1 More Replies
NDK_1
by New Contributor II
  • 896 Views
  • 1 replies
  • 0 kudos

I would like to Create a schedule in Databricks that runs a job on 1st working day of every month

I would like to create a schedule in Databricks that runs a job on the first working day of every month (working days referring to Monday through Friday). I tried using Cron syntax but didn't have any luck. Is there any way we can schedule this in Da...

  • 896 Views
  • 1 replies
  • 0 kudos
Latest Reply
shan_chandra
Esteemed Contributor
  • 0 kudos

@NDK_1 - Cron syntax won't allow the combination of day of month and day of week. you can try creating two different schedules  - one for the first day, second day of the month and then add custom logic to check if it is an working day and then trigg...

  • 0 kudos
Constantine
by Contributor III
  • 11068 Views
  • 3 replies
  • 6 kudos

Resolved! CREATE TEMP TABLE FROM CTE

I have written a CTE in Spark SQL WITH temp_data AS (   ......   )   CREATE VIEW AS temp_view FROM SELECT * FROM temp_view; I get a cryptic error. Is there a way to create a temp view from CTE using Spark SQL in databricks?

  • 11068 Views
  • 3 replies
  • 6 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 6 kudos

In the CTE you can't do a CREATE. It expects an expression in the form of expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( query )where expression_name specifies a name for the common table expression.If you want to create a view from a CTE, y...

  • 6 kudos
2 More Replies
test_123
by New Contributor
  • 586 Views
  • 1 replies
  • 0 kudos

Autoloader not detecting changes/updated values for xml file

if i update the value in xml then autoloader not detecting the changes.same for delete/remove column or property in xml.  So request to you please help me to fix this issue

  • 586 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Honored Contributor
  • 0 kudos

It seems that the issue you're experiencing with Autoloader not detecting changes in XML files might be related to how Autoloader handles schema inference and evolution. Autoloader can automatically detect the schema of loaded XML data, allowing you...

  • 0 kudos
SyedGhouri
by New Contributor III
  • 7388 Views
  • 2 replies
  • 0 kudos

Cannot create jobs with jobs api - Azure databricks - private network

HiI'm trying to deploy the databricks jobs from dev to prod environment. I have jobs in dev environment and using azure devops, I deployed the jobs in the code format to prod environment. Now when I use the post method to create the job programmatica...

  • 7388 Views
  • 2 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

@SyedGhouri You need to setup self-hosted Azure DevOps Agent inside your VNET.

  • 0 kudos
1 More Replies
pshuk
by New Contributor III
  • 2139 Views
  • 2 replies
  • 0 kudos

Copying files from dev environment to prod environment

Hi,Is there a quick and easy way to copy files between different environments? I have copied a large number of files on my dev environment (unity catalog) and want to copy them over to production environment. Instead of doing it from scratch, can I j...

  • 2139 Views
  • 2 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

If you want to copy files in Azure, ADF is usually the fastest option (for example TB of csvs, parquets). If you want to copy tables, just use CLONE. If it is files with code just use Repos and branches.

  • 0 kudos
1 More Replies
MarinD
by New Contributor II
  • 1572 Views
  • 2 replies
  • 0 kudos

Asset bundle pipelines - target schema and catalog

Do asset bundles support DLT pipelines unity catalog as a destination? How to specify catalog and target schema?

  • 1572 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @MarinD, Delta Live Tables (DLT) pipelines can indeed use Unity Catalog as a destination. Here’s how you can specify the catalog and target schema: Create a DLT Pipeline with Unity Catalog: When creating a DLT pipeline, in the UI, select “Uni...

  • 0 kudos
1 More Replies
William_Scardua
by Valued Contributor
  • 1796 Views
  • 1 replies
  • 1 kudos

Resolved! How to no round formating

Hy guys,I need to format the decimal values but I can`t round thenhave any idea ?thank you â€ƒ

Screenshot 2024-03-20 at 10.01.52.png
  • 1796 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @William_Scardua, In Databricks, you can format decimal values without rounding them using a couple of approaches. Let’s explore some options: Using substring: You can use the substring function to extract a specific number of decimal places f...

  • 1 kudos
dbx-user7354
by New Contributor III
  • 2192 Views
  • 3 replies
  • 1 kudos

Pyspark Dataframes orderby only orders within partition when having multiple worker

I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't.  from pyspark.sql import functions as F import matplotlib.pyplot...

dbxuser7354_0-1711014288660.png dbxuser7354_1-1711014300462.png
  • 2192 Views
  • 3 replies
  • 1 kudos
Latest Reply
MarkusFra
New Contributor III
  • 1 kudos

@Kaniz_Fatma Sorry if I have to ask again, but I am a bit confused by this.I thought, that pysparks `orderBy()` and `sort()` do a shuffle operation before the sorting for exact this reason. There is another command `sortWithinPartitions()` that does ...

  • 1 kudos
2 More Replies
ac0
by New Contributor III
  • 976 Views
  • 1 replies
  • 0 kudos

Get size of metastore specifically

Currently my Databricks Metastore is in the the same location as the data for my production catalog. We are moving the data to a separate storage account. In advance of this, I'm curious if there is a way to determine the size of the metastore itself...

  • 976 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @ac0,  Let’s explore how you can determine the size of your Databricks Metastore and estimate the storage requirements for the Azure Storage Account hosting the metastore. Metastore Size: The metastore in Unity Catalog is the top-level contain...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels