Data Engineering

Forum Posts

Sorted by:

by FarBo • New Contributor III

01-05-2023 6:57:40 AM

2474 Views
4 replies
5 kudos

Spark issue handling data from json when the schema DataType mismatch occurs

Hi,I have encountered a problem using spark, when creating a dataframe from a raw json source.I have defined an schema for my data and the problem is that when there is a mismatch between one of the column values and its defined schema, spark not onl...

Data Engineering

2474 Views
4 replies
5 kudos

01-05-2023 6:57:40 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 6:11:47 AM

5 kudos

@Farzad Bonabi :Thank you for reporting this issue. It seems to be a known bug in Spark when dealing with malformed decimal values. When a decimal value in the input JSON data is not parseable by Spark, it sets not only that column to null but also ...

5 kudos

04-10-2023 6:11:47 AM

3 More Replies

by Tam • New Contributor III

11-16-2023 9:56:46 AM

5084 Views
2 replies
2 kudos

Delta Table on AWS Glue Catalog

I have set up Databricks cluster to work with AWS Glue Catalog by enabling the spark.databricks.hive.metastore.glueCatalog.enabled to true. However, when I create a Delta table on Glue Catalog, the schema reflected in the AWS Glue Catalog is incorrec...

Data Engineering

5084 Views
2 replies
2 kudos

11-16-2023 9:56:46 AM

View Replies

Latest Reply

monometa
New Contributor II

03-21-2024 1:43:22 PM

2 kudos

Hi, could you please refer to something or explain in more detail your point about querying Delta Lake files directly instead of through the AWS Glue catalog and why it was highlighted as a best practice?

2 kudos

03-21-2024 1:43:22 PM

1 More Replies

by NDK_1 • New Contributor II

03-21-2024 11:13:22 AM

397 Views
1 replies
0 kudos

I would like to Create a schedule in Databricks that runs a job on 1st working day of every month

I would like to create a schedule in Databricks that runs a job on the first working day of every month (working days referring to Monday through Friday). I tried using Cron syntax but didn't have any luck. Is there any way we can schedule this in Da...

Data Engineering

397 Views
1 replies
0 kudos

03-21-2024 11:13:22 AM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

03-21-2024 1:20:17 PM

0 kudos

@NDK_1 - Cron syntax won't allow the combination of day of month and day of week. you can try creating two different schedules - one for the first day, second day of the month and then add custom logic to check if it is an working day and then trigg...

0 kudos

03-21-2024 1:20:17 PM

by bchaubey • Contributor II

01-05-2022 12:45:24 AM

19307 Views
18 replies
11 kudos

Resolved! To read data from Azure Storage

Hi Team,May i know how to read Azure storage data in Databricks through Python.

Data Engineering

19307 Views
18 replies
11 kudos

01-05-2022 12:45:24 AM

View Replies

Latest Reply

bchaubey
Contributor II

01-10-2022 6:22:05 AM

11 kudos

@Kaniz Fatma need full syllabus of Azure Databricks

11 kudos

01-10-2022 6:22:05 AM

17 More Replies

by Constantine • Contributor III

11-02-2021 7:46:29 PM

8488 Views
3 replies
6 kudos

Resolved! CREATE TEMP TABLE FROM CTE

I have written a CTE in Spark SQL WITH temp_data AS ( ...... ) CREATE VIEW AS temp_view FROM SELECT * FROM temp_view; I get a cryptic error. Is there a way to create a temp view from CTE using Spark SQL in databricks?

Data Engineering

8488 Views
3 replies
6 kudos

11-02-2021 7:46:29 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

11-03-2021 12:08:56 AM

6 kudos

In the CTE you can't do a CREATE. It expects an expression in the form of expression_name [ ( column_name [ , ... ] ) ] [ AS ] ( query )where expression_name specifies a name for the common table expression.If you want to create a view from a CTE, y...

6 kudos

11-03-2021 12:08:56 AM

2 More Replies

by cpayne_vax • New Contributor III

01-17-2024 2:17:00 PM

3809 Views
5 replies
2 kudos

Resolved! Delta Live Tables: dynamic schema

Does anyone know if there's a way to specify an alternate Unity schema in a DLT workflow using the @Dlt.table syntax? In my case, I’m looping through folders in Azure datalake storage to ingest data. I’d like those folders to get created in different...

Data Engineering

3809 Views
5 replies
2 kudos

01-17-2024 2:17:00 PM

View Replies

Latest Reply

data-engineer-d
New Contributor III

03-21-2024 10:45:09 AM

2 kudos

@cpayne_vax now that we are at end of Q1-24, do we have the ability to write to any schema dynamically?

2 kudos

03-21-2024 10:45:09 AM

4 More Replies

by test_123 • New Contributor

03-21-2024 3:03:25 AM

283 Views
1 replies
0 kudos

Autoloader not detecting changes/updated values for xml file

if i update the value in xml then autoloader not detecting the changes.same for delete/remove column or property in xml. So request to you please help me to fix this issue

Data Engineering

283 Views
1 replies
0 kudos

03-21-2024 3:03:25 AM

View Replies

Latest Reply

Walter_C
Valued Contributor II

03-21-2024 9:50:07 AM

0 kudos

It seems that the issue you're experiencing with Autoloader not detecting changes in XML files might be related to how Autoloader handles schema inference and evolution. Autoloader can automatically detect the schema of loaded XML data, allowing you...

0 kudos

03-21-2024 9:50:07 AM

by SyedGhouri • New Contributor III

03-20-2024 4:29:10 PM

2633 Views
2 replies
0 kudos

Cannot create jobs with jobs api - Azure databricks - private network

HiI'm trying to deploy the databricks jobs from dev to prod environment. I have jobs in dev environment and using azure devops, I deployed the jobs in the code format to prod environment. Now when I use the post method to create the job programmatica...

Data Engineering

2633 Views
2 replies
0 kudos

03-20-2024 4:29:10 PM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

03-21-2024 1:14:58 AM

0 kudos

@SyedGhouri You need to setup self-hosted Azure DevOps Agent inside your VNET.

0 kudos

03-21-2024 1:14:58 AM

1 More Replies

by pshuk • New Contributor II

03-20-2024 12:18:20 PM

737 Views
2 replies
0 kudos

Copying files from dev environment to prod environment

Hi,Is there a quick and easy way to copy files between different environments? I have copied a large number of files on my dev environment (unity catalog) and want to copy them over to production environment. Instead of doing it from scratch, can I j...

Data Engineering

737 Views
2 replies
0 kudos

03-20-2024 12:18:20 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

03-20-2024 12:53:20 PM

0 kudos

If you want to copy files in Azure, ADF is usually the fastest option (for example TB of csvs, parquets). If you want to copy tables, just use CLONE. If it is files with code just use Repos and branches.

0 kudos

03-20-2024 12:53:20 PM

1 More Replies

by MarinD • New Contributor II

03-20-2024 3:05:37 PM

834 Views
2 replies
0 kudos

Asset bundle pipelines - target schema and catalog

Do asset bundles support DLT pipelines unity catalog as a destination? How to specify catalog and target schema?

Data Engineering

834 Views
2 replies
0 kudos

03-20-2024 3:05:37 PM

View Replies

Latest Reply

Kaniz
Community Manager

03-21-2024 5:13:30 AM

0 kudos

Hi @MarinD, Delta Live Tables (DLT) pipelines can indeed use Unity Catalog as a destination. Here’s how you can specify the catalog and target schema: Create a DLT Pipeline with Unity Catalog: When creating a DLT pipeline, in the UI, select “Uni...

0 kudos

03-21-2024 5:13:30 AM

1 More Replies

by William_Scardua • Valued Contributor

03-20-2024 6:03:04 AM

367 Views
1 replies
1 kudos

Resolved! How to no round formating

Hy guys,I need to format the decimal values but I can`t round thenhave any idea ?thank you

Data Engineering

367 Views
1 replies
1 kudos

03-20-2024 6:03:04 AM

View Replies

Latest Reply

Kaniz
Community Manager

03-21-2024 6:02:21 AM

1 kudos

Hi @William_Scardua, In Databricks, you can format decimal values without rounding them using a couple of approaches. Let’s explore some options: Using substring: You can use the substring function to extract a specific number of decimal places f...

1 kudos

03-21-2024 6:02:21 AM

by dbx-user7354 • New Contributor III

03-21-2024 2:46:09 AM

694 Views
3 replies
1 kudos

Pyspark Dataframes orderby only orders within partition when having multiple worker

I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't. from pyspark.sql import functions as F import matplotlib.pyplot...

Data Engineering

694 Views
3 replies
1 kudos

03-21-2024 2:46:09 AM

View Replies

Latest Reply

MarkusFra
New Contributor II

03-21-2024 5:52:25 AM

1 kudos

@Kaniz Sorry if I have to ask again, but I am a bit confused by this.I thought, that pysparks `orderBy()` and `sort()` do a shuffle operation before the sorting for exact this reason. There is another command `sortWithinPartitions()` that does not do...

1 kudos

03-21-2024 5:52:25 AM

2 More Replies

by ac0 • New Contributor III

03-20-2024 8:16:27 AM

322 Views
1 replies
0 kudos

Get size of metastore specifically

Currently my Databricks Metastore is in the the same location as the data for my production catalog. We are moving the data to a separate storage account. In advance of this, I'm curious if there is a way to determine the size of the metastore itself...

Data Engineering

322 Views
1 replies
0 kudos

03-20-2024 8:16:27 AM

View Replies

Latest Reply

Kaniz
Community Manager

03-21-2024 5:45:17 AM

0 kudos

Hi @ac0, Let’s explore how you can determine the size of your Databricks Metastore and estimate the storage requirements for the Azure Storage Account hosting the metastore. Metastore Size: The metastore in Unity Catalog is the top-level contain...

0 kudos

03-21-2024 5:45:17 AM

by Dikshant • New Contributor

03-20-2024 10:08:31 AM

476 Views
1 replies
0 kudos

SchemaEvolutionMode exception in Databricks 14.2

I am unable to display the below stream after reading it.df= spark.readStream.format("cloudFiles")\.option("cloudFiles.format", "csv")\.option("header", "true")\.option("delimiter", "\t")\.option("inferSchema", "true")\.option("cloudFiles.connectionS...

Data Engineering

schemaEvolutionMode

476 Views
1 replies
0 kudos

03-20-2024 10:08:31 AM

View Replies

Latest Reply

Kaniz
Community Manager

03-21-2024 5:36:38 AM

0 kudos

Hi @Dikshant, Unfortunately, stateful streaming queries do not support schema evolution. This means that once a query starts with a particular schema, you cannot change it during query restarts.To resolve this issue, you can set the cloudFiles.schem...

0 kudos

03-21-2024 5:36:38 AM

by IshaBudhiraja • New Contributor II

03-20-2024 10:56:34 AM

393 Views
1 replies
0 kudos

Installation of external libraries(wheel file) in Data bricks through synapse using new job cluster

Aim-Installation of external libraries(wheel file) in Data bricks through synapse using new job clusterSolution- I have followed the below steps:I have created a pipeline in synapse that consists of a notebook activity that is using a new job cluster...

Data Engineering

393 Views
1 replies
0 kudos

03-20-2024 10:56:34 AM

View Replies

Latest Reply

Kaniz
Community Manager

03-21-2024 5:32:36 AM

0 kudos

Hi @IshaBudhiraja, Confirm that the library version matches the one you intended to install.Ensure that the library is installed in the same Python environment where your notebook or script is running.

0 kudos

03-21-2024 5:32:36 AM

User

Count

1602

736

343

284

247

Databricks

Forum Posts

Spark issue handling data from json when the schema DataType mismatch occurs

Delta Table on AWS Glue Catalog

I would like to Create a schedule in Databricks that runs a job on 1st working day of every month

Resolved! To read data from Azure Storage

Resolved! CREATE TEMP TABLE FROM CTE

Resolved! Delta Live Tables: dynamic schema

Autoloader not detecting changes/updated values for xml file

Cannot create jobs with jobs api - Azure databricks - private network

Copying files from dev environment to prod environment

Asset bundle pipelines - target schema and catalog

Resolved! How to no round formating

Pyspark Dataframes orderby only orders within partition when having multiple worker

Get size of metastore specifically

SchemaEvolutionMode exception in Databricks 14.2

Installation of external libraries(wheel file) in Data bricks through synapse using new job cluster

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...