Data Engineering

Forum Posts

Sorted by:

by Braxx • Contributor II

02-14-2022 2:40:38 AM

1718 Views
2 replies
3 kudos

Resolved! issue with rounding selected column in "for in" loop

This must be trivial, but I must have missed something.I have a dataframe (test1) and want to round all the columns listed in list of columns (col_list)here is the code I am running:col_list = ['measure1', 'measure2', 'measure3'] for i in col_list:...

Data Engineering

1718 Views
2 replies
3 kudos

02-14-2022 2:40:38 AM

View Replies

Latest Reply

Braxx
Contributor II

02-14-2022 6:50:07 AM

3 kudos

You're absolutely right. thanks

3 kudos

02-14-2022 6:50:07 AM

1 More Replies

by alejandrofm • Valued Contributor

02-12-2022 1:35:42 PM

2455 Views
2 replies
3 kudos

Resolved! Running vacuum on each table

Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?How can I know if I'll be getting any benefit from vacuum, without DRY RU...

Data Engineering

2455 Views
2 replies
3 kudos

02-12-2022 1:35:42 PM

View Replies

Latest Reply

AmanSehgal
Honored Contributor III

02-14-2022 5:22:57 AM

3 kudos

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days. To use VACCUM, first run some analytics on behaviour of your data.Identify % of operations that perform updates and deletes vs insert operati...

3 kudos

02-14-2022 5:22:57 AM

1 More Replies

by NOOR_BASHASHAIK • Contributor

02-07-2022 1:55:47 AM

838 Views
2 replies
0 kudos

Resolved! Databricks PAT (personal access token) with access to databases selectively

Hi all,I am establishing a connection to databricks from Collibra through Spark driver. Collibra expects these details for the connection (for token based):personal access token (pat)server/workspace namehttpPathUpon successful connection, Collibra d...

Data Engineering

838 Views
2 replies
0 kudos

02-07-2022 1:55:47 AM

View Replies

Latest Reply

Atanu
Esteemed Contributor

02-12-2022 8:53:30 AM

0 kudos

PAT token is integrated with the workspace, So it will get access of all hive. Is there anyway you can filter out with Collibra?

0 kudos

02-12-2022 8:53:30 AM

1 More Replies

by jeffreym9 • New Contributor III

12-16-2021 1:55:05 PM

1948 Views
5 replies
0 kudos

Resolved! Hive version after Upgrade Azure Databricks from 6.4 (Spark 2) to 9.1 (Spark 3)

I have upgraded the Azure Databricks from 6.4 to 9.1 which enable me to use Spark3. As far as I know, the Hive version has to be upgraded to 2.3.7 as well as discussed in: https://community.databricks.com/s/question/0D53f00001HKHy2CAH/how-to-upgrade-...

Data Engineering

1948 Views
5 replies
0 kudos

12-16-2021 1:55:05 PM

View Replies

Latest Reply

jeffreym9
New Contributor III

01-26-2022 11:09:04 AM

0 kudos

I'm asking about Datatricks version 9.1. I've follow the url given (https://docs.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore). Do you mind letting me know where in the table is mentioning the supported hive version fo...

0 kudos

01-26-2022 11:09:04 AM

4 More Replies

by thushar • Contributor

12-22-2021 3:55:12 AM

2187 Views
9 replies
6 kudos

Resolved! Compile all the scripts under the workspace folder

In workspace one folder I have around 100+ pyspark scripts, all these scripts need to be compiled before running the main program. In order to compile all these files, we are using the %run magic command like %run ../prod/netSales. Since we have 100+...

Data Engineering

2187 Views
9 replies
6 kudos

12-22-2021 3:55:12 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

12-22-2021 6:58:14 AM

6 kudos

Problem is that you can list all files in workspace only via API call and than you can run every one of them using:dbutils.notebook.run()This is the script to list files from workspace (probably you need to add some filterning):import requests ctx = ...

6 kudos

12-22-2021 6:58:14 AM

8 More Replies

by thushar • Contributor

01-19-2022 3:07:37 AM

2314 Views
4 replies
3 kudos

Resolved! Deploy tar.gz package from private git hub

We created Python package (.tar.gz) and kept it under private git.We can able to connect to that git (using PAT) from the Azure databricks notebook.Our requirement is to install that package from .tar.gz file for that notebook"pip install https://USE...

Data Engineering

2314 Views
4 replies
3 kudos

01-19-2022 3:07:37 AM

View Replies

Latest Reply

Rahul_Samant
Contributor

01-20-2022 2:23:29 AM

3 kudos

For installing the package using pip you need to package the repo using setup.py. check this link for more details https://packaging.python.org/en/latest/tutorials/packaging-projects/alternatively you can pass the tar.gz using --py-files while submi...

3 kudos

01-20-2022 2:23:29 AM

3 More Replies

by Fm_world_shop • New Contributor

02-12-2022 2:08:34 AM

354 Views
0 replies
0 kudos

www.scent-sational-waxmelts.co.uk

Ignite your senses with distinctive and fm world shop delightful fragrances for your home, Discover scents to set the mood and inspire fragrant memories

Data Engineering

354 Views
0 replies
0 kudos

02-12-2022 2:08:34 AM

by soy_wax_melts • New Contributor

02-12-2022 12:25:46 AM

207 Views
0 replies
0 kudos

www.scent-sational-waxmelts.co.uk

Ignite your senses with distinctive and soy wax melts delightful fragrances for your home, Discover scents to set the mood and inspire fragrant memories

Data Engineering

207 Views
0 replies
0 kudos

02-12-2022 12:25:46 AM

by Vibhor • Contributor

02-08-2022 10:13:14 AM

1582 Views
5 replies
1 kudos

Resolved! Notebook level automated pipeline monitoring or failure notif

Hi, is there any way other than adf monitoring where in automated way we can get notebook level execution details without getting to go to each pipeline and checking

Data Engineering

1582 Views
5 replies
1 kudos

02-08-2022 10:13:14 AM

View Replies

Latest Reply

Anonymous
Not applicable

02-10-2022 7:17:01 AM

1 kudos

@Vibhor Sethi - Would you be happy to mark @Werner Stinckens' answer as best if it resolved your question?

1 kudos

02-10-2022 7:17:01 AM

4 More Replies

by Scouty • New Contributor

02-10-2022 2:21:16 AM

4022 Views
2 replies
3 kudos

Resolved! How to reset an autoloader?

Hii'm using an autoloader with Azure Databricks:df = (spark.readStream.format("cloudFiles") .options(**cloudfile) .load("abfss://dev@std******.dfs.core.windows.net/**/*****)) at my target checkpointLocation folder there are some files and subdirs...

Data Engineering

4022 Views
2 replies
3 kudos

02-10-2022 2:21:16 AM

View Replies

Latest Reply

Anonymous
Not applicable

02-10-2022 6:55:58 AM

3 kudos

@Aman Sehgal - My name is Piper, and I'm one of the moderators for Databricks. I wanted to jump in real quick to thank you for being so generous with your knowledge.

3 kudos

02-10-2022 6:55:58 AM

1 More Replies

by ckwan48 • New Contributor III

02-03-2022 1:38:15 PM

9390 Views
5 replies
3 kudos

Resolved! How to prevent my cluster to shut down after inactivity

Currently, I am running a cluster that is set to terminate after 60 minutes of inactivity. However, in one of my notebooks, one of the cells is still running. How can I prevent this from happening, if want my notebook to run overnight without monito...

Data Engineering

9390 Views
5 replies
3 kudos

02-03-2022 1:38:15 PM

View Replies

Latest Reply

AmanSehgal
Honored Contributor III

02-10-2022 6:13:10 AM

3 kudos

If a cell is already running ( I assume it's a streaming operation), then I think it doesn't mean that the cluster is inactive. The cluster should be running if a cell is running on it.On the other hand, if you want to keep running your clusters for ...

3 kudos

02-10-2022 6:13:10 AM

4 More Replies

by irfanaziz • Contributor II

01-17-2022 7:49:47 AM

5200 Views
3 replies
2 kudos

Resolved! Issue in reading parquet file in pyspark databricks.

One of the source systems generates from time to time a parquet file which is only 220kb in size.But reading it fails."java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquetCaused by: org.apache.spark.sql.AnalysisExce...

Data Engineering

5200 Views
3 replies
2 kudos

01-17-2022 7:49:47 AM

View Replies

Latest Reply

Anonymous
Not applicable

02-09-2022 8:13:04 AM

2 kudos

@nafri A - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek's answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks

2 kudos

02-09-2022 8:13:04 AM

2 More Replies

by SailajaB • Valued Contributor III

02-09-2022 3:38:51 AM

6518 Views
1 replies
5 kudos

Resolved! Best practices for implementing Unit Test cases in databricks and Azure devops

Hello,Please suggest the best practices/ ways to implement the unit test cases in Databricks python to pass code coverage at Azure devops

Data Engineering

6518 Views
1 replies
5 kudos

02-09-2022 3:38:51 AM

View Replies

Latest Reply

User16753725182
Contributor III

02-09-2022 3:54:53 AM

5 kudos

Hi, the process is like traditional software development practices.Docs to refer: https://docs.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-azure-devops#unit-tests-in-azure-databricks-notebooksAzure DevOps Best Practices: https://docs.m...

5 kudos

02-09-2022 3:54:53 AM

by mayuri18kadam • New Contributor II

01-24-2022 8:45:25 PM

2805 Views
3 replies
0 kudos

Resolved! com.databricks.sql.io.FileReadException Caused by: com.microsoft.azure.storage.StorageException: Blob hash mismatch

Hi, I am getting the following error:com.databricks.sql.io.FileReadException: Error while reading file wasbs:REDACTED_LOCAL_PART@blobStorageName.blob.core.windows.net/cook/processYear=2021/processMonth=12/processDay=30/processHour=18/part-00003-tid-4...

Data Engineering

2805 Views
3 replies
0 kudos

01-24-2022 8:45:25 PM

View Replies

Latest Reply

mayuri18kadam
New Contributor II

01-26-2022 10:05:45 AM

0 kudos

yes, I can read from notebook with DBR 6.4, when I specify this path: wasbs:REDACTED_LOCAL_PART@blobStorageName.blob.core.windows.net/cook/processYear=2021/processMonth=12/processDay=30/processHour=18but the same using DBR 6.4 from spark-submit, it f...

0 kudos

01-26-2022 10:05:45 AM

2 More Replies

by Ian • New Contributor III

01-03-2022 10:49:20 AM

2704 Views
6 replies
0 kudos

Resolved! Databricks-Connect and Change Data Feed query error

I have installed Databricks-Connect (9.1 LTS). I am able to send queries to the cluster. However, when the query includes a call to the 'table_changes' function that is a part of Change Data Feed, I get the following error:AnalysisException("could ...

Data Engineering

2704 Views
6 replies
0 kudos

01-03-2022 10:49:20 AM

View Replies

Latest Reply

Ian
New Contributor III

01-21-2022 11:00:36 AM

0 kudos

Hi @Kaniz Fatma , the table_changes function is an internal Databricks function used in Change Data Feed (CDF).Please refer to the article below. It discusses the table_changes function.https://docs.databricks.com/delta/delta-change-data-feed.html

0 kudos

01-21-2022 11:00:36 AM

5 More Replies

User

Count

1601

736

343

284

247

Databricks

Forum Posts

Resolved! issue with rounding selected column in "for in" loop

Resolved! Running vacuum on each table

Resolved! Databricks PAT (personal access token) with access to databases selectively

Resolved! Hive version after Upgrade Azure Databricks from 6.4 (Spark 2) to 9.1 (Spark 3)

Resolved! Compile all the scripts under the workspace folder

Resolved! Deploy tar.gz package from private git hub

www.scent-sational-waxmelts.co.uk

www.scent-sational-waxmelts.co.uk

Resolved! Notebook level automated pipeline monitoring or failure notif

Resolved! How to reset an autoloader?

Resolved! How to prevent my cluster to shut down after inactivity

Resolved! Issue in reading parquet file in pyspark databricks.

Resolved! Best practices for implementing Unit Test cases in databricks and Azure devops

Resolved! com.databricks.sql.io.FileReadException Caused by: com.microsoft.azure.storage.StorageException: Blob hash mismatch

Resolved! Databricks-Connect and Change Data Feed query error

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...