Data Engineering

Forum Posts

Sorted by:

by SepidehEb • New Contributor III

11-29-2021 6:58:09 AM

1942 Views
6 replies
7 kudos

Resolved! How to get a minor DBR image?

In short, we aim to add a step to a CI job that would run tests in a container, which supposedly should mimic DBR of our clusters – currently we use 7.3 . We consider using one of databricksruntime images (possibly a standard:7.x for now, https://hub...

Data Engineering

1942 Views
6 replies
7 kudos

11-29-2021 6:58:09 AM

View Replies

Latest Reply

Atanu
Esteemed Contributor

11-30-2021 7:38:57 AM

7 kudos

Hi @Sepideh Ebrahimi , since cluster is Databricks proprietary, you ca not run it locally. as @Werner Stinckens said, you can build your own image but that has to be run in cluster. but there is databricks connect (https://docs.databricks.com/dev-...

7 kudos

11-30-2021 7:38:57 AM

5 More Replies

by sunil_smile • Contributor

11-23-2021 8:04:53 AM

3423 Views
5 replies
6 kudos

Apart from notebook , is it possible to deploy an application (Pyspark , or R+spark) as a package or file and execute them in Databricks ?

Hi,With the help of Databricks-connect i was able to connect the cluster to my local IDE like Pycharm and Rstudio desktop version and able to develop the application and committed the code in Git.When i try to add that repo to the Databricks workspac...

Data Engineering

3423 Views
5 replies
6 kudos

11-23-2021 8:04:53 AM

View Replies

Latest Reply

Atanu
Esteemed Contributor

11-29-2021 10:52:16 PM

6 kudos

may be you will be interested our db connect . not sure if that resolve your issue to connect with 3rd party tool and setup ur supported IDE notebook serverhttps://docs.databricks.com/dev-tools/databricks-connect.html

6 kudos

11-29-2021 10:52:16 PM

4 More Replies

by Abela • New Contributor III

11-22-2021 9:26:11 AM

4690 Views
3 replies
7 kudos

Resolved! Databricks drop and remove s3 storage files safely

After dropping a delta table using DROP command in databricks, is there a way to drop the s3 files in databricks without using rm command? Looking for a solution where junior developers can safely drop a table wihout messing with the rm command where...

Data Engineering

4690 Views
3 replies
7 kudos

11-22-2021 9:26:11 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

11-22-2021 11:51:12 AM

7 kudos

Hi @Alina Bella ,Like @Hubert Dudek mentioned, we have a best practice guide for dropping managed tables. You can find the docs here

7 kudos

11-22-2021 11:51:12 AM

2 More Replies

by itay • New Contributor II

11-28-2021 9:04:02 AM

1093 Views
2 replies
1 kudos

Streaming with runOnce and groupBy window queries

I have a streaming job running a groupBy query with a Window of 3 days. The query is searching for different types of events.The stream is configured with runOnce and there is a job scheduled for every hour.Now, I'm not sure what data is processed ea...

Data Engineering

1093 Views
2 replies
1 kudos

11-28-2021 9:04:02 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

11-29-2021 10:54:23 AM

1 kudos

Hi @itay k ,You will need to take a look at the Progress Reporter. This will show the Micro-batch JSON metrics. For example, the metric called "numInputRows" which will display the number of input rows that it processed for the micro-batch. You will...

1 kudos

11-29-2021 10:54:23 AM

1 More Replies

by kmartin62 • New Contributor III

11-29-2021 4:00:04 AM

2592 Views
9 replies
4 kudos

Resolved! Configure Databricks (spark) context from PyCharm

Hello. I'm trying to connect to Databricks from my IDE (PyCharm) and then run delta table queries from there. However, the cluster I'm trying to access has to give me permission. In this case, I'd go to my cluster, run the cell which gives me permiss...

Data Engineering

2592 Views
9 replies
4 kudos

11-29-2021 4:00:04 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-29-2021 4:13:14 AM

4 kudos

"I'm trying to connect to Databricks from my IDE (PyCharm) and then run delta table queries from there."If you are going to deploy later your code to databricks the only solutions which I see is to use databricks-connect or just make development envi...

4 kudos

11-29-2021 4:13:14 AM

8 More Replies

by prasadvaze • Valued Contributor

10-20-2021 12:56:28 PM

11940 Views
8 replies
3 kudos

Resolved! How to make delta table column values case-insensitive?

we have many delta tables with string columns as unique key (PK in traditional relational db) and we don't want to insert new row because key value only differs in case. Its lot of code change to use upper/lower function on column value compare (in ...

Data Engineering

11940 Views
8 replies
3 kudos

10-20-2021 12:56:28 PM

View Replies

Latest Reply

lizou
Contributor II

11-28-2021 8:49:48 PM

3 kudos

Well, the unintended benefit is now I am using int\big int as surrogate keysfor all tables (preferred in DW). All joins are made on integer data types. Query efficiency is also improved.The string matching using upper() is done only on ETL when com...

3 kudos

11-28-2021 8:49:48 PM

7 More Replies

by jason_mcdonald • New Contributor

06-08-2021 3:42:14 PM

578 Views
1 replies
0 kudos

Can I turn this text to code?

[start code block]I'm writing this. Then I'm going to reformat it in code.[end code block]

Data Engineering

578 Views
1 replies
0 kudos

06-08-2021 3:42:14 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-27-2021 9:26:54 AM

0 kudos

Please provide more details.

0 kudos

11-27-2021 9:26:54 AM

by Anonymous • Not applicable

06-10-2021 4:12:49 PM

897 Views
1 replies
1 kudos

Resolved! Access to Cluster Logs for non-admins

Suppose I have a DevOps team that needs near real-time access to cluster logs to troubleshoot job failures. What is the best way for me to grant access to view logs without granting them admin access?

Data Engineering

897 Views
1 replies
1 kudos

06-10-2021 4:12:49 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-27-2021 9:26:41 AM

1 kudos

Please use logging option and set destination for sending logs in cluster settings to other Azure Blob or S3 storage (need to be mounted first):

1 kudos

11-27-2021 9:26:41 AM

by User16857281869 • New Contributor II

06-18-2021 1:16:32 AM

1227 Views
1 replies
1 kudos

Resolved! Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

Its usually one or more of the following reasons:1) If you are streaming into a table, you should be using .Trigger option to specify the frequency of checkpointing. Otherwise, the job will call the storage API every 10ms to log the transaction data...

Data Engineering

1227 Views
1 replies
1 kudos

06-18-2021 1:16:32 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-27-2021 9:21:34 AM

1 kudos

please mount cheaper storage (LRS) to custom mount and set there checkpoints,please clear data regularly,if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,please remember not to use display() in production,if on th...

1 kudos

11-27-2021 9:21:34 AM

by User16857281869 • New Contributor II

06-18-2021 5:00:52 AM

872 Views
1 replies
1 kudos

Resolved! What is the best way to do time series analysis and forecasting with Spark?

We have developed a library on spark which makes typical operations on time series much simpler. You can check the repo in Github for more info. You could also check out one of our blogs which demos an implementation of a forecasting usecase with S...

Data Engineering

872 Views
1 replies
1 kudos

06-18-2021 5:00:52 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-27-2021 9:17:22 AM

1 kudos

Currently on databricks there is MLFlow with forecasting option - please check it.

1 kudos

11-27-2021 9:17:22 AM

by User16869510359 • Esteemed Contributor

06-25-2021 3:31:36 PM

764 Views
1 replies
0 kudos

How to blacklist an executor on the Spark cluster

Data Engineering

764 Views
1 replies
0 kudos

06-25-2021 3:31:36 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-27-2021 9:07:53 AM

0 kudos

This is a lit of configuration keys to enable or alter the blacklist mechanism:spark.blacklist.enabled – set to Truespark.blacklist.task.maxTaskAttemptsPerExecutor (1 by default)spark.blacklist.task.maxTaskAttemptsPerNode (2 by default)spark.blacklis...

0 kudos

11-27-2021 9:07:53 AM

by DievanB • New Contributor

07-22-2021 12:10:48 AM

1168 Views
1 replies
0 kudos

pyspark: How to run selenium in UDF

Hi all, I am building a webscraper to get prices of certain EAN's from the amazon website. Therefore I use selenium to get the product links. I wrote te following function to get the productlinks based on a EAN: def getProductLinkAmazonPY(EAN): st...

Data Engineering

1168 Views
1 replies
0 kudos

07-22-2021 12:10:48 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-27-2021 9:03:16 AM

0 kudos

UDF functions are serialized and then executed on executors. I don't think it will be possible with Selenium.

0 kudos

11-27-2021 9:03:16 AM

by User16752244127 • Contributor

06-23-2021 1:36:20 AM

9609 Views
3 replies
5 kudos

Resolved! How does Databricks integrate with SAP? I am looking for a tech overview.

Data Engineering

9609 Views
3 replies
5 kudos

06-23-2021 1:36:20 AM

View Replies

Latest Reply

Atanu
Esteemed Contributor

11-27-2021 7:04:24 AM

5 kudos

Currently supported datasources with Databricks https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/ and may be this block will have more insight - https://blogs.sap.com/2019/10/24/your-sap-on-azure-part-22-read-sap-hana-data-from-azu...

5 kudos

11-27-2021 7:04:24 AM

2 More Replies

by Kaniz • Community Manager

09-21-2021 1:57:19 AM

508 Views
1 replies
0 kudos

How to properly sort a dictionary by value of a key? [duplicate]

Data Engineering

508 Views
1 replies
0 kudos

09-21-2021 1:57:19 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-26-2021 10:39:00 AM

0 kudos

just use python dictionary comprehension with lambda for that:dict(sorted(x.items(), key=lambda item: item[1]))

0 kudos

11-26-2021 10:39:00 AM

by Emre • New Contributor II

11-26-2021 6:53:01 AM

745 Views
1 replies
2 kudos

Resolved! The license of JDBC connector for BI vendors

Hey all,We would like to support Databricks in our BI tool, which is an open-source Java application. (See https://github.com/metriql/metriql)In order to connect Databricks, we need to use the JDBC connector similar to the other BI tools such as Look...

Data Engineering

745 Views
1 replies
2 kudos

11-26-2021 6:53:01 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-26-2021 8:03:28 AM

2 kudos

It doesn't look so bad after all (mean terms and conditions on https://databricks.com/jdbc-odbc-driver-license )but I think the best solution is to open ticket via https://databricks.com/company/contact

2 kudos

11-26-2021 8:03:28 AM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Resolved! How to get a minor DBR image?

Apart from notebook , is it possible to deploy an application (Pyspark , or R+spark) as a package or file and execute them in Databricks ?

Resolved! Databricks drop and remove s3 storage files safely

Streaming with runOnce and groupBy window queries

Resolved! Configure Databricks (spark) context from PyCharm

Resolved! How to make delta table column values case-insensitive?

Can I turn this text to code?

Resolved! Access to Cluster Logs for non-admins

Resolved! Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

Resolved! What is the best way to do time series analysis and forecasting with Spark?

How to blacklist an executor on the Spark cluster

pyspark: How to run selenium in UDF

Resolved! How does Databricks integrate with SAP? I am looking for a tech overview.

How to properly sort a dictionary by value of a key? [duplicate]

Resolved! The license of JDBC connector for BI vendors

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...