Data Engineering

Forum Posts

Sorted by:

by Rinat • New Contributor

05-19-2023 8:29:01 AM

1495 Views
0 replies
0 kudos

How to configure Spark to adjust the number of output partitions after a join or groupBy?

I know you can set "spark.sql.shuffle.partitions" and "spark.sql.adaptive.advisoryPartitionSizeInBytes". The former will not work with adaptive query execution, and the latter only works for the first shuffle for some reason, after which it just uses...

Data Engineering

1495 Views
0 replies
0 kudos

05-19-2023 8:29:01 AM

by User16618471166 • New Contributor II

04-09-2023 2:34:34 PM

4603 Views
3 replies
1 kudos

When I aggregate over more data, I get the below error message. I've tried multiple ways of diagnosis like going back to a version I know it was w...

When I aggregate over more data, I get the below error message. I've tried multiple ways of diagnosis like going back to a version I know it was working fine (but still got the same error below). Please advise as this is a critical report where the b...

Data Engineering

4603 Views
3 replies
1 kudos

04-09-2023 2:34:34 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-18-2023 2:31:38 AM

1 kudos

@Jeff Wu :The error message suggests that there is a syntax error in a SQL statement, specifically near the end of the input. Without the full SQL statement or additional information, it's difficult to pinpoint the exact cause of the error. However,...

1 kudos

04-18-2023 2:31:38 AM

2 More Replies

by nicole_wong • New Contributor II

09-22-2021 3:46:34 PM

11370 Views
10 replies
7 kudos

Resolved! Can Terraform be used to set configurations in Admin / workspace settings?

I am posting this on behalf of my customer. They are currently working on the deployment & config of their workspace on AWS via Terraform.Is it possible to set some configs in the Admin/workspace settings via TF? According to the Terraform module, it...

Data Engineering

11370 Views
10 replies
7 kudos

09-22-2021 3:46:34 PM

View Replies

Latest Reply

francly
New Contributor II

08-10-2022 10:13:53 PM

7 kudos

Hi, can I get a full list of the latest configurable supported workspace_conf on tf, I can't find the list on tf registry site.

7 kudos

08-10-2022 10:13:53 PM

9 More Replies

by johnb1 • Contributor

03-30-2023 1:28:30 AM

2511 Views
3 replies
0 kudos

Cluster Configuration for ML Model Training

Hi!I am training a Random Forest (pyspark.ml.classification.RandomForestClassifier) on Databricks with 1,000,000 training examples and 25 features. I employ a cluster with one driver (16 GB Memory, 4 Cores), 2-6 workers (32-96 GB Memory, 8-24 Cores),...

Data Engineering

2511 Views
3 replies
0 kudos

03-30-2023 1:28:30 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-31-2023 7:11:15 PM

0 kudos

Hi @John B Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can...

0 kudos

03-31-2023 7:11:15 PM

2 More Replies

by Arunsundar • New Contributor III

03-13-2023 12:59:03 AM

2586 Views
4 replies
3 kudos

Automating the initial configuration of dbx

Hi Team,Good morning.As of now, for the deployment of our code to Databricks, dbx is configured providing the parameters such as cloud provider, git provider, etc., Say, I have a code repository in any one of the git providers. Can this process of co...

Data Engineering

2586 Views
4 replies
3 kudos

03-13-2023 12:59:03 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-17-2023 11:18:23 PM

3 kudos

Hi @Arunsundar Muthumanickam Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear fr...

3 kudos

03-17-2023 11:18:23 PM

3 More Replies

by oteng • New Contributor III

02-16-2023 4:03:10 PM

2037 Views
1 replies
0 kudos

SET configuration in SQL DLT pipeline not working

I'm not able to get the SET command to work when using sql in DLT pipeline. I am copying the code from this documentation https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-sql-ref.html#sql-spec (relevant code below). When I ru...

Data Engineering

2037 Views
1 replies
0 kudos

02-16-2023 4:03:10 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-10-2023 6:06:52 PM

0 kudos

Hi @Oliver Teng Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks...

0 kudos

03-10-2023 6:06:52 PM

by Sandy21 • New Contributor III

11-03-2022 11:25:07 PM

2154 Views
1 replies
2 kudos

Resolved! Cluster Configuration Best Practices

I have a cluster with the configuration of 400 GB RAM, 160 Cores.Which of the following would be the ideal configuration to use in case of one or more VM failures?Cluster A: Total RAM 400 GB Total Cores 160 Total VMs: 1 400 GB/Exec & 160 c...

Data Engineering

2154 Views
1 replies
2 kudos

11-03-2022 11:25:07 PM

View Replies

Latest Reply

karthik_p
Esteemed Contributor

11-04-2022 3:40:34 PM

2 kudos

@Santhosh Raj can you please confirm cluster sizes you are taking are related to driver and worker node. how much you want to allocate to Driver and Worker? once we are sure about type of driver and worker we would like to pick, we need to enable au...

2 kudos

11-04-2022 3:40:34 PM

by yopbibo • Contributor II

08-16-2022 2:35:41 AM

2121 Views
2 replies
0 kudos

Resolved! Cluster configuration / notebook panel

Hi,Is it possible to let regular users to see all running notebooks (in the notebook panel of the cluster) on a specific cluster they can use (attach and restart).by default admins can see all running notebooks and users can see only their own notebo...

Data Engineering

2121 Views
2 replies
0 kudos

08-16-2022 2:35:41 AM

View Replies

Latest Reply

Prabakar
Databricks Employee

08-16-2022 3:16:11 AM

0 kudos

hi @Philippe CRAVE a user can see a notebook only if they have permission to that notebook. Else they won't be able to see it. Unfortunately there is no possibility for a normal user to see the notebooks attached to a cluster if they do not have per...

0 kudos

08-16-2022 3:16:11 AM

1 More Replies

by Vee • New Contributor

04-07-2022 11:37:05 AM

5227 Views
1 replies
1 kudos

Cluster configuration and optimal number for fs.s3a.connection.maximum , fs.s3a.threads.max

Please could you suggest best cluster configuration for a use case stated below and tips to resolve the errors shown below -Use case:There could be 4 or 5 spark jobs that run concurrently.Each job reads 40 input files and spits out 120 output files ...

Data Engineering

5227 Views
1 replies
1 kudos

04-07-2022 11:37:05 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

04-29-2022 3:09:48 PM

1 kudos

Hi @Vetrivel Senthil , Just wondering if this question is a duplicate from this one https://community.databricks.com/s/feed/0D53f00001qvQJcCAM?

1 kudos

04-29-2022 3:09:48 PM

by Anonymous • Not applicable

12-15-2021 1:07:55 PM

6607 Views
2 replies
4 kudos

Cluster does not have proper permissions to view DBFS mount point to Azure ADLS Gen 2.

I've created other mount points and am now trying to use the OAUTH method. I'm able to define the mount point using the OAUTH Mount to ADLS Gen 2 Storage.I've created an App Registration with Secret, added the App Registration as Contributor to the ...

Data Engineering

6607 Views
2 replies
4 kudos

12-15-2021 1:07:55 PM

View Replies

Latest Reply

Gerbastanovic
New Contributor II

01-13-2022 5:16:11 AM

4 kudos

Also check if you set the right permissions for the app on the containers ACLhttps://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control

4 kudos

01-13-2022 5:16:11 AM

1 More Replies

by TJS • New Contributor II

10-08-2021 8:24:48 AM

16490 Views
6 replies
5 kudos

Resolved! Can you help with this error please? Issue when using a new high concurrency cluster

Hello, I am trying to use MLFlow on a new high concurrency cluster but I get the error below. Does anyone have any suggestions? It was working before on a standard cluster. Thanks.py4j.security.Py4JSecurityException: Method public int org.apache.spar...

Data Engineering

16490 Views
6 replies
5 kudos

10-08-2021 8:24:48 AM

View Replies

Latest Reply

Pradeep54
Databricks Employee

10-19-2021 6:05:56 AM

5 kudos

@Tom Soto We have a workaround for this. This cluster spark configuration setting will disable py4jSecurity while still enabling passthrough spark.databricks.pyspark.enablePy4JSecurity false

5 kudos

10-19-2021 6:05:56 AM

5 More Replies

by adb-rm • New Contributor II

10-27-2021 9:41:42 AM

2125 Views
2 replies
2 kudos

Resolved! mail configuration azure data bricks pyspark notebook

Hi All,i am new to azure databricks , i am using pyspark .. we need to configure mail alerts when notebook failed or succeeded ..please can some one help me in mail configuration azure data bricks .Thanks

Data Engineering

2125 Views
2 replies
2 kudos

10-27-2021 9:41:42 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

10-28-2021 2:15:28 AM

2 kudos

the easiest way to schedule notebooks in Azure is to use Data Factory.In Data Factory you can schedule the notebooks and define the alerts you want to send.The other option is the one Hubert mentioned.

2 kudos

10-28-2021 2:15:28 AM

1 More Replies

by EricOX • New Contributor

09-28-2021 2:24:33 AM

4710 Views
1 replies
3 kudos

Resolved! How to handle configuration for different environment (e.g. DEV, PROD)?

May I know any suggested way to handle different environment variables for the same code base? For example, the mount point of Data Lake for DEV, UAT, and PROD. Any recommendations or best practices? Moreover, how to handle Azure DevOps?

Data Engineering

4710 Views
1 replies
3 kudos

09-28-2021 2:24:33 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

09-28-2021 3:05:03 AM

3 kudos

@Eric Yeung , you can put all your configuration parameters in a file (JSON, CONF, YAML whatever you like) and read that file at the beginning of each program.I like to use the ConfigFactory in Scala for example.You only have to make sure the file c...

3 kudos

09-28-2021 3:05:03 AM

by DouglasLinder • New Contributor III

09-21-2021 10:32:19 PM

10234 Views
4 replies
1 kudos

Is it possible to pass configuration to a job on high concurrency cluster?

On a regular cluster, you can use:```spark.sparkContext._jsc.hadoopConfiguration().set(key, value)```These values are then available on the executors using the hadoop configuration. However, on a high concurrency cluster, attempting to do so results ...

Data Engineering

10234 Views
4 replies
1 kudos

09-21-2021 10:32:19 PM

View Replies

Latest Reply

Ryan_Chynoweth
Esteemed Contributor

09-22-2021 1:21:31 PM

1 kudos

I am not sure why you are getting that error on a high concurrency cluster. As I am able to set the configuration as you show above. Can you try the following code instead? sc._jsc.hadoopConfiguration().set(key, value)

1 kudos

09-22-2021 1:21:31 PM

3 More Replies

by cfregly • Contributor

05-03-2015 12:28:53 PM

7816 Views
5 replies
0 kudos

How can I view and change the SparkConf settings if the SparkContext (sc) is already provided for me?

Data Engineering

7816 Views
5 replies
0 kudos

05-03-2015 12:28:53 PM

View Replies

Latest Reply

MatthewValenti
New Contributor II

01-13-2019 5:32:03 PM

0 kudos

This is an old post, however, is this still accurate for the latest version of Databricks in 2019? If so, how to approach the following?1. Connect to many MongoDBs.2. Connect to MongoDB when connection string information is dynamic (i.e. stored in s...

0 kudos

01-13-2019 5:32:03 PM

4 More Replies