Data Engineering

Forum Posts

Sorted by:

by Jiri_Koutny • Databricks Partner

10-18-2021 7:17:16 AM

10536 Views
5 replies
4 kudos

Resolved! Programatic access to Files in Repos

Hi, we are testing the new Files support in Databricks repos. Is there a way how to programmatically read notebooks?Thanks

Data Engineering

10536 Views
5 replies
4 kudos

10-18-2021 7:17:16 AM

View Replies

Latest Reply

User16871418122
Databricks Employee

11-23-2021 7:37:48 PM

4 kudos

Hi @Jiri Koutny these files anyway should be synced to your remote repository (git, bitbucket, GitLab etc). The APIs from version control tools Git API for example might help you achieve what you want. https://stackoverflow.com/questions/38491722/r...

4 kudos

11-23-2021 7:37:48 PM

4 More Replies

by Anonymous • Not applicable

11-18-2021 1:05:38 PM

1583 Views
1 replies
0 kudos

Is there an equivalent of the %debug from Jupyter notebooks in Databricks notebooks for debugging python notebooks?

Data Engineering

1583 Views
1 replies
0 kudos

11-18-2021 1:05:38 PM

View Replies

Latest Reply

Dileep_Vidyadar
New Contributor III

11-23-2021 11:20:56 AM

0 kudos

Hi @Nathan Tong You can go through the 2 articles below that I found online for Debugging in Databricks.1. 7 Tips to Debug Apache Spark Code Faster with Databricks 2. Easier Spark Code Debugging

0 kudos

11-23-2021 11:20:56 AM

by ashu208 • New Contributor

10-14-2021 7:12:45 AM

3916 Views
4 replies
0 kudos

I am not able to create a cluster

Hi,I am new on the Databricks platform, few weeks before I created a community version and it was working perfectly till 2 days before, now I can not create a cluster anymore, after few minutes it time out whenever I am trying to create a new cluster...

Data Engineering

3916 Views
4 replies
0 kudos

10-14-2021 7:12:45 AM

View Replies

Latest Reply

Dileep_Vidyadar
New Contributor III

11-23-2021 9:47:50 AM

0 kudos

Hi @Ashwinkumar Jayakumar and @Prabakar Ammeappin , I am facing the same issue for 3-4 days.Is there something wrong with Community Edition right now or does my account facing some issues?

0 kudos

11-23-2021 9:47:50 AM

3 More Replies

by brickster_2018 • Databricks Employee

06-23-2021 3:59:24 PM

3752 Views
2 replies
0 kudos

Resolved! External metastore version

I am setting up an external metastore to connect my Databricks cluster. Which is the preferred and recommended Hive metastore version? Also are there any preference or recommendations on the database instance size/type

Data Engineering

3752 Views
2 replies
0 kudos

06-23-2021 3:59:24 PM

View Replies

Latest Reply

prasadvaze
Valued Contributor II

11-22-2021 9:44:17 PM

0 kudos

@Harikrishnan Kunhumveettil we use databricks runtime 7.3LTS and 9.1LTS. And external hive metastore hosted on azue sql db. Using global init script I have set spark.sql.hive.metastore.version 2.3.7 and downloaded spark.sql.hive.metastore.jars f...

0 kudos

11-22-2021 9:44:17 PM

1 More Replies

by sarvesh • Contributor III

11-22-2021 9:03:38 PM

1767 Views
0 replies
0 kudos

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.executor.memory;

I am trying to read a 16mb excel file and I was getting a gc overhead limit exceeded error to resolve that i tried to increase my executor memory with,spark.conf.set("spark.executor.memory", "8g")but i got the following stack :Using Spark's default l...

Data Engineering

1767 Views
0 replies
0 kudos

11-22-2021 9:03:38 PM

by amichel • New Contributor III

11-18-2021 9:23:21 PM

9578 Views
3 replies
2 kudos

Resolved! Is there a way to refresh tokens issued on behalf of service principal?

I want to be able to refresh tokens generated on behalf of a service principal via Token Management API, just like with any other service where OAuth is used and refresh token endpoint is available. Allowing indefinite or very long expiration for acc...

Data Engineering

9578 Views
3 replies
2 kudos

11-18-2021 9:23:21 PM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

11-20-2021 8:07:58 AM

2 kudos

Refresh option would be useful.In Azure you could use Azure automation to make "refresh" script: delete if still existscreate token via: "databricks tokens create" put it to Azure Key Vault with expiration data

2 kudos

11-20-2021 8:07:58 AM

2 More Replies

by AzureDatabricks • New Contributor III

11-21-2021 11:18:10 PM

12594 Views
7 replies
2 kudos

Resolved! Can we store 300 million records and what is the preferable compute type and config?

How we can persist 300 million records? What is the best option to persist data databricks hive metastore/Azure storage/Delta table?What is the limitations we have for deltatables of databricks in terms of data?We have usecase where testers should be...

Data Engineering

12594 Views
7 replies
2 kudos

11-21-2021 11:18:10 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

11-21-2021 11:26:42 PM

2 kudos

You can certainly store 300 million records without any problem.The best option kinda depends on the use case. If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, par...

2 kudos

11-21-2021 11:26:42 PM

6 More Replies

by AzureDatabricks • New Contributor III

11-21-2021 11:25:29 PM

7594 Views
8 replies
4 kudos

Resolved! Need to see all the records in DeltaTable. Exception - java.lang.OutOfMemoryError: GC overhead limit exceeded

Truncate False not working in Delta table. df_delta.show(df_delta.count(),False)Computer size Single Node - Standard_F4S - 8GB Memory, 4 coresHow much max data we can persist in Delta table in Parquet file and How fast we can retrieve data.

Data Engineering

7594 Views
8 replies
4 kudos

11-21-2021 11:25:29 PM

View Replies

Latest Reply

AzureDatabricks
New Contributor III

11-22-2021 7:47:01 PM

4 kudos

thank you !!!

4 kudos

11-22-2021 7:47:01 PM

7 More Replies

by Hola1801 • New Contributor

11-12-2021 1:15:18 PM

3000 Views
3 replies
3 kudos

Resolved! Float Value change when Load with spark? Full Path?

Hello,I have created my table in Databricks, at this point everything is perfect i got the same value than in my CSV. for my column "Exposure" I have :0 0,00 1 0,00 2 0,00 3 0,00 4 0,00 ...But when I load my fi...

Data Engineering

3000 Views
3 replies
3 kudos

11-12-2021 1:15:18 PM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

11-22-2021 2:57:12 PM

3 kudos

Hi @Anis Ben Salem ,How do you read your CSV file? do you use Pandas or Pyspark APIs? also, how do you created your table?could you share more details on the code you are trying to run?

3 kudos

11-22-2021 2:57:12 PM

2 More Replies

by Abela • New Contributor III

11-22-2021 8:14:07 AM

3604 Views
3 replies
3 kudos

Resolved! Specify cluster name in notebook

Anyway to specify to use a particular cluster in the python cell of a notebook?

Data Engineering

3604 Views
3 replies
3 kudos

11-22-2021 8:14:07 AM

View Replies

Latest Reply

Anonymous
Not applicable

11-22-2021 1:09:27 PM

3 kudos

@Alina Bella - If werners' answer solved the issue, would you be happy to mark their answer as best? That will help others find the solution more easily in the future.

3 kudos

11-22-2021 1:09:27 PM

2 More Replies

by Khaled • New Contributor III

11-19-2021 9:44:23 AM

9298 Views
10 replies
7 kudos

Resolved! cluster will be running forever when we create new one or activate the terminated in community edition

Data Engineering

9298 Views
10 replies
7 kudos

11-19-2021 9:44:23 AM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

11-20-2021 7:52:19 AM

7 kudos

It is hard to help without logs.

7 kudos

11-20-2021 7:52:19 AM

9 More Replies

by sarvesh • Contributor III

11-19-2021 12:08:47 AM

8163 Views
3 replies
6 kudos

Resolved! Can we use spark-stream to read/write data from mysql? I can't find an example.

If someone can link me an example where stream is used to read or write to mysql please do.

Data Engineering

8163 Views
3 replies
6 kudos

11-19-2021 12:08:47 AM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

11-20-2021 8:01:52 AM

6 kudos

Regarding writing (sink) is possible without problem via foreachBatch .I use it in production - stream autoload csvs from data lake and writing foreachBatch to SQL (inside foreachBatch function you have temporary dataframe with records and just use w...

6 kudos

11-20-2021 8:01:52 AM

2 More Replies

by AzureDatabricks • New Contributor III

11-21-2021 11:34:20 PM

9110 Views
5 replies
1 kudos

Parallel processing of json files in databricks pyspark

How we can read files from azure blob storage and process parallel in databricks using pyspark.As of now we are reading all 10 files at a time into dataframe and flattening it.Thanks & Regards,Sujata

Data Engineering

9110 Views
5 replies
1 kudos

11-21-2021 11:34:20 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

11-22-2021 1:54:07 AM

1 kudos

spark.read.json("/mnt/dbfs/<ENTER PATH OF JSON DIR HERE>/*.jsonyou first have to mount your blob storage to databricks, I assume that is already done.https://spark.apache.org/docs/latest/sql-data-sources-json.html

1 kudos

11-22-2021 1:54:07 AM

4 More Replies

by Anonymous • Not applicable

11-07-2021 12:25:16 AM

5368 Views
5 replies
0 kudos

Resolved! How to use from standalone Spark Jar running from Intellij Idea the library installed in Databricks DBR?

Hello, I tried without success to use several libraries installed by use in the Databricks 9.1 cluster (not provived by default in DBR) from a standalone Spark application runs from Intellij Idea. For instance, for connecting to Redshift it works onl...

Data Engineering

5368 Views
5 replies
0 kudos

11-07-2021 12:25:16 AM

View Replies

Latest Reply

Anonymous
Not applicable

11-22-2021 9:44:20 AM

0 kudos

Unfortunately, I did not find any solution. We have to package JAR and run it from Databricks job for test/debug. Not efficient but as no solution for remote debug has been found/provided.

0 kudos

11-22-2021 9:44:20 AM

4 More Replies

by Vibhor • Contributor

11-18-2021 2:28:46 AM

8825 Views
5 replies
13 kudos

Resolved! ADF Pipeline - Notebook Run time

In adf/pipeline can we specify to exit notebook and proceed to another notebook after some threshold value like 15 minutes. For example I have a pipeline with notebooks scheduled in sequence, want the pipeline to keep running that notebook for a cert...

Data Engineering

8825 Views
5 replies
13 kudos

11-18-2021 2:28:46 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

11-18-2021 8:17:25 AM

13 kudos

Hi @Vibhor Sethi ,There is a global timeout in Azure Data Factory (ADF) that you can use to stop the pipeline. In addition, you can use the notebook timeout in case you want to control it from your Databricks job.

13 kudos

11-18-2021 8:17:25 AM

4 More Replies

Databricks Community

Forum Posts

Resolved! Programatic access to Files in Repos

Is there an equivalent of the %debug from Jupyter notebooks in Databricks notebooks for debugging python notebooks?

I am not able to create a cluster

Resolved! External metastore version

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.executor.memory;

Resolved! Is there a way to refresh tokens issued on behalf of service principal?

Resolved! Can we store 300 million records and what is the preferable compute type and config?

Resolved! Need to see all the records in DeltaTable. Exception - java.lang.OutOfMemoryError: GC overhead limit exceeded

Resolved! Float Value change when Load with spark? Full Path?

Resolved! Specify cluster name in notebook

Resolved! cluster will be running forever when we create new one or activate the terminated in community edition

Resolved! Can we use spark-stream to read/write data from mysql? I can't find an example.

Parallel processing of json files in databricks pyspark

Resolved! How to use from standalone Spark Jar running from Intellij Idea the library installed in Databricks DBR?

Resolved! ADF Pipeline - Notebook Run time

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template

Use .R file in data pipeline