Data Engineering

Forum Posts

Sorted by:

by Karankaran_alan • New Contributor

08-17-2021 8:19:33 AM

1777 Views
1 replies
0 kudos

cluster not getting created, timing out

Hello - i've been using the Databricks notebook(for pyspark or scala/spark development), and recently have had issues wherein the cluster creation takes a long time to get created, often timing out. Any ideas on how to resolve this ?

Data Engineering

1777 Views
1 replies
0 kudos

08-17-2021 8:19:33 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

09-29-2021 10:54:34 AM

0 kudos

Hi Karankaran.alang,What is the error message you are getting? did you get this error while creating/starting a cluster CE?some times these errors are intermittent and go away after a few re-tries.Thank you

0 kudos

09-29-2021 10:54:34 AM

by RajaLakshmanan • New Contributor

08-20-2021 5:55:04 AM

4404 Views
2 replies
1 kudos

Resolved! Spark StreamingQuery not processing all data from source directory

Hi,I have setup a streaming process that consumers files from HDFS staging directory and writes into target location. Input directory continuesouly gets files from another process.Lets say file producer produces 5 million records sends it to hdfs sta...

Data Engineering

4404 Views
2 replies
1 kudos

08-20-2021 5:55:04 AM

View Replies

Latest Reply

User16763506586
Contributor

09-29-2021 4:12:37 AM

1 kudos

If it helps , you run try running the Left-Anti join on source and sink to identify missing records and see whether the record is in match with the schema provided or not

1 kudos

09-29-2021 4:12:37 AM

1 More Replies

by User15787040559 • Databricks Employee

06-07-2021 9:02:46 AM

2770 Views
1 replies
1 kudos

How can I get Databricks notebooks to stop cutting off the explain plans?

(since Spark 3.0)Dataset.queryExecution.debug.toFilewill dump the full plan to a file, without concatenating the output as a fully materialized Java string in memory.

Data Engineering

2770 Views
1 replies
1 kudos

06-07-2021 9:02:46 AM

View Replies

Latest Reply

dazfuller
Contributor III

09-28-2021 12:16:03 PM

1 kudos

Notebooks really aren't the best method of viewing large files. Two methods you could employ areSave the file to dbfs and then use databricks CLI to download the fileUse the web terminalIn the web terminal option you can do something like "cat my_lar...

1 kudos

09-28-2021 12:16:03 PM

by YuvSaha • New Contributor

08-04-2021 7:44:23 AM

1535 Views
1 replies
0 kudos

Auto Loader for the Shape File ?.

Hello, As you can see from below link, that it support 7 file formats. I am dealing with GeoSpatial Shape files and I want to know if Auto Loader can support Shape Files ???Any help on this is greatly appreciated. avro: Avro filebinaryFile: Binary f...

Data Engineering

1535 Views
1 replies
0 kudos

08-04-2021 7:44:23 AM

View Replies

Latest Reply

dbkent
Databricks Employee

09-28-2021 11:19:01 AM

0 kudos

Hi @Yuv Saha ,Currently, shapefiles are not a supported file-type when using auto-loader. Would you be willing to share more about your use case? I am the Product Manager responsible for Geospatial in Databricks, and I need help from customers like ...

0 kudos

09-28-2021 11:19:01 AM

by delta_lake • New Contributor

09-17-2021 2:36:53 AM

5790 Views
1 replies
0 kudos

Delta Lake Python

I have setup a virtual environment inside my existing hadoop cluster. Since the current cluster does not have spark >3 , so i installed delta spark using virtual environment. While trying to access the hdfs which is kerberose one, Getting below error...

Data Engineering

5790 Views
1 replies
0 kudos

09-17-2021 2:36:53 AM

View Replies

Latest Reply

User16753724663
Valued Contributor

09-28-2021 10:24:25 AM

0 kudos

Hi @Vasanth P ,Could you please confirm the DBR version that you are using in the cluster? There are multiple DBR versions that is available with different spark and scala version which can utilised directly in databricks cluster.

0 kudos

09-28-2021 10:24:25 AM

by EricOX • New Contributor

09-28-2021 2:24:33 AM

5748 Views
1 replies
3 kudos

Resolved! How to handle configuration for different environment (e.g. DEV, PROD)?

May I know any suggested way to handle different environment variables for the same code base? For example, the mount point of Data Lake for DEV, UAT, and PROD. Any recommendations or best practices? Moreover, how to handle Azure DevOps?

Data Engineering

5748 Views
1 replies
3 kudos

09-28-2021 2:24:33 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

09-28-2021 3:05:03 AM

3 kudos

@Eric Yeung , you can put all your configuration parameters in a file (JSON, CONF, YAML whatever you like) and read that file at the beginning of each program.I like to use the ConfigFactory in Scala for example.You only have to make sure the file c...

3 kudos

09-28-2021 3:05:03 AM

by DouglasLinder • New Contributor III

09-21-2021 10:32:19 PM

11366 Views
4 replies
1 kudos

Is it possible to pass configuration to a job on high concurrency cluster?

On a regular cluster, you can use:```spark.sparkContext._jsc.hadoopConfiguration().set(key, value)```These values are then available on the executors using the hadoop configuration. However, on a high concurrency cluster, attempting to do so results ...

Data Engineering

11366 Views
4 replies
1 kudos

09-21-2021 10:32:19 PM

View Replies

Latest Reply

Ryan_Chynoweth
Esteemed Contributor

09-22-2021 1:21:31 PM

1 kudos

I am not sure why you are getting that error on a high concurrency cluster. As I am able to set the configuration as you show above. Can you try the following code instead? sc._jsc.hadoopConfiguration().set(key, value)

1 kudos

09-22-2021 1:21:31 PM

3 More Replies

by Nazar • New Contributor II

09-23-2021 3:06:15 PM

7262 Views
3 replies
4 kudos

Resolved! Incremental write

Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...

Data Engineering

7262 Views
3 replies
4 kudos

09-23-2021 3:06:15 PM

View Replies

Latest Reply

Nazar
New Contributor II

09-27-2021 2:55:33 PM

4 kudos

Thanks werners

4 kudos

09-27-2021 2:55:33 PM

2 More Replies

by William_Scardua • Valued Contributor

09-23-2021 8:21:30 AM

10048 Views
5 replies
3 kudos

Resolved! Read just the new file ???

Hi guys,How can I read just the new file in a batch process ?Can you help me ? pleasThank you

Data Engineering

10048 Views
5 replies
3 kudos

09-23-2021 8:21:30 AM

View Replies

Latest Reply

Ryan_Chynoweth
Esteemed Contributor

09-23-2021 8:41:47 AM

3 kudos

What type of file? Is the file stored in a storage account? Typically, you would read and write data with something like the following code: # read a parquet file df = spark.read.format("parquet").load("/path/to/file") # write the data as a file df...

3 kudos

09-23-2021 8:41:47 AM

4 More Replies

by Meaz10 • New Contributor III

09-24-2021 6:00:20 AM

2545 Views
3 replies
2 kudos

Resolved! Current DBR is not yet available to this notebook

Any one has an idea why i am getting this error:"The current DBR is not yet available to this notebook. Give it a second and try again!"

Data Engineering

2545 Views
3 replies
2 kudos

09-24-2021 6:00:20 AM

View Replies

Latest Reply

Anonymous
Not applicable

09-27-2021 9:50:28 AM

2 kudos

@Meysam az - Thank you for letting us know that the issue has been resolved and for the extra information.

2 kudos

09-27-2021 9:50:28 AM

2 More Replies

by SreedharVengala • New Contributor III

09-20-2021 6:34:20 PM

12865 Views
2 replies
1 kudos

Parsing deeply nested XML in Databricks

Hi Guys,Can someone point me to libraries to parse XML files in Databricks using Python / Scala.Any link to blog / documentations will be helpful.Looked into https://docs.databricks.com/data/data-sources/xml.html.Want to parse XSD's, seem this is exp...

Data Engineering

12865 Views
2 replies
1 kudos

09-20-2021 6:34:20 PM

View Replies

Latest Reply

Anonymous
Not applicable

09-27-2021 9:32:19 AM

1 kudos

@Sreedhar Vengala - I heard back from the team. As you noted, the feature is still experimental and not supported at this time.I would like to assure you that the team is aware of this. I have no information about a time frame to make this a support...

1 kudos

09-27-2021 9:32:19 AM

1 More Replies

by PraveenKumar188 • New Contributor

09-21-2021 5:49:16 AM

4124 Views
2 replies
2 kudos

Resolved! Is is possible to Mount multiple ADLS Gen2 Storage paths in single workspace

Hello Experts,We are looking on feasibility of mounting more that one ADLS Gen2 storages on a single workspace of databricks.Best RegardsPraveen

Data Engineering

4124 Views
2 replies
2 kudos

09-21-2021 5:49:16 AM

View Replies

Latest Reply

Erik
Valued Contributor III

09-27-2021 5:28:44 AM

2 kudos

Yes, its possible, we are doing it. Just mount them to different folders like @Werner Stinckens is saying.

2 kudos

09-27-2021 5:28:44 AM

1 More Replies

by User15787040559 • Databricks Employee

06-07-2021 9:07:05 AM

2367 Views
2 replies
0 kudos

What subset of mysql sql syntax we support in spark sql?

https://spark.apache.org/docs/latest/sql-ref-syntax.html

Data Engineering

2367 Views
2 replies
0 kudos

06-07-2021 9:07:05 AM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-24-2021 1:45:34 AM

0 kudos

Spark 3 has experimental support for ANSI. Read more here:https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html

0 kudos

06-24-2021 1:45:34 AM

1 More Replies

by HafidzZulkifli • New Contributor II

11-13-2017 1:51:02 AM

17075 Views
8 replies
0 kudos

How to import data and apply multiline and charset UTF8 at the same time?

I'm running Spark 2.2.0 at the moment. Currently I'm facing an issue when importing data of Mexican origin, where the characters can have special characters and with multiline for certain columns. Ideally, this is the command I'd like to run: T_new_...

Data Engineering

17075 Views
8 replies
0 kudos

11-13-2017 1:51:02 AM

View Replies

Latest Reply

DianGermishuize
New Contributor II

09-25-2021 4:18:12 AM

0 kudos

You could also potentially use the .withColumns() function on the data frame, and use the pyspark.sql.functions.encode function to convert the characterset to the one you need. Convert the Character Set/Encoding of a String field in a PySpark DataFr...

0 kudos

09-25-2021 4:18:12 AM

7 More Replies

by AndreStarker • New Contributor III

07-26-2021 12:14:36 PM

2699 Views
3 replies
2 kudos

Certification status

I've passed the "Databricks Certified Associate Developer for Apache Spark 3.0 - Scala" certification exam on 7/17/2021. The Webassessor record says I should receive certification status from Databricks within a week. I have not received any communi...

Data Engineering

2699 Views
3 replies
2 kudos

07-26-2021 12:14:36 PM

View Replies

Latest Reply

Anonymous
Not applicable

09-24-2021 2:39:21 PM

2 kudos

@Andre Starker - Congratulations!!!

2 kudos

09-24-2021 2:39:21 PM

2 More Replies

Databricks Community

Forum Posts

cluster not getting created, timing out

Resolved! Spark StreamingQuery not processing all data from source directory

How can I get Databricks notebooks to stop cutting off the explain plans?

Auto Loader for the Shape File ?.

Delta Lake Python

Resolved! How to handle configuration for different environment (e.g. DEV, PROD)?

Is it possible to pass configuration to a job on high concurrency cluster?

Resolved! Incremental write

Resolved! Read just the new file ???

Resolved! Current DBR is not yet available to this notebook

Parsing deeply nested XML in Databricks

Resolved! Is is possible to Mount multiple ADLS Gen2 Storage paths in single workspace

What subset of mysql sql syntax we support in spark sql?

How to import data and apply multiline and charset UTF8 at the same time?

Certification status

Join Us as a Local Community Builder!

DAB + DLT destroy fails due to ownership/permissio...

Can't enable "variantType-preview" using DLTs

Liquid Clustering With Merge

deadlock occurs with use statement

is there another way to authen to azure databricks...