Data Engineering

Forum Posts

Sorted by:

by vasanthvk • New Contributor III

10-01-2021 10:28:11 AM

5064 Views
7 replies
3 kudos

Resolved! Is there a way to automate Table creation in Databricks SQL based on a ADLS storage location which contains multiple Parquet files?

We have ADLS container location which contains several (100+) different data subjects folders which contain Parquet files with partition column and we want to expose each of the data subject folder as a table in Databricks SQL. Is there any way to au...

Data Engineering

5064 Views
7 replies
3 kudos

10-01-2021 10:28:11 AM

View Replies

Latest Reply

User16857282152
Contributor

10-06-2021 9:56:15 AM

3 kudos

Updating dazfuller suggestion, but including code for one level of partitioning, of course if you have deeper partitions then you will have to make a function and do a recursive call to get to the final directory containing parquet files. Parquet wil...

3 kudos

10-06-2021 9:56:15 AM

6 More Replies

by MartinB • Contributor III

09-11-2021 3:34:17 AM

6192 Views
4 replies
3 kudos

Resolved! Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future

Hi,I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.Example:Loading this into a Spark dataframe works fine...

Data Engineering

6192 Views
4 replies
3 kudos

09-11-2021 3:34:17 AM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

10-06-2021 7:42:15 AM

3 kudos

Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue. https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&page=com.atlassian.jira.plugin.system.issuetabpanels%3...

3 kudos

10-06-2021 7:42:15 AM

3 More Replies

by User16783852686 • New Contributor II

10-05-2021 9:52:50 AM

1666 Views
5 replies
2 kudos

Resolved! Slow first time run, jar based jobs

When running a jar-based job, I've noticed that the 1st run always takes the extra time to complete the job and consecutive runs take less time to finish the job. This behavior is reproducible on an interactive cluster. What's causing this? Is this e...

Data Engineering

1666 Views
5 replies
2 kudos

10-05-2021 9:52:50 AM

View Replies

Latest Reply

User16783852686
New Contributor II

10-06-2021 5:58:18 AM

2 kudos

@Sandeep Katta , this is a fat jar that does read-transform-write. @DD Sharma response matches @Werner Stinckens & I intuition that there was efficiency on the second job due to jar already being loaded. I would not have noticed this had job run...

2 kudos

10-06-2021 5:58:18 AM

4 More Replies

by BorislavBlagoev • Valued Contributor III

10-05-2021 3:12:04 AM

1284 Views
4 replies
7 kudos

Resolved! Visualization of Structured Streaming in job.

Does Databricks have feature or good pattern to visualize the data from Structured Streaming? Something like display in the notebook.

Data Engineering

1284 Views
4 replies
7 kudos

10-05-2021 3:12:04 AM

View Replies

Latest Reply

BorislavBlagoev
Valued Contributor III

10-06-2021 1:59:53 AM

7 kudos

I didn't know about that. Thanks!

7 kudos

10-06-2021 1:59:53 AM

3 More Replies

by User16752246002 • New Contributor II

10-01-2021 10:56:32 AM

1149 Views
2 replies
6 kudos

Resolved! New Bill Inmon Book, What are your thoughts?

Have you checked out the new Bill Inmon Book, Building the Data Lakehouse? https://dbricks.co/3uxCXjO What were your thoughts if you read it?

Data Engineering

1149 Views
2 replies
6 kudos

10-01-2021 10:56:32 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

10-06-2021 1:31:10 AM

6 kudos

The quality of the book depends on the audience IMO. For people who have no background in data warehousing it will be interesting to read. For the others the book is too general and descriptive. The 'HOW do you do x' is missing.

6 kudos

10-06-2021 1:31:10 AM

1 More Replies

by FMendez • New Contributor III

09-10-2021 4:29:06 AM

9283 Views
4 replies
7 kudos

Resolved! How can you mount an Azure Data Lake (gen2) using abfss and Shared Key?

I wanted to mount a ADLG2 on databricks and take advantage on the abfss driver which should be better for large analytical workloads (is that even true in the context of DB?).Setting an OAuth is a bit of a pain so I wanted to take the simpler approac...

Data Engineering

9283 Views
4 replies
7 kudos

09-10-2021 4:29:06 AM

View Replies

Latest Reply

User16753724663
Valued Contributor

10-04-2021 2:57:28 AM

7 kudos

Hi @Fernando Mendez ,The below document will help you to mount the ADLS gen2 using abfss:https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.htmlCould you please check if this helps?

7 kudos

10-04-2021 2:57:28 AM

3 More Replies

by del1000 • New Contributor III

09-14-2021 8:19:28 AM

15191 Views
6 replies
3 kudos

Resolved! Is it possible to passthrough job's parameters to variable?

Scenario:I tried to run notebook_primary as a job with same parameters' map. This notebook is orchestrator for notebooks_sec_1, notebooks_sec_2, and notebooks_sec_3 and next. I run them by dbutils.notebook.run(path, timeout, arguments) function.So ho...

Data Engineering

15191 Views
6 replies
3 kudos

09-14-2021 8:19:28 AM

View Replies

Latest Reply

del1000
New Contributor III

10-03-2021 11:56:09 AM

3 kudos

@Balbir Singh , I'm newbie in Databricks but the manual says you can use a python cell and transfer variables to scala's cell by temp tables.https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data

3 kudos

10-03-2021 11:56:09 AM

5 More Replies

by User16789201666 • Contributor II

06-07-2021 3:57:12 PM

855 Views
2 replies
0 kudos

What are some guidelines for migrating to DBR 7/Spark 3?

Data Engineering

855 Views
2 replies
0 kudos

06-07-2021 3:57:12 PM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

10-02-2021 7:11:34 PM

0 kudos

Please refer to the below reference for switching to DBR 7.xWe have extended our DBR 6.4 support until December 2021, DBR 6.4 extended support - Release notes: https://docs.databricks.com/release-notes/runtime/6.4x.htmlMigration guide to DBR 7.x: htt...

0 kudos

10-02-2021 7:11:34 PM

1 More Replies

by MGH1 • New Contributor III

09-09-2021 8:52:57 PM

3139 Views
8 replies
3 kudos

Resolved! how to log the KerasClassifier model in a sklearn pipeline in mlflow?

I have a set of pre-processing stages in a sklearn `Pipeline` and an estimator which is a `KerasClassifier` (`from tensorflow.keras.wrappers.scikit_learn import KerasClassifier`).My overall goal is to tune and log the whole sklearn pipeline in `mlflo...

Data Engineering

3139 Views
8 replies
3 kudos

09-09-2021 8:52:57 PM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

10-02-2021 7:02:02 PM

3 kudos

could you please share the full error stack trace?

3 kudos

10-02-2021 7:02:02 PM

7 More Replies

by brij • New Contributor III

09-23-2021 2:11:13 AM

3068 Views
8 replies
3 kudos

Resolved! Databricks snowflake dataframe.toPandas() taking more space and time

I have 2 exactly same table(rows and schema). One table recides in AZSQL server data base and other one is in snowflake database. Now we have some existing code which we want to migrate from azsql to snowflake but when we are trying to create a panda...

Data Engineering

3068 Views
8 replies
3 kudos

09-23-2021 2:11:13 AM

View Replies

Latest Reply

Anonymous
Not applicable

10-01-2021 12:17:50 PM

3 kudos

@Brijan Elwadhi - That's wonderful. Thank you for sharing your solution.

3 kudos

10-01-2021 12:17:50 PM

7 More Replies

by krishnachaitany • New Contributor II

10-01-2021 7:44:02 AM

584 Views
1 replies
2 kudos

Spot Instances in Azure Databricks

The above screen shot is from AWS Databricks cluster .Similarly, in Azure Databricks - Is there a specific way to determine how many of worker nodes are using spot instances and on-demand instances when it is running/completed a job.Likewise, ...

Compute level spot instances and on demand instances

Data Engineering

584 Views
1 replies
2 kudos

10-01-2021 7:44:02 AM

View Replies

Latest Reply

Anonymous
Not applicable

10-01-2021 12:14:35 PM

2 kudos

Hello!My name is Piper and I'm one of the community moderators. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will follow up with the team. Thanks for your p...

2 kudos

10-01-2021 12:14:35 PM

by Databricks2005 • New Contributor II

09-16-2021 2:08:13 PM

1439 Views
4 replies
3 kudos

Resolved! Cosine similarity between all rows pairwise on a dataset of 100million rows

Hello everyone,I am facing performance issue while calculating cosine similarity in pyspark on a dataframe with around 100 million records.I am trying to do a cross self join on the dataframe to calculate it.The executors are all having same number ...

Data Engineering

1439 Views
4 replies
3 kudos

09-16-2021 2:08:13 PM

View Replies

Latest Reply

Sonal
New Contributor II

10-01-2021 10:00:26 AM

3 kudos

Is there a way to hash the record attributes so that the cartesian join can be avoided? I work on record similarity and fuzzy matching and we do a learning based blocking alorithm which hashes the records into smaller buckets and then the hashes are ...

3 kudos

10-01-2021 10:00:26 AM

3 More Replies

by Quan • New Contributor III

09-22-2021 5:42:40 PM

10382 Views
9 replies
6 kudos

Resolved! How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Hello all, I'm trying to pull table data from databricks tables that contain foreign language characters in UTF-8 into an ETL tool using a JDBC connection. I'm using the latest Simba Spark JDBC driver available from the Databricks website.The issue i...

Data Engineering

10382 Views
9 replies
6 kudos

09-22-2021 5:42:40 PM

View Replies

Latest Reply

Anonymous
Not applicable

10-01-2021 1:56:27 AM

6 kudos

Can you try setting UseUnicodeSqlCharacterTypes=1 in the driver, and also make sure 'file.encoding' is set to UTF-8 in jvm and see if the issue still persists?

6 kudos

10-01-2021 1:56:27 AM

8 More Replies

by Abhendu • New Contributor II

09-23-2021 10:01:50 AM

872 Views
3 replies
0 kudos

Resolved! CICD Databricks

Hi TeamI was wondering if there is a document or step by step process to promote code in CICD across various environments of code repository (GIT/GITHUB/BitBucket/Gitlab) with DBx support? [Without involving code repository merging capability of the ...

Data Engineering

872 Views
3 replies
0 kudos

09-23-2021 10:01:50 AM

View Replies

Latest Reply

Anonymous
Not applicable

09-30-2021 11:40:23 PM

0 kudos

Please refer this related thread on CICD in Databricks https://community.databricks.com/s/question/0D53f00001GHVhMCAX/what-are-some-best-practices-for-cicd

0 kudos

09-30-2021 11:40:23 PM

2 More Replies

by Kaniz • Community Manager

09-29-2021 8:31:57 PM

777 Views
1 replies
0 kudos

What's the difference between Pig and Hive?

Data Engineering

777 Views
1 replies
0 kudos

09-29-2021 8:31:57 PM

View Replies

Latest Reply

Kaniz
Community Manager

09-30-2021 11:11:21 PM

0 kudos

The differences are as follows:-Pig operates on the client-side of a cluster whereas Hive operates on the server-side of a cluster.Pig uses pig-Latin language whereas Hive uses HiveQL language.Pig is a Procedural Data Flow Language whereas Hive is a ...

0 kudos

09-30-2021 11:11:21 PM

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Resolved! Is there a way to automate Table creation in Databricks SQL based on a ADLS storage location which contains multiple Parquet files?

Resolved! Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future

Resolved! Slow first time run, jar based jobs

Resolved! Visualization of Structured Streaming in job.

Resolved! New Bill Inmon Book, What are your thoughts?

Resolved! How can you mount an Azure Data Lake (gen2) using abfss and Shared Key?

Resolved! Is it possible to passthrough job's parameters to variable?

What are some guidelines for migrating to DBR 7/Spark 3?

Resolved! how to log the KerasClassifier model in a sklearn pipeline in mlflow?

Resolved! Databricks snowflake dataframe.toPandas() taking more space and time

Spot Instances in Azure Databricks

Resolved! Cosine similarity between all rows pairwise on a dataset of 100million rows

Resolved! How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Resolved! CICD Databricks

What's the difference between Pig and Hive?

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...