Data Engineering

Forum Posts

Sorted by:

by User16835756816 • Valued Contributor

10-19-2021 11:13:22 AM

537 Views
0 replies
5 kudos

Never miss a beat! Stay up to date with all Databricks news, including product updates, and helpful product tips by signing up for our monthly newslet...

Never miss a beat! Stay up to date with all Databricks news, including product updates, and helpful product tips by signing up for our monthly newsletter.Note: Newsletter is currently available to only AWS & GCP customers.

Data Engineering

537 Views
0 replies
5 kudos

10-19-2021 11:13:22 AM

by User16835756816 • Valued Contributor

10-19-2021 11:06:13 AM

631 Views
0 replies
5 kudos

Learn the basics with these resources: Register for an AWS Onboarding Webinar or an Azure Quickstart Lab- Learn the fundamentals from a Customer Succe...

Learn the basics with these resources: Register for an AWS Onboarding Webinar or an Azure Quickstart Lab- Learn the fundamentals from a Customer Success Engineer & get all your onboarding questions answered live.Started using Databricks, but have que...

Data Engineering

631 Views
0 replies
5 kudos

10-19-2021 11:06:13 AM

by User16835756816 • Valued Contributor

10-19-2021 10:59:15 AM

616 Views
0 replies
6 kudos

Welcome to Databricks! Here you will find resources for a successful onboarding experience. In this group you can ask quick questions and have them an...

Welcome to Databricks! Here you will find resources for a successful onboarding experience. In this group you can ask quick questions and have them answered by experts to unblock and accelerate your ramp up with Databricks.

Data Engineering

616 Views
0 replies
6 kudos

10-19-2021 10:59:15 AM

by magy • New Contributor

10-15-2021 4:33:36 AM

1458 Views
3 replies
0 kudos

Display, count and write commands stuck after 1st job

Hi, I have problems with displaying and saving a table in Databricks. Simple command can run for hours without any progress..Before that I am not doing any rocket science - code runs in less than a minute, I have one join at the end. I am using 7.3 ...

Data Engineering

1458 Views
3 replies
0 kudos

10-15-2021 4:33:36 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

10-18-2021 3:31:46 PM

0 kudos

hi @Just Magy ,what is your data source? what type of lazy transformation and actions do you have in your code? Do you partition your data? Please provide more details.

0 kudos

10-18-2021 3:31:46 PM

2 More Replies

by amitdatabricksc • New Contributor II

10-16-2021 11:53:55 AM

4996 Views
2 replies
0 kudos

AttributeError: 'NoneType' object has no attribute 'repartition'

I am using a framework and i have a query where i am doing,df = seg_df.select(*).write.option("compression", "gzip') and i am getting below error,When i don't do the write.option i am not getting below error. Why is it giving me repartition error. Wh...

Data Engineering

4996 Views
2 replies
0 kudos

10-16-2021 11:53:55 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

10-18-2021 3:21:28 PM

0 kudos

hi @AMIT GADHAVI ,could you provide more details? for example, what is your data source? how do you repartition? etc

0 kudos

10-18-2021 3:21:28 PM

1 More Replies

by eq • New Contributor III

10-13-2021 2:14:41 AM

2733 Views
7 replies
7 kudos

Resolved! Multi-task Jobs orchestration - simulating onComplete status

Currently, we are investigating how to effectively incorporate databricks latest feature for orchestration of tasks - Multi-task Jobs.The default behaviour is that a downstream task would not be executed if the previous one has failed for some reason...

Data Engineering

2733 Views
7 replies
7 kudos

10-13-2021 2:14:41 AM

View Replies

Latest Reply

User16844513407
New Contributor III

10-18-2021 6:47:12 AM

7 kudos

Hi @Stefan V ,My name is Jan and I'm a product manager working on job orchestration. Thank you for your question. At the moment this is not something directly supported yet, this is however on our radar. If you are interested in having a short conve...

7 kudos

10-18-2021 6:47:12 AM

6 More Replies

by snoeprol • New Contributor II

10-17-2021 5:25:45 AM

3250 Views
4 replies
2 kudos

Resolved! Unable to open files with python, but filesystem shows files exist

Dear community,I have the following problem:%fs mv '/FileStore/Tree_point_classification-1.dlpk' '/dbfs/mnt/group22/Tree_point_classification-1.dlpk'I have uploaded a file of a ML-model and have transferred it to the directory with When I now check ...

Data Engineering

3250 Views
4 replies
2 kudos

10-17-2021 5:25:45 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-18-2021 3:38:36 AM

2 kudos

There is dbfs:/dbfs/ displayed maybe file is in /dbfs/dbfs directory? Please check it and try to open with open('/dbfs/dbfs. You can also use "data" from left menu to check what is in dbfs file system more easily.

2 kudos

10-18-2021 3:38:36 AM

3 More Replies

by alonisser • Contributor

10-16-2021 9:36:27 AM

1299 Views
2 replies
1 kudos

Resolved! Accessing confluent schema registry from databricks with scala fails with 401 (just for scala, not python, just in databricks)

Nore, I've tested with the same connection variable:locally with scala - works (via the same prod schema registry)in the cluster with python - worksin the cluster with scala - fails with 401 auth errordef setupSchemaRegistry(schemaRegistryUrl: String...

Data Engineering

1299 Views
2 replies
1 kudos

10-16-2021 9:36:27 AM

View Replies

Latest Reply

alonisser
Contributor

10-18-2021 4:03:31 AM

1 kudos

Found the issue: it's the uber package mangling some dependency resolving, which I fixedAnother issue, is that currently you can't use 6.* branch of confluent schema registry client in databricks, because the avro version is different then the one su...

1 kudos

10-18-2021 4:03:31 AM

1 More Replies

by kjoth • Contributor II

10-13-2021 5:58:37 AM

8897 Views
5 replies
5 kudos

Resolved! Databricks default python libraries list & version

We are using data-bricks. How do we know the default libraries installed in the databricks & what versions are being installed. I have ran pip list, but couldn't find the pyspark in the returned list.

Data Engineering

8897 Views
5 replies
5 kudos

10-13-2021 5:58:37 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

10-13-2021 11:00:08 AM

5 kudos

Hi @karthick J ,If you would like to see all the libraries installed in your cluster and the version, then I will recommend to check the "Environment" tab. In there you will be able to find all the libraries installed in your cluster.Please follow t...

5 kudos

10-13-2021 11:00:08 AM

4 More Replies

by Erik • Valued Contributor II

10-15-2021 3:19:04 AM

3089 Views
6 replies
7 kudos

Databricks query performance when filtering on a column correlated to the partition-column

(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):Setting: Delta-lake, Databricks SQL compute used by powerbi. I am wondering about the following scenario: We have a column `timest...

Data Engineering

3089 Views
6 replies
7 kudos

10-15-2021 3:19:04 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-15-2021 6:54:39 AM

7 kudos

In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.

7 kudos

10-15-2021 6:54:39 AM

5 More Replies

by Kaniz • Community Manager

09-22-2021 2:17:40 PM

830 Views
1 replies
2 kudos

How to call Cluster API and start cluster from within Databricks Notebook?

Data Engineering

830 Views
1 replies
2 kudos

09-22-2021 2:17:40 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-17-2021 5:23:36 AM

2 kudos

There are API endpoints to manage clusters. Official documentation: https://docs.databricks.com/dev-tools/api/latest/clusters.html. Here is example code which can be run from notebook:ctx = dbutils.notebook.entry_point.getDbutils().notebook().getCon...

2 kudos

10-17-2021 5:23:36 AM

by BradCliQ • New Contributor II

10-15-2021 7:05:06 PM

1573 Views
2 replies
2 kudos

Resolved! Clean up of residual AWS resources when deleting a DB workspace

When deleting a workspace from the Databricks Accounts Console, I noticed the AWS resources (VPC, NAT, etc.) are not removed. Should they be? And if not, is there a clean/simple way of cleaning up the residual AWS resources?

Data Engineering

1573 Views
2 replies
2 kudos

10-15-2021 7:05:06 PM

View Replies

Latest Reply

BradCliQ
New Contributor II

10-16-2021 6:09:27 AM

2 kudos

Thank you Prabakar - that's what I figured but didn't know if there was documentation on resource cleanup. I'll just go through and find everything the CF stack created and remove them.Regards,Brad

2 kudos

10-16-2021 6:09:27 AM

1 More Replies

by omsas • New Contributor

10-15-2021 4:48:38 AM

1654 Views
2 replies
0 kudos

How to add Columns for Automatic Fill on Pandas Python

1. I have data x,I would like to create a new column with the condition that the value are 1, 2 or 32. The name of the column is SHIFT where this SHIFT column will be filled automatically if the TIME_CREATED column meets the conditions.3. the conditi...

Data Engineering

1654 Views
2 replies
0 kudos

10-15-2021 4:48:38 AM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

10-15-2021 12:59:20 PM

0 kudos

You an do something like this in pandas. Note there could be a more performant way to do this too. import pandas as pd import numpy as np df = pd.DataFrame({'a':[1,2,3,4]}) df.head() > a > 0 1 > 1 2 > 2 3 > 3 4 conditions = [(df['a'] <=2...

0 kudos

10-15-2021 12:59:20 PM

1 More Replies

by SQLArchitect • New Contributor

10-15-2021 8:33:08 AM

1005 Views
1 replies
1 kudos

Writing Records Failing Constraint Requirements to Separate Table when using Delta Live Tables

Are there any plans / capabilities in place or approaches people are using for writing (logging) records failing constraint requirements to separate tables when using Delta Live Tables? Also, are there any plans / capabilities in place or approaches ...

Data Engineering

1005 Views
1 replies
1 kudos

10-15-2021 8:33:08 AM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

10-15-2021 12:49:57 PM

1 kudos

According to the language reference documentation, I do not believe quarantining records is possible right now out of the box. But there are a few workarounds under the current functionality. Create a second table with the inverse of the expectations...

1 kudos

10-15-2021 12:49:57 PM

by Kamal2 • New Contributor II

09-23-2021 1:37:09 AM

9652 Views
3 replies
2 kudos

Resolved! PDF Parsing in Notebook

I have pdf files stored in azure adls.i want to parse pdf files in pyspark dataframeshow can i do that ?

Data Engineering

9652 Views
3 replies
2 kudos

09-23-2021 1:37:09 AM

View Replies

Latest Reply

User16752240003
Contributor

10-15-2021 8:31:23 AM

2 kudos

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in S...

2 kudos

10-15-2021 8:31:23 AM

2 More Replies

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Never miss a beat! Stay up to date with all Databricks news, including product updates, and helpful product tips by signing up for our monthly newslet...

Learn the basics with these resources: Register for an AWS Onboarding Webinar or an Azure Quickstart Lab- Learn the fundamentals from a Customer Succe...

Welcome to Databricks! Here you will find resources for a successful onboarding experience. In this group you can ask quick questions and have them an...

Display, count and write commands stuck after 1st job

AttributeError: 'NoneType' object has no attribute 'repartition'

Resolved! Multi-task Jobs orchestration - simulating onComplete status

Resolved! Unable to open files with python, but filesystem shows files exist

Resolved! Accessing confluent schema registry from databricks with scala fails with 401 (just for scala, not python, just in databricks)

Resolved! Databricks default python libraries list & version

Databricks query performance when filtering on a column correlated to the partition-column

How to call Cluster API and start cluster from within Databricks Notebook?

Resolved! Clean up of residual AWS resources when deleting a DB workspace

How to add Columns for Automatic Fill on Pandas Python

Writing Records Failing Constraint Requirements to Separate Table when using Delta Live Tables

Resolved! PDF Parsing in Notebook

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...