Data Engineering

Forum Posts

Sorted by:

by User16826994223 • Honored Contributor III

06-08-2021 5:03:12 AM

618 Views
1 replies
0 kudos

Stream is not getting started from kafka after 2 hours of cluster statrt

Hi Team I am setting up the Kafka cluster on databricks to ingest the data on delta, but it seems like the cluster is running from last 2 hours but still, the stream is not started and I am not seeing any failure also.

Data Engineering

618 Views
1 replies
0 kudos

06-08-2021 5:03:12 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-08-2021 5:04:13 AM

0 kudos

This Type of issue happens if you have firewall on cloud account and your ip is not whitelisted, so pleaae whitelist the ip and issue will resolve

0 kudos

06-08-2021 5:04:13 AM

by User16783853032 • New Contributor II

06-07-2021 2:42:59 PM

848 Views
1 replies
0 kudos

Databricks notebook command gets cancelled:Generally when cluster is having init scripts or lib issues while starting cluster. Exact error can be look...

Databricks notebook command gets cancelled:Generally when cluster is having init scripts or lib issues while starting cluster. Exact error can be looked into driver logs.

Data Engineering

848 Views
1 replies
0 kudos

06-07-2021 2:42:59 PM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-08-2021 12:15:46 AM

0 kudos

Awsome Knowledge

0 kudos

06-08-2021 12:15:46 AM

by User16789201666 • Contributor II

06-07-2021 4:13:29 PM

1749 Views
2 replies
2 kudos

Resolved! How do I get access to cost related information for my cluster?

Data Engineering

1749 Views
2 replies
2 kudos

06-07-2021 4:13:29 PM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-08-2021 12:14:43 AM

2 kudos

You can tag your cluster and that tags get propagated to the billing management and there you can see it the cost

2 kudos

06-08-2021 12:14:43 AM

1 More Replies

by User16826994223 • Honored Contributor III

06-07-2021 11:55:08 PM

1053 Views
1 replies
0 kudos

Azure Databricks with Storage Account as data layer and DBFS understanding

What is the difference between ADLS mounted ON DataBricks and dbfs does the Mount of ADLS on databricks make gives any performance benefit , is the mounted ADLS still behave as object storage or it become simple storage

Data Engineering

1053 Views
1 replies
0 kudos

06-07-2021 11:55:08 PM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-07-2021 11:55:31 PM

0 kudos

DBFS is just an abstraction on cloud storage By default when you create a workspace, you get an instance of DBFS - so-called DBFS Root. Plus you can mount additional storage accounts under the /mnt folder. Data written to mount point paths (/mnt) is...

0 kudos

06-07-2021 11:55:31 PM

by User16826994223 • Honored Contributor III

06-07-2021 11:47:26 PM

4039 Views
1 replies
0 kudos

How to conver Dataframe into JSON on Databricks?

Can I convert my jdbc Dataframe into JSON ? Because when I tried it, it got an error. I'm using a script as Pandas DataFrame function df.to_json()

Data Engineering

4039 Views
1 replies
0 kudos

06-07-2021 11:47:26 PM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-07-2021 11:48:15 PM

0 kudos

df.toJSON()

0 kudos

06-07-2021 11:48:15 PM

by User16783855534 • New Contributor III

06-07-2021 10:59:24 AM

3066 Views
3 replies
1 kudos

How many IPs do databricks nodes use?

Data Engineering

3066 Views
3 replies
1 kudos

06-07-2021 10:59:24 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-07-2021 10:25:36 PM

1 kudos

The answer varies depending on the cloud provider (as of June 2021) . In GCP, since the architecture is based on GKE , there are additional ip requirements. For more details see

1 kudos

06-07-2021 10:25:36 PM

2 More Replies

by Anonymous • Not applicable

06-07-2021 8:07:58 PM

651 Views
0 replies
0 kudos

Escaped quotes mess up table records

When table content is dumped from the RDBMS (e.g. Oracle), some column values may contain escaped double quotes (\") in the column values, which may cause the values from multiple columns to be concatenated into one value and result in corrupted reco...

Data Engineering

651 Views
0 replies
0 kudos

06-07-2021 8:07:58 PM

by JustinMills • New Contributor III

01-22-2018 6:55:35 AM

29718 Views
6 replies
0 kudos

Resolved! Job fails with "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

No other output is available, not even output from cells that did run successfully. Also, I'm unable to connect to spark ui or view the logs. It makes an attempt to load each of them, but after some time an error message appears saying it's unable ...

Data Engineering

29718 Views
6 replies
0 kudos

01-22-2018 6:55:35 AM

View Replies

Latest Reply

lzlkni
New Contributor II

06-07-2021 6:33:51 PM

0 kudos

most of the time it's out of memory on driver node. check over all the drive log, data node log in Spark UI. And check if u r collecting huge data to drive node, e.g. collect()

0 kudos

06-07-2021 6:33:51 PM

5 More Replies

by Anonymous • Not applicable

06-07-2021 5:34:14 PM

599 Views
1 replies
0 kudos

Delta - open source?

Delta is open source but certain features such as OPTIMIZE, ZORDER are only available on managed DBR. So how open sourced is it really?

Data Engineering

599 Views
1 replies
0 kudos

06-07-2021 5:34:14 PM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-07-2021 3:01:00 PM

0 kudos

Some of the feature is exclusively added by datbricks on top of delta not by comunity so comapny has right whether it wants to open source or not

0 kudos

06-07-2021 3:01:00 PM

by Anonymous • Not applicable

06-04-2021 10:26:36 AM

2735 Views
2 replies
1 kudos

Resolved! How is Z-ORDER different from bucketing in Hive?

Data Engineering

2735 Views
2 replies
1 kudos

06-04-2021 10:26:36 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-07-2021 5:08:00 PM

1 kudos

Bucketing is physical partition of the the table but the Zordering is arrangement of records in a file , in most optimal manner

1 kudos

06-07-2021 5:08:00 PM

1 More Replies

by User16789201666 • Contributor II

06-07-2021 4:11:26 PM

836 Views
1 replies
0 kudos

For our streaming jobs, I am planning to purge the data older than 60 days from our raw layer where we dump the data into delta table from kinesis stream in json format. Do you see any problem in parallelly doing the purging on older data while streaming

Data Engineering

836 Views
1 replies
0 kudos

06-07-2021 4:11:26 PM

View Replies

Latest Reply

User16789201666
Contributor II

06-07-2021 4:12:06 PM

0 kudos

There isn’t a problem purging old data. When using auto loader it’ll take into account new data being added.

0 kudos

06-07-2021 4:12:06 PM

by User16789201666 • Contributor II

06-07-2021 4:04:40 PM

891 Views
1 replies
2 kudos

What is the best practice for generating jobs in an automated fashion?

Data Engineering

891 Views
1 replies
2 kudos

06-07-2021 4:04:40 PM

View Replies

Latest Reply

User16789201666
Contributor II

06-07-2021 4:04:54 PM

2 kudos

There are several approaches here. You can write an automation script that programmatically accesses Databricks API’s to generate configured jobs. You can also utilize the Databricks Terraform provider. The benefit of the latter approach is that Terr...

2 kudos

06-07-2021 4:04:54 PM

by sajith_appukutt • Honored Contributor II

06-07-2021 3:40:55 PM

406 Views
1 replies
0 kudos

How can I reduce the risk of data exfiltration while using Databricks

Data Engineering

406 Views
1 replies
0 kudos

06-07-2021 3:40:55 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-07-2021 3:44:27 PM

0 kudos

Databricks enterprise security and admin features allow customers to deploy Databricks using their own managed VPC/ VNET. This enables them to have greater flexibility and control over the configuration of their deployment architectureFor Azure follo...

0 kudos

06-07-2021 3:44:27 PM

by tj-cycyota • New Contributor III

06-07-2021 3:15:58 PM

503 Views
0 replies
0 kudos

Is there an API to query E2 usage details by workspace? I want to query to see how many DBUs a specific workspace consumed in a certain time period.

Data Engineering

503 Views
0 replies
0 kudos

06-07-2021 3:15:58 PM

by Anonymous • Not applicable

06-07-2021 3:13:06 PM

610 Views
0 replies
0 kudos

Newline characters mess up the table records

When creating tables from text files containing newline characters in the middle of the lines, the table records will null column values because the newline characters in the middle of the lines break the lines into two different records and fill up ...

Data Engineering

610 Views
0 replies
0 kudos

06-07-2021 3:13:06 PM

User

Count

1602

737

348

285

247

Databricks Community

Forum Posts

Stream is not getting started from kafka after 2 hours of cluster statrt

Databricks notebook command gets cancelled:Generally when cluster is having init scripts or lib issues while starting cluster. Exact error can be look...

Resolved! How do I get access to cost related information for my cluster?

Azure Databricks with Storage Account as data layer and DBFS understanding

How to conver Dataframe into JSON on Databricks?

How many IPs do databricks nodes use?

Escaped quotes mess up table records

Resolved! Job fails with "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

Delta - open source?

Resolved! How is Z-ORDER different from bucketing in Hive?

For our streaming jobs, I am planning to purge the data older than 60 days from our raw layer where we dump the data into delta table from kinesis stream in json format. Do you see any problem in parallelly doing the purging on older data while streaming

What is the best practice for generating jobs in an automated fashion?

How can I reduce the risk of data exfiltration while using Databricks

Is there an API to query E2 usage details by workspace? I want to query to see how many DBUs a specific workspace consumed in a certain time period.

Newline characters mess up the table records

Getting com.databricks.client.jdbc.Driver is not f...

Unit Testing DLT Pipelines

Retrieve job-level parameters in spark_python_task...

Cannot pass arrays to spark.sql() using named para...

unity catalog with external table and column maski...