Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16826994223 • Honored Contributor III

06-21-2021 5:57:04 AM

621 Views
1 replies
0 kudos

How do we manage data recency in Databricks

I want to know how databricks maintain data recency in databricks

Data Engineering

621 Views
1 replies
0 kudos

06-21-2021 5:57:04 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 5:43:42 PM

0 kudos

When using delta tables in databricks, you have the advantage of delta cache which accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. At the beginning of each query delta tables au...

0 kudos

06-22-2021 5:43:42 PM

by MoJaMa • Valued Contributor II

06-22-2021 5:26:58 PM

655 Views
1 replies
0 kudos

Since Databricks manages the runtime on SQL Endpoints, how do I know which version I'm on?

Data Engineering

655 Views
1 replies
0 kudos

06-22-2021 5:26:58 PM

View Replies

Latest Reply

MoJaMa
Valued Contributor II

06-22-2021 5:28:39 PM

0 kudos

Start an endpointRun a queryGo to Query HistoryClick Details, Go to the Environment tabSearch sparkVersion.

0 kudos

06-22-2021 5:28:39 PM

by User16826994223 • Honored Contributor III

06-22-2021 2:25:35 AM

797 Views
1 replies
0 kudos

Why NPIP is an optional and not mandatory

Even though the NPIP is more secure as the network traffic travel through Microsoft backbone network why it is optional , it should be mandatory, is there some limitataion or a case where we may not able to use NPIP .

Data Engineering

797 Views
1 replies
0 kudos

06-22-2021 2:25:35 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 5:26:34 PM

0 kudos

NPIP / secure cluster connectivity requires a NAT gateway (or similar appliance) for outbound traffic from your workspace’s subnets to the Azure backbone and public network. This incurs a small additional cost. Also, it is worth mentioning that ne...

0 kudos

06-22-2021 5:26:34 PM

by MoJaMa • Valued Contributor II

06-22-2021 5:22:11 PM

660 Views
1 replies
0 kudos

Databricks on GCP. How many partitions of local ssd does Databricks need per VM?

Data Engineering

660 Views
1 replies
0 kudos

06-22-2021 5:22:11 PM

View Replies

Latest Reply

MoJaMa
Valued Contributor II

06-22-2021 5:24:10 PM

0 kudos

Each local disk is 375 GB.So, for example, for n2-standard-4, it is 2 local disks. (0.75TB /2)https://databricks.com/wp-content/uploads/2021/05/GCP-Pricing-Estimator-v2.pdf?_ga=2.241263109.66068867.1623086616-828667513.1602536526

0 kudos

06-22-2021 5:24:10 PM

by MoJaMa • Valued Contributor II

06-22-2021 5:20:18 PM

513 Views
1 replies
0 kudos

Databricks on GCP. For the persistent storage with each node what's the specific type Databricks uses?

Data Engineering

513 Views
1 replies
0 kudos

06-22-2021 5:20:18 PM

View Replies

Latest Reply

MoJaMa
Valued Contributor II

06-22-2021 5:20:50 PM

0 kudos

They are Zonal SSD Persistent Diskhttps://cloud.google.com/compute/docs/disks#introduction

0 kudos

06-22-2021 5:20:50 PM

by User16826994223 • Honored Contributor III

06-22-2021 4:53:45 AM

887 Views
2 replies
0 kudos

Don't want checkpoint in delta

Suppose I am not interested in checkpoints, how can I disable Checkpoints write in delta

Data Engineering

887 Views
2 replies
0 kudos

06-22-2021 4:53:45 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 5:13:57 PM

0 kudos

Writing statistics in a checkpoint has a cost which is visible usually only for very large tables. However it is worth mentioning that, this statistics would be very useful for data skipping which speeds up subsequent operations. In Databricks Runti...

0 kudos

06-22-2021 5:13:57 PM

1 More Replies

by Digan_Parikh • Valued Contributor

06-22-2021 4:50:40 PM

846 Views
1 replies
0 kudos

Resolved! Delta Live Table - landing database?

Where do you specify what database the DLT tables land in?

Data Engineering

846 Views
1 replies
0 kudos

06-22-2021 4:50:40 PM

View Replies

Latest Reply

Digan_Parikh
Valued Contributor

06-22-2021 4:53:02 PM

0 kudos

The target key, when creating the pipeline specifies the database that the tables get published to. Documented here - https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-user-guide.html#publish-tables

0 kudos

06-22-2021 4:53:02 PM

by User16826992666 • Valued Contributor

06-16-2021 3:33:39 PM

1750 Views
1 replies
0 kudos

Is it possible to see the cost or DBU's of a particular job?

Data Engineering

1750 Views
1 replies
0 kudos

06-16-2021 3:33:39 PM

View Replies

Latest Reply

Digan_Parikh
Valued Contributor

06-22-2021 4:42:54 PM

0 kudos

you can see the DBUs during cluster creation

0 kudos

06-22-2021 4:42:54 PM

by Anonymous • Not applicable

06-22-2021 11:38:15 AM

1170 Views
1 replies
0 kudos

Resolved! Questions on using Docker image with Databricks Container Service

Specifically, we have in mind:* Create a Databricks job for testing API changes (the API library is built in a custom Jar file)* When we want to test an API change, build a Docker image with the relevant changes in a Jar file* Update the job configur...

Data Engineering

1170 Views
1 replies
0 kudos

06-22-2021 11:38:15 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 4:32:11 PM

0 kudos

>Where do we put custom Jar files when building the Docker image? /databricks/jars>How do we update the job configuration so that the job’s cluster will be built with this new Docker image, and how long do we expect this re-configuring process to tak...

0 kudos

06-22-2021 4:32:11 PM

by User16869510359 • Esteemed Contributor

06-22-2021 4:25:48 PM

4624 Views
1 replies
2 kudos

Resolved! How to find the Databricks Platform version

Data Engineering

4624 Views
1 replies
2 kudos

06-22-2021 4:25:48 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-22-2021 4:27:06 PM

2 kudos

Use the below endpoint on your workspace. https://your-workspace-name.cloud.databricks.com/version

2 kudos

06-22-2021 4:27:06 PM

by User16869510359 • Esteemed Contributor

06-22-2021 4:16:50 PM

1192 Views
1 replies
0 kudos

Resolved! Z-order or Partitioning? Which is better for Data skipping?

For Delta tables, among Z-order and Partioning which is recommended technique for efficient Data Skipping

Data Engineering

1192 Views
1 replies
0 kudos

06-22-2021 4:16:50 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-22-2021 4:19:13 PM

0 kudos

Partition pruning is the most efficient way to ensure Data skipping. However, choosing the right column for partitioning is very important. It's common to see choosing the wrong column for partitioning can cause a large number of small file problems ...

0 kudos

06-22-2021 4:19:13 PM

by Srikanth_Gupta_ • Valued Contributor

06-22-2021 7:56:54 AM

767 Views
2 replies
0 kudos

I have several thousands of Delta tables in my Production, what is the best way to get counts

if I might need a dashboard to see increase in number of rows on day to day basis, also a dashboard that shows size of Parquet/Delta files in my Lake?

Data Engineering

767 Views
2 replies
0 kudos

06-22-2021 7:56:54 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-22-2021 3:53:13 PM

0 kudos

val db = "database_name" spark.sessionState.catalog.listTables(db).map(table=>spark.sessionState.catalog.externalCatalog.getTable(table.database.get,table.table)).filter(x=>x.provider.toString().toLowerCase.contains("delta"))The above code snippet wi...

0 kudos

06-22-2021 3:53:13 PM

1 More Replies

by User16826992666 • Valued Contributor

06-22-2021 8:24:22 AM

2588 Views
2 replies
0 kudos

Resolved! Can I reset the checkpoint of a streaming job if I want to do a full reload of a table?

Data Engineering

2588 Views
2 replies
0 kudos

06-22-2021 8:24:22 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 3:44:42 PM

0 kudos

If the read stream definition has something similar to val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .option("startingOffsets", "earliest")resettin...

0 kudos

06-22-2021 3:44:42 PM

1 More Replies

by Anonymous • Not applicable

06-21-2021 2:46:41 PM

893 Views
2 replies
0 kudos

Changing default Delta behavior in DBR 8.x for writes

Is there anyway to add a Spark Config that reverts the default behavior when doing tables writes from Delta to Parquet in DBR 8.0+? I know you can simply specify .format("parquet") but that could involve a decent amount of code change for some client...

Data Engineering

893 Views
2 replies
0 kudos

06-21-2021 2:46:41 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-22-2021 3:26:30 PM

0 kudos

Thanks @Ryan Chynoweth !

0 kudos

06-22-2021 3:26:30 PM

1 More Replies

by User15761966159 • New Contributor

06-22-2021 12:17:19 PM

588 Views
1 replies
0 kudos

Does removing a User from the workspace automatically invalidate their tokens

If you have a user that is removed from the workspace, are the tokens they've created automatically invalidated?

Data Engineering

588 Views
1 replies
0 kudos

06-22-2021 12:17:19 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-22-2021 3:13:55 PM

0 kudos

Yes, PAT tokens will be invalid if a user is removed since those tokens are attached to their current credentials and access.

0 kudos

06-22-2021 3:13:55 PM

User

Count

1601

736

343

284

247

Databricks

Forum Posts

How do we manage data recency in Databricks

Since Databricks manages the runtime on SQL Endpoints, how do I know which version I'm on?

Why NPIP is an optional and not mandatory

Databricks on GCP. How many partitions of local ssd does Databricks need per VM?

Databricks on GCP. For the persistent storage with each node what's the specific type Databricks uses?

Don't want checkpoint in delta

Resolved! Delta Live Table - landing database?

Is it possible to see the cost or DBU's of a particular job?

Resolved! Questions on using Docker image with Databricks Container Service

Resolved! How to find the Databricks Platform version

Resolved! Z-order or Partitioning? Which is better for Data skipping?

I have several thousands of Delta tables in my Production, what is the best way to get counts

Resolved! Can I reset the checkpoint of a streaming job if I want to do a full reload of a table?

Changing default Delta behavior in DBR 8.x for writes

Does removing a User from the workspace automatically invalidate their tokens

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...