Data Engineering

Forum Posts

Sorted by:

Start a conversation

by aladda • Databricks Employee

06-18-2021 12:02:17 PM

12068 Views
2 replies
1 kudos

Resolved! What is a good way to ingest Google Analytics data into Databricks

Data Engineering

12068 Views
2 replies
1 kudos

06-18-2021 12:02:17 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 8:06:22 PM

1 kudos

Thanks @Digan Parikh . Credit to Tahir Fayyaz, Found a couple of different paths depending on whether you're looking to bring in raw GA data vs aggregated GA data. 1) For Raw You can bring in data from GA Universal Analytics 360 Paid version or GA ...

1 kudos

06-22-2021 8:06:22 PM

1 More Replies

by Anonymous • Not applicable

06-22-2021 7:24:36 PM

1556 Views
0 replies
0 kudos

What are the resulting steps when two pyspark dataframes are co-grouped by a common key & a function is applied to each co-group?

Data Engineering

1556 Views
0 replies
0 kudos

06-22-2021 7:24:36 PM

by User16826994223 • Databricks Employee

06-18-2021 4:00:26 AM

2214 Views
1 replies
0 kudos

What is databricks Sync

I am trying to migrate my workload to another workspace ( from ST to E2), I am planning to use data bricks sync, but still I am not sure, will it migrate everything like , currents, user , groups, job, notebook etc or has some limitations which I s...

Data Engineering

2214 Views
1 replies
0 kudos

06-18-2021 4:00:26 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 5:55:11 PM

0 kudos

Here is the support matrix for import/export operations for databricks-syncAlso checkout https://github.com/databrickslabs/migrate

0 kudos

06-22-2021 5:55:11 PM

by User16826994223 • Databricks Employee

06-21-2021 5:57:04 AM

2239 Views
1 replies
0 kudos

How do we manage data recency in Databricks

I want to know how databricks maintain data recency in databricks

Data Engineering

2239 Views
1 replies
0 kudos

06-21-2021 5:57:04 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 5:43:42 PM

0 kudos

When using delta tables in databricks, you have the advantage of delta cache which accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. At the beginning of each query delta tables au...

0 kudos

06-22-2021 5:43:42 PM

by MoJaMa • Databricks Employee

06-22-2021 5:26:58 PM

2003 Views
1 replies
0 kudos

Since Databricks manages the runtime on SQL Endpoints, how do I know which version I'm on?

Data Engineering

2003 Views
1 replies
0 kudos

06-22-2021 5:26:58 PM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-22-2021 5:28:39 PM

0 kudos

Start an endpointRun a queryGo to Query HistoryClick Details, Go to the Environment tabSearch sparkVersion.

0 kudos

06-22-2021 5:28:39 PM

by User16826994223 • Databricks Employee

06-22-2021 2:25:35 AM

2207 Views
1 replies
0 kudos

Why NPIP is an optional and not mandatory

Even though the NPIP is more secure as the network traffic travel through Microsoft backbone network why it is optional , it should be mandatory, is there some limitataion or a case where we may not able to use NPIP .

Data Engineering

2207 Views
1 replies
0 kudos

06-22-2021 2:25:35 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 5:26:34 PM

0 kudos

NPIP / secure cluster connectivity requires a NAT gateway (or similar appliance) for outbound traffic from your workspace’s subnets to the Azure backbone and public network. This incurs a small additional cost. Also, it is worth mentioning that ne...

0 kudos

06-22-2021 5:26:34 PM

by MoJaMa • Databricks Employee

06-22-2021 5:22:11 PM

1670 Views
1 replies
0 kudos

Databricks on GCP. How many partitions of local ssd does Databricks need per VM?

Data Engineering

1670 Views
1 replies
0 kudos

06-22-2021 5:22:11 PM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-22-2021 5:24:10 PM

0 kudos

Each local disk is 375 GB.So, for example, for n2-standard-4, it is 2 local disks. (0.75TB /2)https://databricks.com/wp-content/uploads/2021/05/GCP-Pricing-Estimator-v2.pdf?_ga=2.241263109.66068867.1623086616-828667513.1602536526

0 kudos

06-22-2021 5:24:10 PM

by MoJaMa • Databricks Employee

06-22-2021 5:20:18 PM

1336 Views
1 replies
0 kudos

Databricks on GCP. For the persistent storage with each node what's the specific type Databricks uses?

Data Engineering

1336 Views
1 replies
0 kudos

06-22-2021 5:20:18 PM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-22-2021 5:20:50 PM

0 kudos

They are Zonal SSD Persistent Diskhttps://cloud.google.com/compute/docs/disks#introduction

0 kudos

06-22-2021 5:20:50 PM

by User16826994223 • Databricks Employee

06-22-2021 4:53:45 AM

2661 Views
2 replies
0 kudos

Don't want checkpoint in delta

Suppose I am not interested in checkpoints, how can I disable Checkpoints write in delta

Data Engineering

2661 Views
2 replies
0 kudos

06-22-2021 4:53:45 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 5:13:57 PM

0 kudos

Writing statistics in a checkpoint has a cost which is visible usually only for very large tables. However it is worth mentioning that, this statistics would be very useful for data skipping which speeds up subsequent operations. In Databricks Runti...

0 kudos

06-22-2021 5:13:57 PM

1 More Replies

by Digan_Parikh • Databricks Employee

06-22-2021 4:50:40 PM

2483 Views
1 replies
0 kudos

Resolved! Delta Live Table - landing database?

Where do you specify what database the DLT tables land in?

Data Engineering

2483 Views
1 replies
0 kudos

06-22-2021 4:50:40 PM

View Replies

Latest Reply

Digan_Parikh
Databricks Employee

06-22-2021 4:53:02 PM

0 kudos

The target key, when creating the pipeline specifies the database that the tables get published to. Documented here - https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-user-guide.html#publish-tables

0 kudos

06-22-2021 4:53:02 PM

by Anonymous • Not applicable

06-22-2021 11:38:15 AM

3490 Views
1 replies
0 kudos

Resolved! Questions on using Docker image with Databricks Container Service

Specifically, we have in mind:* Create a Databricks job for testing API changes (the API library is built in a custom Jar file)* When we want to test an API change, build a Docker image with the relevant changes in a Jar file* Update the job configur...

Data Engineering

3490 Views
1 replies
0 kudos

06-22-2021 11:38:15 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 4:32:11 PM

0 kudos

>Where do we put custom Jar files when building the Docker image? /databricks/jars>How do we update the job configuration so that the job’s cluster will be built with this new Docker image, and how long do we expect this re-configuring process to tak...

0 kudos

06-22-2021 4:32:11 PM

by brickster_2018 • Databricks Employee

06-22-2021 4:25:48 PM

13788 Views
1 replies
2 kudos

Resolved! How to find the Databricks Platform version

Data Engineering

13788 Views
1 replies
2 kudos

06-22-2021 4:25:48 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-22-2021 4:27:06 PM

2 kudos

Use the below endpoint on your workspace. https://your-workspace-name.cloud.databricks.com/version

2 kudos

06-22-2021 4:27:06 PM

by brickster_2018 • Databricks Employee

06-22-2021 4:16:50 PM

3665 Views
1 replies
0 kudos

Resolved! Z-order or Partitioning? Which is better for Data skipping?

For Delta tables, among Z-order and Partioning which is recommended technique for efficient Data Skipping

Data Engineering

3665 Views
1 replies
0 kudos

06-22-2021 4:16:50 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-22-2021 4:19:13 PM

0 kudos

Partition pruning is the most efficient way to ensure Data skipping. However, choosing the right column for partitioning is very important. It's common to see choosing the wrong column for partitioning can cause a large number of small file problems ...

0 kudos

06-22-2021 4:19:13 PM

by Srikanth_Gupta_ • Databricks Employee

06-22-2021 7:56:54 AM

2330 Views
2 replies
0 kudos

I have several thousands of Delta tables in my Production, what is the best way to get counts

if I might need a dashboard to see increase in number of rows on day to day basis, also a dashboard that shows size of Parquet/Delta files in my Lake?

Data Engineering

2330 Views
2 replies
0 kudos

06-22-2021 7:56:54 AM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-22-2021 3:53:13 PM

0 kudos

val db = "database_name" spark.sessionState.catalog.listTables(db).map(table=>spark.sessionState.catalog.externalCatalog.getTable(table.database.get,table.table)).filter(x=>x.provider.toString().toLowerCase.contains("delta"))The above code snippet wi...

0 kudos

06-22-2021 3:53:13 PM

1 More Replies

by User16826992666 • Databricks Employee

06-22-2021 8:24:22 AM

8114 Views
2 replies
0 kudos

Resolved! Can I reset the checkpoint of a streaming job if I want to do a full reload of a table?

Data Engineering

8114 Views
2 replies
0 kudos

06-22-2021 8:24:22 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 3:44:42 PM

0 kudos

If the read stream definition has something similar to val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .option("startingOffsets", "earliest")resettin...

0 kudos

06-22-2021 3:44:42 PM

1 More Replies

Databricks Community

Forum Posts

Resolved! What is a good way to ingest Google Analytics data into Databricks

What are the resulting steps when two pyspark dataframes are co-grouped by a common key & a function is applied to each co-group?

What is databricks Sync

How do we manage data recency in Databricks

Since Databricks manages the runtime on SQL Endpoints, how do I know which version I'm on?

Why NPIP is an optional and not mandatory

Databricks on GCP. How many partitions of local ssd does Databricks need per VM?

Databricks on GCP. For the persistent storage with each node what's the specific type Databricks uses?

Don't want checkpoint in delta

Resolved! Delta Live Table - landing database?

Resolved! Questions on using Docker image with Databricks Container Service

Resolved! How to find the Databricks Platform version

Resolved! Z-order or Partitioning? Which is better for Data skipping?

I have several thousands of Delta tables in my Production, what is the best way to get counts

Resolved! Can I reset the checkpoint of a streaming job if I want to do a full reload of a table?

NOT NULL constraint violated for column during OPT...

Lakebase CDF databricks error synced

DLT pipelines failing out of memory (serverless)

How does Databricks handle registration and discov...

Spark UI Troubleshooting: Data Skew vs Cluster Res...