Data Engineering

Forum Posts

Sorted by:

by User15787040559 • Databricks Employee

06-22-2021 3:31:30 PM

4643 Views
1 replies
0 kudos

What's the difference between Normalization and Standardization?

Normalization typically means rescales the values into a range of [0,1].Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

Data Engineering

4643 Views
1 replies
0 kudos

06-22-2021 3:31:30 PM

View Replies

Latest Reply

User16826994223
Databricks Employee

06-22-2021 10:37:08 PM

0 kudos

Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).A link which explains better is - https://towardsdatascience.com...

0 kudos

06-22-2021 10:37:08 PM

by User16826994223 • Databricks Employee

06-17-2021 4:27:10 AM

1100 Views
1 replies
2 kudos

Issue: Your account {email} does not have the owner or contributor role on the Databricks workspace resource in the Azure portal

Data Engineering

1100 Views
1 replies
2 kudos

06-17-2021 4:27:10 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 10:27:59 PM

2 kudos

https://docs.microsoft.com/en-us/azure/databricks/scenarios/frequently-asked-questions-databricks#solution-1

2 kudos

06-22-2021 10:27:59 PM

by User16826994223 • Databricks Employee

06-17-2021 1:34:31 AM

2459 Views
1 replies
0 kudos

Streaming with Kafka with the same groupid

A kafka topic is having 300 partitions and I see two clusters are running and have the same group id, will the data be duplicate in my delta bonze layer

Data Engineering

2459 Views
1 replies
0 kudos

06-17-2021 1:34:31 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-22-2021 10:26:23 PM

0 kudos

By default, each streaming query generates a unique group ID for reading data ( ensuring it's own its own consumer group ) . In scenarios where you'd want to specify it (authz etc ) , it is not recommended to have two streaming applications specify ...

0 kudos

06-22-2021 10:26:23 PM

by User16826994223 • Databricks Employee

06-22-2021 4:45:08 AM

6301 Views
3 replies
0 kudos

Resolved! Delta lake Check points storage concept

In which format the Checkpoints are stored in storage and , how does it help in delta to increase performance.

Data Engineering

6301 Views
3 replies
0 kudos

06-22-2021 4:45:08 AM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:14:58 PM

0 kudos

Great points above on how checkpointing helps with performance. In additional Delta Lake also provides other data organization strategies such as compaction, Z-ordering to help with both read and write performance of Delta Tables. Additional details ...

0 kudos

06-22-2021 9:14:58 PM

2 More Replies

by Srikanth_Gupta_ • Databricks Employee

06-20-2021 7:10:59 PM

3760 Views
2 replies
0 kudos

Resolved! Can we use/import python notebooks in Scala notebooks and use any functions written in Python, vice versa as well?

Data Engineering

3760 Views
2 replies
0 kudos

06-20-2021 7:10:59 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:13:02 PM

0 kudos

Temp Views and Global Temp Views are the most common way of sharing data across languages within a Notebook/Cluster

0 kudos

06-22-2021 9:13:02 PM

1 More Replies

by User15787040559 • Databricks Employee

06-22-2021 4:09:52 PM

5080 Views
1 replies
0 kudos

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

It depends. If you specify the schema it will be zero, otherwise it will do a full file scan which doesn’t work well processing Big Data at a large scale.CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark...

Data Engineering

5080 Views
1 replies
0 kudos

06-22-2021 4:09:52 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:09:15 PM

0 kudos

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipe...

0 kudos

06-22-2021 9:09:15 PM

by aladda • Databricks Employee

06-22-2021 9:04:54 PM

1536 Views
1 replies
0 kudos

Can I convert an existing Parquet table to Delta without having to copy the data over?

Data Engineering

1536 Views
1 replies
0 kudos

06-22-2021 9:04:54 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:05:46 PM

0 kudos

Yes Convert to Delta allows for converting a parquet table into Delta format in place by adding a transaction log, infering the schema and also collecting stats to improve query performance - https://docs.databricks.com/spark/latest/spark-sql/languag...

0 kudos

06-22-2021 9:05:46 PM

by Anonymous • Not applicable

06-14-2021 7:46:59 PM

2238 Views
3 replies
0 kudos

Resolved! How should we think about physical data storage when using Delta Lake? Will data be duplicated or saved within AWS ?

Data Engineering

2238 Views
3 replies
0 kudos

06-14-2021 7:46:59 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:03:41 PM

0 kudos

And to the earlier comment of Delta being an extension of Parquet. You can start with a dataset in Parquet format in S3 and do an in-place conversion to Delta without having to duplicate the data. See - https://docs.databricks.com/spark/latest/spark-...

0 kudos

06-22-2021 9:03:41 PM

2 More Replies

by Anonymous • Not applicable

06-16-2021 2:01:38 PM

2854 Views
2 replies
1 kudos

Resolved! Does Databricks have alerts / thresholds in place for cost monitoring?

Data Engineering

2854 Views
2 replies
1 kudos

06-16-2021 2:01:38 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:00:31 PM

1 kudos

You can also use tags to setup a chargeback mechanism within your organization for distributed billing - https://docs.databricks.com/administration-guide/account-settings/usage-detail-tags-aws.html

1 kudos

06-22-2021 9:00:31 PM

1 More Replies

by Anonymous • Not applicable

06-16-2021 2:02:08 PM

3501 Views
2 replies
0 kudos

Resolved! Can a cluster be automatically deleted if it's inactive? How can I prevent it from deleting?

Data Engineering

3501 Views
2 replies
0 kudos

06-16-2021 2:02:08 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 8:58:33 PM

0 kudos

Per the comment above the cluster deletion mechanism is designed to keep your cluster config experience organized and not have proliferation of cluster config. Its also a good idea to setup cluster policies and leverage those as a guide for what kind...

0 kudos

06-22-2021 8:58:33 PM

1 More Replies

by User16830818469 • Databricks Employee

06-15-2021 6:12:29 AM

5461 Views
2 replies
0 kudos

Databricks SQL Visualizations - export/embed

Is it possible to embed Databricks SQL Dashboards or specific widgets/visualization into a webpage?

Data Engineering

5461 Views
2 replies
0 kudos

06-15-2021 6:12:29 AM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 8:49:09 PM

0 kudos

Databricks SQL also integrates with several popular BI tools over JDBC/ODBC which you can use as a mechanism to embed visualizations into a webpage

0 kudos

06-22-2021 8:49:09 PM

1 More Replies

by Anonymous • Not applicable

06-22-2021 7:05:00 PM

4295 Views
1 replies
0 kudos

How Can I setup a Presto to Delta Lake integration & Query Delta Tables?

Data Engineering

4295 Views
1 replies
0 kudos

06-22-2021 7:05:00 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 8:25:57 PM

0 kudos

See this docs article for details on setting up a Delta Presto integration - https://docs.databricks.com/delta/presto-integration.html

0 kudos

06-22-2021 8:25:57 PM

by Anonymous • Not applicable

06-22-2021 7:14:53 PM

2314 Views
1 replies
0 kudos

What are some common third party libraries I can use to create visualizations in databricks python notebooks?

Data Engineering

2314 Views
1 replies
0 kudos

06-22-2021 7:14:53 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 8:21:39 PM

0 kudos

You can use libraries such as Seaborn, Bokeh, Matplotlib, Plotly for visualization inside of Python notebooks. See https://docs.databricks.com/notebooks/visualizations/index.html#visualizations-in-pythonAlso, Databricks has its own built-in visualiza...

0 kudos

06-22-2021 8:21:39 PM

by aladda • Databricks Employee

06-18-2021 12:02:17 PM

11252 Views
2 replies
1 kudos

Resolved! What is a good way to ingest Google Analytics data into Databricks

Data Engineering

11252 Views
2 replies
1 kudos

06-18-2021 12:02:17 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 8:06:22 PM

1 kudos

Thanks @Digan Parikh . Credit to Tahir Fayyaz, Found a couple of different paths depending on whether you're looking to bring in raw GA data vs aggregated GA data. 1) For Raw You can bring in data from GA Universal Analytics 360 Paid version or GA ...

1 kudos

06-22-2021 8:06:22 PM

1 More Replies

by Anonymous • Not applicable

06-22-2021 7:24:36 PM

1296 Views
0 replies
0 kudos

What are the resulting steps when two pyspark dataframes are co-grouped by a common key & a function is applied to each co-group?

Data Engineering

1296 Views
0 replies
0 kudos

06-22-2021 7:24:36 PM

Databricks Community

Forum Posts

What's the difference between Normalization and Standardization?

Issue: Your account {email} does not have the owner or contributor role on the Databricks workspace resource in the Azure portal

Streaming with Kafka with the same groupid

Resolved! Delta lake Check points storage concept

Resolved! Can we use/import python notebooks in Scala notebooks and use any functions written in Python, vice versa as well?

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

Can I convert an existing Parquet table to Delta without having to copy the data over?

Resolved! How should we think about physical data storage when using Delta Lake? Will data be duplicated or saved within AWS ?

Resolved! Does Databricks have alerts / thresholds in place for cost monitoring?

Resolved! Can a cluster be automatically deleted if it's inactive? How can I prevent it from deleting?

Databricks SQL Visualizations - export/embed

How Can I setup a Presto to Delta Lake integration & Query Delta Tables?

What are some common third party libraries I can use to create visualizations in databricks python notebooks?

Resolved! What is a good way to ingest Google Analytics data into Databricks

What are the resulting steps when two pyspark dataframes are co-grouped by a common key & a function is applied to each co-group?

Join Us as a Local Community Builder!

DABs with multi github sources

DLT Streaming With Watermark fails, suggesting I s...

Bug in Asset Bundle Sync

Migrating from on-premises HDFS to Unity Catalog -...

Webinars