cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 1576 Views
  • 2 replies
  • 1 kudos

What Databricks Runtime will I have to use if I want to leverage Python 2?

I have some code which is dependent on python 2. I am not able to use Python 2 with Databricks runtime 6.0.

  • 1576 Views
  • 2 replies
  • 1 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 1 kudos

When you create a Databricks Runtime 5.5 LTS cluster by using the workspace UI, the default is Python 3. You have the option to specify Python 2. If you use the Databricks REST API to create a cluster using Databricks Runtime 5.5 LTS, the default is ...

  • 1 kudos
1 More Replies
User16826994223
by Honored Contributor III
  • 1159 Views
  • 1 replies
  • 0 kudos

How is the ETL process different than trigger once stream

I am little confused between what to use between structured stream(trigger once) and etl batch jobs, can I get help here on which basis i should make my decision.

  • 1159 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

In Structured Streaming, triggers are used to specify how often a streaming query should produce results. A RunOnce trigger will fire only once and then will stop the query - effectively running it like a batch job.Now, If your source data is a strea...

  • 0 kudos
User15787040559
by Databricks Employee
  • 3321 Views
  • 1 replies
  • 0 kudos

What's the difference between Normalization and Standardization?

Normalization typically means rescales the values into a range of [0,1].Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

  • 3321 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).A link which explains better is - https://towardsdatascience.com...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 788 Views
  • 1 replies
  • 2 kudos

Issue: Your account {email} does not have the owner or contributor role on the Databricks workspace resource in the Azure portal 

Issue: Your account {email} does not have the owner or contributor role on the Databricks workspace resource in the Azure portal

  • 788 Views
  • 1 replies
  • 2 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 2 kudos

https://docs.microsoft.com/en-us/azure/databricks/scenarios/frequently-asked-questions-databricks#solution-1

  • 2 kudos
User16826994223
by Honored Contributor III
  • 1889 Views
  • 1 replies
  • 0 kudos

Streaming with Kafka with the same groupid

A kafka topic is having 300 partitions and I see two clusters are running and have the same group id, will the data be duplicate in my delta bonze layer

  • 1889 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

By default, each streaming query generates a unique group ID for reading data ( ensuring it's own  its own consumer group ) . In scenarios where you'd want to specify it (authz etc ) , it is not recommended to have two streaming applications specify ...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 4393 Views
  • 3 replies
  • 0 kudos

Resolved! Delta lake Check points storage concept

In which format the Checkpoints are stored in storage and , how does it help in delta to increase performance.

  • 4393 Views
  • 3 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Great points above on how checkpointing helps with performance. In additional Delta Lake also provides other data organization strategies such as compaction, Z-ordering to help with both read and write performance of Delta Tables. Additional details ...

  • 0 kudos
2 More Replies
Srikanth_Gupta_
by Databricks Employee
  • 2907 Views
  • 2 replies
  • 0 kudos
  • 2907 Views
  • 2 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Temp Views and Global Temp Views are the most common way of sharing data across languages within a Notebook/Cluster

  • 0 kudos
1 More Replies
User15787040559
by Databricks Employee
  • 4043 Views
  • 1 replies
  • 0 kudos

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

It depends. If you specify the schema it will be zero, otherwise it will do a full file scan which doesn’t work well processing Big Data at a large scale.CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark...

  • 4043 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipe...

  • 0 kudos
aladda
by Databricks Employee
  • 1063 Views
  • 1 replies
  • 0 kudos
  • 1063 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Yes Convert to Delta allows for converting a parquet table into Delta format in place by adding a transaction log, infering the schema and also collecting stats to improve query performance - https://docs.databricks.com/spark/latest/spark-sql/languag...

  • 0 kudos
Anonymous
by Not applicable
  • 1639 Views
  • 3 replies
  • 0 kudos
  • 1639 Views
  • 3 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

And to the earlier comment of Delta being an extension of Parquet. You can start with a dataset in Parquet format in S3 and do an in-place conversion to Delta without having to duplicate the data. See - https://docs.databricks.com/spark/latest/spark-...

  • 0 kudos
2 More Replies
Anonymous
by Not applicable
  • 2129 Views
  • 2 replies
  • 1 kudos
  • 2129 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

You can also use tags to setup a chargeback mechanism within your organization for distributed billing - https://docs.databricks.com/administration-guide/account-settings/usage-detail-tags-aws.html

  • 1 kudos
1 More Replies
Anonymous
by Not applicable
  • 2443 Views
  • 2 replies
  • 0 kudos
  • 2443 Views
  • 2 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Per the comment above the cluster deletion mechanism is designed to keep your cluster config experience organized and not have proliferation of cluster config. Its also a good idea to setup cluster policies and leverage those as a guide for what kind...

  • 0 kudos
1 More Replies
User16830818469
by New Contributor
  • 4811 Views
  • 2 replies
  • 0 kudos

Databricks SQL Visualizations - export/embed

Is it possible to embed Databricks SQL Dashboards or specific widgets/visualization into a webpage?

  • 4811 Views
  • 2 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Databricks SQL also integrates with several popular BI tools over JDBC/ODBC which you can use as a mechanism to embed visualizations into a webpage

  • 0 kudos
1 More Replies
Anonymous
by Not applicable
  • 1721 Views
  • 1 replies
  • 0 kudos
  • 1721 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

You can use libraries such as Seaborn, Bokeh, Matplotlib, Plotly for visualization inside of Python notebooks. See https://docs.databricks.com/notebooks/visualizations/index.html#visualizations-in-pythonAlso, Databricks has its own built-in visualiza...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels