Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16826994223 • Honored Contributor III

06-22-2021 4:45:08 AM

2256 Views
3 replies
0 kudos

Resolved! Delta lake Check points storage concept

In which format the Checkpoints are stored in storage and , how does it help in delta to increase performance.

Data Engineering

2256 Views
3 replies
0 kudos

06-22-2021 4:45:08 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 9:14:58 PM

0 kudos

Great points above on how checkpointing helps with performance. In additional Delta Lake also provides other data organization strategies such as compaction, Z-ordering to help with both read and write performance of Delta Tables. Additional details ...

0 kudos

06-22-2021 9:14:58 PM

2 More Replies

by Srikanth_Gupta_ • Valued Contributor

06-20-2021 7:10:59 PM

1695 Views
2 replies
0 kudos

Resolved! Can we use/import python notebooks in Scala notebooks and use any functions written in Python, vice versa as well?

Data Engineering

1695 Views
2 replies
0 kudos

06-20-2021 7:10:59 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 9:13:02 PM

0 kudos

Temp Views and Global Temp Views are the most common way of sharing data across languages within a Notebook/Cluster

0 kudos

06-22-2021 9:13:02 PM

1 More Replies

by User15787040559 • New Contributor III

06-22-2021 4:09:52 PM

2531 Views
1 replies
0 kudos

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

It depends. If you specify the schema it will be zero, otherwise it will do a full file scan which doesn’t work well processing Big Data at a large scale.CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark...

Data Engineering

2531 Views
1 replies
0 kudos

06-22-2021 4:09:52 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 9:09:15 PM

0 kudos

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipe...

0 kudos

06-22-2021 9:09:15 PM

by aladda • Honored Contributor II

06-22-2021 9:04:54 PM

620 Views
1 replies
0 kudos

Can I convert an existing Parquet table to Delta without having to copy the data over?

Data Engineering

620 Views
1 replies
0 kudos

06-22-2021 9:04:54 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 9:05:46 PM

0 kudos

Yes Convert to Delta allows for converting a parquet table into Delta format in place by adding a transaction log, infering the schema and also collecting stats to improve query performance - https://docs.databricks.com/spark/latest/spark-sql/languag...

0 kudos

06-22-2021 9:05:46 PM

by Anonymous • Not applicable

06-14-2021 7:46:59 PM

770 Views
3 replies
0 kudos

Resolved! How should we think about physical data storage when using Delta Lake? Will data be duplicated or saved within AWS ?

Data Engineering

770 Views
3 replies
0 kudos

06-14-2021 7:46:59 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 9:03:41 PM

0 kudos

And to the earlier comment of Delta being an extension of Parquet. You can start with a dataset in Parquet format in S3 and do an in-place conversion to Delta without having to duplicate the data. See - https://docs.databricks.com/spark/latest/spark-...

0 kudos

06-22-2021 9:03:41 PM

2 More Replies

by Anonymous • Not applicable

06-16-2021 2:01:38 PM

1152 Views
2 replies
1 kudos

Resolved! Does Databricks have alerts / thresholds in place for cost monitoring?

Data Engineering

1152 Views
2 replies
1 kudos

06-16-2021 2:01:38 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 9:00:31 PM

1 kudos

You can also use tags to setup a chargeback mechanism within your organization for distributed billing - https://docs.databricks.com/administration-guide/account-settings/usage-detail-tags-aws.html

1 kudos

06-22-2021 9:00:31 PM

1 More Replies

by Anonymous • Not applicable

06-16-2021 2:02:08 PM

1129 Views
2 replies
0 kudos

Resolved! Can a cluster be automatically deleted if it's inactive? How can I prevent it from deleting?

Data Engineering

1129 Views
2 replies
0 kudos

06-16-2021 2:02:08 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 8:58:33 PM

0 kudos

Per the comment above the cluster deletion mechanism is designed to keep your cluster config experience organized and not have proliferation of cluster config. Its also a good idea to setup cluster policies and leverage those as a guide for what kind...

0 kudos

06-22-2021 8:58:33 PM

1 More Replies

by User16830818469 • New Contributor

06-15-2021 6:12:29 AM

3057 Views
2 replies
0 kudos

Databricks SQL Visualizations - export/embed

Is it possible to embed Databricks SQL Dashboards or specific widgets/visualization into a webpage?

Data Engineering

3057 Views
2 replies
0 kudos

06-15-2021 6:12:29 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 8:49:09 PM

0 kudos

Databricks SQL also integrates with several popular BI tools over JDBC/ODBC which you can use as a mechanism to embed visualizations into a webpage

0 kudos

06-22-2021 8:49:09 PM

1 More Replies

by Anonymous • Not applicable

06-22-2021 7:05:00 PM

1424 Views
1 replies
0 kudos

How Can I setup a Presto to Delta Lake integration & Query Delta Tables?

Data Engineering

1424 Views
1 replies
0 kudos

06-22-2021 7:05:00 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 8:25:57 PM

0 kudos

See this docs article for details on setting up a Delta Presto integration - https://docs.databricks.com/delta/presto-integration.html

0 kudos

06-22-2021 8:25:57 PM

by Anonymous • Not applicable

06-22-2021 7:14:53 PM

1023 Views
1 replies
0 kudos

What are some common third party libraries I can use to create visualizations in databricks python notebooks?

Data Engineering

1023 Views
1 replies
0 kudos

06-22-2021 7:14:53 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 8:21:39 PM

0 kudos

You can use libraries such as Seaborn, Bokeh, Matplotlib, Plotly for visualization inside of Python notebooks. See https://docs.databricks.com/notebooks/visualizations/index.html#visualizations-in-pythonAlso, Databricks has its own built-in visualiza...

0 kudos

06-22-2021 8:21:39 PM

by aladda • Honored Contributor II

06-18-2021 12:02:17 PM

5368 Views
2 replies
1 kudos

Resolved! What is a good way to ingest Google Analytics data into Databricks

Data Engineering

5368 Views
2 replies
1 kudos

06-18-2021 12:02:17 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 8:06:22 PM

1 kudos

Thanks @Digan Parikh . Credit to Tahir Fayyaz, Found a couple of different paths depending on whether you're looking to bring in raw GA data vs aggregated GA data. 1) For Raw You can bring in data from GA Universal Analytics 360 Paid version or GA ...

1 kudos

06-22-2021 8:06:22 PM

1 More Replies

by User16776431030 • New Contributor III

06-22-2021 5:55:02 PM

920 Views
1 replies
1 kudos

Can you use the Databricks API from a notebook?

I want to test out different APIs directly from a Databricks notebook instead of using Postman or CURL. Is this possible?

Data Engineering

920 Views
1 replies
1 kudos

06-22-2021 5:55:02 PM

View Replies

Latest Reply

Mooune_DBU
Valued Contributor

06-22-2021 6:35:46 PM

1 kudos

If you're question is about using the Databricks API from within a databricks notebook, then the answer is yes of course, you can definitely orchestrate anything and invoke the REST API from a python notebook using the `requests` library already bake...

1 kudos

06-22-2021 6:35:46 PM

by User16826994223 • Honored Contributor III

06-18-2021 4:00:26 AM

725 Views
1 replies
0 kudos

What is databricks Sync

I am trying to migrate my workload to another workspace ( from ST to E2), I am planning to use data bricks sync, but still I am not sure, will it migrate everything like , currents, user , groups, job, notebook etc or has some limitations which I s...

Data Engineering

725 Views
1 replies
0 kudos

06-18-2021 4:00:26 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 5:55:11 PM

0 kudos

Here is the support matrix for import/export operations for databricks-syncAlso checkout https://github.com/databrickslabs/migrate

0 kudos

06-22-2021 5:55:11 PM

by User16826994223 • Honored Contributor III

06-21-2021 5:57:04 AM

621 Views
1 replies
0 kudos

How do we manage data recency in Databricks

I want to know how databricks maintain data recency in databricks

Data Engineering

621 Views
1 replies
0 kudos

06-21-2021 5:57:04 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 5:43:42 PM

0 kudos

When using delta tables in databricks, you have the advantage of delta cache which accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. At the beginning of each query delta tables au...

0 kudos

06-22-2021 5:43:42 PM

by MoJaMa • Valued Contributor II

06-22-2021 5:26:58 PM

655 Views
1 replies
0 kudos

Since Databricks manages the runtime on SQL Endpoints, how do I know which version I'm on?

Data Engineering

655 Views
1 replies
0 kudos

06-22-2021 5:26:58 PM

View Replies

Latest Reply

MoJaMa
Valued Contributor II

06-22-2021 5:28:39 PM

0 kudos

Start an endpointRun a queryGo to Query HistoryClick Details, Go to the Environment tabSearch sparkVersion.

0 kudos

06-22-2021 5:28:39 PM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Resolved! Delta lake Check points storage concept

Resolved! Can we use/import python notebooks in Scala notebooks and use any functions written in Python, vice versa as well?

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

Can I convert an existing Parquet table to Delta without having to copy the data over?

Resolved! How should we think about physical data storage when using Delta Lake? Will data be duplicated or saved within AWS ?

Resolved! Does Databricks have alerts / thresholds in place for cost monitoring?

Resolved! Can a cluster be automatically deleted if it's inactive? How can I prevent it from deleting?

Databricks SQL Visualizations - export/embed

How Can I setup a Presto to Delta Lake integration & Query Delta Tables?

What are some common third party libraries I can use to create visualizations in databricks python notebooks?

Resolved! What is a good way to ingest Google Analytics data into Databricks

Can you use the Databricks API from a notebook?

What is databricks Sync

How do we manage data recency in Databricks

Since Databricks manages the runtime on SQL Endpoints, how do I know which version I'm on?

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...