Data Engineering

Forum Posts

Sorted by:

by User16826992666 • Valued Contributor

06-16-2021 7:41:25 PM

790 Views
1 replies
0 kudos

Resolved! How much space does the metadata for a Delta table take up?

If you have a lot of transactions in a table it seems like the Delta log keeping track of all those transactions would get pretty large. Does the size of the metadata become a problem over time?

Data Engineering

790 Views
1 replies
0 kudos

06-16-2021 7:41:25 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-18-2021 2:07:09 PM

0 kudos

Yes, the size of the metadata can become a problem over time but not because of performance but because of storage costs. Delta performance will not degrade due to the size of the metadata, but your cloud storage bill can increase. By default Delta h...

0 kudos

06-18-2021 2:07:09 PM

by Anonymous • Not applicable

06-17-2021 4:44:12 PM

571 Views
1 replies
0 kudos

Resolved! Delta Sharing internally?

If we don't have any datasets to be shared with external companies, does that mean Delta Sharing is not valid for our org? Is there any use case to use it internally?

Data Engineering

571 Views
1 replies
0 kudos

06-17-2021 4:44:12 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-18-2021 2:03:54 PM

0 kudos

Delta sharing can be done externally and internally. One use case for sharing internally would be if two separate business units would like to share data with each other without exposing their Lakehouse with the other unit.

0 kudos

06-18-2021 2:03:54 PM

by User16830818524 • New Contributor II

06-18-2021 11:25:20 AM

620 Views
1 replies
0 kudos

Is it possible to read a Delta table directly using Koalas?

Can I read a Delta table directly using Koalas or do I need to read using Spark and then convert the Spark dataframe to a Koalas dataframe?

Data Engineering

620 Views
1 replies
0 kudos

06-18-2021 11:25:20 AM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-18-2021 2:02:26 PM

0 kudos

Yes, you can use the "read_delta" function. Documentation.

0 kudos

06-18-2021 2:02:26 PM

by sajith_appukutt • Honored Contributor II

06-13-2021 4:48:27 PM

780 Views
1 replies
2 kudos

Resolved! Unable to get mlflow central model registry to work with dbconnect.

I'm working on setting up tooling to allow team members to easily register and load models from a central mlflow model registry via dbconnect. However after following the instructions at the public docs , hitting this error raise _NoDbutilsError mlfl...

Data Engineering

780 Views
1 replies
2 kudos

06-13-2021 4:48:27 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-18-2021 1:55:43 PM

2 kudos

You could monkey patch MLFlow's _get_dbutils() with something similar to this to get this working while connecting from dbconnectspark = SparkSession.builder.getOrCreate() # monkey-patch MLFlow's _get_dbutils() def _get_dbutils(): return DBUtils(...

2 kudos

06-18-2021 1:55:43 PM

by aladda • Honored Contributor II

06-18-2021 12:27:07 PM

635 Views
1 replies
0 kudos

Can i run data engineering/data transformation jobs using databricks SQL endpoints or are databricks clusters (interactive or jobs) the recommended way to do this?

Data Engineering

635 Views
1 replies
0 kudos

06-18-2021 12:27:07 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-18-2021 1:52:21 PM

0 kudos

Generally, interactive clusters and jobs are better suited for data engineering and transformations as they support more than just SQL. However, if you are using pure SQL, then endpoints can be used for data transformations. All of the Spark SQL fun...

0 kudos

06-18-2021 1:52:21 PM

by aladda • Honored Contributor II

06-18-2021 11:52:48 AM

602 Views
1 replies
0 kudos

Resolved! Does the Jobs API allow executing an older version of a Notebook using version history?

I see the revision_timestamp paramater on NotebookTask https://docs.databricks.com/dev-tools/api/latest/jobs.html#jobsnotebooktask. An example of how to invoke it would be helpful

Data Engineering

602 Views
1 replies
0 kudos

06-18-2021 11:52:48 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-18-2021 1:50:56 PM

0 kudos

You can use the databricks built in version control feature, coupled with the NotebookTask Jobs API to specify a specific version of the notebook based on the timestamp of the save defined in unix timestamp formatcurl -n -X POST -H 'Content-Type: app...

0 kudos

06-18-2021 1:50:56 PM

by User16826992666 • Valued Contributor

06-16-2021 9:42:52 AM

758 Views
1 replies
0 kudos

How do I know if the number of files are causing performance issues?

I have read and heard that having too many small files can cause performance problems when reading large data sets. But how do I know if that is an issue I am facing?

Data Engineering

758 Views
1 replies
0 kudos

06-16-2021 9:42:52 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-18-2021 1:47:00 PM

0 kudos

Databricks SQL endpoint has a query history section which provides additional information to debug / tune queries. One such metric under execution details is the number of files read. For ETL/Data science workloads, you could use the Spark UI of the ...

0 kudos

06-18-2021 1:47:00 PM

by User16765131552 • Contributor III

06-18-2021 1:09:30 PM

1471 Views
1 replies
1 kudos

Displaying spark job process in dashboard

In databricks is there a way to display the spark job process in a dashboard? I have a simple dashboard that displays a table, but the main spark job behind it takes 15 minutes to run. Is there a way to show the spark job progress bar in a dashboard?

Data Engineering

1471 Views
1 replies
1 kudos

06-18-2021 1:09:30 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-18-2021 1:41:41 PM

1 kudos

The best way to do so would be to collect data about the job run using the REST API (runs get endpoint). This endpoint provides as much metadata as possible. You may need to use other endpoints to get the job or run ids in order to get the correct in...

1 kudos

06-18-2021 1:41:41 PM

by User16765131552 • Contributor III

06-18-2021 1:37:39 PM

666 Views
0 replies
0 kudos

In pyspark, how do you draw histogram from groupedby data?

I have a dataframe that looks like the following,+-------+--------+ |Charges| Status| +-------+--------+ | 495.6| Denied| |1806.28| Denied| | 261.3|Accepted| | 8076.5|Accepted| |1041.24| Denied| | 507.88| Denied| | 208.0|Accepted| | 152.49| ...

Data Engineering

666 Views
0 replies
0 kudos

06-18-2021 1:37:39 PM

by User16826992666 • Valued Contributor

06-16-2021 9:38:39 AM

1135 Views
1 replies
0 kudos

Resolved! When running a Merge, if records from the table are deleted are the underlying files that contain the records deleted as well?

I know I have the option to delete rows from a Delta table when running a merge. But I'm confused about how that would actually affect the files that contain the deleted records. Are those files deleted, or are they rewritten, or what?

Data Engineering

1135 Views
1 replies
0 kudos

06-16-2021 9:38:39 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-18-2021 1:36:06 PM

0 kudos

Delta implements MERGE by physically rewriting existing files. It is implemented in two steps.Perform an inner join between the target table and source table to select all files that have matches.Perform an outer join between the selected files in t...

0 kudos

06-18-2021 1:36:06 PM

by User16826992666 • Valued Contributor

06-16-2021 10:57:59 AM

869 Views
1 replies
0 kudos

Resolved! Are Delta tables able to support GDPR compliance?

I know that when deletes are made from a Delta table the underlying files are not actually removed. For compliance reasons I need to able to truly delete the records. How can I know which files need to be removed, and is there a way to remove them ot...

Data Engineering

869 Views
1 replies
0 kudos

06-16-2021 10:57:59 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-18-2021 1:16:37 PM

0 kudos

Here is a document explaining best practices for GDPR and CCPA compliance using Delta Lake. Specifically on cleaning up stale data - you can use the VACUUM function to remove files that are no longer referenced by a Delta table and are older than a s...

0 kudos

06-18-2021 1:16:37 PM

by User16765131552 • Contributor III

06-18-2021 12:57:42 PM

2001 Views
0 replies
0 kudos

Dataframe.write with table containing Always generate columns and auto generate columns is failing(SQL SERVER + sql-spark-connector)

Dataframe write to SQL Server table containing Always autogenerate column fails. I am using Apache Spark Connector for SQL Server and Azure SQL. When autogenerate field are not included in dataframe, I encountered - "No key found " error If auto-gene...

Data Engineering

2001 Views
0 replies
0 kudos

06-18-2021 12:57:42 PM

by jose_gonzalez • Moderator

06-18-2021 12:52:12 PM

1174 Views
1 replies
0 kudos

Resolved! can I use DBconnect to connect any DBR version?

I would like to know if I can connect using to DBconnect to any DBR version or if only the supported version will work?

Data Engineering

1174 Views
1 replies
0 kudos

06-18-2021 12:52:12 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

06-18-2021 12:53:55 PM

0 kudos

Only the following Databricks Runtime versions are supported:Databricks Runtime 8.1 ML, Databricks Runtime 8.1Databricks Runtime 7.3 LTS ML, Databricks Runtime 7.3 LTSDatabricks Runtime 6.4 ML, Databricks Runtime 6.4Databricks Runtime 5.5 LTS ML, Dat...

0 kudos

06-18-2021 12:53:55 PM

by MoJaMa • Valued Contributor II

06-18-2021 12:40:43 PM

600 Views
1 replies
0 kudos

What happens when the person who created a cluster is no longer with the company? I do not want to see their email ID in the cluster event logs.

Data Engineering

600 Views
1 replies
0 kudos

06-18-2021 12:40:43 PM

View Replies

Latest Reply

MoJaMa
Valued Contributor II

06-18-2021 12:42:51 PM

0 kudos

Currently there is no concept of "Cluster Owner". https://docs.databricks.com/security/access-control/cluster-acl.html#cluster-level-permissionsSo you have to clone the cluster, thus making the person who cloned it the creator of the new cluster. The...

0 kudos

06-18-2021 12:42:51 PM

by jose_gonzalez • Moderator

06-18-2021 12:29:19 PM

718 Views
1 replies
0 kudos

Resolved! How can I connect my favorite IDE, like Pycharm to Databricks cluster?

I would like to know if there is a way to connect to Databricks cluster using my IDE

Data Engineering

718 Views
1 replies
0 kudos

06-18-2021 12:29:19 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

06-18-2021 12:40:26 PM

0 kudos

Databricks connect allows you to connect your favorite IDE to Databricks clusters. You can find more details on how to set it up and install all the libraries https://docs.databricks.com/dev-tools/databricks-connect.html

0 kudos

06-18-2021 12:40:26 PM

User

Count

1601

736

343

284

247

Databricks

Forum Posts

Resolved! How much space does the metadata for a Delta table take up?

Resolved! Delta Sharing internally?

Is it possible to read a Delta table directly using Koalas?

Resolved! Unable to get mlflow central model registry to work with dbconnect.

Can i run data engineering/data transformation jobs using databricks SQL endpoints or are databricks clusters (interactive or jobs) the recommended way to do this?

Resolved! Does the Jobs API allow executing an older version of a Notebook using version history?

How do I know if the number of files are causing performance issues?

Displaying spark job process in dashboard

In pyspark, how do you draw histogram from groupedby data?

Resolved! When running a Merge, if records from the table are deleted are the underlying files that contain the records deleted as well?

Resolved! Are Delta tables able to support GDPR compliance?

Dataframe.write with table containing Always generate columns and auto generate columns is failing(SQL SERVER + sql-spark-connector)

Resolved! can I use DBconnect to connect any DBR version?

What happens when the person who created a cluster is no longer with the company? I do not want to see their email ID in the cluster event logs.

Resolved! How can I connect my favorite IDE, like Pycharm to Databricks cluster?

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...