Data Engineering

Forum Posts

Sorted by:

by aladda • Honored Contributor II

06-21-2021 1:09:36 PM

938 Views
1 replies
0 kudos

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

I’m running 3 separate dbt processes in parallel. all of them are reading data from different databrick databases, creating different staging tables by using dbt alias, but they all at the end update/insert to the same target table. the 3 processes r...

Data Engineering

938 Views
1 replies
0 kudos

06-21-2021 1:09:36 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:10:01 PM

0 kudos

You’re likely running into the issue described here and a solution to it as well. While Delta does support concurrent writers to separate partitions of a table, depending on your query structure join/filter/where in particular, there may still be a n...

0 kudos

06-21-2021 1:10:01 PM

by aladda • Honored Contributor II

06-21-2021 1:05:23 PM

3773 Views
1 replies
1 kudos

Resolved! Does Databricks integrate with Splunk? What are some ways to send metrics/logs to Splunk

Data Engineering

3773 Views
1 replies
1 kudos

06-21-2021 1:05:23 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:05:50 PM

1 kudos

The Databricks Add-on for Splunk built as part of Databricks Labs can be leveraged for Splunk integrationIt’s a bi-directional framework that allows for in-place querying of data in databricks from within Splunk by running queries, notebooks or jobs ...

1 kudos

06-21-2021 1:05:50 PM

by Anonymous • Not applicable

06-05-2021 10:00:14 PM

4068 Views
1 replies
1 kudos

Resolved! Jobs - Delta Live tables difference

Can you please explain the difference between Jobs and Delta Live tables?

Data Engineering

4068 Views
1 replies
1 kudos

06-05-2021 10:00:14 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:03:21 PM

1 kudos

Jobs are designed for automated execution (scheduled or manually) of Databricks Notebooks, JARs, spark-submit jobs etc. Its essentially a generic framework to run any kind of Data Engg, Data Analysis or Data Science workload. Delta Live Tables on the...

1 kudos

06-21-2021 1:03:21 PM

by Anonymous • Not applicable

06-18-2021 2:12:44 PM

9673 Views
1 replies
0 kudos

Resolved! Tuning shuffle partitions

Is the best practice for tuning shuffle partitions to have the config "autoOptimizeShuffle.enabled" on? I see it is not switched on by default. Why is that?

Data Engineering

9673 Views
1 replies
0 kudos

06-18-2021 2:12:44 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 1:02:37 PM

0 kudos

AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size wi...

0 kudos

06-21-2021 1:02:37 PM

by aladda • Honored Contributor II

06-18-2021 11:43:59 AM

1866 Views
1 replies
0 kudos

Resolved! Where does Databricks store its Notebooks? Are they on a file system in the control plane or RDS/data management system of some kind

Data Engineering

1866 Views
1 replies
0 kudos

06-18-2021 11:43:59 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 12:59:21 PM

0 kudos

Notebooks in Databricks are part of the WebApp which is run & managed by databricks from the Control Plane. See the high level architecture here for details - https://docs.databricks.com/getting-started/overview.html

0 kudos

06-21-2021 12:59:21 PM

by User16776431030 • New Contributor III

06-21-2021 12:53:48 PM

1098 Views
1 replies
0 kudos

Resolved! How can I make a cluster start up in the availability-zone (AZ) with the most available IPs?

I see the default in the UI is to always create clusters in a single AZ (e.g. us-west-2a), but want to distribute workloads across all available AZs.

Data Engineering

1098 Views
1 replies
0 kudos

06-21-2021 12:53:48 PM

View Replies

Latest Reply

User16776431030
New Contributor III

06-21-2021 12:55:13 PM

0 kudos

Found the answer - not available in the UI, but via API, you can submit the cluster definition with "aws_attributes": { "zone_id": "auto" },This is documented in the Cluster API: https://docs.databricks.com/dev-tools/api/latest/clusters.html#aw...

0 kudos

06-21-2021 12:55:13 PM

by User16137833804 • New Contributor III

06-18-2021 2:58:29 PM

5617 Views
2 replies
0 kudos

How can I monitor the costs of an Azure Databricks cluster via PowerBI?

Data Engineering

5617 Views
2 replies
0 kudos

06-18-2021 2:58:29 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 12:53:46 PM

0 kudos

There is a native Cost Management Connector in Power BI that allows one to make powerful, customized visualization and cost/usage reports. I also recommend reviewing the Chargeback/Cost Analysis section of the ADB Best Practices guide here - https://...

0 kudos

06-21-2021 12:53:46 PM

1 More Replies

by Ryan_Chynoweth • Honored Contributor III

05-28-2021 11:42:18 AM

1623 Views
1 replies
1 kudos

Can you execute an Azure Databricks Notebook from Azure Data Factory? Can you return values back to the Data Factory from the Notebook?

Data Engineering

1623 Views
1 replies
1 kudos

05-28-2021 11:42:18 AM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-21-2021 12:45:33 PM

1 kudos

Yes, Azure Data Factory can execute code on Azure Databricks. The best way to return values from the notebook to Data factory is to use the dbutils.notebook.exit() function at the end of your notebook or whenever you want to terminate execution.

1 kudos

06-21-2021 12:45:33 PM

by Anonymous • Not applicable

06-02-2021 5:30:18 PM

2985 Views
1 replies
0 kudos

Resolved! When to use High Concurrency clusters? What are the benefits?

Data Engineering

2985 Views
1 replies
0 kudos

06-02-2021 5:30:18 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-21-2021 12:44:00 PM

0 kudos

The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies.Note that a Standard cluster is recommended for a single user. Standard clusters can run workloads d...

0 kudos

06-21-2021 12:44:00 PM

by University_RobR • New Contributor

06-11-2021 12:41:49 PM

626 Views
1 replies
0 kudos

What Databricks resources are available for university faculty members?

I would like to use Databricks to teach large-scale analytics in my classroom; does Databricks have any resources or community assets that can help me out?

Data Engineering

626 Views
1 replies
0 kudos

06-11-2021 12:41:49 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-21-2021 12:36:33 PM

0 kudos

For folks that are looking to leverage Databricks as a teaching asset, please sign contact us for Databricks University Alliance. https://databricks.com/p/teach

0 kudos

06-21-2021 12:36:33 PM

by University_RobR • New Contributor

06-11-2021 12:42:47 PM

645 Views
1 replies
0 kudos

What Databricks resources are available for university students?

I want to learn how to use Databricks for my courses at university, and maybe to get a Databricks Certification. Can you help me out?

Data Engineering

645 Views
1 replies
0 kudos

06-11-2021 12:42:47 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-21-2021 12:34:05 PM

0 kudos

We have a ton of great resources available for people who are wanting to learn Databricks, specifically for university students. Checkout our our university page, to learn more about Databricks Community Edition, Free workshops, and self-paced course...

0 kudos

06-21-2021 12:34:05 PM

by User16826994223 • Honored Contributor III

06-14-2021 6:26:52 AM

564 Views
1 replies
0 kudos

Efficient data retrieval process between Azure Blob storage and Azure databricks

I am trying to design a stream a data analytics project using functions --> event hub --> storage --> Azure factory --> databricks --> SQL server.What I am strugging with at the moment is the idea about how to optimize "data retrieval" to feed m...

Data Engineering

564 Views
1 replies
0 kudos

06-14-2021 6:26:52 AM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-21-2021 12:31:54 PM

0 kudos

Check out our auto loader capabilities that can automatically track and process files that need to be processed. AutoloaderThere are two options: directory listing, which is essentially completing the same steps that you have listed above but in a sl...

0 kudos

06-21-2021 12:31:54 PM

by User16826992666 • Valued Contributor

06-15-2021 9:13:25 PM

1229 Views
1 replies
0 kudos

Resolved! Can you implement fine grained access controls on Delta tables?

I would like to provide row and column level security on my tables I have created in my workspace. Is there any way to do this?

Data Engineering

1229 Views
1 replies
0 kudos

06-15-2021 9:13:25 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-21-2021 12:08:32 PM

0 kudos

Databricks includes two user functions that allow you to express column- and row-level permissions dynamically in the body of a view definition.current_user(): return the current user name.is_member(): determine if the current user is a member of a s...

0 kudos

06-21-2021 12:08:32 PM

by User16826987838 • Contributor

06-18-2021 5:50:25 PM

553 Views
1 replies
1 kudos

We are trying to migrate from Dask/Pandas to Databricks. Any gotchas we need to be aware of?

Data Engineering

553 Views
1 replies
1 kudos

06-18-2021 5:50:25 PM

View Replies

Latest Reply

Mooune_DBU
Valued Contributor

06-21-2021 11:45:10 AM

1 kudos

With Koalas, which is a Pandas'API on top of Spark Dataframes, there should be minimal code changes required.Please refer to this blog for more info

1 kudos

06-21-2021 11:45:10 AM

by User16765131552 • Contributor III

06-18-2021 12:39:40 PM

4674 Views
1 replies
0 kudos

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in th...

Data Engineering

4674 Views
1 replies
0 kudos

06-18-2021 12:39:40 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-21-2021 11:33:00 AM

0 kudos

If you are attempting to read all the files in a directory you should be able to use a wild card and filter using the extension. For example: df = (spark .read .format("com.crealytics.spark.excel") .option("header", "True") .option("inferSchema", "tr...

0 kudos

06-21-2021 11:33:00 AM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

Resolved! Does Databricks integrate with Splunk? What are some ways to send metrics/logs to Splunk

Resolved! Jobs - Delta Live tables difference

Resolved! Tuning shuffle partitions

Resolved! Where does Databricks store its Notebooks? Are they on a file system in the control plane or RDS/data management system of some kind

Resolved! How can I make a cluster start up in the availability-zone (AZ) with the most available IPs?

How can I monitor the costs of an Azure Databricks cluster via PowerBI?

Can you execute an Azure Databricks Notebook from Azure Data Factory? Can you return values back to the Data Factory from the Notebook?

Resolved! When to use High Concurrency clusters? What are the benefits?

What Databricks resources are available for university faculty members?

What Databricks resources are available for university students?

Efficient data retrieval process between Azure Blob storage and Azure databricks

Resolved! Can you implement fine grained access controls on Delta tables?

We are trying to migrate from Dask/Pandas to Databricks. Any gotchas we need to be aware of?

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...