Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16765131552 • Contributor III

06-25-2021 8:29:35 AM

1362 Views
1 replies
0 kudos

Resolved! Connect to Microstrategy

Can Azure Databricks be connected through Microstrategy?

Data Engineering

1362 Views
1 replies
0 kudos

06-25-2021 8:29:35 AM

View Replies

Latest Reply

User16765131552
Contributor III

06-25-2021 8:30:29 AM

0 kudos

Found this ...Azure Databricks to Microstrategy JDBC/ODBC Setup TipsPurposeThis is a quick reference for common Microstrategy configuration tips, tricks, and common pitfalls when setting up a connection to Databricks:NetworkingFor Azure, we recommend...

0 kudos

06-25-2021 8:30:29 AM

by User16826994223 • Honored Contributor III

06-25-2021 8:26:39 AM

1134 Views
1 replies
1 kudos

File path Not recognisable for notebook jobs in DBFS

we are working on IDEs and once code is developed we put the .py file in DBFS and I am uisng that DBFS path to create a job , but I am getting an error dbfs:/artifacts/kg/bootstrap.py. I get the error notebook not found errror.what could be the is...

Data Engineering

1134 Views
1 replies
1 kudos

06-25-2021 8:26:39 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 8:27:15 AM

1 kudos

The actual notebooks that you create are not stored in Data plane but it is stored in but in control plane, you can import the notebooks through import in Databricks UI or using API , The notebook placed in DBFS cannot be used to create a job

1 kudos

06-25-2021 8:27:15 AM

by User16826994223 • Honored Contributor III

06-25-2021 8:07:43 AM

707 Views
1 replies
1 kudos

How do i see all the dataframe column if I have more than 1000 column in dataframe

I tried printSchema() of a Dataframe in Databricks. The Dataframe is having more than 1500 columns and apparently the printscheam function is truncating results and displaying only 1000 items. How to see all columns

Data Engineering

707 Views
1 replies
1 kudos

06-25-2021 8:07:43 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 8:10:19 AM

1 kudos

Databricks also shows the schema of the Dataframe when it's created - click on the icon next to the name of variable that holds the dataframeIf you have output of more than limit, then I would imagine outputting the schema into file,

1 kudos

06-25-2021 8:10:19 AM

by Srikanth_Gupta_ • Valued Contributor

06-25-2021 8:06:07 AM

687 Views
1 replies
0 kudos

Can we subscribe to pattern of topics(Kafka) from Structured streaming

Data Engineering

687 Views
1 replies
0 kudos

06-25-2021 8:06:07 AM

View Replies

Latest Reply

Srikanth_Gupta_
Valued Contributor

06-25-2021 8:06:58 AM

0 kudos

Yes we can using below code snippetspark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .load()

0 kudos

06-25-2021 8:06:58 AM

by User16826994223 • Honored Contributor III

06-25-2021 8:01:27 AM

294 Views
0 replies
0 kudos

VM bootstrap and authentication When a VM boots up, it automatically authenticates with Databricks control plane using Managed Identity (MI), a per-VM...

VM bootstrap and authenticationWhen a VM boots up, it automatically authenticates with Databricks control plane using Managed Identity (MI), a per-VM credential signed by Azure AD. Once authenticated, the VM fetches secrets from the control plane, in...

Data Engineering

294 Views
0 replies
0 kudos

06-25-2021 8:01:27 AM

by User16869510359 • Esteemed Contributor

06-25-2021 7:15:58 AM

647 Views
1 replies
0 kudos

Resolved! Can I give partition filter conditions for the VACUUM command similar to OPTIMIZE

For the optimize command, I can give predicates and it's easy to optimize the partitions where the data is added. Similarly, can I specify the "WHERE" clause on the partition for a VACUUM command

Data Engineering

647 Views
1 replies
0 kudos

06-25-2021 7:15:58 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 7:16:17 AM

0 kudos

It's by design, VACUUM command does not support filters on the partition columns. This is because removing the old files partially can leave can impact the time travel feature.

0 kudos

06-25-2021 7:16:17 AM

by User16826994223 • Honored Contributor III

06-25-2021 7:10:24 AM

413 Views
0 replies
0 kudos

Best practices: Hyperparameter tuning with Hyperopt Bayesian approaches can be much more efficient than grid search and random search. Hence, with the...

Best practices: Hyperparameter tuning with HyperoptBayesian approaches can be much more efficient than grid search and random search. Hence, with the Hyperopt Tree of Parzen Estimators (TPE) algorithm, you can explore more hyperparameters and larger ...

Data Engineering

413 Views
0 replies
0 kudos

06-25-2021 7:10:24 AM

by User16869510359 • Esteemed Contributor

06-25-2021 7:07:10 AM

1233 Views
1 replies
0 kudos

Resolved! How to restart the cluster with new instances?

Whenever I restart a Databricks cluster new instances are not launched. This is because Databricks re-uses the instances. However, sometimes it's needed to launch new instances. Some scenarios are to mitigate a bad VM issue or maybe to get a patch fr...

Data Engineering

1233 Views
1 replies
0 kudos

06-25-2021 7:07:10 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 7:07:34 AM

0 kudos

Currently, there is no direct option to restart the cluster with new instances. An easy hack to ensure new instances are launched is to add Cluster tags on your cluster. This will ensure that new instances have to be acquired as it's not possible to ...

0 kudos

06-25-2021 7:07:34 AM

by User16826994223 • Honored Contributor III

06-25-2021 7:01:43 AM

939 Views
1 replies
0 kudos

What are the output operations that can be performed on DStreams?

Data Engineering

939 Views
1 replies
0 kudos

06-25-2021 7:01:43 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 7:02:11 AM

0 kudos

Output operations on DStreams pushes the DStream's data to external systems like a database or a file system. Following are the key operations that can be performed on DStreams.saveAsTextFiles() - Saves the DStream's data as text file.saveAsObjectFil...

0 kudos

06-25-2021 7:02:11 AM

by User16869510359 • Esteemed Contributor

06-25-2021 6:55:16 AM

2389 Views
1 replies
0 kudos

Resolved! What is off-heap memory? For which all instances off-heap is enabled by default?

Data Engineering

2389 Views
1 replies
0 kudos

06-25-2021 6:55:16 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 6:55:42 AM

0 kudos

The off-heap memory is managed outside the executor JVM. Spark has native support to use off-heap memory. The off-heap memory is managed by Spark and not controlled by the executor JVM. Hence GC cycles on the executor do not clean up off-heap. Databr...

0 kudos

06-25-2021 6:55:42 AM

by User16869510359 • Esteemed Contributor

06-25-2021 6:46:19 AM

598 Views
1 replies
0 kudos

Resolved! Is there a way to see if Autoptimize is activated in the Delta History

Data Engineering

598 Views
1 replies
0 kudos

06-25-2021 6:46:19 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 6:47:46 AM

0 kudos

OPTIMIZE done by auto Optimize will show up as part of DESC HISTORY. The parameter to look at is auto=true

0 kudos

06-25-2021 6:47:46 AM

by User16826994223 • Honored Contributor III

06-25-2021 6:46:07 AM

1161 Views
1 replies
0 kudos

Resolved! Can we have multiple MLflo run in parallel ?

Data Engineering

1161 Views
1 replies
0 kudos

06-25-2021 6:46:07 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 6:46:29 AM

0 kudos

I think you cab find a solution on github page of ml flow - code examples here: https://github.com/mlflow/mlflow/issues/3592

0 kudos

06-25-2021 6:46:29 AM

by User16869510359 • Esteemed Contributor

06-25-2021 6:42:42 AM

622 Views
1 replies
1 kudos

Resolved! Is VACUUM performed distributedly utilizing my cluster resources

Data Engineering

622 Views
1 replies
1 kudos

06-25-2021 6:42:42 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 6:43:18 AM

1 kudos

At a high-level VACUUM operation on a Delta table has 2 steps. 1) Identifying the stale files based on the VACUUM command triggered. 2) Deleting the files identified in Step 1The #1 is performed by triggering a Spark job hence utilizes the resource o...

1 kudos

06-25-2021 6:43:18 AM

by User16826994223 • Honored Contributor III

06-25-2021 6:36:11 AM

714 Views
1 replies
0 kudos

Even the Unfinished Experiment in Mlflow is getting saved as finished

when I start the experiment with mlflow.start_run(),even if my script is interrupted or failed before executing mlflow.end_run() ,the run gets tagged as finished instead of unfinished , Can any one help why it is happening here

Data Engineering

714 Views
1 replies
0 kudos

06-25-2021 6:36:11 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 6:36:32 AM

0 kudos

In note book the mlflow tagas ias the command travels and once failed or exit there itself it logs and finishes the experiment even if the noteboolsfails. However, if you want to continue logging metrics or artifacts to that run, you just need to use...

0 kudos

06-25-2021 6:36:32 AM

by User16869510359 • Esteemed Contributor

06-25-2021 6:33:55 AM

580 Views
1 replies
0 kudos

Resolved! Why is my streaming job not resuming even though I specified checkpoint directory

I have provided the checkpointLocation as below, however I see the config is ignored for my streaming queryoption("checkpointLocation", "path/to/checkpoint/dir")

Data Engineering

580 Views
1 replies
0 kudos

06-25-2021 6:33:55 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 6:34:46 AM

0 kudos

This is a common question from many users. If the streaming checkpoint directory is specified correctly then this behavior is expected. Below is an example of specifying the checkpoint correctlydf.writeStream .format("parquet") .option("checkpo...

0 kudos

06-25-2021 6:34:46 AM

User

Count

1601

736

343

284

247

Databricks

Forum Posts

Resolved! Connect to Microstrategy

File path Not recognisable for notebook jobs in DBFS

How do i see all the dataframe column if I have more than 1000 column in dataframe

Can we subscribe to pattern of topics(Kafka) from Structured streaming

VM bootstrap and authentication When a VM boots up, it automatically authenticates with Databricks control plane using Managed Identity (MI), a per-VM...

Resolved! Can I give partition filter conditions for the VACUUM command similar to OPTIMIZE

Best practices: Hyperparameter tuning with Hyperopt Bayesian approaches can be much more efficient than grid search and random search. Hence, with the...

Resolved! How to restart the cluster with new instances?

What are the output operations that can be performed on DStreams?

Resolved! What is off-heap memory? For which all instances off-heap is enabled by default?

Resolved! Is there a way to see if Autoptimize is activated in the Delta History

Resolved! Can we have multiple MLflo run in parallel ?

Resolved! Is VACUUM performed distributedly utilizing my cluster resources

Even the Unfinished Experiment in Mlflow is getting saved as finished

Resolved! Why is my streaming job not resuming even though I specified checkpoint directory

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...