Data Engineering

Forum Posts

Sorted by:

by User16826994223 • Honored Contributor III

06-25-2021 8:43:48 AM

840 Views
0 replies
0 kudos

How do we decide between Avro and parquet which file format would help more

I Know most of the time parquets file is great for different workload, but I still see Avro files are in use , What type of scenario where avro would be great to use over parquet format.

Data Engineering

840 Views
0 replies
0 kudos

06-25-2021 8:43:48 AM

by User16826994223 • Honored Contributor III

06-25-2021 8:35:28 AM

336 Views
0 replies
0 kudos

Avro fileJune 11, 2021Apache Avro is a data serialization system. Avro provides:Rich data structures.A compact, fast, binary data format.A container f...

Avro fileJune 11, 2021Apache Avro is a data serialization system. Avro provides:Rich data structures.A compact, fast, binary data format.A container file, to store persistent data.Remote procedure call (RPC).Simple integration with dynamic languages....

Data Engineering

336 Views
0 replies
0 kudos

06-25-2021 8:35:28 AM

by User16765131552 • Contributor III

06-25-2021 8:29:35 AM

1805 Views
1 replies
0 kudos

Resolved! Connect to Microstrategy

Can Azure Databricks be connected through Microstrategy?

Data Engineering

1805 Views
1 replies
0 kudos

06-25-2021 8:29:35 AM

View Replies

Latest Reply

User16765131552
Contributor III

06-25-2021 8:30:29 AM

0 kudos

Found this ...Azure Databricks to Microstrategy JDBC/ODBC Setup TipsPurposeThis is a quick reference for common Microstrategy configuration tips, tricks, and common pitfalls when setting up a connection to Databricks:NetworkingFor Azure, we recommend...

0 kudos

06-25-2021 8:30:29 AM

by User16826994223 • Honored Contributor III

06-25-2021 8:26:39 AM

1349 Views
1 replies
1 kudos

File path Not recognisable for notebook jobs in DBFS

we are working on IDEs and once code is developed we put the .py file in DBFS and I am uisng that DBFS path to create a job , but I am getting an error dbfs:/artifacts/kg/bootstrap.py. I get the error notebook not found errror.what could be the is...

Data Engineering

1349 Views
1 replies
1 kudos

06-25-2021 8:26:39 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 8:27:15 AM

1 kudos

The actual notebooks that you create are not stored in Data plane but it is stored in but in control plane, you can import the notebooks through import in Databricks UI or using API , The notebook placed in DBFS cannot be used to create a job

1 kudos

06-25-2021 8:27:15 AM

by User16826994223 • Honored Contributor III

06-25-2021 8:07:43 AM

919 Views
1 replies
1 kudos

How do i see all the dataframe column if I have more than 1000 column in dataframe

I tried printSchema() of a Dataframe in Databricks. The Dataframe is having more than 1500 columns and apparently the printscheam function is truncating results and displaying only 1000 items. How to see all columns

Data Engineering

919 Views
1 replies
1 kudos

06-25-2021 8:07:43 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 8:10:19 AM

1 kudos

Databricks also shows the schema of the Dataframe when it's created - click on the icon next to the name of variable that holds the dataframeIf you have output of more than limit, then I would imagine outputting the schema into file,

1 kudos

06-25-2021 8:10:19 AM

by Srikanth_Gupta_ • Valued Contributor

06-25-2021 8:06:07 AM

921 Views
1 replies
0 kudos

Can we subscribe to pattern of topics(Kafka) from Structured streaming

Data Engineering

921 Views
1 replies
0 kudos

06-25-2021 8:06:07 AM

View Replies

Latest Reply

Srikanth_Gupta_
Valued Contributor

06-25-2021 8:06:58 AM

0 kudos

Yes we can using below code snippetspark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .load()

0 kudos

06-25-2021 8:06:58 AM

by User16826994223 • Honored Contributor III

06-25-2021 8:01:27 AM

427 Views
0 replies
0 kudos

VM bootstrap and authentication When a VM boots up, it automatically authenticates with Databricks control plane using Managed Identity (MI), a per-VM...

VM bootstrap and authenticationWhen a VM boots up, it automatically authenticates with Databricks control plane using Managed Identity (MI), a per-VM credential signed by Azure AD. Once authenticated, the VM fetches secrets from the control plane, in...

Data Engineering

427 Views
0 replies
0 kudos

06-25-2021 8:01:27 AM

by brickster_2018 • Esteemed Contributor

06-25-2021 7:15:58 AM

940 Views
1 replies
0 kudos

Resolved! Can I give partition filter conditions for the VACUUM command similar to OPTIMIZE

For the optimize command, I can give predicates and it's easy to optimize the partitions where the data is added. Similarly, can I specify the "WHERE" clause on the partition for a VACUUM command

Data Engineering

940 Views
1 replies
0 kudos

06-25-2021 7:15:58 AM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 7:16:17 AM

0 kudos

It's by design, VACUUM command does not support filters on the partition columns. This is because removing the old files partially can leave can impact the time travel feature.

0 kudos

06-25-2021 7:16:17 AM

by User16826994223 • Honored Contributor III

06-25-2021 7:10:24 AM

616 Views
0 replies
0 kudos

Best practices: Hyperparameter tuning with Hyperopt Bayesian approaches can be much more efficient than grid search and random search. Hence, with the...

Best practices: Hyperparameter tuning with HyperoptBayesian approaches can be much more efficient than grid search and random search. Hence, with the Hyperopt Tree of Parzen Estimators (TPE) algorithm, you can explore more hyperparameters and larger ...

Data Engineering

616 Views
0 replies
0 kudos

06-25-2021 7:10:24 AM

by brickster_2018 • Esteemed Contributor

06-25-2021 7:07:10 AM

1549 Views
1 replies
0 kudos

Resolved! How to restart the cluster with new instances?

Whenever I restart a Databricks cluster new instances are not launched. This is because Databricks re-uses the instances. However, sometimes it's needed to launch new instances. Some scenarios are to mitigate a bad VM issue or maybe to get a patch fr...

Data Engineering

1549 Views
1 replies
0 kudos

06-25-2021 7:07:10 AM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 7:07:34 AM

0 kudos

Currently, there is no direct option to restart the cluster with new instances. An easy hack to ensure new instances are launched is to add Cluster tags on your cluster. This will ensure that new instances have to be acquired as it's not possible to ...

0 kudos

06-25-2021 7:07:34 AM

by User16826994223 • Honored Contributor III

06-25-2021 7:01:43 AM

1202 Views
1 replies
0 kudos

What are the output operations that can be performed on DStreams?

Data Engineering

1202 Views
1 replies
0 kudos

06-25-2021 7:01:43 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 7:02:11 AM

0 kudos

Output operations on DStreams pushes the DStream's data to external systems like a database or a file system. Following are the key operations that can be performed on DStreams.saveAsTextFiles() - Saves the DStream's data as text file.saveAsObjectFil...

0 kudos

06-25-2021 7:02:11 AM

by brickster_2018 • Esteemed Contributor

06-25-2021 6:55:16 AM

2837 Views
1 replies
0 kudos

Resolved! What is off-heap memory? For which all instances off-heap is enabled by default?

Data Engineering

2837 Views
1 replies
0 kudos

06-25-2021 6:55:16 AM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 6:55:42 AM

0 kudos

The off-heap memory is managed outside the executor JVM. Spark has native support to use off-heap memory. The off-heap memory is managed by Spark and not controlled by the executor JVM. Hence GC cycles on the executor do not clean up off-heap. Databr...

0 kudos

06-25-2021 6:55:42 AM

by brickster_2018 • Esteemed Contributor

06-25-2021 6:46:19 AM

834 Views
1 replies
0 kudos

Resolved! Is there a way to see if Autoptimize is activated in the Delta History

Data Engineering

834 Views
1 replies
0 kudos

06-25-2021 6:46:19 AM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 6:47:46 AM

0 kudos

OPTIMIZE done by auto Optimize will show up as part of DESC HISTORY. The parameter to look at is auto=true

0 kudos

06-25-2021 6:47:46 AM

by User16826994223 • Honored Contributor III

06-25-2021 6:46:07 AM

1430 Views
1 replies
0 kudos

Resolved! Can we have multiple MLflo run in parallel ?

Data Engineering

1430 Views
1 replies
0 kudos

06-25-2021 6:46:07 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 6:46:29 AM

0 kudos

I think you cab find a solution on github page of ml flow - code examples here: https://github.com/mlflow/mlflow/issues/3592

0 kudos

06-25-2021 6:46:29 AM

by brickster_2018 • Esteemed Contributor

06-25-2021 6:42:42 AM

874 Views
1 replies
1 kudos

Resolved! Is VACUUM performed distributedly utilizing my cluster resources

Data Engineering

874 Views
1 replies
1 kudos

06-25-2021 6:42:42 AM

View Replies

Latest Reply

brickster_2018
Esteemed Contributor

06-25-2021 6:43:18 AM

1 kudos

At a high-level VACUUM operation on a Delta table has 2 steps. 1) Identifying the stale files based on the VACUUM command triggered. 2) Deleting the files identified in Step 1The #1 is performed by triggering a Spark job hence utilizes the resource o...

1 kudos

06-25-2021 6:43:18 AM

User

Count

1603

744

348

285

247

Databricks Community

Forum Posts

How do we decide between Avro and parquet which file format would help more

Avro fileJune 11, 2021Apache Avro is a data serialization system. Avro provides:Rich data structures.A compact, fast, binary data format.A container f...

Resolved! Connect to Microstrategy

File path Not recognisable for notebook jobs in DBFS

How do i see all the dataframe column if I have more than 1000 column in dataframe

Can we subscribe to pattern of topics(Kafka) from Structured streaming

VM bootstrap and authentication When a VM boots up, it automatically authenticates with Databricks control plane using Managed Identity (MI), a per-VM...

Resolved! Can I give partition filter conditions for the VACUUM command similar to OPTIMIZE

Best practices: Hyperparameter tuning with Hyperopt Bayesian approaches can be much more efficient than grid search and random search. Hence, with the...

Resolved! How to restart the cluster with new instances?

What are the output operations that can be performed on DStreams?

Resolved! What is off-heap memory? For which all instances off-heap is enabled by default?

Resolved! Is there a way to see if Autoptimize is activated in the Delta History

Resolved! Can we have multiple MLflo run in parallel ?

Resolved! Is VACUUM performed distributedly utilizing my cluster resources

Compute Policy Does Not Install Libraries

Is there a way to let the DLT pipeline retry by it...

Can't create Catalog on Databricks on AWS

Executing Notebooks - Run All Cells vs Run All Bel...

getting Status code: 301 Moved Permanently error