cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

MoJaMa
by Databricks Employee
  • 1222 Views
  • 1 replies
  • 0 kudos
  • 1222 Views
  • 1 replies
  • 0 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 0 kudos

That’s only available at Premium and Enterprise SKUs in AWS.See the "Enterprise Security" section here:https://databricks.com/product/aws-pricing

  • 0 kudos
User16783853501
by Databricks Employee
  • 2156 Views
  • 1 replies
  • 0 kudos

What types of files does autoloader support for streaming ingestion ? I see good support for CSV and JSON, how can I ingest files like XML, avro, parquet etc ? would XML rely on Spark-XML ?

What types of files does autoloader support for streaming ingestion ? I see good support for CSV and JSON, how can I ingest files like XML, avro, parquet etc ? would XML rely on Spark-XML ? 

  • 2156 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Please raise a feature request via ideas portal for XML support in autoloader As a workaround, you could look at reading this with wholeTextFiles (which loads the data into a PairRDD with one record per input file) and parsing it with from_xml from ...

  • 0 kudos
User16790091296
by Contributor II
  • 2508 Views
  • 1 replies
  • 1 kudos

Using Databricks Connect (DBConnect)

I'd like to edit Databricks notebooks locally using my favorite editor, and then use Databricks Connect to run the notebook remotely on a Databricks cluster that I usually access via the web interface.I run "databricks-connect configure" , as suggest...

  • 2508 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

Here is the link to the configuration properties https://docs.databricks.com/dev-tools/databricks-connect.html#step-2-configure-connection-properties

  • 1 kudos
User16790091296
by Contributor II
  • 9282 Views
  • 1 replies
  • 0 kudos

Azure Databricks: How to add Spark configuration in Databricks cluster?

I am using a Spark Databricks cluster and want to add a customized Spark configuration.There is a Databricks documentation on this but I am not getting any clue how and what changes I should make. Can someone pls share the example to configure the Da...

  • 9282 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

You can set the configurations on the Databricks cluster UIhttps://docs.databricks.com/clusters/configure.html#spark-configurationTo see the default configuration, run the below code in a notebook:%sql set;

  • 0 kudos
User16790091296
by Contributor II
  • 10945 Views
  • 1 replies
  • 0 kudos

How to List of Notebooks in a Workspace - Databricks?

I want to list down the Notebooks in a folder in Databricks. I tried to use the utilities like , dbutils.fs.ls("/path") - > It shows the path of the storage folder.I also tried to check dbutil.notebook.help() - nothing useful.Lets say, there is a fol...

  • 10945 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

Notebooks are not stored in DBFS. They cannot be directly listed from the file system. You should use the Databricks REST API to list and get the detailshttps://docs.databricks.com/dev-tools/api/latest/workspace.html#list

  • 0 kudos
User16826992666
by Valued Contributor
  • 2406 Views
  • 1 replies
  • 0 kudos
  • 2406 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

To time travel to a particular version, it's necessary to have the JSON file for that particular version. the JSON files in the delta_log have default retention of 30 days. So by default, we can time travel only up to 30 days. The retention of the D...

  • 0 kudos
User16826992666
by Valued Contributor
  • 5329 Views
  • 1 replies
  • 0 kudos

How do I choose which column to partition by?

I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?

  • 5329 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The important factors deciding partition columns are:Even distribution of data. Choose the column that is commonly or widely accessed or queried. Do not create multiple levels of partition, as you can end up with a large number of small files.

  • 0 kudos
User16826992666
by Valued Contributor
  • 1943 Views
  • 1 replies
  • 0 kudos

If I delete a table through the UI, does it also delete the underlying files?

I am using the UI in the workspace. I can use the Data tab to see my tables, then use the delete option through the UI. But I know there are underlying files that contain the tables data. Are these files also being deleted?

  • 1943 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

If the table is external the files are not deleted. For managed table, the underlying files get deleted. Essentially a "DROP TABLE" command is submitted under the hood.

  • 0 kudos
Srikanth_Gupta_
by Databricks Employee
  • 1951 Views
  • 1 replies
  • 0 kudos
  • 1951 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Databricks Employee
  • 0 kudos

Presto and Athena support reading from external tables using a manifest file, which is a text file containing the list of data files to read for querying a tablethis doc explains how to generate Manifest file.https://docs.databricks.com/delta/presto-...

  • 0 kudos
User16790091296
by Contributor II
  • 4825 Views
  • 1 replies
  • 0 kudos
  • 4825 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Partitioning is a way of distributing the data by keys so that you can restrict the amount of data scanned by each query and improve performance / avoid conflicts General rules of thumb for choosing the right partition columns   Cardinality of a colu...

  • 0 kudos
Joseph_B
by Databricks Employee
  • 2531 Views
  • 1 replies
  • 0 kudos

How can I use Databricks to "automagically" distribute scikit-learn model training?

Is there a way to automatically distribute training and model tuning across a Spark cluster, if I want to keep using scikit-learn?

  • 2531 Views
  • 1 replies
  • 0 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 0 kudos

It depends on what you mean by "automagically."If you want to keep using scikit-learn, there are ways to distribute parts of training and tuning with minimal effort. However, there is no "magic" way to distribute training an individual model in scik...

  • 0 kudos
User16790091296
by Contributor II
  • 2185 Views
  • 1 replies
  • 0 kudos

How to read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How c...

  • 2185 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

What is the format of the table - if It is delta, you could use the python bindings for the native Rust API and read the table from your python code and compare bypassing the metastore.

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels