cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16826992666
by Valued Contributor
  • 2239 Views
  • 1 replies
  • 0 kudos
  • 2239 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

To time travel to a particular version, it's necessary to have the JSON file for that particular version. the JSON files in the delta_log have default retention of 30 days. So by default, we can time travel only up to 30 days. The retention of the D...

  • 0 kudos
User16826992666
by Valued Contributor
  • 5075 Views
  • 1 replies
  • 0 kudos

How do I choose which column to partition by?

I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?

  • 5075 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The important factors deciding partition columns are:Even distribution of data. Choose the column that is commonly or widely accessed or queried. Do not create multiple levels of partition, as you can end up with a large number of small files.

  • 0 kudos
User16826992666
by Valued Contributor
  • 1841 Views
  • 1 replies
  • 0 kudos

If I delete a table through the UI, does it also delete the underlying files?

I am using the UI in the workspace. I can use the Data tab to see my tables, then use the delete option through the UI. But I know there are underlying files that contain the tables data. Are these files also being deleted?

  • 1841 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

If the table is external the files are not deleted. For managed table, the underlying files get deleted. Essentially a "DROP TABLE" command is submitted under the hood.

  • 0 kudos
Srikanth_Gupta_
by Databricks Employee
  • 1836 Views
  • 1 replies
  • 0 kudos
  • 1836 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Databricks Employee
  • 0 kudos

Presto and Athena support reading from external tables using a manifest file, which is a text file containing the list of data files to read for querying a tablethis doc explains how to generate Manifest file.https://docs.databricks.com/delta/presto-...

  • 0 kudos
User16790091296
by Contributor II
  • 4497 Views
  • 1 replies
  • 0 kudos
  • 4497 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Partitioning is a way of distributing the data by keys so that you can restrict the amount of data scanned by each query and improve performance / avoid conflicts General rules of thumb for choosing the right partition columns   Cardinality of a colu...

  • 0 kudos
Joseph_B
by Databricks Employee
  • 2383 Views
  • 1 replies
  • 0 kudos

How can I use Databricks to "automagically" distribute scikit-learn model training?

Is there a way to automatically distribute training and model tuning across a Spark cluster, if I want to keep using scikit-learn?

  • 2383 Views
  • 1 replies
  • 0 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 0 kudos

It depends on what you mean by "automagically."If you want to keep using scikit-learn, there are ways to distribute parts of training and tuning with minimal effort. However, there is no "magic" way to distribute training an individual model in scik...

  • 0 kudos
User16790091296
by Contributor II
  • 2089 Views
  • 1 replies
  • 0 kudos

How to read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How c...

  • 2089 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

What is the format of the table - if It is delta, you could use the python bindings for the native Rust API and read the table from your python code and compare bypassing the metastore.

  • 0 kudos
brickster_2018
by Databricks Employee
  • 4331 Views
  • 1 replies
  • 0 kudos
  • 4331 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

One solution is to get the runId,jobId details using notebook context in child notebook and return these values using dbutils.notebook.exit to parent notebook.%scala val jobId = dbutils.notebook.getContext.tags("jobId").toString() val runId = dbutils...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 2837 Views
  • 1 replies
  • 0 kudos

Resolved! Scheduled job did not trigger the job run

I have a job that is scheduled to run every one hour. But rarely I see the job runs are skipped

  • 2837 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

If you choose a timezone with Daylight savings this issue can happen. We recommend choosing UTC timezone to avoid this issue. If you select a zone that observes daylight saving time, an hourly job will be skipped or may appear to not fire for an hour...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 4418 Views
  • 1 replies
  • 0 kudos

Resolved! Unable to overwrite the schema of a Delta table

As per the docs, I can overwrite the schema of a Delta table using the "overWriteSchema" option. But i am unable to overwrite the schema for a Delta table.

  • 4418 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

When Table ACLs are enabled, we can't change the schema of an operation through a write, which requires * MODIFY permissions, when schema changes require OWN permissions. Hence overwriting schema is not supported when Table ACL is enabled for the D...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 8527 Views
  • 1 replies
  • 0 kudos
  • 8527 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The below code can be used to get the number of records in a Delta table without querying it%scala import com.databricks.sql.transaction.tahoe.DeltaLog import org.apache.hadoop.fs.Path import org.apache.spark.sql.DataFrame import org.apache.spark.sql...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 2328 Views
  • 1 replies
  • 1 kudos

Resolved! Cluster logs missing

On the Databricks cluster UI, when I click on the Driver logs, sometimes I see historic logs and sometimes I see logs for the last few hours. Why do we see this inconsistency

  • 2328 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 1 kudos

This is working per design! This is the expected behavior. When the cluster is in terminated state, the logs are serviced by the Spark History server hosted on the Databricks control plane. When the cluster is up and running the logs are serviced by ...

  • 1 kudos
User16790091296
by Contributor II
  • 3182 Views
  • 2 replies
  • 1 kudos

Database within a Database in Databricks

Is it possible to have a folder or database with a database in Azure Databricks? I know you can use the "create database if not exists xxx" to get a database, but I want to have folders within that database where I can put tables.

  • 3182 Views
  • 2 replies
  • 1 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 1 kudos

The default location of a database will be in the /user/hive/warehouse/<databasename.db>. Irrespective of the location of the database the tables in the database can have different locations and they can be specified at the time of creation. Databas...

  • 1 kudos
1 More Replies
User16790091296
by Contributor II
  • 1174 Views
  • 1 replies
  • 0 kudos

How do we get logs on read queries from delta lake in Databricks?

I've tried with :df.write.mode("overwrite").format("com.databricks.spark.csv").option("header","true").csv(dstPath)anddf.write.format("csv").mode("overwrite").save(dstPath)but now I have 10 csv files but I need one file and name it.

  • 1174 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

The header question seems different than your body question. I am assuming that you are asking how to only get a single CSV file when writing? To do so you should use the coalesce:df.coalesce(1).write.format("csv").mode("overwrite").save(dstPath)This...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels