cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 2170 Views
  • 1 replies
  • 0 kudos
  • 2170 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16783855117
Databricks Employee
  • 0 kudos

It really depends on your business intentions! You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retent...

  • 0 kudos
Anonymous
by Not applicable
  • 2951 Views
  • 2 replies
  • 0 kudos

Resolved! Best practices to query logs

We dump our logs in S3 currently. Can you give us best practices to make these logs easier to query?

  • 2951 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Databricks Employee
  • 0 kudos

And if it is generic logs which gets landed on S3 , it'd be worth taking a look at Autoloader. Here is a blog post on processing crowdstrike logs in a similar way

  • 0 kudos
1 More Replies
Anonymous
by Not applicable
  • 4823 Views
  • 1 replies
  • 0 kudos

Resolved! Backfill Delta table

What is the recommended way to backfill a delta table using a series of smaller date partitioned jobs?

  • 4823 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16783855117
Databricks Employee
  • 0 kudos

Another approach you might consider is creating a template notebook to query a known date range with widgets. For example, two date widgets, start time and end time. Then from there you could use Databricks Jobs to update these parameters for each ru...

  • 0 kudos
User16776430979
by Databricks Employee
  • 1889 Views
  • 0 replies
  • 0 kudos

How to optimize conversion between PySpark and Arrow?

Seems like you can convert between dataframes and Arrow objects by using Pandas as an intermediary, but there are some limitations (e.g. it collects all records in the DataFrame to the driver and should be done on a small subset of the data, you hit ...

  • 1889 Views
  • 0 replies
  • 0 kudos
User16790091296
by Databricks Employee
  • 1604 Views
  • 0 replies
  • 5 kudos

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia): [Note: This list is not exhaustive] Leverage the DataFrame or Spar...

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):[Note: This list is not exhaustive]Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also com...

  • 1604 Views
  • 0 replies
  • 5 kudos
Anonymous
by Not applicable
  • 4287 Views
  • 1 replies
  • 0 kudos

Resolved! Delta vs parquet

When does it make sense to use Delta over parquet? Are there any instances when you would rather use parquet?

  • 4287 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Databricks Employee
  • 0 kudos

Users should almost always choose Delta over parquet. Keep in mind that delta is a storage format that sits on top of parquet so the performance of writing to both formats is similar. However, reading data and transforming data with delta is almost a...

  • 0 kudos
Anonymous
by Not applicable
  • 19229 Views
  • 1 replies
  • 0 kudos
  • 19229 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Databricks Employee
  • 0 kudos

An Action in Spark is any operation that does not return an RDD. Evaluation is executed when an action is taken. Actions trigger the scheduler, which build a directed acyclic graph (DAG) as a plan of execution. The plan of execution is created by wor...

  • 0 kudos
Anonymous
by Not applicable
  • 1851 Views
  • 1 replies
  • 0 kudos

Resolved! Converting between Pandas to Koalas

When and why should I convert b/w a Pandas to Koalas dataframe? What are the implications?

  • 1851 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Databricks Employee
  • 0 kudos

Koalas is distributed on a Databricks cluster similar to how Spark dataframes are also distributed. Pandas dataframes only live on the spark driver in memory. If you are a pandas user and are using a multi-node cluster then you should use koalas to p...

  • 0 kudos
Anonymous
by Not applicable
  • 1635 Views
  • 0 replies
  • 0 kudos

Append subset of columns to target Snowflake table

I’m using the databricks-snowflake connector to load data into a Snowflake table. Can someone point me to any example of how we can append only a subset of columns to a target Snowflake table (for example some columns in the target snowflake table ar...

  • 1635 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 1114 Views
  • 0 replies
  • 0 kudos

Detailed logs for R process

We have a user notebook in R that reliably crashes the driver. Are detailed logs from the R process stored somewhere on drivers/workers?

  • 1114 Views
  • 0 replies
  • 0 kudos
Labels