cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 706 Views
  • 2 replies
  • 0 kudos

Resolved! Best practices to query logs

We dump our logs in S3 currently. Can you give us best practices to make these logs easier to query?

  • 706 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

And if it is generic logs which gets landed on S3 , it'd be worth taking a look at Autoloader. Here is a blog post on processing crowdstrike logs in a similar way

  • 0 kudos
1 More Replies
Anonymous
by Not applicable
  • 2537 Views
  • 1 replies
  • 0 kudos

Resolved! Backfill Delta table

What is the recommended way to backfill a delta table using a series of smaller date partitioned jobs?

  • 2537 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16783855117
Contributor II
  • 0 kudos

Another approach you might consider is creating a template notebook to query a known date range with widgets. For example, two date widgets, start time and end time. Then from there you could use Databricks Jobs to update these parameters for each ru...

  • 0 kudos
User16790091296
by Contributor II
  • 425 Views
  • 0 replies
  • 5 kudos

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia): [Note: This list is not exhaustive] Leverage the DataFrame or Spar...

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):[Note: This list is not exhaustive]Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also com...

  • 425 Views
  • 0 replies
  • 5 kudos
Anonymous
by Not applicable
  • 1793 Views
  • 1 replies
  • 0 kudos

Resolved! Delta vs parquet

When does it make sense to use Delta over parquet? Are there any instances when you would rather use parquet?

  • 1793 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

Users should almost always choose Delta over parquet. Keep in mind that delta is a storage format that sits on top of parquet so the performance of writing to both formats is similar. However, reading data and transforming data with delta is almost a...

  • 0 kudos
Anonymous
by Not applicable
  • 6444 Views
  • 1 replies
  • 0 kudos
  • 6444 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

An Action in Spark is any operation that does not return an RDD. Evaluation is executed when an action is taken. Actions trigger the scheduler, which build a directed acyclic graph (DAG) as a plan of execution. The plan of execution is created by wor...

  • 0 kudos
Anonymous
by Not applicable
  • 605 Views
  • 1 replies
  • 0 kudos

Resolved! Converting between Pandas to Koalas

When and why should I convert b/w a Pandas to Koalas dataframe? What are the implications?

  • 605 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

Koalas is distributed on a Databricks cluster similar to how Spark dataframes are also distributed. Pandas dataframes only live on the spark driver in memory. If you are a pandas user and are using a multi-node cluster then you should use koalas to p...

  • 0 kudos
Anonymous
by Not applicable
  • 560 Views
  • 0 replies
  • 0 kudos

Append subset of columns to target Snowflake table

I’m using the databricks-snowflake connector to load data into a Snowflake table. Can someone point me to any example of how we can append only a subset of columns to a target Snowflake table (for example some columns in the target snowflake table ar...

  • 560 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 523 Views
  • 0 replies
  • 0 kudos

Detailed logs for R process

We have a user notebook in R that reliably crashes the driver. Are detailed logs from the R process stored somewhere on drivers/workers?

  • 523 Views
  • 0 replies
  • 0 kudos
User16790091296
by Contributor II
  • 1618 Views
  • 1 replies
  • 0 kudos

Resolved! How can I use a Python function defined in my git-repo module within the DB notebook?

I have a function within a module in my git-repo. I want to import that to my DB notebook - how can I do that?

  • 1618 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Databricks Repos allows you to sync your work in Databricks with a remote Git repository. This makes it easier to implement development best practices. Databricks supports integrations with GitHub, Bitbucket, and GitLab. Using Repos you can bring you...

  • 0 kudos
Labels
Top Kudoed Authors