cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

vida
by Contributor II
  • 10444 Views
  • 8 replies
  • 0 kudos

My Spark SQL join is very slow - what can I do to speed it up?

It's taking 10-12 minutes - can I make it faster?

  • 10444 Views
  • 8 replies
  • 0 kudos
Latest Reply
vida
Contributor II
  • 0 kudos

Analyze is not needed with parquet tables that use the databricks parquet package. That is the default now when you use .saveAsTable(), but if you use a different output format - it's possible that analyze may not work yet.

  • 0 kudos
7 More Replies
t_ras
by New Contributor
  • 5878 Views
  • 1 replies
  • 0 kudos

java.lang.OutOfMemoryError: GC overhead limit exceeded

I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying coutn action on a file. The file is a CSV file 217GB zise Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0 configutation: spark.app.id:local-1443956477103 s...

  • 5878 Views
  • 1 replies
  • 0 kudos
Latest Reply
miklos
Contributor
  • 0 kudos

Looks like the following property is pretty high, which consumes a lot of memory on your executors when you cache the dataset. "spark.storage.memoryFraction:0.9" This could likely be solved by changing the configuration. Take a look at the upstream...

  • 0 kudos
Gabriela_DeQuer
by New Contributor
  • 7364 Views
  • 1 replies
  • 0 kudos
  • 7364 Views
  • 1 replies
  • 0 kudos
Latest Reply
rlgarris
New Contributor III
  • 0 kudos

There is no hardcoded limit we just call panda.fromRecords with a collection of fields to instantiate a new Panda Dataframe. The only limit is memory. See http://stackoverflow.com/questions/15455722/pandas-is-there-a-max-size-max-no-of-columns-max-r...

  • 0 kudos
cfregly
by Contributor
  • 10082 Views
  • 1 replies
  • 0 kudos
  • 10082 Views
  • 1 replies
  • 0 kudos
Latest Reply
cfregly
Contributor
  • 0 kudos

Sorted DataIf your data is sorted using either sort() or ORDER BY, these operations will be deterministic and return either the 1st element using first()/head() or the top-n using head(n)/take(n).show()/show(n) return Unit (void) and will print up to...

  • 0 kudos
__Databricks_Su
by Contributor
  • 9020 Views
  • 1 replies
  • 1 kudos
  • 9020 Views
  • 1 replies
  • 1 kudos
Latest Reply
__Databricks_Su
Contributor
  • 1 kudos

Hover between the cells in the side-to-side middle and you will see a + sign appear. This is how you can insert cells into the top-to-bottom middle of a notebook. You can also move cells by hovering in the upper left of each cell. A cross-hairs will...

  • 1 kudos
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels