cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

washim
by New Contributor III
  • 10138 Views
  • 1 replies
  • 0 kudos
  • 10138 Views
  • 1 replies
  • 0 kudos
Latest Reply
washim
New Contributor III
  • 0 kudos

got it use - features = dataset.map(lambda row: row[0:]) from pyspark.mllib.stat import Statistics corr_mat=Statistics.corr(features, method="pearson")

  • 0 kudos
lau_thiamkok
by New Contributor II
  • 13465 Views
  • 5 replies
  • 0 kudos

Spark + Python - Java gateway process exited before sending the driver its port number?

Why do I get this error on my browser screen, <type 'exceptions.Exception'>: Java gateway process exited before sending the driver its port number args = ('Java gateway process exited before sending the driver its port number',) message = 'Java gat...

  • 13465 Views
  • 5 replies
  • 0 kudos
Latest Reply
EricaLi
New Contributor II
  • 0 kudos

I'm facing the same problem, does anybody know how to connect Spark in Ipython notebook? The issue I created, https://github.com/jupyter/notebook/issues/743

  • 0 kudos
4 More Replies
Anonymous
by Not applicable
  • 11426 Views
  • 2 replies
  • 1 kudos

How can I use display() in a python notebook with pyspark.sql.Row Objects, e.g. after calling the first() operation on a DataFrame?

I'm trying to display() the results from calling first() on a DataFrame, but display() doesn't work with pyspark.sql.Row objects. How can I display this result?

  • 11426 Views
  • 2 replies
  • 1 kudos
Latest Reply
dnchari
New Contributor II
  • 1 kudos

Use take()

  • 1 kudos
1 More Replies
vida
by Contributor II
  • 11589 Views
  • 8 replies
  • 0 kudos

My Spark SQL join is very slow - what can I do to speed it up?

It's taking 10-12 minutes - can I make it faster?

  • 11589 Views
  • 8 replies
  • 0 kudos
Latest Reply
vida
Contributor II
  • 0 kudos

Analyze is not needed with parquet tables that use the databricks parquet package. That is the default now when you use .saveAsTable(), but if you use a different output format - it's possible that analyze may not work yet.

  • 0 kudos
7 More Replies
t_ras
by New Contributor
  • 6170 Views
  • 1 replies
  • 0 kudos

java.lang.OutOfMemoryError: GC overhead limit exceeded

I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying coutn action on a file. The file is a CSV file 217GB zise Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0 configutation: spark.app.id:local-1443956477103 s...

  • 6170 Views
  • 1 replies
  • 0 kudos
Latest Reply
miklos
Contributor
  • 0 kudos

Looks like the following property is pretty high, which consumes a lot of memory on your executors when you cache the dataset. "spark.storage.memoryFraction:0.9" This could likely be solved by changing the configuration. Take a look at the upstream...

  • 0 kudos
Gabriela_DeQuer
by New Contributor
  • 8003 Views
  • 1 replies
  • 0 kudos
  • 8003 Views
  • 1 replies
  • 0 kudos
Latest Reply
rlgarris
Contributor
  • 0 kudos

There is no hardcoded limit we just call panda.fromRecords with a collection of fields to instantiate a new Panda Dataframe. The only limit is memory. See http://stackoverflow.com/questions/15455722/pandas-is-there-a-max-size-max-no-of-columns-max-r...

  • 0 kudos
cfregly
by Contributor
  • 10595 Views
  • 1 replies
  • 0 kudos
  • 10595 Views
  • 1 replies
  • 0 kudos
Latest Reply
cfregly
Contributor
  • 0 kudos

Sorted DataIf your data is sorted using either sort() or ORDER BY, these operations will be deterministic and return either the 1st element using first()/head() or the top-n using head(n)/take(n).show()/show(n) return Unit (void) and will print up to...

  • 0 kudos
__Databricks_Su
by Contributor
  • 9478 Views
  • 1 replies
  • 1 kudos
  • 9478 Views
  • 1 replies
  • 1 kudos
Latest Reply
__Databricks_Su
Contributor
  • 1 kudos

Hover between the cells in the side-to-side middle and you will see a + sign appear. This is how you can insert cells into the top-to-bottom middle of a notebook. You can also move cells by hovering in the upper left of each cell. A cross-hairs will...

  • 1 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels
Top Kudoed Authors