cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

cfregly
by Contributor
  • 4736 Views
  • 3 replies
  • 0 kudos
  • 4736 Views
  • 3 replies
  • 0 kudos
Latest Reply
easimadi
New Contributor II
  • 0 kudos

Hello Pls help (Not an Answer), How do I download complete csv (>1000) result file in FileStore unto my laptop? I was trying to follow this instruction set SQL tutorial (Download All SQL - scala)

  • 0 kudos
2 More Replies
Mallesh
by New Contributor
  • 10074 Views
  • 1 replies
  • 0 kudos

How can i read parquet file compressed by snappy?

Hi All, I wanted to read parqet file compressed by snappy into Spark RDD input file name is: part-m-00000.snappy.parquet i have used sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy") val inputRDD=sqlContext.parqetFile(args(0)) whe...

  • 10074 Views
  • 1 replies
  • 0 kudos
Latest Reply
raela
New Contributor III
  • 0 kudos

Have you tried sqlContext.read.parquet("/filePath/") ?

  • 0 kudos
longcao
by New Contributor III
  • 12512 Views
  • 5 replies
  • 0 kudos

Resolved! Writing DataFrame to PostgreSQL via JDBC extremely slow (Spark 1.6.1)

Hi there,I'm just getting started with Spark and I've got a moderately sized DataFrame created from collating CSVs in S3 (88 columns, 860k rows) that seems to be taking an unreasonable amount of time to insert (using SaveMode.Append) into Postgres. I...

  • 12512 Views
  • 5 replies
  • 0 kudos
Latest Reply
longcao
New Contributor III
  • 0 kudos

In case anyone was curious how I worked around this, I ended up dropping down to Postgres JDBC and using CopyManager to COPY rows in directly from Spark: https://gist.github.com/longcao/bb61f1798ccbbfa4a0d7b76e49982f84

  • 0 kudos
4 More Replies
UmeshKacha
by New Contributor II
  • 8958 Views
  • 3 replies
  • 0 kudos

How to avoid empty/null keys in DataFrame groupby?

Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One r...

  • 8958 Views
  • 3 replies
  • 0 kudos
Latest Reply
silvio
New Contributor II
  • 0 kudos

Hi Umesh,If you want to completely ignore the null/empty values then you could simply filter before you do the groupBy, but are you wanting to keep those values?If you want to keep the null values and avoid the skew, you could try splitting the DataF...

  • 0 kudos
2 More Replies
johnmcauley
by New Contributor II
  • 10014 Views
  • 2 replies
  • 0 kudos

How do I escape a query string in Spark SQL?

Hey all, I am trying to filter on a string but the string has a single quote - how do I escape the string in Scala? I have tried an old version of StringEscapeUtils but no luck. Sorry if a silly question - new to Scala.import org.apache.commons.lan...

  • 10014 Views
  • 2 replies
  • 0 kudos
Latest Reply
antoniosarco
New Contributor II
  • 0 kudos

generally when u deal with apostrophe u replace the the single quote(') with (''). More about....handling single quotes Antonio

  • 0 kudos
1 More Replies
MarcLimotte
by New Contributor II
  • 22893 Views
  • 12 replies
  • 0 kudos

Why do I get 'java.io.IOException: File already exists' for saveAsTable with Overwrite mode?

I have a fairly small, simple DataFrame, month:month.schema org.apache.spark.sql.types.StructType = StructType(StructField(month,DateType,true), StructField(real_month,TimestampType,true), StructField(month_millis,LongType,true))The month Dataframe i...

  • 22893 Views
  • 12 replies
  • 0 kudos
Latest Reply
ReKa
New Contributor III
  • 0 kudos

Your schema is tight, but make sure that the conversion to it does not throw an exception. Try with Memory Optimized Nodes, you may be fine. My problem was parsing a lot of data from sequence files containing 10K xml files and saving them as a table...

  • 0 kudos
11 More Replies
RobertWalsh
by New Contributor II
  • 18235 Views
  • 11 replies
  • 0 kudos

Dataframe Write Append to Parquet Table - Partition Issue

Hello, I am attempting to append new json files into an existing parquet table defined in Databricks. Using a dataset defined by this command (dataframe initially added to a temp table): val output = sql("select headers.event_name, to_date(from_unix...

0693f000007OoJYAA0 0693f000007OoJZAA0
  • 18235 Views
  • 11 replies
  • 0 kudos
Latest Reply
anil_s_langote
New Contributor II
  • 0 kudos

We came across similar situation we are using spark 1.6.1, we have a daily load process to pull data from oracle and write as parquet files, this works fine for 18 days of data (till 18th run), the problem comes after 19th run where the data frame l...

  • 0 kudos
10 More Replies
jpalbeza
by New Contributor II
  • 6949 Views
  • 3 replies
  • 0 kudos

Resolved! How to see the textbox input from getArgument() or dbutils.widgets.text() or dbutils.widgets.dropdown()

getArgument() has been deprecated. I don't see the text box for me to type in any input anymore. What I actually see though is the following error: Deprecation warning: Use dbutils.widgets.text() or dbutils.widgets.dropdown() to create a widget and...

  • 6949 Views
  • 3 replies
  • 0 kudos
Latest Reply
RyanJohnson
New Contributor II
  • 0 kudos

So shouldn't it be removed from the tutorial notebook showing how to connect to S3? I'm trying to connect to S3 for the first time and a deprecation warning isn't a pleasant first experience with a tool I am paying for.

  • 0 kudos
2 More Replies
Sri1
by New Contributor II
  • 10202 Views
  • 5 replies
  • 0 kudos

Create a in-memory table in Spark and insert data into it

Hi, My requirement is I need to create a Spark In-memory table (Not pushing hive table into memory) insert data into it and finally write that back to Hive table. Idea here is to avoid the disk IO while writing into Target Hive table. There are lot ...

  • 10202 Views
  • 5 replies
  • 0 kudos
Latest Reply
vida
Contributor II
  • 0 kudos

Got it - how about using a UnionAll? I believe this code snippet does what you'd want:from pyspark.sql import Row array = [Row(value=1), Row(value=2), Row(value=3)] df = sqlContext.createDataFrame(sc.parallelize(array)) array2 = [Row(value=4), Ro...

  • 0 kudos
4 More Replies
dan11
by New Contributor II
  • 3837 Views
  • 1 replies
  • 1 kudos

sql: how to convert datatype of column?

Bricklayers, I want to port this sql statement from sqlite to databricks: select cast(myage as number) as my_integer_age from ages; Does databricks allow me to do something like this?

  • 3837 Views
  • 1 replies
  • 1 kudos
Latest Reply
raela
New Contributor III
  • 1 kudos

@dan11 We don't support number in Spark SQL. Try using int, double, float, and your query should be fine. To run SQL in a notebook, just prepend any cell with %sql. %sql select cast(myage as double) as my_integer_age from ages;

  • 1 kudos
washim
by New Contributor III
  • 10177 Views
  • 1 replies
  • 0 kudos
  • 10177 Views
  • 1 replies
  • 0 kudos
Latest Reply
washim
New Contributor III
  • 0 kudos

got it use - features = dataset.map(lambda row: row[0:]) from pyspark.mllib.stat import Statistics corr_mat=Statistics.corr(features, method="pearson")

  • 0 kudos
lau_thiamkok
by New Contributor II
  • 13550 Views
  • 5 replies
  • 0 kudos

Spark + Python - Java gateway process exited before sending the driver its port number?

Why do I get this error on my browser screen, <type 'exceptions.Exception'>: Java gateway process exited before sending the driver its port number args = ('Java gateway process exited before sending the driver its port number',) message = 'Java gat...

  • 13550 Views
  • 5 replies
  • 0 kudos
Latest Reply
EricaLi
New Contributor II
  • 0 kudos

I'm facing the same problem, does anybody know how to connect Spark in Ipython notebook? The issue I created, https://github.com/jupyter/notebook/issues/743

  • 0 kudos
4 More Replies
Anonymous
by Not applicable
  • 11467 Views
  • 2 replies
  • 1 kudos

How can I use display() in a python notebook with pyspark.sql.Row Objects, e.g. after calling the first() operation on a DataFrame?

I'm trying to display() the results from calling first() on a DataFrame, but display() doesn't work with pyspark.sql.Row objects. How can I display this result?

  • 11467 Views
  • 2 replies
  • 1 kudos
Latest Reply
dnchari
New Contributor II
  • 1 kudos

Use take()

  • 1 kudos
1 More Replies
vida
by Contributor II
  • 11709 Views
  • 8 replies
  • 0 kudos

My Spark SQL join is very slow - what can I do to speed it up?

It's taking 10-12 minutes - can I make it faster?

  • 11709 Views
  • 8 replies
  • 0 kudos
Latest Reply
vida
Contributor II
  • 0 kudos

Analyze is not needed with parquet tables that use the databricks parquet package. That is the default now when you use .saveAsTable(), but if you use a different output format - it's possible that analyze may not work yet.

  • 0 kudos
7 More Replies
t_ras
by New Contributor
  • 6190 Views
  • 1 replies
  • 0 kudos

java.lang.OutOfMemoryError: GC overhead limit exceeded

I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying coutn action on a file. The file is a CSV file 217GB zise Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0 configutation: spark.app.id:local-1443956477103 s...

  • 6190 Views
  • 1 replies
  • 0 kudos
Latest Reply
miklos
Contributor
  • 0 kudos

Looks like the following property is pretty high, which consumes a lot of memory on your executors when you cache the dataset. "spark.storage.memoryFraction:0.9" This could likely be solved by changing the configuration. Take a look at the upstream...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels