Data Engineering

Forum Posts

Sorted by:

by SohelKhan • New Contributor II

02-22-2016 8:34:33 AM

18947 Views
5 replies
0 kudos

Resolved! Pyspark DataFrame: Converting one column from string to float/double

Pyspark 1.6: DataFrame: Converting one column from string to float/double I have two columns in a dataframe both of which are loaded as string. DF = rawdata.select('house name', 'price') I want to convert DF.price to float. DF = rawdata.select('hous...

Data Engineering

18947 Views
5 replies
0 kudos

02-22-2016 8:34:33 AM

View Replies

Latest Reply

AidanCondron
New Contributor II

01-11-2017 8:31:34 AM

0 kudos

Slightly simpler: df_num = df.select(df.employment.cast("float"), df.education.cast("float"), df.health.cast("float")) This works with multiple columns, three shown here.

0 kudos

01-11-2017 8:31:34 AM

4 More Replies

by richard1_558848 • New Contributor II

05-19-2015 2:57:02 AM

9361 Views
3 replies
0 kudos

How to set size of Parquet output files ?

Hi I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3 I would like to control the file size of each parquet part file. I try this sqlContext.setConf("spark.parquet.block.size", SIZE.toString) sqlContext.setCon...

Data Engineering

9361 Views
3 replies
0 kudos

05-19-2015 2:57:02 AM

View Replies

Latest Reply

manjeet_chandho
New Contributor II

01-04-2017 9:58:14 PM

0 kudos

Hi All can anyone tell me what is the default Raw Group size while writing via SparkSql

0 kudos

01-04-2017 9:58:14 PM

2 More Replies

by dshosseinyousef • New Contributor II

09-22-2016 1:29:26 AM

10118 Views
2 replies
0 kudos

how to Calculate quantile on grouped data in spark Dataframe

I have the following sparkdataframe : agent_id/ payment_amount a /1000 b /1100 a /1100 a /1200 b /1200 b /1250 a /10000 b /9000 my desire output would be something like <code>agen_id 95_quantile a whatever is95 quantile for a...

Data Engineering

10118 Views
2 replies
0 kudos

09-22-2016 1:29:26 AM

View Replies

Latest Reply

Weiluo__David_R
New Contributor II

12-30-2016 10:17:54 AM

0 kudos

For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see th...

0 kudos

12-30-2016 10:17:54 AM

1 More Replies

by dshosseinyousef • New Contributor II

09-20-2016 12:48:29 AM

7548 Views
2 replies
0 kudos

How to extract year and week number from a columns in a sparkDataFrame?

I have the following sparkdataframe : sale_id/ created_at 1 /2016-05-28T05:53:31.042Z 2 /2016-05-30T12:50:58.184Z 3/ 2016-05-23T10:22:18.858Z 4 /2016-05-27T09:20:15.158Z 5 /2016-05-21T08:30:17.337Z 6 /2016-05-28T07:41:14.361Z i need t add a year-wee...

Data Engineering

7548 Views
2 replies
0 kudos

09-20-2016 12:48:29 AM

View Replies

Latest Reply

theodondre
New Contributor II

12-19-2016 8:45:24 AM

0 kudos

THIS IS HOW HE DOCUMENTATION LOOKS LIKE

0 kudos

12-19-2016 8:45:24 AM

1 More Replies

by ChristianKeller • New Contributor II

10-05-2016 6:10:50 AM

16684 Views
6 replies
0 kudos

Two stage join fails with java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

Sometimes the error is part of "org.apache.spark.SparkException: Exception thrown in awaitResult:". The error source is the step, where we extract the second time the rows, where the data is updated. We can count the rows, but we cannot display or w...

Data Engineering

16684 Views
6 replies
0 kudos

10-05-2016 6:10:50 AM

View Replies

Latest Reply

activescott
New Contributor III

11-10-2016 9:04:14 PM

0 kudos

Thanks Lleido. I eventually found I had changed the schema of a partitioned DataFrame that I had made inadvertently where I narrowed a column's type from a long to an integer. While rather obvious cause of the problem in hindsight it was terribly di...

0 kudos

11-10-2016 9:04:14 PM

5 More Replies

by FrancisLau • New Contributor

07-30-2015 8:58:10 PM

5169 Views
2 replies
0 kudos

Resolved! agg function not working for multiple aggregations

Data has 2 columns: |requestDate|requestDuration| | 2015-06-17| 104| Here is the code: avgSaveTimesByDate = gridSaves.groupBy(gridSaves.requestDate).agg({"requestDuration": "min", "requestDuration": "max","requestDuration": "avg"}) avgSaveTimesBy...

Data Engineering

5169 Views
2 replies
0 kudos

07-30-2015 8:58:10 PM

View Replies

Latest Reply

ReKa
New Contributor III

11-12-2016 12:41:47 PM

0 kudos

My guess is that the reason this may not work is the fact that the dictionary input does not have unique keys. With this syntax, column-names are keys and if you have two or more aggregation for the same column, some internal loops may forget the no...

0 kudos

11-12-2016 12:41:47 PM

1 More Replies

by Jean-FrancoisRa • New Contributor

11-25-2015 1:10:14 PM

4949 Views
2 replies
0 kudos

Resolved! Select dataframe columns from a sequence of string

Is there a simple way to select columns from a dataframe with a sequence of string? Something like val colNames = Seq("c1", "c2") df.select(colNames)

Data Engineering

4949 Views
2 replies
0 kudos

11-25-2015 1:10:14 PM

View Replies

Latest Reply

vEdwardpc
New Contributor II

10-01-2016 1:21:41 PM

0 kudos

Thanks. I needed to modify the final lines. val df_new = df.select(column_names_col:_*) df_new.show() Edward

0 kudos

10-01-2016 1:21:41 PM

1 More Replies

by dheeraj • New Contributor II

06-07-2016 4:27:41 PM

6552 Views
3 replies
0 kudos

How to calculate Percentile of column in a DataFrame in spark?

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select per...

Data Engineering

6552 Views
3 replies
0 kudos

06-07-2016 4:27:41 PM

View Replies

Latest Reply

amandaphy
New Contributor II

09-24-2016 1:56:39 AM

0 kudos

You can try using df.registerTempTable("tmp_tbl") val newDF = sql(/ do something with tmp_tbl /)// and continue using newDF Learn More

0 kudos

09-24-2016 1:56:39 AM

2 More Replies

by saqib • New Contributor II

08-19-2016 8:40:53 AM

15276 Views
5 replies
2 kudos

Markup in Databricks Notebook

Do Databricks Scala Notebooks support any sort of markup/markdown?

Data Engineering

15276 Views
5 replies
2 kudos

08-19-2016 8:40:53 AM

View Replies

Latest Reply

Anonymous
Not applicable

09-15-2016 8:43:13 AM

2 kudos

Is it possible to reference variables in markdown?

2 kudos

09-15-2016 8:43:13 AM

4 More Replies

by cfregly • Contributor

02-25-2015 5:27:26 PM

7062 Views
3 replies
0 kudos

How do I get files into /FileStore to be accessed with /files from within my notebooks?

Data Engineering

7062 Views
3 replies
0 kudos

02-25-2015 5:27:26 PM

View Replies

Latest Reply

easimadi
New Contributor II

09-13-2016 11:00:58 PM

0 kudos

Hello Pls help (Not an Answer), How do I download complete csv (>1000) result file in FileStore unto my laptop? I was trying to follow this instruction set SQL tutorial (Download All SQL - scala)

0 kudos

09-13-2016 11:00:58 PM

2 More Replies

by Mallesh • New Contributor

08-03-2016 7:39:41 AM

11960 Views
1 replies
0 kudos

How can i read parquet file compressed by snappy?

Hi All, I wanted to read parqet file compressed by snappy into Spark RDD input file name is: part-m-00000.snappy.parquet i have used sqlContext.setConf("spark.sql.parquet.compression.codec.", "snappy") val inputRDD=sqlContext.parqetFile(args(0)) whe...

Data Engineering

11960 Views
1 replies
0 kudos

08-03-2016 7:39:41 AM

View Replies

Latest Reply

raela
Databricks Employee

08-05-2016 10:31:51 AM

0 kudos

Have you tried sqlContext.read.parquet("/filePath/") ?

0 kudos

08-05-2016 10:31:51 AM

by longcao • New Contributor III

07-01-2016 10:23:05 AM

19063 Views
5 replies
0 kudos

Resolved! Writing DataFrame to PostgreSQL via JDBC extremely slow (Spark 1.6.1)

Hi there,I'm just getting started with Spark and I've got a moderately sized DataFrame created from collating CSVs in S3 (88 columns, 860k rows) that seems to be taking an unreasonable amount of time to insert (using SaveMode.Append) into Postgres. I...

Data Engineering

19063 Views
5 replies
0 kudos

07-01-2016 10:23:05 AM

View Replies

Latest Reply

longcao
New Contributor III

07-11-2016 9:59:35 AM

0 kudos

In case anyone was curious how I worked around this, I ended up dropping down to Postgres JDBC and using CopyManager to COPY rows in directly from Spark: https://gist.github.com/longcao/bb61f1798ccbbfa4a0d7b76e49982f84

0 kudos

07-11-2016 9:59:35 AM

4 More Replies

by UmeshKacha • New Contributor II

05-21-2016 1:37:22 PM

12315 Views
3 replies
0 kudos

How to avoid empty/null keys in DataFrame groupby?

Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One r...

Data Engineering

12315 Views
3 replies
0 kudos

05-21-2016 1:37:22 PM

View Replies

Latest Reply

silvio
New Contributor II

06-03-2016 9:39:57 AM

0 kudos

Hi Umesh,If you want to completely ignore the null/empty values then you could simply filter before you do the groupBy, but are you wanting to keep those values?If you want to keep the null values and avoid the skew, you could try splitting the DataF...

0 kudos

06-03-2016 9:39:57 AM

2 More Replies

by johnmcauley • New Contributor II

08-12-2015 8:51:32 AM

14065 Views
2 replies
0 kudos

How do I escape a query string in Spark SQL?

Hey all, I am trying to filter on a string but the string has a single quote - how do I escape the string in Scala? I have tried an old version of StringEscapeUtils but no luck. Sorry if a silly question - new to Scala.import org.apache.commons.lan...

Data Engineering

14065 Views
2 replies
0 kudos

08-12-2015 8:51:32 AM

View Replies

Latest Reply

antoniosarco
New Contributor II

05-04-2016 12:20:10 AM

0 kudos

generally when u deal with apostrophe u replace the the single quote(') with (''). More about....handling single quotes Antonio

0 kudos

05-04-2016 12:20:10 AM

1 More Replies

by MarcLimotte • New Contributor II

07-21-2015 7:10:07 AM

28148 Views
12 replies
0 kudos

Why do I get 'java.io.IOException: File already exists' for saveAsTable with Overwrite mode?

I have a fairly small, simple DataFrame, month:month.schema org.apache.spark.sql.types.StructType = StructType(StructField(month,DateType,true), StructField(real_month,TimestampType,true), StructField(month_millis,LongType,true))The month Dataframe i...

Data Engineering

28148 Views
12 replies
0 kudos

07-21-2015 7:10:07 AM

View Replies

Latest Reply

ReKa
New Contributor III

05-02-2016 6:25:50 AM

0 kudos

Your schema is tight, but make sure that the conversion to it does not throw an exception. Try with Memory Optimized Nodes, you may be fine. My problem was parsing a lot of data from sequence files containing 10K xml files and saving them as a table...

0 kudos

05-02-2016 6:25:50 AM

11 More Replies

Databricks Community

Forum Posts

Resolved! Pyspark DataFrame: Converting one column from string to float/double

How to set size of Parquet output files ?

how to Calculate quantile on grouped data in spark Dataframe

How to extract year and week number from a columns in a sparkDataFrame?

Two stage join fails with java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

Resolved! agg function not working for multiple aggregations

Resolved! Select dataframe columns from a sequence of string

How to calculate Percentile of column in a DataFrame in spark?

Markup in Databricks Notebook

How do I get files into /FileStore to be accessed with /files from within my notebooks?

How can i read parquet file compressed by snappy?

Resolved! Writing DataFrame to PostgreSQL via JDBC extremely slow (Spark 1.6.1)

How to avoid empty/null keys in DataFrame groupby?

How do I escape a query string in Spark SQL?

Why do I get 'java.io.IOException: File already exists' for saveAsTable with Overwrite mode?

Join Us as a Local Community Builder!

DLT or DataBricks for CDC and NRT

Embedding Dashboards on Databricks Apps

Parameter "expand_tasks" on List job runs request ...

Assistance Required with Auto Liquid Clustering Im...

How to Efficiently Sync MLflow Traces and Asynchro...