cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

prachicsa
by New Contributor
  • 2233 Views
  • 3 replies
  • 0 kudos

Filtering records for all values of an array in Spark

I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of these token values. I tried the following way: va...

  • 2233 Views
  • 3 replies
  • 0 kudos
Latest Reply
__max
New Contributor III
  • 0 kudos

Actually, the intersection transformation does deduplication. If you don't need it, you can just slightly modify your code: val filteredRdd = rddAll.filter(line => line.contains(token)) and send data of the rdd to your program by calling of an act...

  • 0 kudos
2 More Replies
NarwshKumar
by New Contributor
  • 6173 Views
  • 3 replies
  • 0 kudos

calculate median and inter quartile range on spark dataframe

I have a spark dataframe of 5 columns and I want to calculate median and interquartile range on all. I am not able to figure out how do I write udf and call them on columns.

  • 6173 Views
  • 3 replies
  • 0 kudos
Latest Reply
jmwilli25
New Contributor II
  • 0 kudos

Here is the easiest way to calculate this... https://stackoverflow.com/questions/37032689/scala-first-quartile-third-quartile-and-iqr-from-spark-sqlcontext-dataframe No Hive or windowing necessary.

  • 0 kudos
2 More Replies
pmezentsev
by New Contributor
  • 4638 Views
  • 1 replies
  • 0 kudos

What is the difference between createTempView, createGlobalTempView and registerTempTable

Hi, friends! I have a question about difference between this three functions: dataframe . createTempViewdataframe . createGlobalTempView dataframe . registerTempTable all of them create intermediate tables. How to decide which I have to choose in c...

  • 4638 Views
  • 1 replies
  • 0 kudos
Latest Reply
KeshavP
New Contributor II
  • 0 kudos

From my understanding, createTempView (or more appropriately createOrReplaceTempView) has been introduced in Spark 2.0 to replace registerTempTable, which has been deprecated in 2.0. CreateTempView creates an in memory reference to the Dataframe in ...

  • 0 kudos
WenLin
by New Contributor II
  • 6761 Views
  • 3 replies
  • 0 kudos

data.write.format('com.databricks.spark.csv') added additional quotation marks

0favorite I am using the following code (pyspark) to export my data frame to csv: data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath') Note that I use d...

  • 6761 Views
  • 3 replies
  • 0 kudos
Latest Reply
chaotic3quilibr
New Contributor III
  • 0 kudos

The way to turn off the default escaping of the double quote character (") with the backslash character (\) - i.e. to avoid escaping for all characters entirely, you must add an .option() method call with just the right parameters after the .write() ...

  • 0 kudos
2 More Replies
supriya
by New Contributor II
  • 11648 Views
  • 12 replies
  • 0 kudos

How to append new column values in dataframe behalf of unique id's

I need to create new column with data in dataframe. Example:val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), ...

  • 11648 Views
  • 12 replies
  • 0 kudos
Latest Reply
raela
Databricks Employee
  • 0 kudos

@supriya you will have to do a join. import org.apache.spark.sql.functions._ val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

  • 0 kudos
11 More Replies
SohelKhan
by New Contributor II
  • 15667 Views
  • 5 replies
  • 0 kudos

Resolved! Pyspark DataFrame: Converting one column from string to float/double

Pyspark 1.6: DataFrame: Converting one column from string to float/double I have two columns in a dataframe both of which are loaded as string. DF = rawdata.select('house name', 'price') I want to convert DF.price to float. DF = rawdata.select('hous...

  • 15667 Views
  • 5 replies
  • 0 kudos
Latest Reply
AidanCondron
New Contributor II
  • 0 kudos

Slightly simpler: df_num = df.select(df.employment.cast("float"), df.education.cast("float"), df.health.cast("float")) This works with multiple columns, three shown here.

  • 0 kudos
4 More Replies
richard1_558848
by New Contributor II
  • 7032 Views
  • 3 replies
  • 0 kudos

How to set size of Parquet output files ?

Hi I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3 I would like to control the file size of each parquet part file. I try this sqlContext.setConf("spark.parquet.block.size", SIZE.toString) sqlContext.setCon...

  • 7032 Views
  • 3 replies
  • 0 kudos
Latest Reply
manjeet_chandho
New Contributor II
  • 0 kudos

Hi All can anyone tell me what is the default Raw Group size while writing via SparkSql

  • 0 kudos
2 More Replies
dshosseinyousef
by New Contributor II
  • 8652 Views
  • 2 replies
  • 0 kudos

how to Calculate quantile on grouped data in spark Dataframe

I have the following sparkdataframe : agent_id/ payment_amount a /1000 b /1100 a /1100 a /1200 b /1200 b /1250 a /10000 b /9000 my desire output would be something like <code>agen_id 95_quantile a whatever is95 quantile for a...

  • 8652 Views
  • 2 replies
  • 0 kudos
Latest Reply
Weiluo__David_R
New Contributor II
  • 0 kudos

For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see th...

  • 0 kudos
1 More Replies
dshosseinyousef
by New Contributor II
  • 6082 Views
  • 2 replies
  • 0 kudos

How to extract year and week number from a columns in a sparkDataFrame?

I have the following sparkdataframe : sale_id/ created_at 1 /2016-05-28T05:53:31.042Z 2 /2016-05-30T12:50:58.184Z 3/ 2016-05-23T10:22:18.858Z 4 /2016-05-27T09:20:15.158Z 5 /2016-05-21T08:30:17.337Z 6 /2016-05-28T07:41:14.361Z i need t add a year-wee...

  • 6082 Views
  • 2 replies
  • 0 kudos
Latest Reply
theodondre
New Contributor II
  • 0 kudos

THIS IS HOW HE DOCUMENTATION LOOKS LIKE

  • 0 kudos
1 More Replies
ChristianKeller
by New Contributor II
  • 14512 Views
  • 6 replies
  • 0 kudos

Two stage join fails with java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

Sometimes the error is part of "org.apache.spark.SparkException: Exception thrown in awaitResult:". The error source is the step, where we extract the second time the rows, where the data is updated. We can count the rows, but we cannot display or w...

  • 14512 Views
  • 6 replies
  • 0 kudos
Latest Reply
activescott
New Contributor III
  • 0 kudos

Thanks Lleido. I eventually found I had changed the schema of a partitioned DataFrame that I had made inadvertently where I narrowed a column's type from a long to an integer. While rather obvious cause of the problem in hindsight it was terribly di...

  • 0 kudos
5 More Replies
FrancisLau
by New Contributor
  • 3494 Views
  • 2 replies
  • 0 kudos

Resolved! agg function not working for multiple aggregations

Data has 2 columns: |requestDate|requestDuration| | 2015-06-17| 104| Here is the code: avgSaveTimesByDate = gridSaves.groupBy(gridSaves.requestDate).agg({"requestDuration": "min", "requestDuration": "max","requestDuration": "avg"}) avgSaveTimesBy...

  • 3494 Views
  • 2 replies
  • 0 kudos
Latest Reply
ReKa
New Contributor III
  • 0 kudos

My guess is that the reason this may not work is the fact that the dictionary input does not have unique keys. With this syntax, column-names are keys and if you have two or more aggregation for the same column, some internal loops may forget the no...

  • 0 kudos
1 More Replies
Jean-FrancoisRa
by New Contributor
  • 3799 Views
  • 2 replies
  • 0 kudos

Resolved! Select dataframe columns from a sequence of string

Is there a simple way to select columns from a dataframe with a sequence of string? Something like val colNames = Seq("c1", "c2") df.select(colNames)

  • 3799 Views
  • 2 replies
  • 0 kudos
Latest Reply
vEdwardpc
New Contributor II
  • 0 kudos

Thanks. I needed to modify the final lines. val df_new = df.select(column_names_col:_*) df_new.show() Edward

  • 0 kudos
1 More Replies
dheeraj
by New Contributor II
  • 5289 Views
  • 3 replies
  • 0 kudos

How to calculate Percentile of column in a DataFrame in spark?

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select per...

  • 5289 Views
  • 3 replies
  • 0 kudos
Latest Reply
amandaphy
New Contributor II
  • 0 kudos

You can try using df.registerTempTable("tmp_tbl") val newDF = sql(/ do something with tmp_tbl /)// and continue using newDF Learn More

  • 0 kudos
2 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels