cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

kelleyrw
by New Contributor II
  • 12032 Views
  • 7 replies
  • 0 kudos

Resolved! How do I register a UDF that returns an array of tuples in scala/spark?

I'm relatively new to Scala. In the past, I was able to do the following python: def foo(p1, p2): import datetime as dt dt.datetime(2014, 4, 17, 12, 34) result = [ (1, "1", 1.1, dt.datetime(2014, 4, 17, 1, 0)), (2, "2", 2...

0693f000007OoHdAAK
  • 12032 Views
  • 7 replies
  • 0 kudos
Latest Reply
__max
New Contributor III
  • 0 kudos

Hello, Just in case, here is an example for proposed solution above: import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ import org.apache.spark.sql.types._ val data = Seq(("A", Seq((3,4),(5,6),(7,10))), ("B", Seq((-1,...

  • 0 kudos
6 More Replies
samalexg
by New Contributor III
  • 19780 Views
  • 13 replies
  • 1 kudos

How to add environment variable

Instead of setting the AWS accessKey and secret Key in hadoopConfiguration, I would like to add those in environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. How can I do that in databricks?

  • 19780 Views
  • 13 replies
  • 1 kudos
Latest Reply
jric
New Contributor II
  • 1 kudos

It is possible! I was able to confirm that the following post's "Best" answer works: https://forums.databricks.com/questions/11116/how-to-set-an-environment-variable.htmlFYI for @Miklos Christine​  and @Mike Trewartha​ 

  • 1 kudos
12 More Replies
KiranRastogi
by New Contributor
  • 40444 Views
  • 2 replies
  • 2 kudos

Pandas dataframe to a table

I want to write a pandas dataframe to a table, how can I do this ? Write command is not working, please help.

  • 40444 Views
  • 2 replies
  • 2 kudos
Latest Reply
amy_wang
New Contributor II
  • 2 kudos

Hey Kiran, Just taking a stab in the dark but do you want to convert the Pandas DataFrame to a Spark DataFrame and then write out the Spark DataFrame as a non-temporary SQL table? import pandas as pd ## Create Pandas Frame pd_df = pd.DataFrame({u'20...

  • 2 kudos
1 More Replies
letsflykite
by New Contributor II
  • 18826 Views
  • 2 replies
  • 1 kudos

How to increase spark.kryoserializer.buffer.max

when I join two dataframes, I got the following error. org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1 Serialization trace: values (org.apache.spark.sql.catalyst.expressions.GenericRow) otherEle...

  • 18826 Views
  • 2 replies
  • 1 kudos
Latest Reply
Jose_Maria_Tala
New Contributor II
  • 1 kudos

val conf = new SparkConf() ... conf.set("spark.kryoserializer.buffer.max.mb", "512") ...

  • 1 kudos
1 More Replies
cfregly
by Contributor
  • 6155 Views
  • 4 replies
  • 0 kudos
  • 6155 Views
  • 4 replies
  • 0 kudos
Latest Reply
TianziCai
New Contributor II
  • 0 kudos

sample = (spark.read .format("com.databricks.spark.redshift") .option("url", jdbcUrl) .option("dbtable", "xx.xxx") # schema, table .option("forward_spark_s3_credentials", True) .option("tempdir", tem...

  • 0 kudos
3 More Replies
prachicsa
by New Contributor
  • 2729 Views
  • 3 replies
  • 0 kudos

Filtering records for all values of an array in Spark

I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of these token values. I tried the following way: va...

  • 2729 Views
  • 3 replies
  • 0 kudos
Latest Reply
__max
New Contributor III
  • 0 kudos

Actually, the intersection transformation does deduplication. If you don't need it, you can just slightly modify your code: val filteredRdd = rddAll.filter(line => line.contains(token)) and send data of the rdd to your program by calling of an act...

  • 0 kudos
2 More Replies
NarwshKumar
by New Contributor
  • 6957 Views
  • 3 replies
  • 0 kudos

calculate median and inter quartile range on spark dataframe

I have a spark dataframe of 5 columns and I want to calculate median and interquartile range on all. I am not able to figure out how do I write udf and call them on columns.

  • 6957 Views
  • 3 replies
  • 0 kudos
Latest Reply
jmwilli25
New Contributor II
  • 0 kudos

Here is the easiest way to calculate this... https://stackoverflow.com/questions/37032689/scala-first-quartile-third-quartile-and-iqr-from-spark-sqlcontext-dataframe No Hive or windowing necessary.

  • 0 kudos
2 More Replies
pmezentsev
by New Contributor
  • 5351 Views
  • 1 replies
  • 2 kudos

What is the difference between createTempView, createGlobalTempView and registerTempTable

Hi, friends! I have a question about difference between this three functions: dataframe . createTempViewdataframe . createGlobalTempView dataframe . registerTempTable all of them create intermediate tables. How to decide which I have to choose in c...

  • 5351 Views
  • 1 replies
  • 2 kudos
Latest Reply
KeshavP
New Contributor II
  • 2 kudos

From my understanding, createTempView (or more appropriately createOrReplaceTempView) has been introduced in Spark 2.0 to replace registerTempTable, which has been deprecated in 2.0. CreateTempView creates an in memory reference to the Dataframe in ...

  • 2 kudos
WenLin
by New Contributor II
  • 7468 Views
  • 3 replies
  • 0 kudos

data.write.format('com.databricks.spark.csv') added additional quotation marks

0favorite I am using the following code (pyspark) to export my data frame to csv: data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath') Note that I use d...

  • 7468 Views
  • 3 replies
  • 0 kudos
Latest Reply
chaotic3quilibr
New Contributor III
  • 0 kudos

The way to turn off the default escaping of the double quote character (") with the backslash character (\) - i.e. to avoid escaping for all characters entirely, you must add an .option() method call with just the right parameters after the .write() ...

  • 0 kudos
2 More Replies
supriya
by New Contributor II
  • 13263 Views
  • 12 replies
  • 0 kudos

How to append new column values in dataframe behalf of unique id's

I need to create new column with data in dataframe. Example:val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), ...

  • 13263 Views
  • 12 replies
  • 0 kudos
Latest Reply
raela
Databricks Employee
  • 0 kudos

@supriya you will have to do a join. import org.apache.spark.sql.functions._ val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

  • 0 kudos
11 More Replies
SohelKhan
by New Contributor II
  • 17535 Views
  • 5 replies
  • 0 kudos

Resolved! Pyspark DataFrame: Converting one column from string to float/double

Pyspark 1.6: DataFrame: Converting one column from string to float/double I have two columns in a dataframe both of which are loaded as string. DF = rawdata.select('house name', 'price') I want to convert DF.price to float. DF = rawdata.select('hous...

  • 17535 Views
  • 5 replies
  • 0 kudos
Latest Reply
AidanCondron
New Contributor II
  • 0 kudos

Slightly simpler: df_num = df.select(df.employment.cast("float"), df.education.cast("float"), df.health.cast("float")) This works with multiple columns, three shown here.

  • 0 kudos
4 More Replies
richard1_558848
by New Contributor II
  • 7938 Views
  • 3 replies
  • 0 kudos

How to set size of Parquet output files ?

Hi I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3 I would like to control the file size of each parquet part file. I try this sqlContext.setConf("spark.parquet.block.size", SIZE.toString) sqlContext.setCon...

  • 7938 Views
  • 3 replies
  • 0 kudos
Latest Reply
manjeet_chandho
New Contributor II
  • 0 kudos

Hi All can anyone tell me what is the default Raw Group size while writing via SparkSql

  • 0 kudos
2 More Replies
dshosseinyousef
by New Contributor II
  • 9429 Views
  • 2 replies
  • 0 kudos

how to Calculate quantile on grouped data in spark Dataframe

I have the following sparkdataframe : agent_id/ payment_amount a /1000 b /1100 a /1100 a /1200 b /1200 b /1250 a /10000 b /9000 my desire output would be something like <code>agen_id 95_quantile a whatever is95 quantile for a...

  • 9429 Views
  • 2 replies
  • 0 kudos
Latest Reply
Weiluo__David_R
New Contributor II
  • 0 kudos

For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see th...

  • 0 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels