Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I'm relatively new to Scala. In the past, I was able to do the following python:
def foo(p1, p2):
import datetime as dt
dt.datetime(2014, 4, 17, 12, 34)
result = [
(1, "1", 1.1, dt.datetime(2014, 4, 17, 1, 0)),
(2, "2", 2...
Hello,
Just in case, here is an example for proposed solution above:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
val data = Seq(("A", Seq((3,4),(5,6),(7,10))), ("B", Seq((-1,...
Instead of setting the AWS accessKey and secret Key in hadoopConfiguration, I would like to add those in environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
How can I do that in databricks?
It is possible! I was able to confirm that the following post's "Best" answer works: https://forums.databricks.com/questions/11116/how-to-set-an-environment-variable.htmlFYI for @Miklos Christine and @Mike Trewartha
Hey Kiran,
Just taking a stab in the dark but do you want to convert the Pandas DataFrame to a Spark DataFrame and then write out the Spark DataFrame as a non-temporary SQL table?
import pandas as pd
## Create Pandas Frame
pd_df = pd.DataFrame({u'20...
when I join two dataframes, I got the following error.
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1 Serialization trace: values (org.apache.spark.sql.catalyst.expressions.GenericRow) otherEle...
I am very new to Spark.
I have a very basic question. I have an array of values:
listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A)
I want to filter an RDD for all of these token values. I tried the following way:
va...
Actually, the intersection transformation does deduplication. If you don't need it, you can just slightly modify your code:
val filteredRdd = rddAll.filter(line => line.contains(token))
and send data of the rdd to your program by calling of an act...
I have a spark dataframe of 5 columns and I want to calculate median and interquartile range on all. I am not able to figure out how do I write udf and call them on columns.
Here is the easiest way to calculate this... https://stackoverflow.com/questions/37032689/scala-first-quartile-third-quartile-and-iqr-from-spark-sqlcontext-dataframe
No Hive or windowing necessary.
Hi, friends!
I have a question about difference between this three functions:
dataframe . createTempViewdataframe . createGlobalTempView dataframe . registerTempTable
all of them create intermediate tables.
How to decide which I have to choose in c...
From my understanding, createTempView (or more appropriately createOrReplaceTempView) has been introduced in Spark 2.0 to replace registerTempTable, which has been deprecated in 2.0. CreateTempView creates an in memory reference to the Dataframe in ...
0favorite
I am using the following code (pyspark) to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use d...
The way to turn off the default escaping of the double quote character (") with the backslash character (\) - i.e. to avoid escaping for all characters entirely, you must add an .option() method call with just the right parameters after the .write() ...
I need to create new column with data in dataframe.
Example:val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), ...
@supriya
you will have to do a join.
import org.apache.spark.sql.functions._
val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")
Pyspark 1.6: DataFrame: Converting one column from string to float/double
I have two columns in a dataframe both of which are loaded as string.
DF = rawdata.select('house name', 'price')
I want to convert DF.price to float.
DF = rawdata.select('hous...
Slightly simpler:
df_num = df.select(df.employment.cast("float"), df.education.cast("float"), df.health.cast("float"))
This works with multiple columns, three shown here.
Hi
I'm using Parquet for format to store Raw Data. Actually the part file are stored on S3
I would like to control the file size of each parquet part file.
I try this
sqlContext.setConf("spark.parquet.block.size", SIZE.toString)
sqlContext.setCon...
I have the following sparkdataframe :
agent_id/ payment_amount
a /1000
b /1100
a /1100
a /1200
b /1200
b /1250
a /10000
b /9000
my desire output would be something like
<code>agen_id 95_quantile
a whatever is95 quantile for a...
For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see th...