cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

kkarthik
by New Contributor
  • 3661 Views
  • 1 replies
  • 0 kudos

I want to split a dataframe with date range 1 week, with each week data in different column.

DF Q Date(yyyy-mm-dd) q1 2017-10-01 q2 2017-10-03 q1 2017-10-09 q3 2017-10-06 q2 2017-10-01 q1 2017-10-13 Q1 2017-10-02 Q3 2017-10-21 Q4 2017-10-17 Q5 2017-10-20 Q4 2017-10-31 Q2 2017-10-27 Q5 2017-10-01 Dataframe: ...

  • 3661 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16857281974
Contributor
  • 0 kudos

It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...

  • 0 kudos
XinZodl
by New Contributor III
  • 10627 Views
  • 3 replies
  • 1 kudos

Resolved! How to parse a file with newline character, escaped with \ and not quoted

Hi! I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this: Line1field1;Line1field2.1 \ Line1field2.2;Line1field3; Line2FIeld1;...

  • 10627 Views
  • 3 replies
  • 1 kudos
Latest Reply
XinZodl
New Contributor III
  • 1 kudos

Solution is "sparkContext.wholeTextFiles"

  • 1 kudos
2 More Replies
kelleyrw
by New Contributor II
  • 8155 Views
  • 7 replies
  • 0 kudos

Resolved! How do I register a UDF that returns an array of tuples in scala/spark?

I'm relatively new to Scala. In the past, I was able to do the following python: def foo(p1, p2): import datetime as dt dt.datetime(2014, 4, 17, 12, 34) result = [ (1, "1", 1.1, dt.datetime(2014, 4, 17, 1, 0)), (2, "2", 2...

0693f000007OoHdAAK
  • 8155 Views
  • 7 replies
  • 0 kudos
Latest Reply
__max
New Contributor III
  • 0 kudos

Hello, Just in case, here is an example for proposed solution above: import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ import org.apache.spark.sql.types._ val data = Seq(("A", Seq((3,4),(5,6),(7,10))), ("B", Seq((-1,...

  • 0 kudos
6 More Replies
samalexg
by New Contributor III
  • 14176 Views
  • 13 replies
  • 1 kudos

How to add environment variable

Instead of setting the AWS accessKey and secret Key in hadoopConfiguration, I would like to add those in environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. How can I do that in databricks?

  • 14176 Views
  • 13 replies
  • 1 kudos
Latest Reply
jric
New Contributor II
  • 1 kudos

It is possible! I was able to confirm that the following post's "Best" answer works: https://forums.databricks.com/questions/11116/how-to-set-an-environment-variable.htmlFYI for @Miklos Christine​  and @Mike Trewartha​ 

  • 1 kudos
12 More Replies
KiranRastogi
by New Contributor
  • 29550 Views
  • 2 replies
  • 1 kudos

Pandas dataframe to a table

I want to write a pandas dataframe to a table, how can I do this ? Write command is not working, please help.

  • 29550 Views
  • 2 replies
  • 1 kudos
Latest Reply
amy_wang
New Contributor II
  • 1 kudos

Hey Kiran, Just taking a stab in the dark but do you want to convert the Pandas DataFrame to a Spark DataFrame and then write out the Spark DataFrame as a non-temporary SQL table? import pandas as pd ## Create Pandas Frame pd_df = pd.DataFrame({u'20...

  • 1 kudos
1 More Replies
RobertWalsh
by New Contributor II
  • 6468 Views
  • 6 replies
  • 0 kudos

Resolved! Hive Table Creation - Parquet does not support Timestamp Datatype?

Good afternoon, Attempting to run this statement: %sql CREATE EXTERNAL TABLE IF NOT EXISTS dev_user_login ( event_name STRING, datetime TIMESTAMP, ip_address STRING, acting_user_id STRING ) PARTITIONED BY (date DATE) STORED AS PARQUET ...

  • 6468 Views
  • 6 replies
  • 0 kudos
Latest Reply
SirChokolate
New Contributor II
  • 0 kudos

How can apply the solution above, in spark script: package com.neoris.spark import java.text.SimpleDateFormat import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{SQLContext, SaveMode} import org.apache.spark.sql.types.{Dat...

  • 0 kudos
5 More Replies
letsflykite
by New Contributor II
  • 15262 Views
  • 2 replies
  • 1 kudos

How to increase spark.kryoserializer.buffer.max

when I join two dataframes, I got the following error. org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1 Serialization trace: values (org.apache.spark.sql.catalyst.expressions.GenericRow) otherEle...

  • 15262 Views
  • 2 replies
  • 1 kudos
Latest Reply
Jose_Maria_Tala
New Contributor II
  • 1 kudos

val conf = new SparkConf() ... conf.set("spark.kryoserializer.buffer.max.mb", "512") ...

  • 1 kudos
1 More Replies
cfregly
by Contributor
  • 4619 Views
  • 4 replies
  • 0 kudos
  • 4619 Views
  • 4 replies
  • 0 kudos
Latest Reply
TianziCai
New Contributor II
  • 0 kudos

sample = (spark.read .format("com.databricks.spark.redshift") .option("url", jdbcUrl) .option("dbtable", "xx.xxx") # schema, table .option("forward_spark_s3_credentials", True) .option("tempdir", tem...

  • 0 kudos
3 More Replies
prachicsa
by New Contributor
  • 1617 Views
  • 3 replies
  • 0 kudos

Filtering records for all values of an array in Spark

I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of these token values. I tried the following way: va...

  • 1617 Views
  • 3 replies
  • 0 kudos
Latest Reply
__max
New Contributor III
  • 0 kudos

Actually, the intersection transformation does deduplication. If you don't need it, you can just slightly modify your code: val filteredRdd = rddAll.filter(line => line.contains(token)) and send data of the rdd to your program by calling of an act...

  • 0 kudos
2 More Replies
NarwshKumar
by New Contributor
  • 5340 Views
  • 3 replies
  • 0 kudos

calculate median and inter quartile range on spark dataframe

I have a spark dataframe of 5 columns and I want to calculate median and interquartile range on all. I am not able to figure out how do I write udf and call them on columns.

  • 5340 Views
  • 3 replies
  • 0 kudos
Latest Reply
jmwilli25
New Contributor II
  • 0 kudos

Here is the easiest way to calculate this... https://stackoverflow.com/questions/37032689/scala-first-quartile-third-quartile-and-iqr-from-spark-sqlcontext-dataframe No Hive or windowing necessary.

  • 0 kudos
2 More Replies
pmezentsev
by New Contributor
  • 3911 Views
  • 1 replies
  • 0 kudos

What is the difference between createTempView, createGlobalTempView and registerTempTable

Hi, friends! I have a question about difference between this three functions: dataframe . createTempViewdataframe . createGlobalTempView dataframe . registerTempTable all of them create intermediate tables. How to decide which I have to choose in c...

  • 3911 Views
  • 1 replies
  • 0 kudos
Latest Reply
KeshavP
New Contributor II
  • 0 kudos

From my understanding, createTempView (or more appropriately createOrReplaceTempView) has been introduced in Spark 2.0 to replace registerTempTable, which has been deprecated in 2.0. CreateTempView creates an in memory reference to the Dataframe in ...

  • 0 kudos
WenLin
by New Contributor II
  • 5628 Views
  • 3 replies
  • 0 kudos

data.write.format('com.databricks.spark.csv') added additional quotation marks

0favorite I am using the following code (pyspark) to export my data frame to csv: data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath') Note that I use d...

  • 5628 Views
  • 3 replies
  • 0 kudos
Latest Reply
chaotic3quilibr
New Contributor III
  • 0 kudos

The way to turn off the default escaping of the double quote character (") with the backslash character (\) - i.e. to avoid escaping for all characters entirely, you must add an .option() method call with just the right parameters after the .write() ...

  • 0 kudos
2 More Replies
supriya
by New Contributor II
  • 9056 Views
  • 12 replies
  • 0 kudos

How to append new column values in dataframe behalf of unique id's

I need to create new column with data in dataframe. Example:val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), ...

  • 9056 Views
  • 12 replies
  • 0 kudos
Latest Reply
raela
New Contributor III
  • 0 kudos

@supriya you will have to do a join. import org.apache.spark.sql.functions._ val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

  • 0 kudos
11 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels
Top Kudoed Authors