Data Engineering

Forum Posts

Sorted by:

by kkarthik • New Contributor

11-13-2017 9:09:37 PM

3661 Views
1 replies
0 kudos

I want to split a dataframe with date range 1 week, with each week data in different column.

DF Q Date(yyyy-mm-dd) q1 2017-10-01 q2 2017-10-03 q1 2017-10-09 q3 2017-10-06 q2 2017-10-01 q1 2017-10-13 Q1 2017-10-02 Q3 2017-10-21 Q4 2017-10-17 Q5 2017-10-20 Q4 2017-10-31 Q2 2017-10-27 Q5 2017-10-01 Dataframe: ...

Data Engineering

3661 Views
1 replies
0 kudos

11-13-2017 9:09:37 PM

View Replies

Latest Reply

User16857281974
Contributor

11-28-2017 4:24:00 PM

0 kudos

It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...

0 kudos

11-28-2017 4:24:00 PM

by XinZodl • New Contributor III

11-03-2017 12:01:16 AM

10627 Views
3 replies
1 kudos

Resolved! How to parse a file with newline character, escaped with \ and not quoted

Hi! I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this: Line1field1;Line1field2.1 \ Line1field2.2;Line1field3; Line2FIeld1;...

Data Engineering

10627 Views
3 replies
1 kudos

11-03-2017 12:01:16 AM

View Replies

Latest Reply

XinZodl
New Contributor III

11-07-2017 11:59:09 PM

1 kudos

Solution is "sparkContext.wholeTextFiles"

1 kudos

11-07-2017 11:59:09 PM

2 More Replies

by kelleyrw • New Contributor II

06-30-2016 1:28:05 PM

8155 Views
7 replies
0 kudos

Resolved! How do I register a UDF that returns an array of tuples in scala/spark?

I'm relatively new to Scala. In the past, I was able to do the following python: def foo(p1, p2): import datetime as dt dt.datetime(2014, 4, 17, 12, 34) result = [ (1, "1", 1.1, dt.datetime(2014, 4, 17, 1, 0)), (2, "2", 2...

Data Engineering

8155 Views
7 replies
0 kudos

06-30-2016 1:28:05 PM

View Replies

Latest Reply

__max
New Contributor III

10-18-2017 5:40:07 PM

0 kudos

Hello, Just in case, here is an example for proposed solution above: import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ import org.apache.spark.sql.types._ val data = Seq(("A", Seq((3,4),(5,6),(7,10))), ("B", Seq((-1,...

0 kudos

10-18-2017 5:40:07 PM

6 More Replies

by samalexg • New Contributor III

09-03-2015 9:24:07 PM

14176 Views
13 replies
1 kudos

How to add environment variable

Instead of setting the AWS accessKey and secret Key in hadoopConfiguration, I would like to add those in environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. How can I do that in databricks?

Data Engineering

14176 Views
13 replies
1 kudos

09-03-2015 9:24:07 PM

View Replies

Latest Reply

jric
New Contributor II

10-16-2017 4:16:15 PM

1 kudos

It is possible! I was able to confirm that the following post's "Best" answer works: https://forums.databricks.com/questions/11116/how-to-set-an-environment-variable.htmlFYI for @Miklos Christine and @Mike Trewartha

1 kudos

10-16-2017 4:16:15 PM

12 More Replies

by KiranRastogi • New Contributor

05-07-2017 11:55:01 PM

29550 Views
2 replies
1 kudos

Pandas dataframe to a table

I want to write a pandas dataframe to a table, how can I do this ? Write command is not working, please help.

Data Engineering

29550 Views
2 replies
1 kudos

05-07-2017 11:55:01 PM

View Replies

Latest Reply

amy_wang
New Contributor II

09-27-2017 11:13:12 AM

1 kudos

Hey Kiran, Just taking a stab in the dark but do you want to convert the Pandas DataFrame to a Spark DataFrame and then write out the Spark DataFrame as a non-temporary SQL table? import pandas as pd ## Create Pandas Frame pd_df = pd.DataFrame({u'20...

1 kudos

09-27-2017 11:13:12 AM

1 More Replies

by RobertWalsh • New Contributor II

09-06-2015 1:07:57 PM

6468 Views
6 replies
0 kudos

Resolved! Hive Table Creation - Parquet does not support Timestamp Datatype?

Good afternoon, Attempting to run this statement: %sql CREATE EXTERNAL TABLE IF NOT EXISTS dev_user_login ( event_name STRING, datetime TIMESTAMP, ip_address STRING, acting_user_id STRING ) PARTITIONED BY (date DATE) STORED AS PARQUET ...

Data Engineering

6468 Views
6 replies
0 kudos

09-06-2015 1:07:57 PM

View Replies

Latest Reply

SirChokolate
New Contributor II

09-11-2017 9:39:18 AM

0 kudos

How can apply the solution above, in spark script: package com.neoris.spark import java.text.SimpleDateFormat import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{SQLContext, SaveMode} import org.apache.spark.sql.types.{Dat...

0 kudos

09-11-2017 9:39:18 AM

5 More Replies

by cfregly • Contributor

04-30-2015 2:58:41 PM

3676 Views
4 replies
0 kudos

How do I replace nulls with 0's in a DataFrame?

Data Engineering

3676 Views
4 replies
0 kudos

04-30-2015 2:58:41 PM

View Replies

Latest Reply

GauravKhare
New Contributor II

09-04-2017 5:21:27 AM

0 kudos

df.na.replace(df.columns,Map("" -> "0")).show() // to convert from blank strings to zero

0 kudos

09-04-2017 5:21:27 AM

3 More Replies

by letsflykite • New Contributor II

07-31-2015 10:25:03 PM

15262 Views
2 replies
1 kudos

How to increase spark.kryoserializer.buffer.max

when I join two dataframes, I got the following error. org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1 Serialization trace: values (org.apache.spark.sql.catalyst.expressions.GenericRow) otherEle...

Data Engineering

15262 Views
2 replies
1 kudos

07-31-2015 10:25:03 PM

View Replies

Latest Reply

Jose_Maria_Tala
New Contributor II

08-03-2017 6:34:42 AM

1 kudos

val conf = new SparkConf() ... conf.set("spark.kryoserializer.buffer.max.mb", "512") ...

1 kudos

08-03-2017 6:34:42 AM

1 More Replies

by cfregly • Contributor

05-09-2015 2:35:31 PM

4619 Views
4 replies
0 kudos

SSL connection java.sql.SQLException with Redshift

Data Engineering

4619 Views
4 replies
0 kudos

05-09-2015 2:35:31 PM

View Replies

Latest Reply

TianziCai
New Contributor II

06-14-2017 2:09:16 PM

0 kudos

sample = (spark.read .format("com.databricks.spark.redshift") .option("url", jdbcUrl) .option("dbtable", "xx.xxx") # schema, table .option("forward_spark_s3_credentials", True) .option("tempdir", tem...

0 kudos

06-14-2017 2:09:16 PM

3 More Replies

by prachicsa • New Contributor

09-09-2015 2:54:36 AM

1617 Views
3 replies
0 kudos

Filtering records for all values of an array in Spark

I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of these token values. I tried the following way: va...

Data Engineering

1617 Views
3 replies
0 kudos

09-09-2015 2:54:36 AM

View Replies

Latest Reply

__max
New Contributor III

06-13-2017 7:42:48 AM

0 kudos

Actually, the intersection transformation does deduplication. If you don't need it, you can just slightly modify your code: val filteredRdd = rddAll.filter(line => line.contains(token)) and send data of the rdd to your program by calling of an act...

0 kudos

06-13-2017 7:42:48 AM

2 More Replies

by NarwshKumar • New Contributor

02-06-2016 7:11:12 AM

5340 Views
3 replies
0 kudos

calculate median and inter quartile range on spark dataframe

I have a spark dataframe of 5 columns and I want to calculate median and interquartile range on all. I am not able to figure out how do I write udf and call them on columns.

Data Engineering

5340 Views
3 replies
0 kudos

02-06-2016 7:11:12 AM

View Replies

Latest Reply

jmwilli25
New Contributor II

05-23-2017 3:28:53 PM

0 kudos

Here is the easiest way to calculate this... https://stackoverflow.com/questions/37032689/scala-first-quartile-third-quartile-and-iqr-from-spark-sqlcontext-dataframe No Hive or windowing necessary.

0 kudos

05-23-2017 3:28:53 PM

2 More Replies

by pmezentsev • New Contributor

04-01-2017 8:03:26 AM

3911 Views
1 replies
0 kudos

What is the difference between createTempView, createGlobalTempView and registerTempTable

Hi, friends! I have a question about difference between this three functions: dataframe . createTempViewdataframe . createGlobalTempView dataframe . registerTempTable all of them create intermediate tables. How to decide which I have to choose in c...

Data Engineering

3911 Views
1 replies
0 kudos

04-01-2017 8:03:26 AM

View Replies

Latest Reply

KeshavP
New Contributor II

04-11-2017 3:39:49 PM

0 kudos

From my understanding, createTempView (or more appropriately createOrReplaceTempView) has been introduced in Spark 2.0 to replace registerTempTable, which has been deprecated in 2.0. CreateTempView creates an in memory reference to the Dataframe in ...

0 kudos

04-11-2017 3:39:49 PM

by WenLin • New Contributor II

06-06-2016 11:40:17 AM

5628 Views
3 replies
0 kudos

data.write.format('com.databricks.spark.csv') added additional quotation marks

0favorite I am using the following code (pyspark) to export my data frame to csv: data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath') Note that I use d...

Data Engineering

5628 Views
3 replies
0 kudos

06-06-2016 11:40:17 AM

View Replies

Latest Reply

chaotic3quilibr
New Contributor III

03-30-2017 5:37:47 PM

0 kudos

The way to turn off the default escaping of the double quote character (") with the backslash character (\) - i.e. to avoid escaping for all characters entirely, you must add an .option() method call with just the right parameters after the .write() ...

0 kudos

03-30-2017 5:37:47 PM

2 More Replies

by supriya • New Contributor II

01-22-2016 1:47:07 AM

9056 Views
12 replies
0 kudos

How to append new column values in dataframe behalf of unique id's

I need to create new column with data in dataframe. Example:val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), ...

Data Engineering

9056 Views
12 replies
0 kudos

01-22-2016 1:47:07 AM

View Replies

Latest Reply

raela
New Contributor III

02-02-2016 5:08:32 PM

0 kudos

@supriya you will have to do a join. import org.apache.spark.sql.functions._ val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

0 kudos

02-02-2016 5:08:32 PM

11 More Replies

by cfregly • Contributor

05-04-2015 5:44:49 PM

6085 Views
4 replies
0 kudos

Resolved! How do I import a CSV file (local or remote) into Databricks Cloud?

Data Engineering

6085 Views
4 replies
0 kudos

05-04-2015 5:44:49 PM

View Replies

Latest Reply

Bill_Chambers
Contributor II

01-11-2017 11:01:01 AM

0 kudos

Please see this guide on how to import data into Databricks.

0 kudos

01-11-2017 11:01:01 AM

3 More Replies

User

Count

1602

739

348

285

247

Databricks Community

Forum Posts

I want to split a dataframe with date range 1 week, with each week data in different column.

Resolved! How to parse a file with newline character, escaped with \ and not quoted

Resolved! How do I register a UDF that returns an array of tuples in scala/spark?

How to add environment variable

Pandas dataframe to a table

Resolved! Hive Table Creation - Parquet does not support Timestamp Datatype?

How do I replace nulls with 0's in a DataFrame?

How to increase spark.kryoserializer.buffer.max

SSL connection java.sql.SQLException with Redshift

Filtering records for all values of an array in Spark

calculate median and inter quartile range on spark dataframe

What is the difference between createTempView, createGlobalTempView and registerTempTable

data.write.format('com.databricks.spark.csv') added additional quotation marks

How to append new column values in dataframe behalf of unique id's

Resolved! How do I import a CSV file (local or remote) into Databricks Cloud?

How hard is Databricks Data Analyst Associate? Is ...

Read and process large CSV files that updates regu...

Autoloader to concatenate CSV files that updates r...

when to activate photon and when not to ?

Databricks with Private cloud