Data Engineering

Forum Posts

Sorted by:

by User16301467532 • New Contributor II

07-15-2015 11:45:24 AM

18179 Views
9 replies
1 kudos

How can I change the parquet compression algorithm from gzip to something else?

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

Data Engineering

18179 Views
9 replies
1 kudos

07-15-2015 11:45:24 AM

View Replies

Latest Reply

ZhenZeng
New Contributor II

10-01-2019 2:10:05 AM

1 kudos

spark.sql("set spark.sql.parquet.compression.codec=gzip");

1 kudos

10-01-2019 2:10:05 AM

8 More Replies

by Venkata_Krishna • New Contributor

01-13-2020 11:04:43 AM

8122 Views
1 replies
0 kudos

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.

Data Engineering

8122 Views
1 replies
0 kudos

01-13-2020 11:04:43 AM

View Replies

Latest Reply

lee
Contributor

01-13-2020 5:17:35 PM

0 kudos

You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...

0 kudos

01-13-2020 5:17:35 PM

by MithuWagh • New Contributor

12-24-2019 4:14:09 AM

5692 Views
1 replies
0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

Data Engineering

5692 Views
1 replies
0 kudos

12-24-2019 4:14:09 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-30-2019 3:27:03 AM

0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

0 kudos

12-30-2019 3:27:03 AM

by KrisMusial • New Contributor

07-31-2016 11:07:50 AM

5552 Views
2 replies
0 kudos

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success. I minimized the code and reproduced the issue with the following two cells: > case class MyClass(val fld1: Integer, val fld2: Integer) > > val lst1 = sc.paralle...

Data Engineering

5552 Views
2 replies
0 kudos

07-31-2016 11:07:50 AM

View Replies

Latest Reply

Guru421421
New Contributor II

12-20-2019 1:59:16 PM

0 kudos

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

0 kudos

12-20-2019 1:59:16 PM

1 More Replies

by NandhaKumar • New Contributor II

11-14-2019 2:34:57 AM

3914 Views
3 replies
0 kudos

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

I have created a databricks in azure. I have created a cluster for python 3. I am creating a job using spark-submit parameters. How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in ...

Data Engineering

3914 Views
3 replies
0 kudos

11-14-2019 2:34:57 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

11-17-2019 9:46:20 PM

0 kudos

Hi @Nandha Kumar,please go through the below docs to pass python files as job,https://docs.databricks.com/dev-tools/api/latest/jobs.html#sparkpythontask

0 kudos

11-17-2019 9:46:20 PM

2 More Replies

by cfregly • Contributor

05-26-2015 11:38:48 AM

3099 Views
4 replies
0 kudos

How do I group my dataset by a key or combination of keys without doing any aggregations using RDDs, DataFrames, and SQL?

Data Engineering

3099 Views
4 replies
0 kudos

05-26-2015 11:38:48 AM

View Replies

Latest Reply

GeethGovindSrin
New Contributor II

12-19-2019 2:47:04 AM

0 kudos

@cfregly : For DataFrames, you can use the following code for using groupBy without aggregations.Df.groupBy(Df["column_name"]).agg({})

0 kudos

12-19-2019 2:47:04 AM

3 More Replies

by tourist_on_road • New Contributor

12-12-2019 4:47:16 PM

4984 Views
1 replies
0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

Data Engineering

4984 Views
1 replies
0 kudos

12-12-2019 4:47:16 PM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-16-2019 10:00:26 PM

0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

0 kudos

12-16-2019 10:00:26 PM

by naveenreddy1 • New Contributor II

11-21-2019 8:40:58 PM

16930 Views
3 replies
0 kudos

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace

We are using the databricks 3 node cluster with 32 GB memory. It is working fine but some times it automatically throwing the error: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.

Data Engineering

16930 Views
3 replies
0 kudos

11-21-2019 8:40:58 PM

View Replies

Latest Reply

RodrigoDe_Freit
New Contributor II

12-10-2019 11:55:58 AM

0 kudos

If your job fails follow this:According to https://docs.databricks.com/jobs.html#jar-job-tips: "Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run will be canceled and ma...

0 kudos

12-10-2019 11:55:58 AM

2 More Replies

by MikeK_ • New Contributor II

11-29-2019 11:32:28 AM

13360 Views
1 replies
0 kudos

Resolved! SQL variables in a notebook

Hi, In an SQL notebook, using this link: https://docs.databricks.com/spark/latest/spark-sql/language-manual/set.html I managed to figure out to set values and how to get the value. SET my_val=10; //saves the value 10 for key my_val SET my_val; //dis...

Data Engineering

13360 Views
1 replies
0 kudos

11-29-2019 11:32:28 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-01-2019 11:38:37 PM

0 kudos

Hi @Mike K.., you can do this with widgets and getArgument. Here's a small example of what that might look like: https://community.databricks.com/s/feed/0D53f00001HKHZfCAP

0 kudos

12-01-2019 11:38:37 PM

by kruhly • New Contributor II

05-12-2015 3:29:18 AM

28540 Views
12 replies
0 kudos

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example belowllist = [(...

Data Engineering

28540 Views
12 replies
0 kudos

05-12-2015 3:29:18 AM

View Replies

Latest Reply

TejuNC
New Contributor II

01-23-2017 1:55:52 AM

0 kudos

This is an expected behavior. DataFrame.join method is equivalent to SQL join like thisSELECT*FROM a JOIN b ON joinExprsIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you c...

0 kudos

01-23-2017 1:55:52 AM

11 More Replies

by Pierrek20 • New Contributor

10-11-2018 4:59:22 AM

13434 Views
2 replies
0 kudos

How to loop over spark dataframe with scala ?

Hello ! I 'm rookie to spark scala, here is my problem : tk's in advance for your help my input dataframe looks like this : index bucket time ap station rssi 0 1 00:00 1 1 -84.0 1 1 00:00 1 3 -67.0 2 1 00:00 1 4 -82.0 3 1 00:00 1 2 -68.0 4 1 00:00...

Data Engineering

13434 Views
2 replies
0 kudos

10-11-2018 4:59:22 AM

View Replies

Latest Reply

Eve
New Contributor III

11-19-2019 1:53:57 AM

0 kudos

Looping is not always necessary, I always use this foreach method, something like the following: aps.collect().foreach(row => <do something>)

0 kudos

11-19-2019 1:53:57 AM

1 More Replies

by 1stcommander • New Contributor II

11-11-2019 6:10:40 AM

7122 Views
2 replies
0 kudos

Parquet partitionBy - date column to nested folders

Hi, when writing a DataFrame to parquet using partitionBy(<date column>), the resulting folder structure looks like this: root |----------------- day1 |----------------- day2 |----------------- day3 Is it possible to create a structure like to foll...

Data Engineering

7122 Views
2 replies
0 kudos

11-11-2019 6:10:40 AM

View Replies

Latest Reply

Saphira
New Contributor II

11-13-2019 6:09:41 AM

0 kudos

Hey @1stcommander You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. Im not sure what your problem is with the partition pruning. It...

0 kudos

11-13-2019 6:09:41 AM

1 More Replies

by paourissi • New Contributor

11-22-2015 1:03:31 PM

7990 Views
2 replies
1 kudos

When to persist and when to unpersist RDD in Spark

Lets say i have the following:<code>val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) val dataset3 = dataset2.map(.....)1) 1)If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist ...

Data Engineering

7990 Views
2 replies
1 kudos

11-22-2015 1:03:31 PM

View Replies

Latest Reply

Arun_KumarPT
New Contributor II

11-24-2015 10:10:50 PM

1 kudos

It is well documented here : http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

1 kudos

11-24-2015 10:10:50 PM

1 More Replies

by AnandJ_Kadhi • New Contributor II

08-18-2017 5:47:44 AM

4856 Views
2 replies
1 kudos

Handle comma inside cell of CSV

We are using spark-csv_2.10 > version 1.5.0 and reading the csv file column which contains comma " , " as one of the character. The problem we are facing is like that it treats the rest of line after the comma as new column and data is not interpre...

Data Engineering

4856 Views
2 replies
1 kudos

08-18-2017 5:47:44 AM

View Replies

Latest Reply

User16857282152
Contributor

11-01-2019 10:27:53 AM

1 kudos

Take a look here for options, http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv If a csv file has commas then the tradition is to quote the string that contains the comma, In ...

1 kudos

11-01-2019 10:27:53 AM

1 More Replies

by SwapanSwapandee • New Contributor II

10-26-2019 8:28:02 PM

6870 Views
2 replies
0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

Data Engineering

6870 Views
2 replies
0 kudos

10-26-2019 8:28:02 PM

View Replies

Latest Reply

shyam_9
Valued Contributor

10-28-2019 10:40:48 PM

0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

0 kudos

10-28-2019 10:40:48 PM

1 More Replies

User

Count

1602

738

348

285

247

Databricks Community

Forum Posts

How can I change the parquet compression algorithm from gzip to something else?

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to deal with column name with .(dot) in pyspark dataframe??

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

How do I group my dataset by a key or combination of keys without doing any aggregations using RDDs, DataFrames, and SQL?

How to read binary data in pyspark

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace

Resolved! SQL variables in a notebook

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

How to loop over spark dataframe with scala ?

Parquet partitionBy - date column to nested folders

When to persist and when to unpersist RDD in Spark

Handle comma inside cell of CSV

How to pass column names in selectExpr through one or more string parameters in spark using scala?

Databricks with Private cloud

Pyspark serialization

Getting com.databricks.client.jdbc.Driver is not f...

Unit Testing DLT Pipelines

Retrieve job-level parameters in spark_python_task...