Topics with Label: Spark--dataframe

Forum Posts

Sorted by:

by DineshKumar • New Contributor III

08-24-2020 9:52:19 AM

33341 Views
5 replies
2 kudos

Spark Read CSV doesn't preserve the double quotes while reading!

Hi , I am trying to read a csv file with one column has double quotes like below. James,Butt,"Benton, John B Jr",6649 N Blue Gum St Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd Art,Venere,"Chemel, James L Cpa",8 W Cerritos Ave #54...

Data Engineering

33341 Views
5 replies
2 kudos

08-24-2020 9:52:19 AM

View Replies

Latest Reply

LearningAj
New Contributor II

08-10-2023 12:08:21 PM

2 kudos

Hi Team,I am also facing same issue and i have applied all the option mentioned from above posts:I will just post my dataset here:Attached is the my input data with 3 different column out of which comment column contains text value with double quotes...

2 kudos

08-10-2023 12:08:21 PM

4 More Replies

by hamzatazib96 • New Contributor III

08-18-2021 9:11:46 AM

88087 Views
21 replies
12 kudos

Resolved! Read file from dbfs with pd.read_csv() using databricks-connect

Hello all, As described in the title, here's my problem: 1. I'm using databricks-connect in order to send jobs to a databricks cluster 2. The "local" environment is an AWS EC2 3. I want to read a CSV file that is in DBFS (databricks) with pd.read_cs...

Data Engineering

88087 Views
21 replies
12 kudos

08-18-2021 9:11:46 AM

View Replies

Latest Reply

so16
New Contributor II

07-19-2023 1:13:17 PM

12 kudos

Please guys I need your help, I got the same issue still after readed all your comments.I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.spark = DatabricksSession.builder.remote( host=host...

12 kudos

07-19-2023 1:13:17 PM

20 More Replies

by NithinTiruveedh • New Contributor II

06-20-2016 11:59:28 AM

29098 Views
12 replies
0 kudos

How can I split a Spark Dataframe into n equal Dataframes (by rows)? I tried to add a Row ID column to acheive this but was unsuccessful.

I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each. This would be easy if I could create a column that contains Row ID. Is that possible?

Data Engineering

29098 Views
12 replies
0 kudos

06-20-2016 11:59:28 AM

View Replies

Latest Reply

Anonymous
Not applicable

07-12-2023 1:19:12 AM

0 kudos

Hi @NithinTiruveedh Thank you for posting your question in our community! We are happy to assist you. To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

0 kudos

07-12-2023 1:19:12 AM

11 More Replies

by Shridhar • New Contributor

10-17-2018 6:24:35 PM

18148 Views
2 replies
2 kudos

Resolved! Load multiple csv files into a dataframe in order

I can load multiple csv files by doing something like: paths = ["file_1", "file_2", "file_3"] df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .load(paths) But this doesn't seem to preserve the...

Data Engineering

18148 Views
2 replies
2 kudos

10-17-2018 6:24:35 PM

View Replies

Latest Reply

Jaswanth_Saniko
New Contributor III

01-12-2022 4:43:10 AM

2 kudos

val diamonds = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/FileStore/tables/11.csv","/FileStore/tables/12.csv","/FileStore/tables/13.csv") display(diamonds)This is working for me @Shridhar

2 kudos

01-12-2022 4:43:10 AM

1 More Replies

by Nazar • New Contributor II

09-23-2021 3:06:15 PM

7273 Views
3 replies
4 kudos

Resolved! Incremental write

Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...

Data Engineering

7273 Views
3 replies
4 kudos

09-23-2021 3:06:15 PM

View Replies

Latest Reply

Nazar
New Contributor II

09-27-2021 2:55:33 PM

4 kudos

Thanks werners

4 kudos

09-27-2021 2:55:33 PM

2 More Replies

by User16790091296 • Contributor II

06-24-2021 8:07:43 AM

9115 Views
1 replies
0 kudos

How to add a new datetime column to a spark dataFrame from existing timestamp column

I have a data frame in Spark that has a column timestamp. I want to add a new column to this data frame that has the DateTime in the below format created from this existing timestamp column.“YYYY-MM-DD HH:MM:SS”

Data Engineering

9115 Views
1 replies
0 kudos

06-24-2021 8:07:43 AM

View Replies

Latest Reply

Srikanth_Gupta_
Databricks Employee

06-25-2021 6:07:59 AM

0 kudos

val df = Seq(("2021-11-05 02:46:47.154410"),("2019-10-05 2:46:47.154410")).toDF("old_column")display(df)import org.apache.spark.sql.functions._val df2 = df.withColumn("new_column", from_unixtime(unix_timestamp(col("old_column"), "yyyy-MM-dd HH:mm:ss....

0 kudos

06-25-2021 6:07:59 AM

by User16826992666 • Valued Contributor

06-22-2021 7:15:48 PM

36083 Views
3 replies
1 kudos

Resolved! When I save a Spark dataframe using df.write.format("csv"), I end up with mulitple csv files. Why is this happening?

Data Engineering

36083 Views
3 replies
1 kudos

06-22-2021 7:15:48 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 12:12:11 PM

1 kudos

This is by design and working as expected. Spark writes the data distributedly. use of coalesce (1) can help to generate one file, however this solution is not scalable for large data set as it involves bringing the data to one single task.

1 kudos

06-23-2021 12:12:11 PM

2 More Replies

by User16776430979 • New Contributor III

06-07-2021 9:51:14 AM

3455 Views
0 replies
0 kudos

How to optimize and convert a Spark DataFrame to Arrow?

Example use case: When connecting a sample Plotly Dash application to a large dataset, in order to test the performance, I need the file format to be in either hdf5 or arrow. According to this doc: Optimize conversion between PySpark and pandas DataF...

Data Engineering

3455 Views
0 replies
0 kudos

06-07-2021 9:51:14 AM

by HarisKhan • New Contributor

04-12-2020 5:32:03 AM

12850 Views
2 replies
0 kudos

Escape Backslash(/) while writing spark dataframe into csv

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue. I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv. I have so...

Data Engineering

12850 Views
2 replies
0 kudos

04-12-2020 5:32:03 AM

View Replies

Latest Reply

sean_owen
Databricks Employee

04-17-2020 2:19:09 PM

0 kudos

I'm confused - you say the escape is backslash, but you show forward slashes in your data. Don't you want the escape to be forward slash?

0 kudos

04-17-2020 2:19:09 PM

1 More Replies

by bhosskie • New Contributor

05-13-2016 1:33:41 PM

19992 Views
9 replies
0 kudos

How to merge two data frames column-wise in Apache Spark

I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example, df1: +-----+...

Data Engineering

19992 Views
9 replies
0 kudos

05-13-2016 1:33:41 PM

View Replies

Latest Reply

AmolZinjade
New Contributor II

12-16-2020 9:36:04 AM

0 kudos

@bhosskie from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate() sc = spark.sparkContext sqlDF1 = spark.sql("select count(*) as Total FROM user_summary") sqlDF2 = sp...

0 kudos

12-16-2020 9:36:04 AM

8 More Replies

by MithuWagh • New Contributor

12-24-2019 4:14:09 AM

9235 Views
1 replies
0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

Data Engineering

9235 Views
1 replies
0 kudos

12-24-2019 4:14:09 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

12-30-2019 3:27:03 AM

0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

0 kudos

12-30-2019 3:27:03 AM

by SwapanSwapandee • New Contributor II

10-26-2019 8:28:02 PM

9180 Views
2 replies
0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

Data Engineering

9180 Views
2 replies
0 kudos

10-26-2019 8:28:02 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

10-28-2019 10:40:48 PM

0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

0 kudos

10-28-2019 10:40:48 PM

1 More Replies