cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

DineshKumar
by New Contributor III
  • 19020 Views
  • 5 replies
  • 2 kudos

Spark Read CSV doesn't preserve the double quotes while reading!

Hi , I am trying to read a csv file with one column has double quotes like below. James,Butt,"Benton, John B Jr",6649 N Blue Gum St Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd Art,Venere,"Chemel, James L Cpa",8 W Cerritos Ave #54...

  • 19020 Views
  • 5 replies
  • 2 kudos
Latest Reply
LearningAj
New Contributor II
  • 2 kudos

Hi Team,I am also facing same issue and i have applied all the option mentioned from above posts:I will just post my dataset here:Attached is the my input data with 3 different column out of which comment column contains text value with double quotes...

  • 2 kudos
4 More Replies
hamzatazib96
by New Contributor III
  • 44523 Views
  • 28 replies
  • 12 kudos

Resolved! Read file from dbfs with pd.read_csv() using databricks-connect

Hello all, As described in the title, here's my problem: 1. I'm using databricks-connect in order to send jobs to a databricks cluster 2. The "local" environment is an AWS EC2 3. I want to read a CSV file that is in DBFS (databricks) with pd.read_cs...

  • 44523 Views
  • 28 replies
  • 12 kudos
Latest Reply
so16
New Contributor II
  • 12 kudos

Please guys I need your help, I got the same issue still after readed all your comments.I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.spark = DatabricksSession.builder.remote( host=host...

  • 12 kudos
27 More Replies
NithinTiruveedh
by New Contributor II
  • 15981 Views
  • 12 replies
  • 0 kudos

How can I split a Spark Dataframe into n equal Dataframes (by rows)? I tried to add a Row ID column to acheive this but was unsuccessful.

I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each. This would be easy if I could create a column that contains Row ID. Is that possible?

  • 15981 Views
  • 12 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @NithinTiruveedh  Thank you for posting your question in our community! We are happy to assist you. To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

  • 0 kudos
11 More Replies
Shridhar
by New Contributor
  • 12643 Views
  • 2 replies
  • 2 kudos

Resolved! Load multiple csv files into a dataframe in order

I can load multiple csv files by doing something like: paths = ["file_1", "file_2", "file_3"] df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .load(paths) But this doesn't seem to preserve the...

  • 12643 Views
  • 2 replies
  • 2 kudos
Latest Reply
Jaswanth_Saniko
New Contributor III
  • 2 kudos

val diamonds = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/FileStore/tables/11.csv","/FileStore/tables/12.csv","/FileStore/tables/13.csv")   display(diamonds)This is working for me @Shridhar​ 

  • 2 kudos
1 More Replies
Nazar
by New Contributor II
  • 3815 Views
  • 5 replies
  • 5 kudos

Resolved! Incremental write

Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...

  • 3815 Views
  • 5 replies
  • 5 kudos
Latest Reply
Nazar
New Contributor II
  • 5 kudos

Thanks werners

  • 5 kudos
4 More Replies
User16776430979
by New Contributor III
  • 1635 Views
  • 1 replies
  • 0 kudos

How to optimize and convert a Spark DataFrame to Arrow?

Example use case: When connecting a sample Plotly Dash application to a large dataset, in order to test the performance, I need the file format to be in either hdf5 or arrow. According to this doc: Optimize conversion between PySpark and pandas DataF...

  • 1635 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

Hi @ josephine.ho! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.

  • 0 kudos
User16790091296
by Contributor II
  • 4996 Views
  • 3 replies
  • 0 kudos

How to add a new datetime column to a spark dataFrame from existing timestamp column

I have a data frame in Spark that has a column timestamp. I want to add a new column to this data frame that has the DateTime in the below format created from this existing timestamp column.“YYYY-MM-DD HH:MM:SS”

  • 4996 Views
  • 3 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

from pyspark import SparkContextfrom pyspark.sql import SQLContextfrom functools import reduceimport pyspark.sql.functions as Fsc = SparkContext.getOrCreate()sql = SQLContext(sc)input_list = [ (1,"2019-11-07 10:30:00")  ,(1,"2019-11-08 10:30:00")  ,(...

  • 0 kudos
2 More Replies
User16826992666
by Valued Contributor
  • 12511 Views
  • 3 replies
  • 0 kudos
  • 12511 Views
  • 3 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

This is by design and working as expected. Spark writes the data distributedly. use of coalesce (1) can help to generate one file, however this solution is not scalable for large data set as it involves bringing the data to one single task.

  • 0 kudos
2 More Replies
HarisKhan
by New Contributor
  • 8671 Views
  • 2 replies
  • 0 kudos

Escape Backslash(/) while writing spark dataframe into csv

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue. I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv. I have so...

  • 8671 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

I'm confused - you say the escape is backslash, but you show forward slashes in your data. Don't you want the escape to be forward slash?

  • 0 kudos
1 More Replies
bhosskie
by New Contributor
  • 8604 Views
  • 9 replies
  • 0 kudos

How to merge two data frames column-wise in Apache Spark

I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example, df1: +-----+...

  • 8604 Views
  • 9 replies
  • 0 kudos
Latest Reply
AmolZinjade
New Contributor II
  • 0 kudos

@bhosskie from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate() sc = spark.sparkContext sqlDF1 = spark.sql("select count(*) as Total FROM user_summary") sqlDF2 = sp...

  • 0 kudos
8 More Replies
MithuWagh
by New Contributor
  • 5325 Views
  • 1 replies
  • 0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

  • 5325 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

  • 0 kudos
SwapanSwapandee
by New Contributor II
  • 6623 Views
  • 2 replies
  • 0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

  • 6623 Views
  • 2 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

  • 0 kudos
1 More Replies
Labels