cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

DineshKumar
by New Contributor III
  • 25153 Views
  • 5 replies
  • 2 kudos

Spark Read CSV doesn't preserve the double quotes while reading!

Hi , I am trying to read a csv file with one column has double quotes like below. James,Butt,"Benton, John B Jr",6649 N Blue Gum St Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd Art,Venere,"Chemel, James L Cpa",8 W Cerritos Ave #54...

  • 25153 Views
  • 5 replies
  • 2 kudos
Latest Reply
LearningAj
New Contributor II
  • 2 kudos

Hi Team,I am also facing same issue and i have applied all the option mentioned from above posts:I will just post my dataset here:Attached is the my input data with 3 different column out of which comment column contains text value with double quotes...

  • 2 kudos
4 More Replies
hamzatazib96
by New Contributor III
  • 68391 Views
  • 21 replies
  • 12 kudos

Resolved! Read file from dbfs with pd.read_csv() using databricks-connect

Hello all, As described in the title, here's my problem: 1. I'm using databricks-connect in order to send jobs to a databricks cluster 2. The "local" environment is an AWS EC2 3. I want to read a CSV file that is in DBFS (databricks) with pd.read_cs...

  • 68391 Views
  • 21 replies
  • 12 kudos
Latest Reply
so16
New Contributor II
  • 12 kudos

Please guys I need your help, I got the same issue still after readed all your comments.I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.spark = DatabricksSession.builder.remote( host=host...

  • 12 kudos
20 More Replies
NithinTiruveedh
by New Contributor II
  • 21818 Views
  • 12 replies
  • 0 kudos

How can I split a Spark Dataframe into n equal Dataframes (by rows)? I tried to add a Row ID column to acheive this but was unsuccessful.

I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each. This would be easy if I could create a column that contains Row ID. Is that possible?

  • 21818 Views
  • 12 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @NithinTiruveedh  Thank you for posting your question in our community! We are happy to assist you. To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

  • 0 kudos
11 More Replies
Shridhar
by New Contributor
  • 15693 Views
  • 2 replies
  • 2 kudos

Resolved! Load multiple csv files into a dataframe in order

I can load multiple csv files by doing something like: paths = ["file_1", "file_2", "file_3"] df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .load(paths) But this doesn't seem to preserve the...

  • 15693 Views
  • 2 replies
  • 2 kudos
Latest Reply
Jaswanth_Saniko
New Contributor III
  • 2 kudos

val diamonds = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/FileStore/tables/11.csv","/FileStore/tables/12.csv","/FileStore/tables/13.csv")   display(diamonds)This is working for me @Shridhar​ 

  • 2 kudos
1 More Replies
Nazar
by New Contributor II
  • 5640 Views
  • 3 replies
  • 4 kudos

Resolved! Incremental write

Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...

  • 5640 Views
  • 3 replies
  • 4 kudos
Latest Reply
Nazar
New Contributor II
  • 4 kudos

Thanks werners

  • 4 kudos
2 More Replies
User16790091296
by Contributor II
  • 7300 Views
  • 1 replies
  • 0 kudos

How to add a new datetime column to a spark dataFrame from existing timestamp column

I have a data frame in Spark that has a column timestamp. I want to add a new column to this data frame that has the DateTime in the below format created from this existing timestamp column.“YYYY-MM-DD HH:MM:SS”

  • 7300 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Valued Contributor
  • 0 kudos

val df = Seq(("2021-11-05 02:46:47.154410"),("2019-10-05 2:46:47.154410")).toDF("old_column")display(df)import org.apache.spark.sql.functions._val df2 = df.withColumn("new_column", from_unixtime(unix_timestamp(col("old_column"), "yyyy-MM-dd HH:mm:ss....

  • 0 kudos
User16826992666
by Valued Contributor
  • 26059 Views
  • 3 replies
  • 1 kudos
  • 26059 Views
  • 3 replies
  • 1 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 1 kudos

This is by design and working as expected. Spark writes the data distributedly. use of coalesce (1) can help to generate one file, however this solution is not scalable for large data set as it involves bringing the data to one single task.

  • 1 kudos
2 More Replies
User16776430979
by New Contributor III
  • 2457 Views
  • 0 replies
  • 0 kudos

How to optimize and convert a Spark DataFrame to Arrow?

Example use case: When connecting a sample Plotly Dash application to a large dataset, in order to test the performance, I need the file format to be in either hdf5 or arrow. According to this doc: Optimize conversion between PySpark and pandas DataF...

  • 2457 Views
  • 0 replies
  • 0 kudos
HarisKhan
by New Contributor
  • 10931 Views
  • 2 replies
  • 0 kudos

Escape Backslash(/) while writing spark dataframe into csv

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue. I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv. I have so...

  • 10931 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

I'm confused - you say the escape is backslash, but you show forward slashes in your data. Don't you want the escape to be forward slash?

  • 0 kudos
1 More Replies
bhosskie
by New Contributor
  • 14406 Views
  • 9 replies
  • 0 kudos

How to merge two data frames column-wise in Apache Spark

I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example, df1: +-----+...

  • 14406 Views
  • 9 replies
  • 0 kudos
Latest Reply
AmolZinjade
New Contributor II
  • 0 kudos

@bhosskie from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate() sc = spark.sparkContext sqlDF1 = spark.sql("select count(*) as Total FROM user_summary") sqlDF2 = sp...

  • 0 kudos
8 More Replies
MithuWagh
by New Contributor
  • 7162 Views
  • 1 replies
  • 0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

  • 7162 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

  • 0 kudos
SwapanSwapandee
by New Contributor II
  • 7843 Views
  • 2 replies
  • 0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

  • 7843 Views
  • 2 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

  • 0 kudos
1 More Replies
Labels