- 30304 Views
- 5 replies
- 2 kudos
Hi , I am trying to read a csv file with one column has double quotes like below.
James,Butt,"Benton, John B Jr",6649 N Blue Gum St
Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd
Art,Venere,"Chemel, James L Cpa",8 W Cerritos Ave #54...
- 30304 Views
- 5 replies
- 2 kudos
Latest Reply
Hi Team,I am also facing same issue and i have applied all the option mentioned from above posts:I will just post my dataset here:Attached is the my input data with 3 different column out of which comment column contains text value with double quotes...
4 More Replies
- 74475 Views
- 21 replies
- 12 kudos
Hello all,
As described in the title, here's my problem:
1. I'm using databricks-connect in order to send jobs to a databricks cluster
2. The "local" environment is an AWS EC2
3. I want to read a CSV file that is in DBFS (databricks) with pd.read_cs...
- 74475 Views
- 21 replies
- 12 kudos
Latest Reply
Please guys I need your help, I got the same issue still after readed all your comments.I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.spark = DatabricksSession.builder.remote( host=host...
20 More Replies
- 23997 Views
- 12 replies
- 0 kudos
I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each.
This would be easy if I could create a column that contains Row ID. Is that possible?
- 23997 Views
- 12 replies
- 0 kudos
Latest Reply
Hi @NithinTiruveedh
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...
11 More Replies
- 16687 Views
- 2 replies
- 2 kudos
I can load multiple csv files by doing something like:
paths = ["file_1", "file_2", "file_3"]
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load(paths)
But this doesn't seem to preserve the...
- 16687 Views
- 2 replies
- 2 kudos
Latest Reply
val diamonds = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/11.csv","/FileStore/tables/12.csv","/FileStore/tables/13.csv")
display(diamonds)This is working for me @Shridhar
1 More Replies
by
Nazar
• New Contributor II
- 6089 Views
- 3 replies
- 4 kudos
Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...
- 6089 Views
- 3 replies
- 4 kudos
- 8021 Views
- 1 replies
- 0 kudos
I have a data frame in Spark that has a column timestamp. I want to add a new column to this data frame that has the DateTime in the below format created from this existing timestamp column.“YYYY-MM-DD HH:MM:SS”
- 8021 Views
- 1 replies
- 0 kudos
Latest Reply
val df = Seq(("2021-11-05 02:46:47.154410"),("2019-10-05 2:46:47.154410")).toDF("old_column")display(df)import org.apache.spark.sql.functions._val df2 = df.withColumn("new_column", from_unixtime(unix_timestamp(col("old_column"), "yyyy-MM-dd HH:mm:ss....
- 2804 Views
- 0 replies
- 0 kudos
Example use case: When connecting a sample Plotly Dash application to a large dataset, in order to test the performance, I need the file format to be in either hdf5 or arrow. According to this doc: Optimize conversion between PySpark and pandas DataF...
- 2804 Views
- 0 replies
- 0 kudos
- 11687 Views
- 2 replies
- 0 kudos
I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue.
I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv. I have so...
- 11687 Views
- 2 replies
- 0 kudos
Latest Reply
I'm confused - you say the escape is backslash, but you show forward slashes in your data. Don't you want the escape to be forward slash?
1 More Replies
- 16745 Views
- 9 replies
- 0 kudos
I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example,
df1:
+-----+...
- 16745 Views
- 9 replies
- 0 kudos
Latest Reply
@bhosskie
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlDF1 = spark.sql("select count(*) as Total FROM user_summary")
sqlDF2 = sp...
8 More Replies
- 7929 Views
- 1 replies
- 0 kudos
We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data:
df1 = df.selectExpr("CAST(value AS STRING)")
{"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...
- 7929 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")
- 8307 Views
- 2 replies
- 0 kudos
I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...
- 8307 Views
- 2 replies
- 0 kudos
Latest Reply
Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")
1 More Replies