- 18382 Views
- 5 replies
- 2 kudos
Hi , I am trying to read a csv file with one column has double quotes like below.
James,Butt,"Benton, John B Jr",6649 N Blue Gum St
Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd
Art,Venere,"Chemel, James L Cpa",8 W Cerritos Ave #54...
- 18382 Views
- 5 replies
- 2 kudos
Latest Reply
Hi Team,I am also facing same issue and i have applied all the option mentioned from above posts:I will just post my dataset here:Attached is the my input data with 3 different column out of which comment column contains text value with double quotes...
4 More Replies
- 42463 Views
- 28 replies
- 12 kudos
Hello all,
As described in the title, here's my problem:
1. I'm using databricks-connect in order to send jobs to a databricks cluster
2. The "local" environment is an AWS EC2
3. I want to read a CSV file that is in DBFS (databricks) with pd.read_cs...
- 42463 Views
- 28 replies
- 12 kudos
Latest Reply
Please guys I need your help, I got the same issue still after readed all your comments.I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.spark = DatabricksSession.builder.remote( host=host...
27 More Replies
- 15021 Views
- 12 replies
- 0 kudos
I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each.
This would be easy if I could create a column that contains Row ID. Is that possible?
- 15021 Views
- 12 replies
- 0 kudos
Latest Reply
Hi @NithinTiruveedh
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...
11 More Replies
- 12146 Views
- 2 replies
- 2 kudos
I can load multiple csv files by doing something like:
paths = ["file_1", "file_2", "file_3"]
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load(paths)
But this doesn't seem to preserve the...
- 12146 Views
- 2 replies
- 2 kudos
Latest Reply
val diamonds = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/11.csv","/FileStore/tables/12.csv","/FileStore/tables/13.csv")
display(diamonds)This is working for me @Shridhar
1 More Replies
by
Nazar
• New Contributor II
- 3544 Views
- 5 replies
- 5 kudos
Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...
- 3544 Views
- 5 replies
- 5 kudos
- 1514 Views
- 1 replies
- 0 kudos
Example use case: When connecting a sample Plotly Dash application to a large dataset, in order to test the performance, I need the file format to be in either hdf5 or arrow. According to this doc: Optimize conversion between PySpark and pandas DataF...
- 1514 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @ josephine.ho! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.
- 4642 Views
- 3 replies
- 0 kudos
I have a data frame in Spark that has a column timestamp. I want to add a new column to this data frame that has the DateTime in the below format created from this existing timestamp column.“YYYY-MM-DD HH:MM:SS”
- 4642 Views
- 3 replies
- 0 kudos
Latest Reply
from pyspark import SparkContextfrom pyspark.sql import SQLContextfrom functools import reduceimport pyspark.sql.functions as Fsc = SparkContext.getOrCreate()sql = SQLContext(sc)input_list = [ (1,"2019-11-07 10:30:00") ,(1,"2019-11-08 10:30:00") ,(...
2 More Replies
- 8394 Views
- 2 replies
- 0 kudos
I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue.
I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv. I have so...
- 8394 Views
- 2 replies
- 0 kudos
Latest Reply
I'm confused - you say the escape is backslash, but you show forward slashes in your data. Don't you want the escape to be forward slash?
1 More Replies
- 8010 Views
- 9 replies
- 0 kudos
I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example,
df1:
+-----+...
- 8010 Views
- 9 replies
- 0 kudos
Latest Reply
@bhosskie
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlDF1 = spark.sql("select count(*) as Total FROM user_summary")
sqlDF2 = sp...
8 More Replies
- 5063 Views
- 1 replies
- 0 kudos
We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data:
df1 = df.selectExpr("CAST(value AS STRING)")
{"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...
- 5063 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")
- 6450 Views
- 2 replies
- 0 kudos
I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...
- 6450 Views
- 2 replies
- 0 kudos
Latest Reply
Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")
1 More Replies