DataFrame to CSV write has issues due to multiple ...

sai_sathya · ‎04-11-2024

Hi all

iam working on a data containing JSON fields with embedded commas into CSV format. iam facing challenges due to the commas

within the JSON being misinterpreted as column delimiters during the conversion process.

i tried several methods to modify data and tried to escape the comma comes under the row value but it doesent works at the same time i should make sure the JSON code has to be in correct syntax as that will be used in some other place within the project

the sample pyspark code and data that i worked with

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType


schema = StructType([
    StructField("IDcol1", IntegerType(), True),
    StructField("IDcol2", IntegerType(), True),
    StructField("AdditionalRequestParameters", StringType(), True),
    StructField("RequestURL", StringType(), True)
])

data = [
    (1, 2, "{'Locale':'en','KnowledgeType':[{'Name':'IndustryKnowledge'},{'Name':'KnowledgeIndustry'}],'SegmentCountry':[{'Country':'US','IndustrySegment':'IP'}],'Setversion':'VersionValue','Flags':null,'ScalingID':0,'VersionInfo':{'PracticeSubType':'SubPracticeValue','Version':'VersionValue'}}", 'https://abc/something/API/2021v3/nothing/data'),
    
]

df = spark.createDataFrame(data, schema=schema)
df.display()

by displaying as an dataframe its no doubt it works fine

and that is how the expected data should be but while writing it into an CSV file in my ADLS it misbehaves and creates an new column for every comma that comes under the JSON column

anyway iam unable to read the csv data i tried display() and show() and when i look into the csv file that generated from the container this is what i was able to find

please help me how to handle this commas . Thnaks

DataFrame to CSV write has issues due to multiple commas inside an row value