DataFrame to CSV write has issues due to multiple commas inside an row value
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-11-2024 09:06 AM
Hi all
iam working on a data containing JSON fields with embedded commas into CSV format. iam facing challenges due to the commas
within the JSON being misinterpreted as column delimiters during the conversion process.
i tried several methods to modify data and tried to escape the comma comes under the row value but it doesent works at the same time i should make sure the JSON code has to be in correct syntax as that will be used in some other place within the project
the sample pyspark code and data that i worked with
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("IDcol1", IntegerType(), True),
StructField("IDcol2", IntegerType(), True),
StructField("AdditionalRequestParameters", StringType(), True),
StructField("RequestURL", StringType(), True)
])
data = [
(1, 2, "{'Locale':'en','KnowledgeType':[{'Name':'IndustryKnowledge'},{'Name':'KnowledgeIndustry'}],'SegmentCountry':[{'Country':'US','IndustrySegment':'IP'}],'Setversion':'VersionValue','Flags':null,'ScalingID':0,'VersionInfo':{'PracticeSubType':'SubPracticeValue','Version':'VersionValue'}}", 'https://abc/something/API/2021v3/nothing/data'),
]
df = spark.createDataFrame(data, schema=schema)
df.display()
by displaying as an dataframe its no doubt it works fine
and that is how the expected data should be but while writing it into an CSV file in my ADLS it misbehaves and creates an new column for every comma that comes under the JSON column
anyway iam unable to read the csv data i tried display() and show() and when i look into the csv file that generated from the container this is what i was able to find
please help me how to handle this commas . Thnaks