Databricks Community

JissMathew · ‎11-14-2024

while try to read a csv file using data frame , read csv using a file format , but fail in case of formatting and column error while loading how the data in databricks ,the code i used for

df = spark.read.format("csv") \

.option("header", "true") \

.option("quote", '"') \

.option("delimiter", ",") \

.option("nullValue", "") \

.option("emptyValue", "NULL") \

.schema(schema) \

.load(f"{bronze_folder_path}/Test.csv")

this is actually data format

Jiss Mathew
India .

Mike_Szklarczyk · ‎11-20-2024

You can try add multiline option:

df = (
	spark.read.format("csv")
		.option("header", "true")
		.option("quote", '"')
		.option("delimiter", ",")
		.option("nullValue", "")
		.option("emptyValue", "NULL")
		.option("multiline", True)
		.schema(schema)
		.load(f"{bronze_folder_path}/Test.csv"
)

https://spark.apache.org/docs/3.5.1/sql-data-sources-csv.html

I also encourage you to use the syntax

df = (
  spark.read
  .some_transformation
) 
rather than 

df=spark.read \
  .some_transformation \

it improves readability and allows you to comment out selected lines

View solution in original post

MuthuLakshmi · ‎11-14-2024

@JissMathew What is the error that you are getting when trying to load?

JissMathew · ‎11-15-2024

@MuthuLakshmi actually, In "adreess" column we need "kochi", and column miss match and get into "name" column , that is the error

Jiss Mathew
India .

szymon_dybczak · ‎11-15-2024

Hi @JissMathew ,

Could you also provide sample csv file?

JissMathew · ‎11-17-2024

Hi @szymon_dybczak have only option to send in png, jpg formats

Jiss Mathew
India .

holly · ‎11-18-2024

Hey, what's the schema you're referencing? The dates are very inconsistent and unlikely to be loaded in as anything useful. It also looks like the delimiter of a comma is causing you issues as it's also within the body of the text without quotes each time. If this is a csv you want to use for a one off instance, you could export it to a tab delimited file (or other delimiter of your choice) and that should go some way to fixing the issue.

JissMathew · ‎11-18-2024

hey @holly

actually this .option("quote", '"') option in code should have to fix the error but its not working !, is there any standard file format for csv files ?

Jiss Mathew
India .

Lakshay · ‎11-18-2024

As the "kochi" is in new line, that is causing the issue. Ideally, I would suggest to avoid generating a csv file that has line breaks in a column data. But if you want to handle this scenario, you probably need to put exclusive quotes in your file for each column data so that the line break in a column data are not identified as new row.

JissMathew · ‎11-19-2024

if there is a option for handle this scenario using a file format for this ? or we have to manually edit in our source file ?

Jiss Mathew
India .

gilt · ‎11-19-2024

test

JissMathew · ‎11-19-2024

@gilt test ????

Jiss Mathew
India .

Mike_Szklarczyk · ‎11-20-2024

You can try add multiline option:

df = (
	spark.read.format("csv")
		.option("header", "true")
		.option("quote", '"')
		.option("delimiter", ",")
		.option("nullValue", "")
		.option("emptyValue", "NULL")
		.option("multiline", True)
		.schema(schema)
		.load(f"{bronze_folder_path}/Test.csv"
)

https://spark.apache.org/docs/3.5.1/sql-data-sources-csv.html

I also encourage you to use the syntax

df = (
  spark.read
  .some_transformation
) 
rather than 

df=spark.read \
  .some_transformation \

it improves readability and allows you to comment out selected lines

JissMathew · ‎11-21-2024

@Mike_Szklarczyk Thank you! The issue has been successfully resolved. I sincerely appreciate your guidance and support throughout this process. Your assistance was invaluable. 😊

Jiss Mathew
India .