Databricks

Ancil · ‎12-01-2022

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.

what is the easiest and time effective way to do this?

I tried with collect and it's taking long time.

And I tried UDF methods but getting below error

UmaMahesh1 · ‎12-01-2022

Hi @Ancil P A

Is your data in the result column a json value or how is it ?

From your question, I understood that you have two columns in your df, 1 column is the file path and the other column is data.

Also please post what udf you are trying to build so that if your approach is useful, fix can be done on that.

Cheers..

Ancil · ‎12-01-2022

Hi @Uma Maheswara Rao Desula

In the result column have result json data , but column type is string.

Please find below screen shot for UDF

Once I called below line am getting below error

input_data_df = input_data_df.withColumn("is_file_created",write_files_udf(col("file_path"),col("data_after_grammar_correction")))

-werners- · ‎12-01-2022

Is it an option to write is as a single parquet file, but partitioned?

Like that physically the paths of the partitions are different, but they all belong to the same parquet file.

The key is to avoid loops.

Ancil · ‎12-01-2022

Hi @Werner Stinckens

My use case is to write text files with how many rows in dataframe.

For example, if I have 100 rows, then I need to write 100 files in the specified location.

-werners- · ‎12-01-2022

yes exactly, that is what partitioning does.

all you need is a common path where you will write all those files, and partition on the part that is not common.

f.e.

/path/to/file1|<data>

/path/to/file2|<data>

the common part(/path/to), you use as target location.

The changing part (file1, file2) you use as partition column

so it will become:

df.write.partitionBy(<fileCol>).parquet(<commonPath>)

Spark will write a file (or even more than 1) per partition.

If you want only one single file you also have to repartition by filecol.

Ancil · ‎12-01-2022

Hi @Werner Stinckens

In my case there is no common path, the file path column has different paths in a storage container.

Do we have any other way

-werners- · ‎12-01-2022

afaik partitioning is the only way to write to multiple locations in parallel.

This SO thread perhaps has a way.

Ancil · ‎12-01-2022

Thanks a lot, let me check

Ancil · ‎12-01-2022

Hi @Werner Stinckens

After partitioning also am getting below error. Do u have any about this error

NhatHoang · ‎12-01-2022

Hi,

I agree with Werners, try to avoid loop with Pyspark Dataframe.

If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.

Thanks.

Ancil · ‎12-01-2022

Hi @Nhat Hoang

The size may vary it may be up to 1 lakh, I will check with pandas

Databricks

Any on please suggest how we can effectively loop through PySpark Dataframe .

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI