topic Re: Any on please suggest how we can effectively loop through PySpark Dataframe . in Data Engineering

Any on please suggest how we can effectively loop through PySpark Dataframe .

Ancil — Thu, 01 Dec 2022 12:59:35 GMT

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.

what is the easiest and time effective way to do this?

I tried with collect and it's taking long time.

And I tried UDF methods but getting below error

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

UmaMahesh1 — Thu, 01 Dec 2022 13:06:50 GMT

Hi @Ancil P A

Is your data in the result column a json value or how is it ?

From your question, I understood that you have two columns in your df, 1 column is the file path and the other column is data.

Also please post what udf you are trying to build so that if your approach is useful, fix can be done on that.

Cheers..

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

-werners- — Thu, 01 Dec 2022 13:06:56 GMT

Is it an option to write is as a single parquet file, but partitioned?

Like that physically the paths of the partitions are different, but they all belong to the same parquet file.

The key is to avoid loops.

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

Ancil — Thu, 01 Dec 2022 13:25:47 GMT

Hi @Uma Maheswara Rao Desula

In the result column have result json data , but column type is string.

Please find below screen shot for UDF

Once I called below line am getting below error

input_data_df = input_data_df.withColumn("is_file_created",write_files_udf(col("file_path"),col("data_after_grammar_correction")))

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

Ancil — Thu, 01 Dec 2022 13:28:43 GMT

Hi @Werner Stinckens

My use case is to write text files with how many rows in dataframe.

For example, if I have 100 rows, then I need to write 100 files in the specified location.

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

-werners- — Thu, 01 Dec 2022 13:37:50 GMT

yes exactly, that is what partitioning does.

all you need is a common path where you will write all those files, and partition on the part that is not common.

f.e.

/path/to/file1|<data>

/path/to/file2|<data>

the common part(/path/to), you use as target location.

The changing part (file1, file2) you use as partition column

so it will become:

df.write.partitionBy(<fileCol>).parquet(<commonPath>)

Spark will write a file (or even more than 1) per partition.

If you want only one single file you also have to repartition by filecol.

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

Ancil — Thu, 01 Dec 2022 13:44:04 GMT

Hi @Werner Stinckens

In my case there is no common path, the file path column has different paths in a storage container.

Do we have any other way

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

-werners- — Thu, 01 Dec 2022 13:51:37 GMT

afaik partitioning is the only way to write to multiple locations in parallel.

This SO thread perhaps has a way.

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

Ancil — Thu, 01 Dec 2022 14:02:37 GMT

Thanks a lot, let me check

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

Ancil — Thu, 01 Dec 2022 16:36:19 GMT

Hi @Werner Stinckens

After partitioning also am getting below error. Do u have any about this error

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

NhatHoang — Fri, 02 Dec 2022 03:28:07 GMT

Hi,

I agree with Werners, try to avoid loop with Pyspark Dataframe.

If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.

Thanks.

Re: Any on please suggest how we can effectively loop through PySpark Dataframe .

Ancil — Fri, 02 Dec 2022 04:34:42 GMT

Hi @Nhat Hoang

The size may vary it may be up to 1 lakh, I will check with pandas