Data Engineering

by mimezzz • Contributor

11-02-2022 6:46:44 PM

7792 Views
8 replies
10 kudos

Resolved! Dataframe rows missing after write_to_delta and read_from_delta

Hi, i am trying to load mongo into s3 using pyspark 3.1.1 by reading them into a parquet. My code snippets are like:df = spark \ .read \ .format("mongo") \ .options(**read_options) \ .load(schema=schema)df = df.coalesce(64)write_df_to_del...

Data Engineering

7792 Views
8 replies
10 kudos

11-02-2022 6:46:44 PM

View Replies

Latest Reply

mimezzz
Contributor

01-26-2023 9:45:26 PM

10 kudos

So i think i have solved the mystery here it was to do with the retention config. By setting the retentionEnabled to True and rention hours being 0, we somewhat loses a few rows in the first file as they were mistaken as files from last session and ...

10 kudos

01-26-2023 9:45:26 PM

7 More Replies

by SRK • Contributor III

12-21-2022 8:29:40 AM

8620 Views
2 replies
0 kudos

How to get the count of dataframe rows when reading through spark.readstream using batch jobs?

I am trying to read messages from kafka topic using spark.readstream, I am using the following code to read it.My CODE:df = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "192.1xx.1.1xx:9xx") .option("subscr...

Data Engineering

8620 Views
2 replies
0 kudos

12-21-2022 8:29:40 AM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

12-22-2022 5:13:54 AM

0 kudos

You can try this approach:https://stackoverflow.com/questions/57568038/how-to-see-the-dataframe-in-the-console-equivalent-of-show-for-structured-st/62161733#62161733ReadStream is running a thread in background so there's no easy way like df.show().

0 kudos

12-22-2022 5:13:54 AM

1 More Replies

Databricks Community

Forum Posts

Resolved! Dataframe rows missing after write_to_delta and read_from_delta

How to get the count of dataframe rows when reading through spark.readstream using batch jobs?