- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-14-2023 10:00 AM
@Artem Sachuk :
Yes, to keep track of which files were loaded by Autoloader, you can add the input_file_name() function as a new column in your data frame during the loading process. This way, you can know exactly which file each row of data came from.
Alternatively, you can also use the Autoloader's checkpoint mechanism to keep track of which files have been successfully loaded. The checkpoint mechanism stores the file metadata (such as file name, size, and modified time) in a checkpoint file. You can query this checkpoint file to see which files have been processed and loaded successfully.
To enable checkpointing, you need to specify a checkpoint directory when you create the Autoloader stream. For example:
checkpoint_dir = "/mnt/checkpoints"
autoloader = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.checkpointLocation", checkpoint_dir) \
.load(path)The checkpoint directory can be any path in a file system that supports write access. Make sure that the checkpoint directory is on a durable file system (such as HDFS or Azure Blob Storage) and not on a local disk, as the checkpoint information needs to survive driver or executor failures.
Once the checkpoint directory is set, the Autoloader stream will create checkpoint files in that directory to keep track of its progress. You can query the checkpoint files using the StreamingQuery.recentProgress method to get information about the latest batch of processed files.
For Example:
query = autoloader.writeStream \
.outputMode("append") \
.format("console") \
.start()
while not query.isActive:
time.sleep(1)
while query.isActive:
for progress in query.recentProgress:
if "numInputRows" in progress:
print("Loaded {} rows from file {}"
.format(progress["numInputRows"], progress["inputFiles"]))
time.sleep(1)
query.awaitTermination()This code will print the number of rows loaded and the file names for each processed batch of files.