- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2022 08:25 AM
Currently I load multiple parquet file with this code:
df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")
(Inside the Voucher folder, there is one folder by date. Each one containing one parquet file)
How can I add a column into this DataFrame, that contains the creation date of each parquet file ?
Thanks
- Labels:
-
Dataframe
-
Parquet File
-
Pyspark Dataframe
-
Python
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2022 08:43 AM
Hi,
You can use the file metadata column: https://docs.databricks.com/ingestion/file-metadata-column.html
This way you can access the file_path, file_name, file_size and file_modification_time of the data file from the corresponding dataframe row. No need to do it manually!
I found it useful 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2022 08:43 AM
Hi,
You can use the file metadata column: https://docs.databricks.com/ingestion/file-metadata-column.html
This way you can access the file_path, file_name, file_size and file_modification_time of the data file from the corresponding dataframe row. No need to do it manually!
I found it useful 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2022 12:46 PM
Thanks @Michail Karamanos