Hi @Tim Tremperโ, The specific dataset you mentioned, "training/ecommerce/events/events.parquet", is in Parquet format, but you can easily convert it into a CSV format using Apache Sparkโข on Databricks.
Here's a step-by-step guide to convert the Parquet dataset into a CSV file and download it locally:
- First, load the Parquet file into a DataFrame:
parquet_df = spark.read.parquet("dbfs:/databricks-datasets/ecommerce/events/events.parquet")
- Next, save the DataFrame as a temporary CSV file in your DBFS:
parquet_df.write.csv("dbfs:/tmp/events.csv", mode="overwrite", header=True)
Now, you can copy the CSV file from DBFS to the local file system of the driver node:
%fs cp -r dbfs:/tmp/events.csv file:/tmp/events.csv
- Finally, download the CSV file from the driver node to your local machine using the following command:
dbutils.fs.cp("file:/tmp/events.csv", "dbfs:/FileStore/events.csv", recurse=True)
You can now download the CSV file from your browser by navigating to:
https://<your-databricks-instance>/files/events.csv Replace <your-databricks-instance> with the URL of your Databricks workspace.
Once you have the CSV file, you can upload it to your company's Databricks environment and use it as a data source for the "Apache Sparkโข Programming with Databricks" course.
Remember that converting the Parquet dataset to a CSV format may cause the file size to increase and result in a loss of some features, like schema evolution and data compression. However, it should be sufficient for the purposes of the course.