07-02-2023 10:49 PM - edited 07-02-2023 10:50 PM
I am using the Unity Catalog Cluster. I have a requirement to read the files placed by the source team in a specific location (landing) in S3. I am already using a metastore pointing to a different bucket. Do I need to use an external location pointing to the landing bucket in S3? Additionally, how can I read the data from those files?
07-04-2023 10:30 PM
You have a couple of options to consider:
External Location: You can create an external location in your Unity metastore that points to the landing bucket in S3. This allows Unity to access the files in that location without having to copy or move them to the default location managed by Unity. You can configure the external location using the Unity Catalog's administration tools or by using the Unity SDK/API.
To create the external location, specify the S3 bucket and prefix (folder) where the files are located. Unity will be able to read the data directly from the specified S3 location without any data movement.
Direct Read: Unity also provides the ability to directly read data from files in S3 without the need for an external location. In this approach, you can directly query the files in the S3 landing bucket using SQL or Spark commands. Unity will use its underlying query engine to perform distributed processing and retrieve the data from the S3 files.
To read the data directly from the S3 landing bucket, you can use the Unity Catalog's SQL or Spark interfaces to interact with the data and perform the necessary operations like filtering, aggregating, or joining the datasets.
07-05-2023 10:43 AM
If you could share an example of reading the file of both the cases it would be really helpful.
07-12-2023 02:54 AM
Hi @Databricks3
Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.
Cheers!
08-19-2024 02:03 PM - edited 08-19-2024 02:04 PM
did anyone get any solution on this topic? I am also facing the challenges reading the file from s3 using the boto3 with unity enabled cluster, created the s3 external location and granted the enough access. any help on this ?
same path and data accessible using the pyspark without any issues,
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group