cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Copy Into : Pattern for sub-folders

Anup
New Contributor III

While trying to ingest data from the S3 bucket, we are running into a situation where the data in s3 buckets is in sub-folders of multiple depths.

Is there a good way of specifying patterns for the above case?

We tried using the following for a depth of 4, and it works.

 

%sql
COPY INTO table_name
FROM 's3://bucket_name'
FILEFORMAT = BINARYFILE
PATTERN = '/*/*/*/*'
FORMAT_OPTIONS ('mergeSchema' = 'true',
'header' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
 
However, it is not always possible to know the exact depth for a large amount of random data.
Has anyone run into this problem and was able to solve it?
1 ACCEPTED SOLUTION

Accepted Solutions

Anup
New Contributor III

Thanks @Kaniz ,

We ended up implementing the programmatic approach to calculate the depth (using boto3).

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Anup, When dealing with data in S3 buckets that are organized into sub-folders of varying depths, specifying patterns can be challenging.

 

However, there are some approaches you can consider:

 

Wildcard Patterns:

  • You’ve already used a wildcard pattern in your example: '/*/*/*/*'. This works well when you know the exact depth, but as you mentioned, it’s not always feasible.
  • If you have a rough idea of the maximum depth, you can use a wildcard pattern like '/*/*/*/*/*' to cover a broader range of depths.

Recursive Wildcard Patterns:

Programmatic Approaches:

  • Write custom scripts or use AWS SDKs (such as Boto3 for Python) to programmatically explore the bucket structure and dynamically determine the depth.
  • You can recursively list objects in a bucket and process them accordingly.

Flat File Structure:

  • Remember that S3 itself has a flat structure with no inherent hierarchy like a typical file system. It uses key name prefixes for objects.
  • While the Amazon S3 console supports the concept of folders for organizational purposes, these are not actual folders but rather key prefixes.
  • Consider organizing your data in a way that minimizes the need for deep nesting.

Bucket Design:

Service Limit Increase:

Remember that flexibility and scalability are essential when dealing with large amounts of data. Experiment with different approaches to find the one that best fits your specific use case. 🚀

Anup
New Contributor III

Thanks @Kaniz ,

We ended up implementing the programmatic approach to calculate the depth (using boto3).

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.