cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Copy Into : Pattern for sub-folders

Anup
New Contributor III

While trying to ingest data from the S3 bucket, we are running into a situation where the data in s3 buckets is in sub-folders of multiple depths.

Is there a good way of specifying patterns for the above case?

We tried using the following for a depth of 4, and it works.

 

%sql
COPY INTO table_name
FROM 's3://bucket_name'
FILEFORMAT = BINARYFILE
PATTERN = '/*/*/*/*'
FORMAT_OPTIONS ('mergeSchema' = 'true',
'header' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
 
However, it is not always possible to know the exact depth for a large amount of random data.
Has anyone run into this problem and was able to solve it?
1 ACCEPTED SOLUTION

Accepted Solutions

Anup
New Contributor III

Thanks @Kaniz_Fatma ,

We ended up implementing the programmatic approach to calculate the depth (using boto3).

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Anup, When dealing with data in S3 buckets that are organized into sub-folders of varying depths, specifying patterns can be challenging.

 

However, there are some approaches you can consider:

 

Wildcard Patterns:

  • You’ve already used a wildcard pattern in your example: '/*/*/*/*'. This works well when you know the exact depth, but as you mentioned, it’s not always feasible.
  • If you have a rough idea of the maximum depth, you can use a wildcard pattern like '/*/*/*/*/*' to cover a broader range of depths.

Recursive Wildcard Patterns:

Programmatic Approaches:

  • Write custom scripts or use AWS SDKs (such as Boto3 for Python) to programmatically explore the bucket structure and dynamically determine the depth.
  • You can recursively list objects in a bucket and process them accordingly.

Flat File Structure:

  • Remember that S3 itself has a flat structure with no inherent hierarchy like a typical file system. It uses key name prefixes for objects.
  • While the Amazon S3 console supports the concept of folders for organizational purposes, these are not actual folders but rather key prefixes.
  • Consider organizing your data in a way that minimizes the need for deep nesting.

Bucket Design:

Service Limit Increase:

Remember that flexibility and scalability are essential when dealing with large amounts of data. Experiment with different approaches to find the one that best fits your specific use case. 🚀

Anup
New Contributor III

Thanks @Kaniz_Fatma ,

We ended up implementing the programmatic approach to calculate the depth (using boto3).

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group