cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Archive of legacy system into Databricks with structure and semi-structured data

Mathew-Vesely
Visitor

We are currently exploring using Data Bricks to store and archive data from a legacy syste. The governance features of Unity Catalogue will give us the required capabilities to ensure we meet our legal, statutory and policy requirements for data retention.

 

We have about 100 tables exported from the source systems, these are structured tables from a CRM and Finance system. Additionally we have exported PDF attachments using a custom program, in addition we have exported  outbound emails that conform to the RFC standard for multi part Mime - i.e each email is a .eml and can be opened in Outlook

We propose having the PDF attachment's and Emails stored in Data Bricks, the current thinking is as an external volume in the schema as a table, this points towards a storage account on Azure (17mio emails and 5mio PDF's)

There is a view that a product like agent bricks, or similar, could be used to create an end user (internal only) application that could use allow employees to query data contained within. An example scenario, and the reason for including emails and attachments is;

A customer rings in a wishes to query an interaction that occurred in regards to the delivery of a item from 6 months ago. We have structured data for customer information, names, delivery addresses, including email address, we also have emails that were sent and the contents of them. We include any email contract attachments like policies or other artefacts stored against this customer but exported as PDF. 

 

By having the .eml and DPF's stored in DB, and being able to link them using for example email address and/or customer number the documents can be joined. Resulting queries for interactions or data for 'this customer' would not only retrieve structure data, but we could also include emails and attachments

 

Has anyone used a similar use case where semi structure data is linked to structured data? Also keen to get feedback on best practices for this scenario?

 

0 REPLIES 0