cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Archive of legacy system into Databricks with structure and semi-structured data

Mathew-Vesely
New Contributor

We are currently exploring using Data Bricks to store and archive data from a legacy syste. The governance features of Unity Catalogue will give us the required capabilities to ensure we meet our legal, statutory and policy requirements for data retention.

 

We have about 100 tables exported from the source systems, these are structured tables from a CRM and Finance system. Additionally we have exported PDF attachments using a custom program, in addition we have exported  outbound emails that conform to the RFC standard for multi part Mime - i.e each email is a .eml and can be opened in Outlook

We propose having the PDF attachment's and Emails stored in Data Bricks, the current thinking is as an external volume in the schema as a table, this points towards a storage account on Azure (17mio emails and 5mio PDF's)

There is a view that a product like agent bricks, or similar, could be used to create an end user (internal only) application that could use allow employees to query data contained within. An example scenario, and the reason for including emails and attachments is;

A customer rings in a wishes to query an interaction that occurred in regards to the delivery of a item from 6 months ago. We have structured data for customer information, names, delivery addresses, including email address, we also have emails that were sent and the contents of them. We include any email contract attachments like policies or other artefacts stored against this customer but exported as PDF. 

 

By having the .eml and DPF's stored in DB, and being able to link them using for example email address and/or customer number the documents can be joined. Resulting queries for interactions or data for 'this customer' would not only retrieve structure data, but we could also include emails and attachments

 

Has anyone used a similar use case where semi structure data is linked to structured data? Also keen to get feedback on best practices for this scenario?

 

4 REPLIES 4

Coffee77
Contributor III

Could be a solution to create a delta table to store your structured data about your files (metadata) along with links or references to files (PDFs, emails, etc.) within the volume in a new column? You can return this info in "light" list sets and then, when user select one record or a subset of them, load content from files.

This is an approach, I used in some of my pipelines. In my case, no sense to store directly semi-structured data directly in delta tables.


Lifelong Solution Architect Learner | Coffee & Data

saurabh18cs
Honored Contributor II

Hi @Coffee77 In my case, no sense to store directly semi-structured data directly in delta tables.?? even if someone is storing semi-structured data directly in delta table , how is he/she storing? parsing whole file into a single column cell?

Not saying that at all. My approach was to convert JSON attributes in structured and non-structured columns in delta tables which, as said, it did not make sense as I didn't have the need of processing multiple files at once, only one file at a time before moving that raw info to silver layer.


Lifelong Solution Architect Learner | Coffee & Data

Raman_Unifeye
Contributor III

Classic 360 Custoemr View case and Databricks is certainly the right platform to do so.

Strcutred Data - Stores in the Delta Tables

Email and PDFs - stored in Volumes, however, metadata as path to the volumes stored in delta table against customer-id

In order to search through the unstrctured data content (.eml or PDF), you will have to write an extraction process to extract the info such as To, From, Subject, Date, and the Body text for emails.

If you need to search through the PDF content as well (not just the names), you may want to use ai_parse_document to extract the PPDF content and store in a delta table. This steps very much depends on the how spohisiticated search you want to build.


RG #Driving Business Outcomes with Data Intelligence