08-21-2021 11:58 AM
I have an existing data pipeline which looks like this: A small MySQL data source (around 250 GB) and data passes through Debezium/ Kafka / a custom data redactor -> to Glue ETL jobs and finally lands on Redshift, but the scale of the data is too small. I am trying to build a new data pipeline from the following current state: - 3 MySQL replicas (source) in 1 cluster (total data size 1 TB) - And data scientists directly query the data in a hacky way resulting in hours of delay. - But for this new pipeline, the data is much larger in TBs. So existing pipeline implementation fails I am brainstorming architectural ideas at this point. A lot of these concepts are quite new to me. So, have some questions which are unclear at this point: 1. Is it possible to create a CDC pipeline inside the SQL cluster? Is DeltaLakes a good use case with MYSQL replicas? 2. Is there a good way to redact/hide PII information inside the SQL cluster? I was thinking of creating views that don't read the PII fields. What is an alternative to Debezium/Kafka if I am using Databricks and want to hide PII fields upstream? 3. I am looking at AWS Databricks as an option to perform ETL operations. But in this case, should the source be the CDC tables before ETL starts? Or is it advisable to build another CDC pipeline followed by another one to read from the output of this pipeline? (This portion is quite unclear to me) 4. What would be good data ingestion options in this scenario? I am thinking of a pull-based architecture where Spark reads the data source in batch intervals. Is there a better approach? 5. Do I really need Redshift if the data source is just 1 TB? Would a DW be overkill at this point? Can I not just create structured tables in DataBricks itself?
09-09-2021 01:57 PM
There is a lot in this question, so generally speaking I suggest you reach out to the sales team at Databricks. You can talk to a solutions architect who get into more detail. Here are my general thoughts having seen a lot of customer arch:
Generally, you can do all of this in Databricks, you do not to use Redshift here, or MySQL. You can store the data and results in a Delta table and have your users be able to query it. The new Databricks SQL is a SQL box that you can make dashboards with or connect to you other BI tools (Tableau, etc.). This replaces Redshift and has 1-2 sec latency for queries. The great part of doing this is that you can keep the parquet files with your data in it on S3. You don't need to lock it into Redshift or MySQL formats. Or replicate it.
As for PII and redaction, the best bet is going to saving your data into a Delta table that has an external path to S3. Then you can manage permissions to the file location on S3. As for redacting and hiding just certain columns, you might need to wait for Unity Catalog to come out. I'm not sure that will be available just yet.
08-23-2021 02:07 AM
A proven pattern is to first land your new/changed data into a data lake (ADLS, S3,...) in f.e. parquet format. So for this you need some ETL tool which can read either the CDC tool, or by use of queries (based on a timestamp, date). The main goal is to get the new data in your data lake as fast as possible. This can be done with multiple tools (with Databricks one of them, or AWS Glue).
So I would advise to first have a pipeline which lands the raw data, and then a second one (and maybe even more) which processes that data. You could put the whole flow into one single pipeline, but the downside is that you do not have clear distinction between extracting the raw data and actually processing this data.
A good guideline is to think about which parts of the whole data flow should be able to run separately(multiple pipelines might use the same data, but extracting this data should not run multiple times, so the extraction is a separate step).Because you use incremental updates, delta lake is certainly a good option!
Concerning the Pii data: IMO the easiest approach is to tackle this as soon as possible in the data flow. f.e. directly in etl tool (by means of a select query or a view that does not display pii data)If you have no need for streaming data, I'd start with batch processing. Get the hang of it (Databricks/Spark). If the need for streaming data arises later on, you can migrate from batch to streaming, which is possible.
Redshift or not... It depends. You can certainly serve data in databricks itself (certainly with the new SQL analytical environment). So I think it would be a matter of price/performance.
I myself f.e. use Azure Synapse (comparable to Redshift). But I am quite sure the same could be achieved with Databricks. So we might change this in the future. This will depend on price/performance.Hope you can do something with this answer.
09-09-2021 01:57 PM
There is a lot in this question, so generally speaking I suggest you reach out to the sales team at Databricks. You can talk to a solutions architect who get into more detail. Here are my general thoughts having seen a lot of customer arch:
Generally, you can do all of this in Databricks, you do not to use Redshift here, or MySQL. You can store the data and results in a Delta table and have your users be able to query it. The new Databricks SQL is a SQL box that you can make dashboards with or connect to you other BI tools (Tableau, etc.). This replaces Redshift and has 1-2 sec latency for queries. The great part of doing this is that you can keep the parquet files with your data in it on S3. You don't need to lock it into Redshift or MySQL formats. Or replicate it.
As for PII and redaction, the best bet is going to saving your data into a Delta table that has an external path to S3. Then you can manage permissions to the file location on S3. As for redacting and hiding just certain columns, you might need to wait for Unity Catalog to come out. I'm not sure that will be available just yet.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group