cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Jobs and Pipeline input parameter

Ritesh-Dhumne
New Contributor II

I wanted to extract all files in the volume I have uploaded , in notebook 1 and then in notebook 2 perform basic transformation on every files like missing values , nulls , also I want to store the null , dirty records seperately and a clean dataframe seperately for all the files .In Community Edition in Jobs and PIpeline.

2 REPLIES 2

BS_THE_ANALYST
Esteemed Contributor II

Hey @Ritesh-Dhumne  👋,

Firstly, have you thought about moving across to the Free Edition instead of the Community Edition? https://docs.databricks.com/aws/en/getting-started/free-edition 

Based on your query, here's a couple you could consider: 

1. If you're considering setting up a Job with two Tasks, each as a notebook, then you could create a Job Parameter or a Task Parameter. I'd have a read up on these in the documentations, the docs are great: https://docs.databricks.com/aws/en/jobs/parameters 

BS_THE_ANALYST_1-1759821809948.png

2. If you didn't want to create a job, you could make use of the magic command %run which would allow you to run another notebook from within your current notebook. You could configure some Widgets which basically are you parameters 🙂.
https://docs.databricks.com/aws/en/notebooks/widgets#use-databricks-widgets-with-runBS_THE_ANALYST_0-1759821702054.png

Feel free to ask any follow up questions 💪.

Please keep us updated on the project, it sounds really cool! 🙂. To level up the project, perhaps consider using Notebook 1 to create a Table rather than outputting to a file. This way, you can leverage the delta lake and have things like lineage and table history etc. Lots of cool features to explore! 😀

All the best,
BS

szymon_dybczak
Esteemed Contributor III

 

Hi @Ritesh-Dhumne ,

I'm assuming that you mistakenly named Free Edition as Community since you're using volumes which are not available in community edition.

I’m not sure if I’ve understood your approach correctly, but at first glance it seems incorrect - you can’t pass a DataFrame between tasks. What you can do is load all the files from the volume into a bronze table in Notebook1. You can use the special _metadata column to add information about the file_path from which each particular row originates. Here’s an example of how to use it:

szymon_dybczak_0-1759826052216.png

 

 

Then, in Notebook2, you can apply your transformations based on this bronze table. You can count nulls, handle dirty data, and benefit from the fact that you can relate all these issues to a particular file, since this information is added to the bronze table through the _metadata special column.

From what I see you're in a learning process so I won't introduce the concept of autoloader which is pretty handy for ingestion of files 🙂