topic Re: Jobs and Pipeline input parameter in Data Engineering

Jobs and Pipeline input parameter

Ritesh-Dhumne — Tue, 07 Oct 2025 05:05:25 GMT

I wanted to extract all files in the volume I have uploaded , in notebook 1 and then in notebook 2 perform basic transformation on every files like missing values , nulls , also I want to store the null , dirty records seperately and a clean dataframe seperately for all the files .In Community Edition in Jobs and PIpeline.

Re: Jobs and Pipeline input parameter

BS_THE_ANALYST — Tue, 07 Oct 2025 07:34:50 GMT

Hey @Ritesh-Dhumne 👋,

Firstly, have you thought about moving across to the Free Edition instead of the Community Edition? https://docs.databricks.com/aws/en/getting-started/free-edition

Based on your query, here's a couple you could consider:

1. If you're considering setting up a Job with two Tasks, each as a notebook, then you could create a Job Parameter or a Task Parameter. I'd have a read up on these in the documentations, the docs are great: https://docs.databricks.com/aws/en/jobs/parameters

2. If you didn't want to create a job, you could make use of the magic command %run which would allow you to run another notebook from within your current notebook. You could configure some Widgets which basically are you parameters 🙂.
https://docs.databricks.com/aws/en/notebooks/widgets#use-databricks-widgets-with-run

Feel free to ask any follow up questions 💪.

Please keep us updated on the project, it sounds really cool! 🙂. To level up the project, perhaps consider using Notebook 1 to create a Table rather than outputting to a file. This way, you can leverage the delta lake and have things like lineage and table history etc. Lots of cool features to explore! 😀

All the best,
BS

Re: Jobs and Pipeline input parameter

szymon_dybczak — Tue, 07 Oct 2025 08:36:47 GMT

Hi @Ritesh-Dhumne ,

I'm assuming that you mistakenly named Free Edition as Community since you're using volumes which are not available in community edition.

I’m not sure if I’ve understood your approach correctly, but at first glance it seems incorrect - you can’t pass a DataFrame between tasks. What you can do is load all the files from the volume into a bronze table in Notebook1. You can use the special _metadata column to add information about the file_path from which each particular row originates. Here’s an example of how to use it:

Then, in Notebook2, you can apply your transformations based on this bronze table. You can count nulls, handle dirty data, and benefit from the fact that you can relate all these issues to a particular file, since this information is added to the bronze table through the _metadata special column.

From what I see you're in a learning process so I won't introduce the concept of autoloader which is pretty handy for ingestion of files 🙂