cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Using AutoML to predict completion dates of a project management dataset

ProtonMix
New Contributor II

Hello! I am fairly new to Databricks. I'm trying to do a proof of concept with AutoML in Databricks at my organization, and the dataset I am using is a project management dataset. Here's a sample:

 

project_idmarketgeneral_contractorproject_typepermit_datepermit_statusconstruction_dateconstruction_statuscompletion_datecompletion_status
project_1NYacme increhab2/1/2024complete3/1/2024projected4/1/2024projected
project_2LAxyz incbuild to suit1/1/2020complete2/2/2023complete3/4/2023complete

So based on this dataset, I want to be able to see how I can reduce completion_date period. For example, if I use acme inc in LA, will that reduce my completion date and if so, by how much? or for example if I reduce my permit_date by 2 days, how big of an impact will it have on completion_date? Of course I only have to rely on historical data so all the status fields must be set to "complete".

How do I go about doing this? Also, is there a way to output the result in a way for stakeholder to analyze, using a visual tool like tableau or powerbi?

Thanks!

 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @ProtonMix, Let’s break down your requirements and tackle them step by step.

  1. Reducing Completion Date Period:

    • To understand how different factors impact the completion date, you can use regression analysis. Specifically, you want to predict the completion date based on other features like market, general contractor, and permit date.
    • Since you’re interested in historical data, filter your dataset to include only records with a “complete” status for all relevant fields (permit, construction, and completion).
    • Create a new feature that represents the time difference between permit date and completion date (i.e., the duration of the project).
    • Train a regression model (such as linear regression) using AutoML in Databricks. Configure the target variable as the completion date and input features as market, general contractor, and the duration.
    • The model will provide insights into how each factor affects the completion date. For example, it will estimate how much the completion date changes when using a specific general contractor or when the permit date is adjusted.
  2. Visualizing Results for Stakeholders:

    • Once you have the model, you can visualize the results using tools like Tableau or Power BI.
    • Export the model predictions along with the original dataset (including features and actual completion dates).
    • In Tableau or Power BI:
      • Create a scatter plot or line chart showing the predicted completion dates against the actual completion dates.
      • Use filters to explore specific scenarios (e.g., using acme inc in LA, adjusting permit dates).
      • Add tooltips to display additional information (e.g., market, general contractor).
      • Stakeholders can interact with the visualizations to understand the impact of different factors on completion dates.
  3. Using Databricks AutoML:

    • To train a regression model using AutoML in Databricks, follow these steps:
      • In the sidebar, select New > AutoML Experiment.
      • Configure the AutoML experiment by specifying the dataset, problem type (regression), target column (completion date), and other relevant settings.
      • Run the experiment and monitor the results.
      • Once you have the best model, export the predictions for visualization.

Remember to adjust the experiment settings (e.g., stopping conditions) based on your dataset size an...1. Good luck with your proof of concept! 🚀

 

ProtonMix
New Contributor II

Hello Kaniz, Thank you so much for your reply!! I am trying to follow your steps, but however I cannot seem to select Completion Date as my target. It is only showing SYS_ID's (which are numeric in nature):
DBricks1.jpgWhat should I add to "Time Column for Training/Validation/Testing Split"? Is that where completion_date goes?

One more question if you don't mind. On the right side, it lists all the columns that I have available with an "Impute with" function:

DBricks2.jpg

 Is there where I select what columns I need in my dataset? I was not sure what Impute With means here.

I appreciate all your help. Thank you so much 🙂 🙂 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.