cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Jira data import to Databricks

greengil
New Contributor III

We need to import large amount of Jira data into Databricks, and should import only the delta changes.  What's the best approach to do so?  Using the Fivetran Jira connector or develop our own Python scripts/pipeline code?  Thanks.

4 REPLIES 4

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @greengil,

Have you considered Lakeflow Connect?  Databricks now has a native Jira connector in Lakeflow Connect that can achieve what you are looking for. It's in beta, but something you may want to consider. 

It ingests Jira into Delta with incremental (delta) loads out of the box, supports SCD1/SCD2, handles deletes via audit logs, and runs fully managed on serverless with Unity Catalog governance. This is lower-effort and better integrated than both Fivetran and custom Python, and directly targets your large volume + only changes requirement.

If you canโ€™t use the Databricks Jira connector, prefer Fivetran Jira --> Databricks over custom code for a managed, low-maintenance ELT path. Only build custom Python pipelines if you have very specific requirements that neither managed option can meet.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

greengil
New Contributor III

Hi @Ashwin_DSA  - Thank you for the information.  Appreciate it.  Regarding the built-in Lakeflow Connect, I see that it will inject all the Jira tables into Databricks.  Is there a way to inject only a subset of data?  For example, instead of all issues, I want only a subset.  Thanks.

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @greengil,

Yes, you can restrict what Lakeflow Connect for Jira ingests, both by table and by rows (partially).

In the UI, on the Source step, you can select only the tables you care about (for example, just issues, or issues + projects) instead of all source tables. In DABs/API, only list the tables you want under objects.

The Jira connector supports filtering by Jira project/space via jira_options.include_jira_spaces (list of project keys). In the UI, this is exposed as an option to filter the data by Jira spaces or projects (you enter project keys, not names or IDs).

If you are looking for anything more granular than project/space (e.g. specific issue types, statuses, labels), then that's ot supported as of now. The connector ingests all matching issues for those projects/spaces, and you then filter downstream in silver/gold tables. More general row-level filtering for Jira is on the backlog but not yet available.

Refer to these pages jira pipeline and limitation

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

abhi_dabhi
Databricks Partner

Hi @greengil  good question, I went through this something similar recently, so sharing what I found.

My instinct was also to build it in Python, but once I dug in, the "just write a script" path hides a lot of pain:

  • Deletions are invisible. Jira's REST API doesn't return deleted issues. Without webhooks, you'll have ghost records in Delta forever.
  • Field history isn't free. The API gives you current state, not change history. Reporting usually needs history, which means building and maintaining it yourself.
  • Archived issues aren't returned in JQL queries, only by ID.
  • Rate limits, pagination, schema drift for custom fields, all real work.

Fivetran's Jira connector handles all of this natively, JQL-based incremental sync, webhook-based deletion capture, auto-populated ISSUE_FIELD_HISTORY tables, schema drift detection, MERGE into Delta, and it's available through Databricks Partner Connect for quick setup. There's also a free dbt package (fivetran/dbt_jira) with pre-built analytics models.

My take: I would suggest go with Fivetran unless you have a specific reason not to - high volume cost concerns, need for archived issues, or data residency restrictions. Custom Python makes sense for narrow use cases, but it's weeks of build plus ongoing maintenance.

References I did research and came up with solution, please take a look, I think you will find it really helpful:

Happy to dig in further if you're leaning one way.