Hi @spjti,
Incorrect manager mappings in an employee hierarchy typically come from a few common root causes. Since the question is fairly open-ended, here is a checklist of the most likely issues and how to diagnose them in Databricks.
COMMON REASONS FOR INCORRECT MANAGER MAPPINGS
1. Stale or delayed source data: If the employee/manager data is being ingested from an HR system (Workday, SAP SuccessFactors, ADP, etc.), the mapping may be out of date because the pipeline has not picked up the latest changes. Check the last modified timestamp on your source records and compare it to the last successful pipeline run.
2. Self-referencing or circular references: A manager_id that points back to the same employee_id (or creates a cycle) will break hierarchy traversal. You can detect this with a simple query:
SELECT employee_id, manager_id
FROM employee_table
WHERE employee_id = manager_id;
3. NULL or missing manager_id values: Employees with a NULL manager_id who should have a manager assigned will appear unlinked. Only the top-level executive(s) should have a NULL manager. Check for unexpected NULLs:
SELECT employee_id, employee_name
FROM employee_table
WHERE manager_id IS NULL
ORDER BY employee_name;
4. Orphan references (dangling foreign keys): A manager_id value that does not match any employee_id in the table means the manager record is missing. Find orphans with:
SELECT e.employee_id, e.employee_name, e.manager_id
FROM employee_table e
LEFT JOIN employee_table m ON e.manager_id = m.employee_id
WHERE e.manager_id IS NOT NULL
AND m.employee_id IS NULL;
5. Duplicate employee records: If the source data contains duplicates for the same employee (e.g., multiple active records from different effective dates), a join can fan out and produce incorrect mappings. Deduplicate by selecting only the most recent record per employee:
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY employee_id
ORDER BY effective_date DESC
) AS rn
FROM employee_table
)
SELECT * FROM ranked WHERE rn = 1;
6. Data type mismatches: If employee_id is stored as an integer in one column but manager_id is stored as a string (or vice versa), joins can silently fail or produce unexpected results. Verify with:
DESCRIBE TABLE employee_table;
Make sure employee_id and manager_id share the same data type.
7. Incorrect join logic in the transformation: If you are building the hierarchy via a self-join or recursive CTE, double-check the join condition. The standard pattern is:
SELECT
e.employee_id,
e.employee_name,
m.employee_name AS manager_name
FROM employee_table e
LEFT JOIN employee_table m ON e.manager_id = m.employee_id;
BUILDING A FULL HIERARCHY WITH A RECURSIVE CTE
If you need to traverse multiple levels of the hierarchy (e.g., to find the chain of command), you can use a recursive common table expression in Databricks SQL:
WITH RECURSIVE hierarchy AS (
SELECT employee_id, employee_name, manager_id, 1 AS level
FROM employee_table
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.employee_name, e.manager_id, h.level + 1
FROM employee_table e
INNER JOIN hierarchy h ON e.manager_id = h.employee_id
)
SELECT * FROM hierarchy ORDER BY level, employee_name;
Recursive CTEs are supported on Databricks Runtime 13.3 LTS and above. Documentation:
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-cte
NEXT STEPS
If you can share more details about your setup, such as the source system, your table schema, and the specific incorrect mappings you are seeing, the community can give you a more targeted answer. In particular, it would help to know:
- Where the employee data is coming from (HR system, flat file, API)
- How the data is being ingested (batch, streaming, Lakeflow Connect)
- Whether the issue affects all employees or only a subset
- A sample of the incorrect rows (with any sensitive data masked)
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.