I’ve spent the last few years jumping between insurance, healthcare, and retail, and I’ve come to a very painful conclusion that we should never have let humans type their own addresses into a text box. 😁
For a pet project, I’m currently looking at a dataset where the "City" column is a ghost town, but "Address Line 4" is a chaotic party of cities and counties. It’s making geographical analysis for healthcare accessibility basically impossible. You’re trying to map out deprived areas to actually help people, but the data is so scrambled it looks like everyone lives in a different dimension.
Retail seems to be the absolute Wild West of this. People just put things in whatever order their heart desires at 11 PM while buying shoes.
I’m curious to know how the rest of you are actually surviving this in your Databricks pipelines. Are we all just suffering through thousand-line SQL CASE statements that we update every time a new typo is invented? Or has anyone actually found a "magic bullet" tool or API that plays nice with Spark and doesn't cost a fortune? I’ve thought about throwing an LLM at it, but I’m worried it’ll just hallucinate everyone onto a private island.
What’s the most "creative" address you’ve had to clean up lately?
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***