Databricks Community

Dan_Z · 11-07-2024

Echoing what we said in Part 1: Test Data Curation, when your team is migrating script code and pipeline code to Databricks, there are three main steps: Look at the original code, understand what it's doingConvert the code to run on Databricks (conve...

Dan_Z · 06-28-2024

Introduction to the SeriesOverview of Part 1: Test Data CurationSolution ApproachRequiredConsiderationsScript ModificationParsing all touched tablesParsing modified tablesAugment ScriptsJob script modificationEdge cases handledRe-use of scriptsDealin...

Dan_Z · 05-17-2024

IntroductionRequirements of a great historical data loadOptionsSolution OverviewTypes of ActivitiesPipeline ParametersPerformanceActivity DetailsCopy activityLoad to tablesValidate tablesOptimize tablesGeneral Considerations Introduction When migrati...

Dan_Z · 05-13-2024

IntroductionLogging in Azure Data Factory and Databricks NotebooksSolution RequirementsProposed SolutionCustom Logging PackageSet UpPrerequisitesStepsUsageADF Activity ParametersDatabricks Notebook WidgetsExample Log Analytics QueryConclusion Introdu...

Dan_Z · 10-22-2021

mapInPandas is one of the most powerful Spark functions. It uses an arrow-like in-memory data structure to split up Spark Data Frames into chunks and feeding them to a function that takes a Pandas DF as input and output. Check it out here:https://spa...

Dan_Z · 05-04-2022

Hello @Cristobal Berger , - I could not reproduce this using DBR 10; I think you may be doing something wrong.

Dan_Z · 05-04-2022

Hey @Ben Ben , so Spark-XML is not a package maintained by Databricks. It seems like the community doesn't have any inputs here. I'd suggest you reach out to the package maintainers via an Issue on their GitHub here: https://github.com/databricks/sp...

Dan_Z · 05-04-2022

yes- dbutils is only available on the driver.

Dan_Z · 05-04-2022

@Drew Ringo , What's happening here is that the directory is so large and it's having to do a full scan on that second batch which takes time, which should be parallelized in DBR 9.1+. I think what you need is IncrementalListing in your directory. I...

Dan_Z · 05-04-2022

@Franklin George , Honestly, there is no easy way to do this. Your only option is to set up cluster log delivery, which will give you access to the cluster's event log file. This event log file is JSON and contains all of the info that the SparkUI u...

Databricks Community

User Stats

User Activity

Data Migration Decoded - Part 2: Creating and Running Tests

Data Migration Decoded - Part 1: Crafting Test Data for Automated Validation

A guide to quick and scalable historical data loads into Databricks using Azure Data Factory

Unified Logging for Databricks Notebooks and ADF with Azure Log Analytics

spark.apache.org

Re: SELECT * FROM delta doesn't work on Spark 3.2

Re: Databricks Spark XML parser : support for namespace declared at the ancestor level.

Re: Error executing mlflow's mlflow.register_model() in applyInPandas to register Model to another workspace.

Re: Spark streaming autoloader slow second batch - checkpoint issues?

Re: How to programmatically get the Spark Job ID of a running Spark Task?