Hi,I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.Example:Loading this into a Spark dataframe works fine...
Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue. https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&page=com.atlassian.jira.plugin.system.issuetabpanels%3...
When running a jar-based job, I've noticed that the 1st run always takes the extra time to complete the job and consecutive runs take less time to finish the job. This behavior is reproducible on an interactive cluster. What's causing this? Is this e...
@Sandeep Katta​ , this is a fat jar that does read-transform-write. @DD Sharma​ response matches @Werner Stinckens​ & I intuition that there was efficiency on the second job due to jar already being loaded. I would not have noticed this had job run...
The quality of the book depends on the audience IMO. For people who have no background in data warehousing it will be interesting to read. For the others the book is too general and descriptive. The 'HOW do you do x' is missing.
I wanted to mount a ADLG2 on databricks and take advantage on the abfss driver which should be better for large analytical workloads (is that even true in the context of DB?).Setting an OAuth is a bit of a pain so I wanted to take the simpler approac...
Hi @Fernando Mendez​ ,The below document will help you to mount the ADLS gen2 using abfss:https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.htmlCould you please check if this helps?
Scenario:I tried to run notebook_primary as a job with same parameters' map. This notebook is orchestrator for notebooks_sec_1, notebooks_sec_2, and notebooks_sec_3 and next. I run them by dbutils.notebook.run(path, timeout, arguments) function.So ho...
@Balbir Singh​ , I'm newbie in Databricks but the manual says you can use a python cell and transfer variables to scala's cell by temp tables.https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data
Please refer to the below reference for switching to DBR 7.xWe have extended our DBR 6.4 support until December 2021, DBR 6.4 extended support - Release notes: https://docs.databricks.com/release-notes/runtime/6.4x.htmlMigration guide to DBR 7.x: htt...
I have a set of pre-processing stages in a sklearn `Pipeline` and an estimator which is a `KerasClassifier` (`from tensorflow.keras.wrappers.scikit_learn import KerasClassifier`).My overall goal is to tune and log the whole sklearn pipeline in `mlflo...
I have 2 exactly same table(rows and schema). One table recides in AZSQL server data base and other one is in snowflake database. Now we have some existing code which we want to migrate from azsql to snowflake but when we are trying to create a panda...
The above screen shot is from AWS Databricks cluster .Similarly, in Azure Databricks - Is there a specific way to determine how many of worker nodes are using spot instances and on-demand instances when it is running/completed a job.Likewise, ...
Hello!My name is Piper and I'm one of the community moderators. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will follow up with the team. Thanks for your p...
Hello everyone,I am facing performance issue while calculating cosine similarity in pyspark on a dataframe with around 100 million records.I am trying to do a cross self join on the dataframe to calculate it.​The executors are all having same number ...
Is there a way to hash the record attributes so that the cartesian join can be avoided? I work on record similarity and fuzzy matching and we do a learning based blocking alorithm which hashes the records into smaller buckets and then the hashes are ...
Hello all, I'm trying to pull table data from databricks tables that contain foreign language characters in UTF-8 into an ETL tool using a JDBC connection. I'm using the latest Simba Spark JDBC driver available from the Databricks website.The issue i...
Can you try setting UseUnicodeSqlCharacterTypes=1 in the driver, and also make sure 'file.encoding' is set to UTF-8 in jvm and see if the issue still persists?
Hi TeamI was wondering if there is a document or step by step process to promote code in CICD across various environments of code repository (GIT/GITHUB/BitBucket/Gitlab) with DBx support? [Without involving code repository merging capability of the ...
Please refer this related thread on CICD in Databricks https://community.databricks.com/s/question/0D53f00001GHVhMCAX/what-are-some-best-practices-for-cicd
The differences are as follows:-Pig operates on the client-side of a cluster whereas Hive operates on the server-side of a cluster.Pig uses pig-Latin language whereas Hive uses HiveQL language.Pig is a Procedural Data Flow Language whereas Hive is a ...