- 2185 Views
- 1 replies
- 0 kudos
I have 10+ columns and want to take distinct rows by multiple columns into consideration. How to achieve this using pyspark dataframe functions ?
- 2185 Views
- 1 replies
- 0 kudos
Latest Reply
You can use dropDuplicates
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=distinct#pyspark.sql.DataFrame.dropDuplicates
- 12929 Views
- 4 replies
- 0 kudos
I am trying to display the html output or read in an html file to display in databricks notebook from pandas-profiling.import pandas as pd import pandas_profiling df = pd.read_csv("/dbfs/FileStore/tables/my_data.csv", header='infer', parse_dates=Tru...
- 12929 Views
- 4 replies
- 0 kudos
Latest Reply
What eventually worked for me was displayHTML(profile.to_html()) for the pandas_profiling and displayHTML(profile.html) for the spark_profiling.
3 More Replies
- 4369 Views
- 4 replies
- 0 kudos
Editing notebooks on DataBricks is rather cumbersome because it lacks a lot of features IDEs like PyCharm have.
Another problem is that a DataBricks notebook comes with some local state which are not present on my computer.
How can I edit notebooks ...
- 4369 Views
- 4 replies
- 0 kudos
Latest Reply
The documents are out for databricks-connect: https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html
I've also written up about a few limitations I have found - some with workarounds: https://datathirst.net/blog/2019/3/7/databricks-co...
3 More Replies
- 8898 Views
- 12 replies
- 0 kudos
Hi,I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.spark.read.format("c...
- 8898 Views
- 12 replies
- 0 kudos
Latest Reply
Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options.
Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sou...
11 More Replies
- 5599 Views
- 4 replies
- 0 kudos
I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark):
<code>split_date=pyspark.sql.functions.split(df['Date'], '-')
df= df.withColumn('Year', split_date.getItem(0))
df= df.wit...
- 5599 Views
- 4 replies
- 0 kudos
by
dan11
• New Contributor II
- 2462 Views
- 4 replies
- 1 kudos
<pre> Hello databricks people, I started working with databricks today. I have a sql script which I developed with sqlite3 on a laptop. I want to port the script to databricks. I started with two sql statements: select count(prop_id) from prop0; del...
- 2462 Views
- 4 replies
- 1 kudos
Latest Reply
Hey Dan, good to hear you're getting started with Databricks. This is not a limitation of Databricks it's a restriction built into Spark itself. Spark is not a data store, it's a distributed computation framework. Therefore deleting data would be un...
3 More Replies
- 4059 Views
- 1 replies
- 0 kudos
I have two files and I created two dataframes prod1 and prod2 out of it.I need to find the records with column names and values that are not matching in both the dfs.
id_sk is the primary key .all the cols are string datatype
dataframe 1 (prod1)
id_...
- 4059 Views
- 1 replies
- 0 kudos
Latest Reply
use full Outer Join in spark SQL
- 13359 Views
- 2 replies
- 0 kudos
I am using markdown to include links urls. I am using the below markdown syntax:
[link text](http://example.com)
The issue is each time I click the linked text it opens the url in the same tab as the notebook. I want the url to open it in a new ta...
- 13359 Views
- 2 replies
- 0 kudos
Latest Reply
Hi @Ariel Herrera,
You can just put html anchor tag in databricks notebook cell. It will open a new tab when you click it.
Please try the example below. It works for me in databricks notebook.
%md <a href="https://google.com" target="_blank">google ...
1 More Replies
- 14273 Views
- 1 replies
- 0 kudos
I have a table in hbase with 1 billions records.I want to filter the records based on certain condition (by date).
For example:
Dataframe.filter(col(date) === todayDate)
Filter will be applied after all records from the table will be loaded into me...
- 14273 Views
- 1 replies
- 0 kudos
Latest Reply
Hello @senthil kumar​ To pass external values to the filter (or where) transformations you can use the "lit" function in the following way:Dataframe.filter(col(date) == lit(todayDate))don´t know if that helps. Be careful with the schema infered by th...
- 8826 Views
- 4 replies
- 0 kudos
Can someone please offer some insight - I've spent days trying to solve this issue
We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and...
- 8826 Views
- 4 replies
- 0 kudos
Latest Reply
You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.
3 More Replies
by
Tamara
• New Contributor III
- 8967 Views
- 8 replies
- 1 kudos
I'd like to access a table on a MS SQL Server (Microsoft). Is it possible from Databricks?
To my understanding, the syntax is something like this (in a SQL Notebook):
CREATE TEMPORARY TABLE jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS ( url...
- 8967 Views
- 8 replies
- 1 kudos
Latest Reply
Thanks for the trick that you have shared with us. I am really amazed to use this informational post. If you are facing MacBook error like MacBook Pro won't turn on black screen then click the link.
7 More Replies
- 10864 Views
- 2 replies
- 0 kudos
It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the dat...
- 10864 Views
- 2 replies
- 0 kudos
Latest Reply
I am new in Azure Data Bricks..and I am trying to write the Data frame in mounted ADLS file. But in below command
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header","true").csv("/mnt/<mount-name>")
1 More Replies
- 8251 Views
- 4 replies
- 0 kudos
I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error.
Look like it fails on shuffle stage. Approx number of mappers is 150,000
Spark config:spark.sql.warehouse.dir hdfs:...
- 8251 Views
- 4 replies
- 0 kudos
Latest Reply
@Satheessh Chinnusamy how did you solve the above issue
3 More Replies