topic Re: Bamboolib with databricks, low-code programming is now available on #databricks Now you can prep in Data Engineering

Bamboolib with databricks, low-code programming is now available on #databricks Now you can prepare your databricks code without ... coding. Low code ...

Hubert-Dudek — Wed, 19 Oct 2022 13:22:45 GMT

Bamboolib with databricks, low-code programming is now available on #databricks

Now you can prepare your databricks code without ... coding. Low code solution is now available on Databricks. Install and import bamboolib to start (require a version of 11 DBR for Azure and AWS, 11.1 for GCC). %pip can be used to install or cluster settings -> “libraries” tab:

As we see on the above screen, we have a few options.

- read CSV files,

- read the parquet file,

- read the table from metastore (I bet it will be the most popular option),

- or use some example dataset with ***** data

We will use an example titanic dataset.

Now we can make transformations and actions using the wizard. We can see below the auto-generated code:

So, let’s assume that we select only two fields:

Put age to bins (0-10 years old, 10-20, 20-30, etc.):

Group by and see the result together with the code:

Now we can copy our code and use it in our projects. We can remember replacing pandas with pandas on spark so it will be run distributed way.

These are example transformations available:

- Select or drop columns,

- Filter rows,

- Sort rows,

- Group by and aggregate,

- Join / merge,

- Change data types,

- Change names,

- Find and replace,

- Conditional replace / if else

- Change DateTime frequency,

- Extract DateTime,

- Move column,

- Bin column,

- Concatenatete,

- Pivot,

- Unpivot,

- Window functions,

- Plot creators

Thanks to the plot creator so we can visualize our data easily.

In the below example, we used a bar plot.

Auto-generated code from the above example is as below:

import pandas as pd; import numpy as np
df = pd.read_csv(bam.titanic_csv)
 
# Step: Select columns
df = df[['Age', 'Seex']]
 
# Step: Bin column
df['Age'] = pd.cut(df['Age'], bins=[0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0], right=False, precision=0)
 
# Step: Group by and aggregate
df = df.groupby(['Age', 'Seex']).agg(Seex_count=('Seex', 'count')).reset_index()
 
# Step: Change data type of Age to String/Text
df['Age'] = df['Age'].astype('string')
 
 import plotly.express as px
fig = px.bar(df, x='Age', y='Seex_count', color='Seex')
fig

Re: Bamboolib with databricks, low-code programming is now available on #databricks Now you can prepare your databricks code without ... coding. Low code ...

karthik_p — Wed, 19 Oct 2022 14:13:09 GMT

@Hubert Dudek Informative article, thanks for creating

Re: Bamboolib with databricks, low-code programming is now available on #databricks Now you can prepare your databricks code without ... coding. Low code ...

Hubert-Dudek — Wed, 19 Oct 2022 19:56:30 GMT

Thanks

Re: Bamboolib with databricks, low-code programming is now available on #databricks Now you can prep

Palkers — Thu, 29 Jun 2023 10:52:12 GMT

I have tried to load parquet file using bamboolib menu, and getting below error that path does not exist
I can load the same file without no problem using spark or pandas using following path
citi_pdf = pd.read_parquet(f'/dbfs/mnt/orbify-sales-raw/WideWorldImportersDW/Dimension_City_new.parquet', engine='pyarrow')

does it work already or still has some bugs ?

AnalysisException: [PATH_NOT_FOUND] Path does not exist: dbfs:/dbfs/mnt/orbify-sales-raw/WideWorldImportersDW/Dimension_City_new.parquet.



Full stack trace:
-----------------------------
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/bamboolib/helper/gui_outlets.py", line 346, in safe_execution
hide_outlet = execute_function(self, *args, **kwargs)
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/bamboolib/setup/module_view.py", line 365, in open_parquet
df = exec_code(code, symbols=self.symbols, result_name=df_name)
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/bamboolib/helper/utils.py", line 446, in exec_code
exec(code, exec_symbols, exec_symbols)
File "", line 1, in
File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
res = func(*args, **kwargs)
File "/databricks/spark/python/pyspark/sql/readwriter.py", line 533, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions.py", line 234, in deco
raise converted from None
pyspark.errors.exceptions.AnalysisException: [PATH_NOT_FOUND] Path does not exist: dbfs:/dbfs/mnt/orbify-sales-raw/WideWorldImportersDW/Dimension_City_new.parquet.