cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databrick select from web address that returns JSON

ozmike
New Contributor II
Hi I'm in a data bricks notebook and want to select from a web site that returns json. 
 
will return some JSON. (example only)
 
Can i do the following or Do you need to use python. (please give an example of py code if so)
Will DB notebooks allow you to do this with out opening anything up.

%sql
select
* from json.`http://ergast.com/api/f1/2004/1/results.json`
`
3 REPLIES 3

Stefan-Koch
Valued Contributor II

Hi @ozmike 

You could do this as example with wget on shell: 

%sh
wget -P /Volumes/demo/raw/files/ergast http://ergast.com/api/f1/2004/1/results.json

Then you can access the json file in a volume.

StefanKoch_0-1740547550570.png

The better way is to use python:

%pip install wget

import wget

url = "http://ergast.com/api/f1/2004/1/results.json"
destination_path = "/Volumes/demo/raw/files/my_results.json"
wget.download(url, out=destination_path)

 

Then you see the content again in the volume:

StefanKoch_1-1740547720860.png

Read the json content into a dataframe:

df = spark.read.json(destination_path)
display(df)

StefanKoch_2-1740547802740.png

You may want to flatten the strucure:

from pyspark.sql.functions import col, explode

df_flat = df.select(
    col("MRData.RaceTable.Races.season").alias("season"),
    col("MRData.RaceTable.Races.round").alias("round"),
    col("MRData.RaceTable.Races.url").alias("race_url"),
    col("MRData.RaceTable.Races.raceName").alias("race_name"),
    col("MRData.RaceTable.Races.Circuit.circuitId").alias("circuit_id"),
    col("MRData.RaceTable.Races.Circuit.url").alias("circuit_url"),
    col("MRData.RaceTable.Races.Circuit.circuitName").alias("circuit_name"),
    col("MRData.RaceTable.Races.Circuit.Location.lat").alias("circuit_lat"),
    col("MRData.RaceTable.Races.Circuit.Location.long").alias("circuit_long"),
    col("MRData.RaceTable.Races.Circuit.Location.locality").alias("circuit_locality"),
    col("MRData.RaceTable.Races.Circuit.Location.country").alias("circuit_country"),
    explode("MRData.RaceTable.Races.Results").alias("result")
).select(
    col("season"),
    col("round"),
    col("race_url"),
    col("race_name"),
    col("circuit_id"),
    col("circuit_url"),
    col("circuit_name"),
    col("circuit_lat"),
    col("circuit_long"),
    col("circuit_locality"),
    col("circuit_country"),
    col("result.number").alias("driver_number"),
    col("result.position").alias("driver_position"),
    col("result.positionText").alias("driver_position_text"),
    col("result.points").alias("driver_points"),
    col("result.Driver.driverId").alias("driver_id"),
    col("result.Driver.permanentNumber").alias("driver_permanent_number"),
    col("result.Driver.code").alias("driver_code"),
    col("result.Driver.url").alias("driver_url"),
    col("result.Driver.givenName").alias("driver_given_name"),
    col("result.Driver.familyName").alias("driver_family_name"),
    col("result.Driver.dateOfBirth").alias("driver_dob"),
    col("result.Driver.nationality").alias("driver_nationality"),
    col("result.Constructor.constructorId").alias("constructor_id"),
    col("result.Constructor.url").alias("constructor_url"),
    col("result.Constructor.name").alias("constructor_name"),
    col("result.Constructor.nationality").alias("constructor_nationality"),
    col("result.grid").alias("grid_position"),
    col("result.laps").alias("laps"),
    col("result.status").alias("status")
)

display(df_flat)

StefanKoch_3-1740547911744.png

May, you adapt the script and dynamically flatten the structure.

ozmike
New Contributor II

Hi Thanks
Can you do a select from this volume you have created ( from SQL not PY )?

SELECT * FROM JSON.'volume:xyz'

Also how do you get  shell  access from the GUI ? or - is this an admin thing.

 

Stefan-Koch
Valued Contributor II

yes, you could use SQL to read the data from the volume:

%sql
SELECT * FROM json.`/Volumes/demo/raw/files/ergast/my_results.json`

StefanKoch_0-1740549044779.png

You still have to flatten the result.

The thing with the shell was just an example. Wouldn't recommend that, just use python

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now