Databricks Community

tt_mmo · yesterday

I have a table with ~6 million rows. I am attempting to convert this from a sql table on my catalog to an R dataframe to use the tableone package. I separate my table into 3 tables each containing about 2 million rows then ran it through tbl() and as.data.frame(). Finally merged the 3 tables back together. This worked, but is there a way to perform this without the subsetting my original 6 million row dataset? Been troubleshoot for about 1 week now. Thanks

Code below:

%r

# Install packages if not already installed

if (!require("sparklyr")) install.packages("sparklyr")

if (!require("dplyr")) install.packages("dplyr")

library(sparklyr)

library(dplyr)

#configure

config <- spark_config()

config$spark.driver.memory <- "32G"

config$spark.executor.memory <- "32G"

config$spark.memory.fraction <- 0.8

#connect to databricks

sc = spark_connect(method = 'databricks', version = '2.0.0', config = config)

# Copy table from SQL to R object with tables already separated out

m1 <- tbl(sc, "final_lab_mstr_3_1")

m2 <- tbl(sc, "final_lab_mstr_3_2")

m3 <- tbl(sc, "final_lab_mstr_3_3")

# Convert Spark DataFrame to R DataFrame

df_1 <- as.data.frame(collect(m1))

df_2 <- as.data.frame(collect(m2))

df_3 <- as.data.frame(collect(m3))

# Merge the data frames

df3 <- rbind(df_1, df_2, df_3)

BigRoux · yesterday

To handle a large SQL table (~6 million rows) and convert it into an R dataframe without splitting it into smaller subsets, you can use more efficient strategies and tools that are optimized for large datasets. Here are some recommendations:

1. Use `dbplyr` for Lazy Loading
Instead of loading the entire dataset into R memory, you can use the `dbplyr` package to interact with the SQL table directly. This allows you to perform operations on the database without fully loading the data into R. The `collect()` function can then be used to retrieve only the required subset.

```R
library(dplyr)
library(dbplyr)

# Connect to the database
con <- DBI::dbConnect(odbc::odbc(), Driver = "SQL Server", Server = "your_server", Database = "your_database", UID = "your_username", PWD = "your_password")

# Reference the SQL table
sql_table <- tbl(con, "final_lab_mstr")

# Perform operations directly on the database
filtered_data <- sql_table %>%
filter(some_condition) %>% # Apply filters if needed
select(column1, column2) # Select specific columns if possible

# Collect data into an R dataframe when ready
df <- collect(filtered_data)
```

This approach minimizes memory usage by pushing computations to the database and only pulling the required data into R.

2. Optimize Memory Usage with `data.table`**
If you need to load the entire dataset into R, consider using the `data.table` package, which is highly optimized for handling large datasets in memory.

```R
library(data.table)

# Convert SQL query result to a data.table directly
df <- setDT(collect(tbl(con, "final_lab_mstr")))
```

The `data.table` package is faster and more memory-efficient than base R data frames.

3. Increase Memory Efficiency**
Ensure your R session has sufficient memory allocated. If you're using Spark as in your example, configure Spark with higher memory limits (as you've done with `spark.driver.memory` and `spark.executor.memory`). Additionally:
- Use column selection in your SQL query or R script to reduce unnecessary data transfer.
- Avoid using `as.data.frame()` unnecessarily; stick with `data.table` or tibbles.

4. Use Chunked Reading**
If you cannot avoid splitting due to memory constraints, use chunked reading with packages like `data.table::fread()` or custom SQL queries that limit rows per chunk.

```R
library(data.table)

# Example: Read chunks of 1 million rows at a time
chunk_size <- 1e6
offset <- 0

repeat {
query <- paste0("SELECT * FROM final_lab_mstr LIMIT ", chunk_size, " OFFSET ", offset)
chunk <- dbGetQuery(con, query)

if (nrow(chunk) == 0) break

# Process or append chunk to main dataframe
df <- rbindlist(list(df, setDT(chunk)), use.names = TRUE, fill = TRUE)

offset <- offset + chunk_size
}
```

5. Consider Using Arrow or Parquet Formats**
For very large datasets, consider exporting the SQL table as a Parquet file and then using the `arrow` package in R for efficient reading.

```R
library(arrow)

# Load Parquet file into an R dataframe
df <- read_parquet("path_to_parquet_file")
```

Parquet files are highly compressed and optimized for large-scale data processing.

6. Pre-Aggregate Data in SQL**
If possible, perform summarization or filtering directly in SQL before pulling data into R. This reduces both memory usage and transfer time.

```SQL
SELECT column1, column2, AVG(column3) AS avg_col3
FROM final_lab_mstr
GROUP BY column1, column2;
```

Then load the aggregated result into R using `dbGetQuery()` or similar functions.

By leveraging these techniques, you can handle large datasets more efficiently without splitting them manually. The choice of method depends on your specific use case and computational resources available

Databricks Community

SQL table convert to R dataframe

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!