<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Convert multiple string fields to int or long during streaming in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/convert-multiple-string-fields-to-int-or-long-during-streaming/m-p/54536#M6209</link>
    <description>&lt;P&gt;Thanks for confirming that the readStream.withColumn() approach is the best available option.&amp;nbsp; Unfortunately, this will force me to maintain a separate notebook for each of the event types,&amp;nbsp; but it does work.&amp;nbsp;&amp;nbsp; I was hoping to create just one parameterized notebook that could read from an on-disk schema file for each event type (which would be the parameter).&amp;nbsp;&amp;nbsp; This notebook is my evolution of the work in &lt;A href="https://www.databricks.com/blog/2021/05/20/building-a-cybersecurity-lakehouse-for-crowdstrike-falcon-events.html," target="_blank" rel="noopener"&gt;https://www.databricks.com/blog/2021/05/20/building-a-cybersecurity-lakehouse-for-crowdstrike-falcon-events.html, so the source data has a mix of event types. This stage in processing moves them into event-specific tables.&lt;/A&gt;&lt;/P&gt;&lt;LI-CODE lang="java"&gt;spark.readStream
  .format("delta")
  .option("failOnDataLoss", "true")
  .table(rawTableName)
  .filter('event_simpleName === eventName)  // prod filter
  .withColumn("event", from_json('value,rawEventSchema))  // expand the JSON from raw into a new "event" column.
  .withColumn("_time", col("event.timestamp").cast(LongType)/1000)   // Add a Splunk-friendly timestamp
  .select($"_time", $"event.*")  // flatten into a simple row,  remove the event.* prefix
  .withColumn("timestamp", from_unixtime(col("timestamp").cast(LongType)/1000).cast(TimestampNTZType))  // Create a SQL-friendly timestamp
  //  Paste event-specific columns here, generated this code in Excel
  .withColumn("ContextProcessId",col("ContextProcessId").cast(LongType))
  .withColumn("ContextTimeStamp",from_unixtime(col("ContextTimeStamp").cast(LongType)/1000).cast(TimestampNTZType))
  .withColumn("Protocol",col("Protocol").cast(IntegerType))
  .withColumn("ConnectionFlags",col("ConnectionFlags").cast(LongType))
  .withColumn("LocalPort",col("LocalPort").cast(IntegerType))
  .withColumn("RemotePort",col("RemotePort").cast(IntegerType))
  .withColumn("ConnectionDirection",col("ConnectionDirection").cast(IntegerType))
  .withColumn("IcmpType",col("IcmpType").cast(IntegerType))
  .withColumn("IcmpCode",col("IcmpCode").cast(IntegerType))
  .withColumn("TreeId",col("TreeId").cast(LongType))
  .withColumn("ContextThreadId",col("ContextThreadId").cast(LongType))
  .withColumn("InContext",col("InContext").cast(BooleanType))
  .withColumn("TcpConnectErrorCode",col("TcpConnectErrorCode").cast(IntegerType))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 03 Dec 2023 16:14:11 GMT</pubDate>
    <dc:creator>hukel</dc:creator>
    <dc:date>2023-12-03T16:14:11Z</dc:date>
    <item>
      <title>Convert multiple string fields to int or long during streaming</title>
      <link>https://community.databricks.com/t5/get-started-discussions/convert-multiple-string-fields-to-int-or-long-during-streaming/m-p/54464#M6207</link>
      <description>&lt;P&gt;Source data looks like:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;{
	"IntegrityLevel": "16384",
	"ParentProcessId": "10972929104936",
	"SourceProcessId": "10972929104936",
	"SHA256Hash": "a26a1ffb81a61281ffa55cb7778cc3fb0ff981704de49f75f51f18b283fba7a2",
	"ImageFileName": "\\Device\\HarddiskVolume3\\Windows\\System32\\conhost.exe",
	"SourceThreadId": "192964614202574",
	"name": "ProcessRollup2V19",
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Current streaming read is using an inferred schema, which is rendering fields like ParentProcessId as strings.&lt;/P&gt;&lt;P&gt;I have experimented with a manually-constructed schema using LongType, but this turns the would-be integers to nulls:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;val schema = new StructType()
  .add("ImageFileName", StringType,true)
  .add("ParentProcessId", LongType, true)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If this can't be done via schema, is there an elegant way (avoiding foreachBatch) in&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;readStream&lt;/SPAN&gt;&lt;SPAN&gt;.format(&lt;/SPAN&gt;&lt;SPAN&gt;"delta"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN&gt;or &lt;/SPAN&gt;&lt;SPAN&gt;writeStream&lt;/SPAN&gt;&lt;SPAN&gt;.format("delta")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;to convert a dozen or so fields to a different type?&lt;/P&gt;</description>
      <pubDate>Fri, 01 Dec 2023 20:38:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/convert-multiple-string-fields-to-int-or-long-during-streaming/m-p/54464#M6207</guid>
      <dc:creator>hukel</dc:creator>
      <dc:date>2023-12-01T20:38:57Z</dc:date>
    </item>
    <item>
      <title>Re: Convert multiple string fields to int or long during streaming</title>
      <link>https://community.databricks.com/t5/get-started-discussions/convert-multiple-string-fields-to-int-or-long-during-streaming/m-p/54536#M6209</link>
      <description>&lt;P&gt;Thanks for confirming that the readStream.withColumn() approach is the best available option.&amp;nbsp; Unfortunately, this will force me to maintain a separate notebook for each of the event types,&amp;nbsp; but it does work.&amp;nbsp;&amp;nbsp; I was hoping to create just one parameterized notebook that could read from an on-disk schema file for each event type (which would be the parameter).&amp;nbsp;&amp;nbsp; This notebook is my evolution of the work in &lt;A href="https://www.databricks.com/blog/2021/05/20/building-a-cybersecurity-lakehouse-for-crowdstrike-falcon-events.html," target="_blank" rel="noopener"&gt;https://www.databricks.com/blog/2021/05/20/building-a-cybersecurity-lakehouse-for-crowdstrike-falcon-events.html, so the source data has a mix of event types. This stage in processing moves them into event-specific tables.&lt;/A&gt;&lt;/P&gt;&lt;LI-CODE lang="java"&gt;spark.readStream
  .format("delta")
  .option("failOnDataLoss", "true")
  .table(rawTableName)
  .filter('event_simpleName === eventName)  // prod filter
  .withColumn("event", from_json('value,rawEventSchema))  // expand the JSON from raw into a new "event" column.
  .withColumn("_time", col("event.timestamp").cast(LongType)/1000)   // Add a Splunk-friendly timestamp
  .select($"_time", $"event.*")  // flatten into a simple row,  remove the event.* prefix
  .withColumn("timestamp", from_unixtime(col("timestamp").cast(LongType)/1000).cast(TimestampNTZType))  // Create a SQL-friendly timestamp
  //  Paste event-specific columns here, generated this code in Excel
  .withColumn("ContextProcessId",col("ContextProcessId").cast(LongType))
  .withColumn("ContextTimeStamp",from_unixtime(col("ContextTimeStamp").cast(LongType)/1000).cast(TimestampNTZType))
  .withColumn("Protocol",col("Protocol").cast(IntegerType))
  .withColumn("ConnectionFlags",col("ConnectionFlags").cast(LongType))
  .withColumn("LocalPort",col("LocalPort").cast(IntegerType))
  .withColumn("RemotePort",col("RemotePort").cast(IntegerType))
  .withColumn("ConnectionDirection",col("ConnectionDirection").cast(IntegerType))
  .withColumn("IcmpType",col("IcmpType").cast(IntegerType))
  .withColumn("IcmpCode",col("IcmpCode").cast(IntegerType))
  .withColumn("TreeId",col("TreeId").cast(LongType))
  .withColumn("ContextThreadId",col("ContextThreadId").cast(LongType))
  .withColumn("InContext",col("InContext").cast(BooleanType))
  .withColumn("TcpConnectErrorCode",col("TcpConnectErrorCode").cast(IntegerType))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 03 Dec 2023 16:14:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/convert-multiple-string-fields-to-int-or-long-during-streaming/m-p/54536#M6209</guid>
      <dc:creator>hukel</dc:creator>
      <dc:date>2023-12-03T16:14:11Z</dc:date>
    </item>
  </channel>
</rss>

