Cauchy — Threat Detection and Response at Scale

Visualizing process trees with marimo and anywidget

This work was inspired by a project at DNB’s Cyber Defense Center where we have been exploring the use of visualizations and data apps to help us do incident response more efficiently. The process tree visualization presented here re-envisions those data apps within the notebook environment, demonstrating how similar interactive analysis capabilities can be achieved in computational notebooks. While this specific implementation focuses on teaching core concepts, we hope to share more about our production security visualization work in future posts or presentations.

Process creation event logs are one of the primary data sources when investigating security incidents. However, when treated as a collection of events, we are left with a tabular representation of what is in reality a tree relationship, and it can be difficult for an analyst to get an overview of what is going on. While Microsoft and other EDR vendors provide visualization tools out of the box, they come with some limitations: you can’t customize them, data expires after a while, and they are only available if you pay for premium tiers.

In this post, we will show how to build an interactive process tree visualization by combining:

  • anywidget - a framework for creating custom Jupyter and marimo notebook widgets
  • marimo - a reactive Python notebook
  • ibis - a Python dataframe library that is backend agnostic
  • Apache Spark & Spark Connect - a distributed query engine
  • dependentree - d3 tree visualization library created by Square

Below is a diagram showing the overview of the system architecture and how the components relate to eachother.

events = traitlets.List([]).tag(sync=True)
events = traitlets.List([]).tag(sync...
Delta
Delta
anywidget(ProcessTreeWidget(events))
anywidget(ProcessTreeWidget(ev...
http://hostname:2718
http://hostname:2718
spark-connect
spark-connect
process_id = traitlets.Int(-1).tag(sync=True)
process_id = traitlets.Int(-1).tag(sync=Tr...
Text is not SVG - cannot display

Overview of the system architecture and data flow. Users create ibis dataframe queries in a marimo app that are executed on a remote Apache Spark cluster. The process creation events are retrieved, a tree structure is created and sent to the anywidget which renders the d3 process tree visualization.

We will not dive deep into these tools here, but they all have great documentation and tutorials for those who want to learn more. In particular, for an introduction to anywidget, check out this presentation by the creator Trevor Manz, or watch his step-by-step tutorial on building a fun widget from scratch.

Process Creation Events

Even though we will use data from Microsoft Defender for Endpoint, the approach can be adapted to logs from any EDR. The MDE process creation events are stored in the DeviceProcessEvents schema. For the process tree use-case the important fields are summarized below.


Field Description
Timestamp Date and time when the event was recorded
ReportId Event identifier based on a repeating counter. To identify unique events, this column must be used in conjunction with the DeviceName and Timestamp columns.
DeviceName Fully qualified domain name (FQDN) of the device
ProcessId Process ID (PID) of the newly created process
FileName Name of the file that the recorded action was applied to
ProcessCreationTime Date and time the process was created
InitiatingProcessId Process ID (PID) of the process that initiated the event
InitiatingProcessFileName Name of the process file that initiated the event; if unavailable, the name of the process that initiated the event might be shown instead
InitiatingProcessCreationTime Date and time when the process that initiated the event was started
InitiatingProcessParentId Process ID (PID) of the parent process that spawned the process responsible for the event
InitiatingProcessParentFileName Name of the parent process that spawned the process responsible for the event
InitiatingProcessParentCreationTime Date and time when the parent of the process responsible for the event was started

Timestamp
Date and time when the event was recorded
ReportId
Event identifier based on a repeating counter. To identify unique events, this column must be used in conjunction with the DeviceName and Timestamp columns.
DeviceName
Fully qualified domain name (FQDN) of the device
ProcessId
Process ID (PID) of the newly created process
FileName
Name of the file that the recorded action was applied to
ProcessCreationTime
Date and time the process was created
InitiatingProcessId
Process ID (PID) of the process that initiated the event
InitiatingProcessFileName
Name of the process file that initiated the event; if unavailable, the name of the process that initiated the event might be shown instead
InitiatingProcessCreationTime
Date and time when the process that initiated the event was started
InitiatingProcessParentId
Process ID (PID) of the parent process that spawned the process responsible for the event
InitiatingProcessParentFileName
Name of the parent process that spawned the process responsible for the event
InitiatingProcessParentCreationTime
Date and time when the parent of the process responsible for the event was started

To make the widget easier to re-use with different data sources, we will map the DeviceProcessEvents table to the ProcessEvent schema from the ASIM (Advanced Security Information Model). The Azure Sentinel repository contains ASIM parsers for many data sources. While these parsers are written in KQL (Kusto Query Language), it is straightforward to rewrite them as Ibis expressions.

process_creation_events = (
  events
    .filter(_.ActionType == "ProcessCreated")
    .distinct(
       on=["ReportId", "Timestamp", "DeviceName"], 
       keep="first"
    )
    .order_by(_.Timestamp)
    .mutate(
       TargetProcessId=_.ProcessId,
       TargetProcessFilename=_.FileName,
       TargetProcessCreationTime=_.ProcessCreationTime,
       # ...
    )
)

Ibis and Spark Connect

By using Ibis the same code can run on a remote data system, locally using DuckDB or even in the browser. In a production system, you would typically connect to distributed query engines like Apache Spark, BigQuery or Snowflake.

Spark Connect allows you to execute Apache Spark queries remotely from a notebook enviroment (or any client). When you run queries from your notebook, the client sends your operations to the Spark server, which executes them and returns the results which are streamed back to the client through gRPC in Arrow format. This client-server architecture lets us run intensive queries on powerful remote clusters while maintaining an interactive notebook experience.


config = Config(profile="security")
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

con = ibis.pyspark.connect(spark)
device_process_events = (
    con.table(
        name="device_process_events",
        database=("security_logs", "mde"),
    )
    .select(_.properties)
    .unpack("properties")
)

t = ibis.memtable(device_process_events.to_pyarrow())

An example of connecting to a remote Spark cluster on Databricks using Ibis and Spark Connect and creating an in-memory table backed by DuckDB. This approach allows remote query execution while maintaining local interactivity, though it’s worth noting that data transfer involves conversion through pandas before reaching DuckDB, so it’s not a zero-copy operation.

Retrieving the right set of process creation events is an interesting challenge in itself, especially if you don’t have access to the complete start and end of a system session. One approach is to examine a wide time interval of process creation events and look for system boot markers like the Windows kernel (ntoskrnl.exe) being loaded. When investigating a specific process, you can define session boundaries by identifying when ntoskrnl.exe was loaded - the previous load marking the session start and the next load indicating a reboot and new session. While this information is typically available in dedicated system event logs, it can be reconstructed from process creation events with careful querying.

 ntoskrl_creation_events = (
  events
    .filter(_.ParentProcessFilename == "ntoskrnl.exe")
    .select(_.ParentProcessCreationTime)
    .distinct(on="ParentProcessCreationTime", keep="first")
 )

An example of how to identify system boot events by looking for ntoskrnl.exe as a parent process. While not implemented in our demo, this approach can help establish session boundaries when investigating incidents. In practice, you might simply look back a few hours or days from a suspicious event, or use other time-based filtering approaches depending on your investigation needs.

For the remainder of this article, we’ll assume you already have a collection of process events (retrieved using Apache Spark, Splunk, Elastic, or some other query or search engine) and are now ready to create a process tree visualization. The techniques we’ll cover work regardless of how you obtained your process event data.

Building the Tree

Before we can create visualizations, we need to construct a tree structure from the process creation events. We will use treelib, an efficient tree manipulation library with no external dependencies. The library allows nodes to contain arbitrary data, so we will define a Process.

class Process(BaseModel):

    # Process being created
    target_process_id: int
    target_process_filename: str
    target_process_creation_time: datetime

    # Direct parent process
    acting_process_id: int
    acting_process_filename: str
    acting_process_creation_time: datetime

    # Grandparent process
    parent_process_id: int
    parent_process_filename: str
    parent_process_creation_time: datetime

    def identifier(self) -> str:
        return f"{self.target_process_id}|{self.target_process_creation_time}"

    def parent_identifier(self) -> str:
        if self.acting_process_id == Process.MISSING_PROCESS_ID:
            return "<root>"

        return f"{self.acting_process_id}|{self.acting_process_creation_time}"

As the root node, we use a placeholder value <root> from which all processes originate. Each process node has a unique identifier created by concatenating its target_process_id and target_process_creation_time values. When adding a node to the tree, we specify both its identifier and its parent’s identifier. The diagram below illustrates a process tree and shows how these fields relate to each other.

<root>
<root>
services.exe
services.exe
word.exe
word.exe
cmd.exe
cmd.exe
pwsh.exe
pwsh.exe
svchost.exe
svchost.exe
updater.exe
updater.exe
rdpclip.exe
rdpclip.exe
ParentProcessId = ?
ActingProcessId = ?
TargetProcessId = 1
ParentProcessId = ?...
ParentProcessId = ?
ActingProcessId = 1
TargetProcessId = 2
ParentProcessId = ?...
ParentProcessId = 1
ActingProcessId = 2
TargetProcessId = 3
ParentProcessId = 1...
pid: 1
pid: 1
pid: 2
pid: 2
pid: 3
pid: 3
id=1| 2024-11-01 00:00:42
id=1|2024-11-01 00:00:42
id=2| 2024-11-01 00:03:42
id=2|2024-11-01 00:03:42
Text is not SVG - cannot display

An example of a process tree structure created by treelib. Nodes with diagonal stripes represent processes whose creation events were not directly available, but were reconstructed using ActingProcess or ParentProcess information from other events.

The nodes shown with diagonal stripes (hatched pattern) represent processes where we do not have the original process creation event. This could be because we either did not retrieve the event or because it was not logged by the EDR. However, we can still partially reconstruct these nodes using information from other events - specifically, the ActingProcess or ParentProcess values.

Next, we define a ProcessTree class that uses treelib to construct our tree structure. When initialized, this class creates a tree with a root node, and provides methods to build out the process hierarchy. The class handles both direct process creation events and reconstructs missing nodes using parent process information.

class ProcessTree:
    def __init__(self, processes: List | None = None):
        self.tree: Tree = Tree()
        self.root = self.tree.create_node(
            tag="<root>", 
            identifier="<root>", 
            data=None
        )

    def insert_or_update(self, process: Process):
        ...

    def insert_process(self, process: Process):
        parent_process = Process(
            target_process_id=process.parent_process_id,
            target_process_filename=process.parent_process_filename,
            target_process_creation_time=process.parent_process_creation_time,
        )

        ... 

        self.insert_or_update(parent_process)
        self.insert_or_update(acting_process)
        self.insert_or_update(process)

    def create_dependentree_format(self):
      ...

The ProcessTree class builds a tree structure from process events, tracking how processes are created and relate to each other. It can handle both direct process creation events and fill in missing information about parent processes, ensuring we have a complete picture of process relationships.

The create_dependentree_format method (omitted for brevity) transforms the hierarchical process structure into the format required by DependenTree, which is a graph visualization library built using tree layout from D3. The expected format is a list of dictionaries, where each dictionary represents a node (process) in the tree. The only fields required by DependenTree are _name and _deps. However, we want the structure used for the tree also to contain additional fields:


Field Description
_name The unique identifier of the process
_deps A list containing the identifier of the parent processes. In our use-case there is always only one parent so it's a list of one element.
ProcessName The filename of the process.
FileName Name of the file that the recorded action was applied to
ProcessId The process ID.
ProcessCreationTime The creation time of the process.

_name
The unique identifier of the process
_deps
A list containing the identifier of the parent processes. In our use-case there is always only one parent so it's a list of one element.
ProcessName
The filename of the process.
FileName
Name of the file that the recorded action was applied to
ProcessId
The process ID.
ProcessCreationTime
The creation time of the process.

The Widget

With the process tree data structure in place, the next step is creating an interactive widget for computational notebooks. anywidget does two things: it provides the tooling for Jupyter-compatible widget creation and implements the Anywidget Front-End Module (AFM) specification based on standard ECMAScript modules.

To create the widget’s frontend, we need to write an ES module that defines lifecycle methods, e.g.,

  • initialize: Sets up the widget’s initial state and event listeners
  • render: Handles the actual rendering of the widget in the notebook

The host platform (like Jupyter or marimo) loads this module and communicates with it through a standardized interface. Here’s the basic structure:

export default {
  initialize({ model }) {
    // Add instance-specific event listeners
    return () => {
      // Clean up event listeners
    }
  },
  render({ model, el }) {
    // Render the widget
    return () => {
    // Clean up event listeners
    }
  },
};

The AFM module defines core widget lifecycle methods for initialization and rendering, each returning cleanup functions. Through synchronized traits, it enables bidirectional Python-JavaScript communication, allowing features like interactive selection and brushing. For details, see reusable widgets for interactive analysis and visualization in computational notebooks.

For our Process tree visualization widget, we want to maintain a shared tree structure state between Python and JavaScript, with bidirectional synchronization of both the tree and the currently selected node. This means that when a user clicks a node in the visualization, the selection should be reflected in Python, and when we update the tree structure in Python, the widget should re-render the tree visualization. This bidirectional communication is handled through traitlets - we’ll define both an events trait for the tree structure and a process_id trait for tracking the currently selected process.

For the host side we need to define an anywidget.AnyWidget subclass

class Widget(anywidget.AnyWidget):
    _esm = pathlib.Path(__file__).parent / "static" / "widget.js"
    events = traitlets.List([]).tag(sync=True)
    process_id = traitlets.int(0).tag(sync=True)

Process tree widget. The events property is a synchronized list that contains the process tree data. When this list is modified in Python, the changes are automatically reflected in the JavaScript client, triggering a re-render of the visualization. The _esm is the JavaScript side of things.

For the AFM we need to load and setup the DependenTree, insert it into the DOM and pass it the events list which was generated by the create_dependentree_format method in our ProcessTree class. This connects our Python data structure to the JavaScript visualization.

// slightly modified version to allow 
// for node selection and styling
import DependenTree from "https://esm.sh/gh/kyrre/dependentree@dev"

export default {

  render({ model, el }) {
    this.treeDiv = document.createElement("div");
    this.treeDiv.id = "tree";
    this.activePid = null;

    // this callback function is called when the events list
    // is changed on the Python side, so we re-create the visualization
    // with the new data
    model.on("change:events", () => {

      this.tree.removeTree();

      this.tree = new DependenTree(this.treeDiv, options);
      this.tree.addEntities(structuredClone(model.get("events")));
      this.tree.setTree('<root>', 'downstream');

    });


    el.classList.add("process_tree_widget");
    el.appendChild(this.treeDiv);

    const options = {
      // ... 
      // settings omitted for brevity

      // whenever we click a node in tree we update the 
      // process_id value, which is then synced back to 
      // Python via the process_id traitlet

      nodeClick: (node) => {
        model.set("process_id", node.ProcessId);
        model.save_changes();
      }
    };

    // the rendering needs to complete before we create the tree
    // via discord :blessed:
    requestAnimationFrame(() => {
      this.tree = new DependenTree(this.treeDiv, options);
      this.tree.addEntities(structuredClone(model.get("events")));
      this.tree.setTree('<root>', 'downstream');
    });
  }
}

The process tree visualization AFM implements the widget’s frontend logic. It creates a DOM container for the tree, initializes the DependenTree visualization library, and establishes bidirectional communication with Python. When the shared events state changes (triggered from Python), the “change:events” callback recreates the visualization using the new data. Conversely, when a user clicks a node, the widget updates the process_id value, which synchronizes back to Python, enabling interactive exploration.

Interactive Demo

With all the components for our process tree visualization in place, we can now build a notebook that showcases how the widget works in practice, allowing you to:

  1. Filter process events by time range by using a marimo datetime slider
  2. Explore the hierarchical process tree structure
  3. Select individual processes to view their details
  4. See the bidirectional communication between Python and JavaScript in action

Since marimo notebooks can be run entirely in the browser by using Pyodide (CPython ported to WebAssembly), we can generate a static WASM notebook and embed directly into an iframe. This is great for documentation and for creating examples.

Note: When running in WebAssembly via Pyodide, we need to handle a few additional setup steps - specifically downloading and installing packages through micropip, and fetching our Parquet data files via HTTP, converting them to Arrow and creating the in-memory dataframe. While this setup code may look a bit involved, most of the complexity is due the workarounds needed to run the demo in a broswer environment. The core visualization functionality remains the same whether you’re running locally or in WebAssembly.

Below you’ll find an interactive notebook where you can explore the example data. Note that the nodes themselves must be doubleclicked to expand. Opening the notebook in a new tab is recommended to better explore the more deeply nested subtrees. The tree can be seen close to the bottom of the notebook after giving it some time to generate.

It doesn’t work on mobile so in that case there’s only a video.

Interactive process tree visualization running entirely in your browser via WebAssembly. This demo showcases the power of bidirectional communication between Python and JavaScript - you can filter the dataset using the time range controls, and clicking on any process node updates the Python state, allowing for detailed inspection of selected processes. The reactive nature of marimo ensures all components stay synchronized as you explore the data.

It’s also clear from the visualization that the EDR wasn’t able to log all the process relationships, which is why not all processes are properly nested under ntoskrnl.exe. This illustrates the challenge we discussed earlier with the hatched nodes in our tree diagram - some process creation events are missing from the logs, requiring us to reconstruct relationships from parent process information. We recommend exploring the subtree ntoskrnl.exesmss.exewinlogon.exeuserinit.exeexplorer.exe, which shows a user launching a sequence of applications.

Conclusion

In this post, we demonstrated how to build an interactive process tree visualization widget using:

  • marimo - a reactive Python notebook environment
  • anywidget and AFM - connecting Python and JavaScript for widget creation
  • DependenTree - creating interactive tree visualizations with d3
  • ibis - a backend-agnostic dataframe library

By transforming raw process logs into an interactive tree visualization, this widget helps incident responders understand the chain of process executions when investigating security issues. The bidirectional communication between Python and JavaScript enables analysis - analysts can click and interact with nodes in the visualization to select processes of interest, while querying and analyzing the selected process data in Python. This integration between visualization and analysis capabilities helps explore process relationships and examine details when investigating security incidents.

The solution can work with different EDR data sources by mapping their process events to the ASIM schema, and the visualization can be modified using D3 and other JavaScript libraries or frameworks like React or Vue. Additionally, thanks to Pyodide, the notebook can run directly in the browser via WebAssembly, making it easy to share and demonstrate.

Future Improvements

While the current implementation works well for typical process trees, there are some areas for future enhancement:

  • Handling processes with many children: The visualization can become overwhelming when dealing with processes that spawn hundreds of child processes (like services.exe).
  • Timeline filtering: Adding timeline controls would allow users to focus on specific time intervals, making it easier to analyze process relationships during particular periods of interest.
  • Additional context: Incorporating more process metadata and allowing filtering based on process attributes could provide valuable context during investigations.

References

  • anywidget - Framework for creating custom Jupyter and marimo notebook widgets
  • marimo - Reactive Python notebook
  • ibis - Python dataframe library
  • Apache Spark - Distributed query engine
  • Spark Connect - Spark’s client-server interface
  • dependentree - D3 tree visualization library
  • treelib - Tree data structure manipulation library
  • Pyodide - Python runtime for the browser