Visualizing process trees with marimo and anywidget
This work was inspired by a project at DNB’s Cyber Defense Center where we have been exploring the use of visualizations and data apps to help us do incident response more efficiently. The process tree visualization presented here re-envisions those data apps within the notebook environment, demonstrating how similar interactive analysis capabilities can be achieved in computational notebooks. While this specific implementation focuses on teaching core concepts, we hope to share more about our production security visualization work in future posts or presentations.
Process creation event logs are one of the primary data sources when investigating security incidents. However, when treated as a collection of events, we are left with a tabular representation of what is in reality a tree relationship, and it can be difficult for an analyst to get an overview of what is going on. While Microsoft and other EDR vendors provide visualization tools out of the box, they come with some limitations: you can’t customize them, data expires after a while, and they are only available if you pay for premium tiers.
In this post, we will show how to build an interactive process tree visualization by combining:
- anywidget - a framework for creating custom Jupyter and marimo notebook widgets
- marimo - a reactive Python notebook
- ibis - a Python dataframe library that is backend agnostic
- Apache Spark & Spark Connect - a distributed query engine
- dependentree - d3 tree visualization library created by Square
Below is a diagram showing the overview of the system architecture and how the components relate to eachother.
Overview of the system architecture and data flow. Users create ibis dataframe queries in a marimo app that are executed on a remote Apache Spark cluster. The process creation events are retrieved, a tree structure is created and sent to the anywidget which renders the d3 process tree visualization.
We will not dive deep into these tools here, but they all have great documentation and tutorials for those who want to learn more. In particular, for an introduction to anywidget, check out this presentation by the creator Trevor Manz, or watch his step-by-step tutorial on building a fun widget from scratch.
Process Creation Events
Even though we will use data from Microsoft Defender for Endpoint, the approach can be adapted to logs from any EDR. The MDE process creation events are stored in the DeviceProcessEvents schema. For the process tree use-case the important fields are summarized below.
Field | Description |
---|---|
Timestamp | Date and time when the event was recorded |
ReportId | Event identifier based on a repeating counter. To identify unique events, this column must be used in conjunction with the DeviceName and Timestamp columns. |
DeviceName | Fully qualified domain name (FQDN) of the device |
ProcessId | Process ID (PID) of the newly created process |
FileName | Name of the file that the recorded action was applied to |
ProcessCreationTime | Date and time the process was created |
InitiatingProcessId | Process ID (PID) of the process that initiated the event |
InitiatingProcessFileName | Name of the process file that initiated the event; if unavailable, the name of the process that initiated the event might be shown instead |
InitiatingProcessCreationTime | Date and time when the process that initiated the event was started |
InitiatingProcessParentId | Process ID (PID) of the parent process that spawned the process responsible for the event |
InitiatingProcessParentFileName | Name of the parent process that spawned the process responsible for the event |
InitiatingProcessParentCreationTime | Date and time when the parent of the process responsible for the event was started |
- Timestamp
- Date and time when the event was recorded
- ReportId
- Event identifier based on a repeating counter. To identify unique events, this column must be used in conjunction with the DeviceName and Timestamp columns.
- DeviceName
- Fully qualified domain name (FQDN) of the device
- ProcessId
- Process ID (PID) of the newly created process
- FileName
- Name of the file that the recorded action was applied to
- ProcessCreationTime
- Date and time the process was created
- InitiatingProcessId
- Process ID (PID) of the process that initiated the event
- InitiatingProcessFileName
- Name of the process file that initiated the event; if unavailable, the name of the process that initiated the event might be shown instead
- InitiatingProcessCreationTime
- Date and time when the process that initiated the event was started
- InitiatingProcessParentId
- Process ID (PID) of the parent process that spawned the process responsible for the event
- InitiatingProcessParentFileName
- Name of the parent process that spawned the process responsible for the event
- InitiatingProcessParentCreationTime
- Date and time when the parent of the process responsible for the event was started
To make the widget easier to re-use with different data sources, we will map the
DeviceProcessEvents
table to the
ProcessEvent
schema from the ASIM (Advanced Security Information Model). The Azure Sentinel
repository contains ASIM
parsers
for many data sources. While these parsers are written in KQL (Kusto Query
Language), it is straightforward to rewrite them as Ibis expressions.
process_creation_events = (
events
.filter(_.ActionType == "ProcessCreated")
.distinct(
on=["ReportId", "Timestamp", "DeviceName"],
keep="first"
)
.order_by(_.Timestamp)
.mutate(
TargetProcessId=_.ProcessId,
TargetProcessFilename=_.FileName,
TargetProcessCreationTime=_.ProcessCreationTime,
# ...
)
)
Ibis and Spark Connect
By using Ibis the same code can run on a remote data system, locally using DuckDB or even in the browser. In a production system, you would typically connect to distributed query engines like Apache Spark, BigQuery or Snowflake.
Spark Connect allows you to execute Apache Spark queries remotely from a notebook enviroment (or any client). When you run queries from your notebook, the client sends your operations to the Spark server, which executes them and returns the results which are streamed back to the client through gRPC in Arrow format. This client-server architecture lets us run intensive queries on powerful remote clusters while maintaining an interactive notebook experience.
config = Config(profile="security")
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
con = ibis.pyspark.connect(spark)
device_process_events = (
con.table(
name="device_process_events",
database=("security_logs", "mde"),
)
.select(_.properties)
.unpack("properties")
)
t = ibis.memtable(device_process_events.to_pyarrow())
An example of connecting to a remote Spark cluster on Databricks using Ibis and Spark Connect and creating an in-memory table backed by DuckDB. This approach allows remote query execution while maintaining local interactivity, though it’s worth noting that data transfer involves conversion through pandas before reaching DuckDB, so it’s not a zero-copy operation.
Retrieving the right set of process creation events is an interesting challenge in itself, especially if you don’t have access to the complete start and end of a system session. One approach is to examine a wide time interval of process creation events and look for system boot markers like the Windows kernel (ntoskrnl.exe) being loaded. When investigating a specific process, you can define session boundaries by identifying when ntoskrnl.exe was loaded - the previous load marking the session start and the next load indicating a reboot and new session. While this information is typically available in dedicated system event logs, it can be reconstructed from process creation events with careful querying.
ntoskrl_creation_events = (
events
.filter(_.ParentProcessFilename == "ntoskrnl.exe")
.select(_.ParentProcessCreationTime)
.distinct(on="ParentProcessCreationTime", keep="first")
)
An example of how to identify system boot events by looking for ntoskrnl.exe as a parent process. While not implemented in our demo, this approach can help establish session boundaries when investigating incidents. In practice, you might simply look back a few hours or days from a suspicious event, or use other time-based filtering approaches depending on your investigation needs.
For the remainder of this article, we’ll assume you already have a collection of process events (retrieved using Apache Spark, Splunk, Elastic, or some other query or search engine) and are now ready to create a process tree visualization. The techniques we’ll cover work regardless of how you obtained your process event data.
Building the Tree
Before we can create visualizations, we need to construct a tree structure from the process creation events. We will use treelib, an efficient tree manipulation library with no external dependencies. The library allows nodes to contain arbitrary data, so we will define a Process.
class Process(BaseModel):
# Process being created
target_process_id: int
target_process_filename: str
target_process_creation_time: datetime
# Direct parent process
acting_process_id: int
acting_process_filename: str
acting_process_creation_time: datetime
# Grandparent process
parent_process_id: int
parent_process_filename: str
parent_process_creation_time: datetime
def identifier(self) -> str:
return f"{self.target_process_id}|{self.target_process_creation_time}"
def parent_identifier(self) -> str:
if self.acting_process_id == Process.MISSING_PROCESS_ID:
return "<root>"
return f"{self.acting_process_id}|{self.acting_process_creation_time}"
As the root node, we use a placeholder value <root>
from which all processes originate. Each process node has a unique
identifier created by concatenating its target_process_id
and target_process_creation_time
values. When adding a node
to the tree, we specify both its identifier and its parent’s identifier. The diagram below illustrates a process tree and
shows how these fields relate to each other.
An example of a process tree structure created by treelib. Nodes with diagonal stripes represent processes whose creation events were not directly available, but were reconstructed using ActingProcess or ParentProcess information from other events.
The nodes shown with diagonal stripes (hatched pattern) represent processes where we do not have the original process creation event. This could be because we either did not retrieve the event or because it was not logged by the EDR. However, we can still partially reconstruct these nodes using information from other events - specifically, the ActingProcess or ParentProcess values.
Next, we define a ProcessTree class that uses treelib to construct our tree structure. When initialized, this class creates a tree with a root node, and provides methods to build out the process hierarchy. The class handles both direct process creation events and reconstructs missing nodes using parent process information.
class ProcessTree:
def __init__(self, processes: List | None = None):
self.tree: Tree = Tree()
self.root = self.tree.create_node(
tag="<root>",
identifier="<root>",
data=None
)
def insert_or_update(self, process: Process):
...
def insert_process(self, process: Process):
parent_process = Process(
target_process_id=process.parent_process_id,
target_process_filename=process.parent_process_filename,
target_process_creation_time=process.parent_process_creation_time,
)
...
self.insert_or_update(parent_process)
self.insert_or_update(acting_process)
self.insert_or_update(process)
def create_dependentree_format(self):
...
The ProcessTree class builds a tree structure from process events, tracking how processes are created and relate to each other. It can handle both direct process creation events and fill in missing information about parent processes, ensuring we have a complete picture of process relationships.
The create_dependentree_format
method (omitted for brevity) transforms the hierarchical process structure into the format required by
DependenTree, which is a graph visualization library built using
tree layout from D3. The expected format is a list of dictionaries, where each dictionary represents a node (process) in the
tree. The only fields required by DependenTree are _name
and _deps
. However, we want the structure used for the tree also to contain
additional fields:
Field | Description |
---|---|
_name | The unique identifier of the process |
_deps | A list containing the identifier of the parent processes. In our use-case there is always only one parent so it's a list of one element. |
ProcessName | The filename of the process. |
FileName | Name of the file that the recorded action was applied to |
ProcessId | The process ID. |
ProcessCreationTime | The creation time of the process. |
- _name
- The unique identifier of the process
- _deps
- A list containing the identifier of the parent processes. In our use-case there is always only one parent so it's a list of one element.
- ProcessName
- The filename of the process.
- FileName
- Name of the file that the recorded action was applied to
- ProcessId
- The process ID.
- ProcessCreationTime
- The creation time of the process.
The Widget
With the process tree data structure in place, the next step is creating an interactive widget for computational notebooks. anywidget does two things: it provides the tooling for Jupyter-compatible widget creation and implements the Anywidget Front-End Module (AFM) specification based on standard ECMAScript modules.
To create the widget’s frontend, we need to write an ES module that defines lifecycle methods, e.g.,
initialize
: Sets up the widget’s initial state and event listenersrender
: Handles the actual rendering of the widget in the notebook
The host platform (like Jupyter or marimo) loads this module and communicates with it through a standardized interface. Here’s the basic structure:
export default {
initialize({ model }) {
// Add instance-specific event listeners
return () => {
// Clean up event listeners
}
},
render({ model, el }) {
// Render the widget
return () => {
// Clean up event listeners
}
},
};
The AFM module defines core widget lifecycle methods for initialization and rendering, each returning cleanup functions. Through synchronized traits, it enables bidirectional Python-JavaScript communication, allowing features like interactive selection and brushing. For details, see reusable widgets for interactive analysis and visualization in computational notebooks.
For our Process tree visualization widget, we want to maintain a shared tree structure state between Python and JavaScript,
with bidirectional synchronization of both the tree and the currently selected node. This means that when a user clicks a node in the visualization,
the selection should be reflected in Python, and when we update the tree structure in Python, the widget should re-render the tree visualization.
This bidirectional communication is handled through traitlets - we’ll define both an events
trait for the tree structure and
a process_id
trait for tracking the currently selected process.
For the host side we need to define an anywidget.AnyWidget
subclass
class Widget(anywidget.AnyWidget):
_esm = pathlib.Path(__file__).parent / "static" / "widget.js"
events = traitlets.List([]).tag(sync=True)
process_id = traitlets.int(0).tag(sync=True)
Process tree widget. The events property is a synchronized list that contains the process tree data. When this list is modified in Python, the changes are automatically reflected in the JavaScript client, triggering a re-render of the visualization. The _esm is the JavaScript side of things.
For the AFM we need to load and setup the DependenTree, insert it into the DOM and pass it
the events
list which was generated by the create_dependentree_format
method in our
ProcessTree class. This connects our Python data structure to the JavaScript visualization.
// slightly modified version to allow
// for node selection and styling
import DependenTree from "https://esm.sh/gh/kyrre/dependentree@dev"
export default {
render({ model, el }) {
this.treeDiv = document.createElement("div");
this.treeDiv.id = "tree";
this.activePid = null;
// this callback function is called when the events list
// is changed on the Python side, so we re-create the visualization
// with the new data
model.on("change:events", () => {
this.tree.removeTree();
this.tree = new DependenTree(this.treeDiv, options);
this.tree.addEntities(structuredClone(model.get("events")));
this.tree.setTree('<root>', 'downstream');
});
el.classList.add("process_tree_widget");
el.appendChild(this.treeDiv);
const options = {
// ...
// settings omitted for brevity
// whenever we click a node in tree we update the
// process_id value, which is then synced back to
// Python via the process_id traitlet
nodeClick: (node) => {
model.set("process_id", node.ProcessId);
model.save_changes();
}
};
// the rendering needs to complete before we create the tree
// via discord :blessed:
requestAnimationFrame(() => {
this.tree = new DependenTree(this.treeDiv, options);
this.tree.addEntities(structuredClone(model.get("events")));
this.tree.setTree('<root>', 'downstream');
});
}
}
The process tree visualization AFM implements the widget’s frontend logic.
It creates a DOM container for the tree, initializes the DependenTree visualization library,
and establishes bidirectional communication with Python. When the shared events
state changes (triggered from Python),
the “change:events” callback recreates the visualization using the new data.
Conversely, when a user clicks a node, the widget updates the process_id
value,
which synchronizes back to Python, enabling interactive exploration.
Interactive Demo
With all the components for our process tree visualization in place, we can now build a notebook that showcases how the widget works in practice, allowing you to:
- Filter process events by time range by using a marimo datetime slider
- Explore the hierarchical process tree structure
- Select individual processes to view their details
- See the bidirectional communication between Python and JavaScript in action
Since marimo notebooks can be run entirely in the browser by using Pyodide (CPython ported to WebAssembly), we can generate a static WASM notebook and embed directly into an iframe. This is great for documentation and for creating examples.
Note: When running in WebAssembly via Pyodide, we need to handle a few additional setup steps - specifically downloading and installing packages through micropip, and fetching our Parquet data files via HTTP, converting them to Arrow and creating the in-memory dataframe. While this setup code may look a bit involved, most of the complexity is due the workarounds needed to run the demo in a broswer environment. The core visualization functionality remains the same whether you’re running locally or in WebAssembly.
Below you’ll find an interactive notebook where you can explore the example data. Note that the nodes themselves must be doubleclicked to expand. Opening the notebook in a new tab is recommended to better explore the more deeply nested subtrees. The tree can be seen close to the bottom of the notebook after giving it some time to generate.
It doesn’t work on mobile so in that case there’s only a video.
Interactive process tree visualization running entirely in your browser via WebAssembly. This demo showcases the power of bidirectional communication between Python and JavaScript - you can filter the dataset using the time range controls, and clicking on any process node updates the Python state, allowing for detailed inspection of selected processes. The reactive nature of marimo ensures all components stay synchronized as you explore the data.
It’s also clear from the visualization that the EDR wasn’t able to log all the process relationships,
which is why not all processes are properly nested under ntoskrnl.exe
. This illustrates the challenge we discussed earlier with the
hatched nodes in our tree diagram - some process creation events are missing from the logs, requiring us to reconstruct relationships from
parent process information. We recommend exploring the subtree ntoskrnl.exe
→ smss.exe
→ winlogon.exe
→ userinit.exe
→ explorer.exe
,
which shows a user launching a sequence of applications.
Conclusion
In this post, we demonstrated how to build an interactive process tree visualization widget using:
- marimo - a reactive Python notebook environment
- anywidget and AFM - connecting Python and JavaScript for widget creation
- DependenTree - creating interactive tree visualizations with d3
- ibis - a backend-agnostic dataframe library
By transforming raw process logs into an interactive tree visualization, this widget helps incident responders understand the chain of process executions when investigating security issues. The bidirectional communication between Python and JavaScript enables analysis - analysts can click and interact with nodes in the visualization to select processes of interest, while querying and analyzing the selected process data in Python. This integration between visualization and analysis capabilities helps explore process relationships and examine details when investigating security incidents.
The solution can work with different EDR data sources by mapping their process events to the ASIM schema, and the visualization can be modified using D3 and other JavaScript libraries or frameworks like React or Vue. Additionally, thanks to Pyodide, the notebook can run directly in the browser via WebAssembly, making it easy to share and demonstrate.
Future Improvements
While the current implementation works well for typical process trees, there are some areas for future enhancement:
- Handling processes with many children: The visualization can become overwhelming when dealing with processes that spawn hundreds of child processes (like
services.exe
). - Timeline filtering: Adding timeline controls would allow users to focus on specific time intervals, making it easier to analyze process relationships during particular periods of interest.
- Additional context: Incorporating more process metadata and allowing filtering based on process attributes could provide valuable context during investigations.
References
- anywidget - Framework for creating custom Jupyter and marimo notebook widgets
- marimo - Reactive Python notebook
- ibis - Python dataframe library
- Apache Spark - Distributed query engine
- Spark Connect - Spark’s client-server interface
- dependentree - D3 tree visualization library
- treelib - Tree data structure manipulation library
- Pyodide - Python runtime for the browser