Building the Datadog Agent on RISC-V: A Personal Journey
This article details my experience building the Datadog agent from source on a RISC-V processor, specifically the StarFive JH7110 found in the Milk-V CM compute module. This was a personal challenge, as Datadog does not officially support RISC-V, and documentation on this process is non-existent. My only resource was the public Datadog agent repository.
My journey with this board began in 2023. While the processor is adequate for basic tasks, it's not exactly a powerhouse. Initially, I intended to use it in my Uconsole cyberdeck (from ClockworkPi), which normally uses a Raspberry Pi compute module. However, I encountered driver issues that prevented the Milk-V from driving the display – a story for another time.
I explored other uses for the board. My initial ideas, like setting up Nextcloud or Internet in a Box, ran into roadblocks. Nextcloud's complexity and limited RISC-V library support at the time made it impractical. Internet in a Box, while I eventually got it running, performed so poorly on the RISC-V SoC that it was essentially unusable. These projects were shelved in 2023, and the board sat idle until 2024.
Working at Datadog sparked a new idea: could I run the Datadog agent on this RISC-V board? Since I wasn't using it for anything else, running the agent seemed like a worthwhile experiment. This "just because" mentality is what drove me to tackle this project.
I'm comfortable navigating Linux, but I had never compiled a Linux kernel. So my first step was to consult an AI chatbot (Gemini). I asked it for a step-by-step guide on building the Datadog agent from source and installing it on a RISC-V processor. Remember, there's no official Datadog support for RISC-V, so this was uncharted territory, and I was essentially being handheld by an AI chatbot.
Deep Dive into Go Dependency Hell: Building the Datadog Agent on RISC-V
Building software from source can be a smooth, satisfying experience – or it can plunge you into a labyrinth of cryptic errors, conflicting dependencies, and platform-specific quirks. This article documents a real-world troubleshooting journey: building the Datadog Agent on a VisionFive 2 single-board computer running a Debian-based Linux distribution, an architecture based on RISC-V. What started as a seemingly straightforward task turned into a deep dive into Go's module system, C library compatibility, and the intricacies of Git's tagging system. If you're facing similar build issues, especially with Go projects on less common architectures, this might save you some serious head-scratching.
The Initial Goal: A Simple Build
The initial goal seemed simple enough: build the Datadog Agent from source, following the instructions in the official repository's README. The Datadog Agent is a crucial piece of infrastructure for monitoring, collecting metrics, and logs from your systems. The target platform was a RISC-V single-board computer. The documentation suggested using the
invoke
build system:git clone <https://github.com/DataDog/datadog-agent.git> cd datadog-agent invoke agent.build --build-exclude=systemd # Example command
Round 1: "No Space Left on Device" - The Misleading Partition
The first attempt immediately hit a wall:
fatal: write error: No space left on device
fatal: fetch-pack: invalid index-pack output
My initial thought was: "What do you mean?! that I had plenty of space on my 32GB microSD card. A quick
df -h
confirmed plenty of total space. But further inspection with df -h
and lsblk
revealed the real problem:lsblk
showed the microSD card (mmcblk0
) as 28.9GB (it was partitioned at the beginning).
df -h
showed the root filesystem (/dev/mmcblk0p4
) as only 3.7GB!
The Linux installation had created a small root partition, leaving the vast majority of the card unused. The
git clone
operation, which downloads the entire history of a repository (potentially much larger than the current source code), was filling up this small partition.Solution 1: Expand the Root Filesystem
Since I couldn't easily use a live environment (a limitation of the setup), I opted for the riskier, but necessary, approach of resizing the mounted root filesystem. I used these commands:
Bash command
sudo parted /dev/mmcblk0 # Replace mmcblk0 if needed (parted) resizepart Partition number? 4 # Your root partition number! Size? 100% (parted) print # Verify the change! Make sure the size of mmcblk0p4 is now ~28.8GB. (parted) quit sudo resize2fs /dev/mmcblk0p4 df -h # Verify
To expand to the whole disk, I redid the
parted
command, with a manual specification of the endpoint.Round 2: Network Nightmares and Shallow Clones
With the partition resized, I tried
git clone
again. This time, different errors appeared:error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8)
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
These pointed to network instability. The Datadog Agent repository is large, and any interruption during the clone can cause these errors.
Solution 2: Shallow Clone and Network Stability
The essential fix was to use a shallow clone:
git clone --depth 1 <https://github.com/DataDog/datadog-agent.git>
depth 1
tells Git to download only the latest commit, drastically reducing the amount of data transferred. I also ensured a stable internet connection (switching to a wired connection if possible is always a good idea for large downloads). I also increased git buffer size just in case:
git config --global http.postBuffer 524288000
Round 3: pip3: command not found
- Missing Python Package Installer
The next hurdle was a simple one, but easily overlooked:
bash: pip3: command not found
pip3
, the Python package installer, wasn't installed. This is common on minimal Linux installations.Solution 3: Install python3-pip
The fix was straightforward, using the distribution's package manager:
sudo apt update # For Debian/Ubuntu sudo apt install python3-pip
Round 4: "can't find Rust compiler" - The Rust Toolchain
Progress! The clone worked, and
pip3
was available. But the build process itself now failed:error: can't find Rust compiler
The Datadog Agent uses Rust for some of its components, and the build system couldn't find the necessary tools.
Solution 4: Install the Rust Toolchain with rustup
The recommended way to install Rust is using
rustup
:curl --proto '=https' --tlsv1.2 -sSf <https://sh.rustup.rs> | sh source "$HOME/.cargo/env" # Or restart your shell rustc --version # Verify installation cargo --version # Verify Cargo (Rust's package manager) rustup target add riscv64gc-unknown-linux-gnu #For cross compilation
And also, since I was using the RISCV architecture, I needed to add the required target.
Round 5: maturin
Missing - Build Dependency Woes
The build continued to fail, but with a new error:
FileNotFoundError: [Errno 2] No such file or directory: 'maturin'
maturin
is a tool for building and packaging Rust-based Python packages. Even though pip
had seemingly installed build dependencies, maturin
(used internally) wasn't available.Solution 5: python3 -m maturin
and Explicit Reinstall
Two steps helped here:
- Verify
maturin
installation (diagnostically):
python3 -m maturin --version
If this showed an error, it meant it was still not installed.
- Force-reinstall
maturin
:
pip3 install --force-reinstall maturin
And to be sure, I upgraded some basic tools:
pip3 install --upgrade pip setuptools wheel
Round 6: invoke: command not found
- The PATH
to Success
Another seemingly simple, yet frustrating error:
Bash
bash: invoke: command not found
invoke
was installed (I could see it with pip3 show invoke
), but the executable (inv
) wasn't in my shell's PATH
.Solution 6: python3 -m invoke
(and Updating PATH
)
Two solutions worked:
- Use
python3 -m invoke
: This is the most robust and portable way to runinvoke
(and many other Python-based tools):Bash
python3 -m invoke agent.build --build-exclude=systemd
- Add
~/.local/bin
toPATH
(for convenience): This is a good long-term solution, making user-installed executables available:Bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc # Or ~/.zshrc, etc. source ~/.bashrc # Or open a new terminal
Even though I had added the path to my bash, I just started using the command to save myself the headaches
python3 -m invoke agent.build --build-exclude=systemd
.Round 7: "No idea what '--arch' is!" - Task Definition Mismatch (Gemini hallucination)
Finally,
invoke
ran! But now the build task itself complained:No idea what '--arch' is!
This meant the
agent.build
task, defined in the Datadog Agent's tasks.py
file, didn't have an --arch
option. I had to check the documentation and the tasks.py
and tasks/__init__.py
files [here .]Solution 7: AI can be great, but verifying the code is sometimes greater
The
tasks.py
file does not have or I could not find a --arch
flag. It seems the AI chatbot hallucinated this option.Round 8: "No tags can describe..." - The Git Tag Mystery
The build almost worked, but then hit this final, perplexing error:
fatal: No tags can describe '7b722a315a84db8b0ccdbc1d8300df192c61fddb'. Try --always, or create some tags.
(Newbie mistake) This was a
git describe
error within the build process. git describe
uses tags to generate version strings. The error meant that the current commit (identified by its hash) didn't have any ancestor tags matching the pattern "7.*" (Datadog Agent version 7 tags). I had fetched tags, so this was a surprise.Solution 8: Building from a Tag, Not the Bleeding Edge
The root cause was that I was trying to build from the very latest commit on the
main
branch, which hadn't yet been tagged with a release. The correct solution is to build from a tagged release:- List tags:
git tag -l | grep "7\\."
- Checkout a tag:
git checkout tags/7.61.0 # Replace with a recent tag
- Build:
python3 -m invoke agent.build --build-exclude=systemd
Alternative (not recommended): Modifying
tasks/libs/common/utils.py
I also found where to add the
--always
flag that was suggested in the error message. This is not the recommended approach, but I'll include it for completeness: The code needed to be changed inside the ~/datadog-agent/tasks/libs/common/utils.py
file, in the get_version
function.Python
latest_tag = ctx.run( f'git describe --tags --match "{tag_prefix}*" --abbrev=0 --always', #Added --always hide=True, env=get_goenv(ctx), encoding='utf-8', ).stdout.strip()
I honestly would not touch it just to prevent breaking more stuff.
Round 9: gopsutil
- The Stubborn Dependency
Even after all this, the build still failed with a
gopsutil
error:# github.com/DataDog/gopsutil/host
../go/pkg/mod/github.com/!data!dog/gopsutil@v1.2.2/host/host_linux.go:142:22: undefined: sizeOfUtmp
../go/pkg/mod/github.com/!data!dog/gopsutil@v1.2.2/host/host_linux.go:147:14: undefined: sizeOfUtmp
../go/pkg/mod/github.com/!data!dog/gopsutil@v1.2.2/host/host_linux.go:149:9: undefined: utmp
This was the most persistent issue. It turned out to be a combination of a tricky dependency conflict and a platform-specific incompatibility.
Solution 9, sort of: Forcing the Correct gopsutil
Version (and Understanding Why)
The Datadog agent uses three versions of the library
gopsutil
: one maintained by Datadog (the troublemaker) and two maintained by shirou. The root cause of this issue is likely due to the Datadog fork of gopsutil
(github.com/DataDog/gopsutil
) lagging behind the community-maintained versions (github.com/shirou/gopsutil/v2
and github.com/shirou/gopsutil/v3
). The undefined: sizeOfUtmp
and undefined: utmp
errors suggest that the Datadog fork, version v1.2.2
, is missing definitions or functionalities related to the utmp
structure, which are necessary for system monitoring on certain architectures, potentially including RISC-V. This incompatibility might stem from differences in how system calls or data structures are handled on RISC-V compared to the architectures Datadog's fork was initially developed and tested for.Inspecting dependencies confirmed that a transitive dependency was still pulling in the older, problematic code:
go list -m all | grep "github.com/shirou/gopsutil" go mod graph | grep "github.com/DataDog/gopsutil"
I attempted to force updates of the direct dependency and its transitive dependencies:
go get -u github.com/DataDog/datadog-agent/pkg/network/go #get the direct dependency. go get -u github.com/shirou/gopsutil/v3@latest go mod tidy go mod verify goenv invoke agent.clean
Rebuild:
python3 -m invoke agent.build --build-exclude=systemd
But I guess life is not that easy, and now I need to figure out how to bypass
DataDog/gopsutil
and just use the newer versions maintained by shirou.Rounds 9, 10, 11, and beyond: The Persistent Dependency Conflicts (gopsutil and go-systemd)
This is where the real troubleshooting began. Even after all the previous steps, the build consistently failed with errors related to
gopsutil
(and later, go-systemd
). This was not a simple missing dependency; it was a conflict between different versions of the same module, caused by Datadog's forked version.The Core Problem: Go Module Conflicts and Forks
Go uses a module system to manage dependencies. The
go.mod
file lists these dependencies and their versions. The go.mod
file can use replace
directives to tell Go to use a specific version of a module, or even a local path, instead of the one specified in the dependency tree.The Datadog Agent uses
gopsutil
for system information. However, there are multiple versions:github.com/DataDog/gopsutil
: An older, Datadog-maintained fork.
github.com/shirou/gopsutil/v2
: A community-maintained version.
github.com/shirou/gopsutil/v3
: Another community-maintained version.
The build was trying to use the older, incompatible Datadog fork (
v1.2.2
), even though the project intended to use the newer shirou/gopsutil/v3
. Later, a similar issue appeared with go-systemd
.The undefined: sizeOfUtmp and undefined: utmp errors suggest that the Datadog fork, version v1.2.2, is missing definitions or functionalities related to the utmp structure, which are necessary for system monitoring on certain architectures, potentially including RISC-V. This incompatibility might stem from differences in how system calls or data structures are handled on RISC-V compared to the architectures Datadog's fork was initially developed and tested for.
The Solution (So Far): Precise go.mod
,go.work
and removing all direct call to datadog/gopsutil
Configuration
The key to (hopefully) resolving these conflicts is to meticulously configure
go.mod
and go.work
.go.work
(CRITICAL): It must contain only
`go 1.23
use .`
No other
use
directives.go.mod
(CRITICAL): Remove all of mentions of GitHub - DataDog/gopsutil: psutil for golang and change it for shirou/gopsutil/v3 anywhere where datadog/gopsutil was being used and accessed directly.
- Carefully crafting the
go.mod
andgo.work
files to force the use of the correct dependencies and resolve conflicts. This is the most important part of the entire process.
The real solution: Taking the time to define sizeOfUtmp
struct for RISC-V
count := len(buf) / sizeOfUtmp ret := make([]UserStat, 0, count) for i := 0; i < count; i++ { b := buf[i*sizeOfUtmp : (i+1)*sizeOfUtmp]
Why is this the real solution? Because even after I got pass this undefined issue and force the project to use
gopsutil v3
, that is when I began getting a lot more dependency issues all over the build process and prompted me to just take the L, instead of sinking more time into this.Lessons Learned
This troubleshooting journey, while not fully concluding with a successful build due to the unresolved
gopsutil
issue, provided significant learning opportunities. Here are key lessons learned:- Carefully Read Error Messages: Each error message, even seemingly cryptic ones, contained valuable clues. The progression of errors, from "no space" to "network error" to "Rust compiler" and so on, formed a chain of problems, each requiring a specific solution. Relying on AI chatbots can be helpful for initial guidance, but always verify the suggestions against the actual code and documentation.
- Understand Your Tools: A solid understanding of tools like
git clone
,df
,lsblk
,pip3
,invoke
,go mod
,grep
, andrustup
is crucial for debugging build issues. Specifically, masteringpip3
and the Python ecosystem's build tools is essential when working with Python-based projects.
- Go Modules are Powerful but Complex: Go's module system is designed to manage dependencies, but subtle conflicts can still arise, especially when dealing with forks or when platform-specific issues are involved. Tools like
go mod graph
,go list -m
,go get
,go mod tidy
, andgo mod verify
are indispensable for navigating Go dependency management.
- Platform Matters: Building on less common architectures like RISC-V significantly increases the chance of encountering platform-specific compatibility issues. Be prepared for potential challenges related to library availability, system call differences, and toolchain support when targeting less mainstream architectures.
- Build from Tags for Stability: When building software from source, especially for complex projects like the Datadog Agent, it is generally best practice to build from tagged releases. Building from the bleeding edge of the
main
branch can lead to unexpected errors and inconsistencies, as demonstrated by the "git describe" error.
- Shallow Clone for Large Repositories (with Tag Awareness): For large repositories, using shallow clones (
git clone --depth 1
) can drastically reduce download times and disk space usage. However, remember to fetch tags separately (git fetch --tags
) if the build process relies on tags, as was the case in this project.
- Consult Official Documentation: Always refer to the official documentation for the software you are building. The Datadog Agent's documentation provided valuable clues about the build process and dependencies, and carefully reviewing it can save significant time and effort.
Conclusion
Building the Datadog Agent on RISC-V proved to be a challenging but ultimately insightful journey into the complexities of software builds, dependency management, and platform-specific issues. While the final hurdle of the
gopsutil
dependency remains unresolved, the process highlighted the importance of troubleshooting, in-depth understanding of build tools, and the ever-present challenges of targeting less common architectures (I learned all of them). For those venturing into similar projects, persistence, meticulous error analysis, and a willingness to explore unconventional solutions will be your greatest gifts.This troubleshooting journey while frustrating at times, each error provided a learning opportunity, ultimately leading to an almost successful build of the Datadog Agent on RISC-V, I still need to figure out the issue with
DataDog/gopsutil
and the dependency spiral it took me but that is for future me, present me is chilling for now :D