Low-Level Techniques to Emulate Inline Assembly in Haskell
GHC lacks inline assembly support. However, efforts to call special CPU instructions like 64-bit multiplication from Haskell are garnering attention.
Haskell/GHC does not have inline assembly or intrinsics like those available in C. However, there are many scenarios where developers might want to use special CPU instructions (such as SIMD, cryptographic, or hashing commands) from Haskell. A technical blog post titled “Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC,” published on June 30, 2026, explores various methods to overcome these limitations in detail, sparking discussion within the Haskell community.
This article builds upon the content of that blog, examining methods to call low-level CPU instructions from Haskell and evaluating their practicality.
The Barrier to Low-Level Control in Haskell
Modern CPUs come equipped with numerous specialized instructions tailored for specific purposes, such as SIMD, hashing, and cryptographic processing. While C and C++ can directly utilize these instructions through inline assembly or intrinsics, Haskell prioritizes high-level abstraction and does not provide such low-level mechanisms by default.
One specific example highlighted by the author involves obtaining the upper 64 bits of a 64-bit integer multiplication operation. On x86 architecture, the mulq instruction computes a 128-bit product and stores the upper 64 bits in %rdx and the lower 64 bits in %rax. However, C’s standard multiplication operator only returns the lower 64 bits.
In C, this operation can be implemented in a single line using the __int128 type provided by GCC/Clang:
unsigned __int128 wideningMul(uint64_t a, uint64_t b) {
return (unsigned __int128)a * (unsigned __int128)b;
}
While inline assembly allows for more explicit implementation, C can handle this problem relatively easily. But what about Haskell?
The Potential of GHC Primitives
GHC actually provides a primitive: timesWord2# :: Word# -> Word# -> (# Word#, Word# #). This intrinsic returns both the upper and lower 64 bits of a 64-bit multiplication as a tuple, making it perfectly suited for this purpose. The primitive translates directly into CPU instructions during GHC’s code generation phase, resulting in minimal overhead.
However, the issue lies in the fact that such primitives are not comprehensively available. For example, certain cryptographic operations that rely on specialized instructions, such as carry-less multiplication for polynomial multiplication over finite fields, do not have corresponding primitives in GHC. In such cases, developers must resort to alternative methods.
The Reality of Calling C Functions via FFI
For instructions not supported by a GHC primitive, the most common approach is to use FFI (Foreign Function Interface) to call wrapper functions written in C. For instance, packages like cryptonite-openssl have adopted this method to wrap C libraries for use in Haskell.
However, the FFI approach comes with a clear downside: the overhead of function calls. Each time a C function is invoked, the Haskell runtime must convert arguments to comply with C calling conventions and then convert the results back to Haskell data structures after the call.
For short operations (such as assembly blocks spanning only 1–2 instructions), this overhead can surpass the cost of the operation itself. The blog author has published quantified measurements of this overhead in a repository, comparing the performance of FFI-based methods with GHC primitives in macro benchmarks. While specific figures are available in the blog’s repository, the typical cost of FFI overhead is in the tens of nanoseconds range.
Coding Techniques for Returning Multiple Values
The subtitle of the blog, “Methods for Returning Multiple Values from Functions,” highlights the differences in language design between C and Haskell. In C, developers can either return values wrapped in a structure or write results into memory via pointers. In contrast, Haskell natively supports boxed tuples and unboxed tuples ((# #) syntax), which makes it superior in this regard.
However, boxed tuples in Haskell may incur memory allocation costs. GHC primitives like timesWord2#, which return unboxed tuples, are designed to avoid such overhead.
Practicality and Limitations of Low-Level
Optimization
This investigation demonstrates that it is possible to utilize special CPU instructions in Haskell while shedding light on the practical trade-offs involved.
- When GHC primitives exist: They are the optimal choice, offering performance comparable to C intrinsics.
- When no primitives exist: FFI-based C wrapper functions are the next best solution. However, the overhead of function calls cannot be ignored, making this method practical only for sufficiently lengthy operations or batch processing that amortizes the overhead.
- Automatic C code generation: Implementing custom code generation mechanisms, as seen in
cryptonite-openssl, is another option but comes with high maintenance costs.
These techniques are most realistically applied to specific bottleneck operations, such as cryptographic libraries or hash functions. Introducing low-level optimizations in regular application code often sacrifices Haskell’s core strength: its high level of abstraction—which raises questions about its overall value.
Editorial Opinion
In the short term, this article may inspire proposals to add new primitives to GHC. With the rising prominence of languages like Rust and Zig that prioritize low-level control, enhancing Haskell’s intrinsic support for cryptographic processing and SIMD operations is urgently needed to maintain competitiveness.
From a long-term perspective, this issue represents a crossroads for Haskell’s future in system programming. Balancing the advantages of pure functional programming—safety and automatic memory management—with low-level control is a critical challenge. Meanwhile, developments such as Linux’s Cache Aware Scheduling extension (Linux Cache Aware Scheduling extension boosts MySQL performance by up to 360%) provide a contrasting approach to low-level optimization, which may inform discussions around Haskell’s trajectory.
The editorial team poses the question: Is it truly cost-effective to add intrinsics to GHC, and can Haskell provide low-level control without compromising its high-level abstractions? These are pivotal issues for the future of functional programming languages.
References
- Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC — Published June 30, 2026
- Author’s GitHub repository (includes benchmarking results) — Referenced in the original article
Frequently Asked Questions
- Is there a standard way to use inline assembly in Haskell?
- GHC does not provide an inline assembly mechanism like C. Instead, developers typically use GHC primitives (such as `timesWord2#`) or call C wrapper functions via FFI. In some cases, crafting custom code generation tools may also be considered.
- How can the upper 64 bits of a 64-bit multiplication be obtained in Haskell?
- The most efficient method is to use GHC’s `timesWord2#` primitive, which returns the upper and lower bits as an unboxed tuple. Alternatively, one can use FFI to call C functions utilizing the `__int128` type, though this incurs function call overhead. Benchmarks have confirmed the superiority of primitives in terms of performance.
- Can this approach be applied to cryptographic processing?
- It can be applied selectively. For specific instructions like AES-NI or carry-less multiplication, FFI-based C wrappers are practical solutions. However, due to latency, this method is less suitable for real-time processing and is better suited for batch operations or infrequent calls.
Comments