########################################
 Bringing Typing Back to the Programmer
########################################

.. contents::
   :local:

Hello! This is the second *FhY* design blog post. If you haven't read our first post and
are interested in what this project is all about, you can find it :doc:`here
<post_2024-04-04>`.

On Wednesday (that would be the day before the first design blog post was released), the
team had a discussion about types and how the type signature of an operation or
procedure would affect how it was mapped to a hardware target. In the current state of
*FhY* (i.e., as represented by the first blog post), sub-routines would be represented
as ``op`` or ``proc`` definitions with a definite type signature. For example, a simple
``op`` definition for a matrix-matrix multiplication might look like this:

.. code-block:: FhY

   op matmul(input int32[m, n], input int32[n, p] B) -> output int32[m, p] {
       temp index[1:m] i;
       temp index[1:p] j;
       temp index[1:n] k;
       temp int32[m, p] C;

       C[i, j] = sum[k](A[i, k] * B[k, j]);
       return C;
   }

This code defines the matrix-matrix multiplication correctly. However, it only defines
it for ``int32`` multi-dimensional arrays. While this might seem innocuous, it could
cause problems when trying to perform operation scheduling that will yield the best
performance.

Why?

Let's say you have at your disposal a CPU on your current machine and an FPGA with a
custom domain-specific accelerator realized on the FPGA. For the sake of this example,
consider the GeneSys NPU (a part of the GeneSys deep learning acceleration system; read
more about it `here <https://actlab-genesys.github.io/>`_) as the hardware realized on
the FPGA. The GeneSys NPU has a systolic array that can perform matrix-matrix
multiplications extremely efficiently in ``int8`` precision.

You are aware that you have this NPU that can perform the matrix-matrix multiplication
operation efficiently in ``int8`` precision; therefore, you define your program's data
as multi-dimensional arrays with ``int8`` elements. However, when you invoke the
``matmul`` operation defined above, the types of the input arrays will be promoted from
``int8`` to ``int32``, as this is a valid type promotion with no information loss. This
means that the compiler must map the ``matmul`` operation to the CPU rather than the
FPGA, as the CPU is the only target that can handle the ``int32`` data. Unfortunately,
the CPU is much slower than the GeneSys NPU at matrix-matrix multiplication (like orders
of magnitude slower).

You, the programmer, could circumvent this issue by defining a new ``matmul`` operation
that takes ``int8`` arrays as input. This would solve the immediate problem. However, we
want to build libraries of operations that can be reused across different projects.
Creating a permutation of every operation for every possible type signature is a
non-starter considering that the logic of the operation is the same regardless of the
type signature.

One solution to this problem would be to introduce extremely weak base data types like
``real`` that could be promoted to whatever base data type the user passes. This would
allow the user to define the operation with a weak type signature and the compiler to
promote the weak types to the appropriate base data type. However, now, that ``real``
base data type can promote to just about anything. This could lead to an adjacent
problem where the programmer has access to a hardware target that can handle an
operation efficiently at a certain precision, but hasn't designed their program around
that precision and has now lost the opportunity to use that hardware target.

Instead, we propose to leverage template types like those found in C++ and Rust (called
Generics). This solution provides the best of both worlds, as the programmer can define
the operation with a weak type signature and the programmer can specify exactly what
base data type they want to use. This allows the programmer to design their program
around the precision they want to use and be notified by the compiler if they are
missing out on an opportunity to use a hardware target that can handle the operation
efficiently in a certain precision.

Here is the same matrix-matrix multiplication operation defined with template types:

.. code-block:: FhY

   op matmul<T>(input T[m, n], input T[n, p] B) -> output T[m, p] {
       temp index[1:m] i;
       temp index[1:p] j;
       temp index[1:n] k;
       temp T[m, p] C;

       C[i, j] = sum[k](A[i, k] * B[k, j]);
       return C;
   }

   proc main(input int8[2, 2] A, input int8[2, 2] B, output int8[2, 2] C) {
       C = matmul<int8>(A, B);
   }

As you can see, the ``matmul`` operation is now defined with a template type ``T``. The
programmer can now specify that they want the operation to be performed with ``int8``
arrays by invoking the operation with ``matmul<int8>``. Therefore, the logic for the
operation is the same regardless of the base data type, and the programmer can design
their program around the precision they want to use. Additionally, if the programmer
attempts to pass a higher precision array to the operation, the compiler with throw an
error for an illegal implicit type promotion. Overall, we feel that this solution
provides an easy way to define operations for many different base data types while also
ensuring that the programmer can tailor their program to the precision they desire.

If you have any thoughts on this solution or have a better solution, please let us know!

-  **Release Date**: Friday April 5th, 2024
-  **Last Updated**: Friday May 24th, 2024
-  **Post Author(s)**: Christopher Priebe