Introduction#

Hello! This is the first design blog post for the Alternative Computing Technology (ACT) Lab’s newest project: FhY.

FhY is a cross-domain language with mathematical foundations that moves beyond the current paradigm of domain-specific languages to enable cross-domain multi-acceleration. To start, what is cross-domain multi-acceleration? A cross-domain application is an application with constituent algorithms from various domains. A good example is an autonomous driving pipeline that might use algorithms from digital signal processing, machine learning, and control theory. Multi-acceleration is the acceleration of an application using many different hardware devices (e.g., CPUs, GPUs, domain-specific accelerations). Therefore, cross-domain multi-acceleration is the acceleration of a cross-domain application with multi-acceleration.

Our objectives with this project are (1) to enable researchers to implement cross-domain programs using familiar notation and (2) to achieve high performance through multi-acceleration. However, instead of presenting to you exactly how we are going to address each of these two objectives, we have decided to take a different approach. We will upload blog posts discussing our ideas for the project and the design decisions we have made. You will be able to provide feedback on our ideas and help us shape the project as it progresses. We call this Glass-House development. Although we have a preliminary vision for the project, we want input from the community and are excited to see how it evolves as we receive feedback. If you want to tell us about an idea or provide feedback, either make a post in the discussion section of the repository or email us at fhy.actlab [at] gmail [dot] com.

Summary#

This week, we discussed the high-level objectives of the project, our initial thoughts on the grammar and semantics of FhY, and the high-level structure of the compiler and the compiler’s front-end.

Achievements#

Proposed FhY Features#

As mentioned earlier, our objective with FhY is to enable researchers to implement cross-domain programs using familiar notation and to achieve high performance through multi-acceleration. To accomplish this, we propose a cross-domain programming language with mathematical foundations, FhY, and a compiler for FhY that can effectively target various hardware devices. Our reasoning behind this is as follows. Mathematical notation is a universal language that researchers across domains are familiar with. Therefore, a programming language that closely resembles mathematical notation will enable researchers to implement algorithms more easily while also specifying the algorithms at a very high level, enabling the compiler to explore various optimization opportunities on its own. Additionally, with the rise of domain-specific accelerators, achieving high performance requires mapping computations to the correct hardware devices that will achieve the best performance for a given algorithm and generating efficient code for those devices. Therefore, a compiler that can target various hardware devices and generate efficient code for those devices is essential for achieving high performance.

The programming language syntax must be in line with mathematical notation while also preserving typical programming language constructs like types and functions. In regards to this, we currently propose that assignment statements in FhY resemble Einstein notation using special index variables. This enables mathematical-like notation in the programming language while also not explicitly defining parallelism and preserving the typical structure of a programming language. Additionally, we need FhY to support modern programming language features like compilation of many source files, pre-compiled library support, and user-defined operations and data types.

As for the compiler, enabling compilation for various hardware devices requires a representation that can represent computation at various granularities. One of the key challenges with targeting various hardware devices is the variability in the compute models of these devices. For example, CPUs generally operate on scalar values, while a domain-specific accelerator could perform an operation on an entire matrix with a single instruction. Therefore, we must use a representation that can represent computation at various granularities. In regards to this, we currently propose a fractalized data-flow graph (f-DFG) intermediate representation (IR) that can represent computation at various granularities. Like other data-flow graph IRs, each node in the f-DFG IR represents a computation and the edges represent the data dependencies between computations. However, each node in the f-DFG IR contains a sub-f-DFG that represents the computation of the node with a series of finer-grained computations. In this way, the f-DFG IR enables the compiler to zoom in and out on the granularity of the computation, allowing it to target various compute models and, therefore, various hardware devices. Another necessary feature of the compiler for FhY is extensibility. Compilers achieve their optimization capabilities through the various passes they perform on the IR. These passes can be both dependent or independent of the hardware. With the expansive ecosystem of compilation passes and hardware targets, the number of passes that need to be implemented and tuned is vast. Therefore, the implementation of the compiler must be extensible and easy to work with to enable community-driven development of optimization passes.

Proposed FhY Language#

While there are still many features we need to discuss, we have an initial design for the FhY language. This current iteration of the language will be described in the following sections.

Pure Example#

We will motivate the basic constructs of our language with an example. Let’s say you have just derived a new operation called matrix-matrix multiplication. You have derived a formula (below) given two input matrices \(\pmb{A} \in \mathbb{R}^{m \times n}\), \(\pmb{B} \in \mathbb{R}^{n \times p}\) and an output matrix \(\pmb{C} \in \mathbb{R}^{m \times p}\).

\[\pmb{C}_{ij} = \sum_{k = 1}^{n} \pmb{A}_{ik} \pmb{B}_{kj}\]

Now, you want to implement your algorithm in a programming language. An implementation of this algorithm in Python might look like this:

def matmul(A, B, m, n, p):
    C = [[0 for j in range(p)] for i in range(m)]
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i][j] += A[i][k] * B[k][j]
    return C

Assume that A and B are 2-dimensional arrays represented as lists and that m, n, and p are the dimensions of the arrays. While this code faithfully implements your new algorithm, it looks quite different from the mathematical formula you derived. What if you were not a programmer and did not think in terms of loops, but rather in terms of mathematical notation, a more native language for you? Additionally, this Python code defines a sequential execution of this algorithm, even though matrix-matrix multiplication is a highly parallelizable operation. What if you could write your algorithm in a language that more closely resembles the mathematical formula and implicitly represents many opportunities for parallelism?

To solve this, we propose FhY, a domain-specific language based on mathematical constructs like Einstein notation that allows you to write your algorithm in a way that closely resembles the mathematical formula. Additionally, due to the inherent high-level nature of mathematical notation, opportunities for parallelism are not explicitly defined and can be explored by the compiler. Here is an example of how you might write the matrix-matrix multiplication algorithm in FhY:

proc matmul(input int32[m, n] A, input int32[n, p] B, output int32[m, p] C){
   temp index[1:m] i;
   temp index[1:p] j;
   temp index[1:n] k;
   C[i, j] = sum[k](A[i, k] * B[k, j]);
}

In this FhY code, we define a procedure called matmul that takes two input multi-dimensional arrays, A and B, and an output multi-dimensional array, C. The multi-dimensional arrays each have an associated type qualifier (e.g., input or output in this procedure) that dictates their read and write properties within the procedure. Additionally, they have a type that consists of the base data type (e.g., int32) for the values contained in the multi-dimensional array and the shape of the multi-dimensional array (e.g., [m, n]). FhY is a strongly typed language with multi-dimensional arrays as first-class types, enabling the compiler to perform type-checking and catch bugs early to ease the debugging process. The procedure defines three index variables, i, j, and k, that are used to iterate over the multi-dimensional arrays. Index variables are an important construct in FhY that enables Einstein notation and implicit parallelism. Notice how no explicit loop order is defined in the procedure; the compiler can explore different parallelization strategies to optimize the execution of the procedure using the index variables. The procedure then defines the computation of the output multi-dimensional array, C, using Einstein notation and a summation function. Notice the similarity between the final line of the FhY code and the mathematical formula for matrix-matrix multiplication. This is one of the tenets of FhY; it allows you to write your algorithm in a way that closely resembles the mathematical formula, making it easier to implement and understand.

Non-Pure Example#

This previous example highlights how FhY allows programming in a way that closely resembles mathematical notation. However, FhY must also support stateful computations, control flow, and other constructs that are not present in the previous example to enable practical use. To illustrate this, let’s say we want to implement a gradient descent algorithm to train a simple neural network consisting of one fully-connected layer and a sigmoid activation function. We can define the forward pass of this simple neural network as follows (click to skip the math):

\[\hat{\pmb{y}} = \sigma\left(\pmb{x} \pmb{W}^{\top} + \pmb{b}\right)\]

where \(\hat{\pmb{y}} \in \mathbb{R}^{m}\) is the predicted output by the neural network and \(\pmb{x} \in \mathbb{R}^{n}\) is the input vector. This simple neural network has two trainable parameters: the weight matrix of the fully-connected layer \(\pmb{W} \in \mathbb{R}^{m \times n}\) and the bias vector \(\pmb{b} \in \mathbb{R}^{m}\) of the fully-connected layer. The gradient descent update rules for the weight matrix and bias vector are as follows:

\[\begin{split}\begin{align*} \pmb{W} &\leftarrow \pmb{W} - \alpha \frac{\partial L}{\partial \pmb{W}} \\ \pmb{b} &\leftarrow \pmb{b} - \alpha \frac{\partial L}{\partial \pmb{b}} \\ \end{align*}\end{split}\]

where \(\alpha\) is the learning rate, \(L\) is the loss function, and \(\frac{\partial L}{\partial \pmb{W}}\) and \(\frac{\partial L}{\partial \pmb{b}}\) are the gradients of the loss function with respect to the weight matrix and bias vector, respectively. For this example, \(L\) will be the mean squared error (MSE) loss function which is defined as follows:

\[L\left(\pmb{y}, \hat{\pmb{y}}\right) = \frac{1}{m} \sum_{i = 1}^{m} \left(y_{i} - \hat{y}_{i}\right)^2\]

where \(\pmb{y} \in \mathbb{R}^{m}\) is the ground truth output. Let \(L_{i}\left(y_{i}, \hat{y}_{i}\right) = \left(y_{i} - \hat{y}_{i}\right)^{2}\) be the loss for the \(i^{th}\) dimension of the output such that \(L\left(\pmb{y}, \hat{\pmb{y}}\right) = \frac{1}{m} \sum_{i = 1}^{m} L_{i}\left(y_{i}, \hat{y}_{i}\right)\). Using the chain rule, we get the following expressions for the gradients of the loss function with respect to each element in the weight matrix and bias vector:

\[\begin{split}\begin{align*} \frac{\partial L}{\partial W_{pq}} &= \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial L_{i}}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial W_{pq}} \\ \frac{\partial L}{\partial b_{p}} &= \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial L_{i}}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial b_{p}} \\ \end{align*}\end{split}\]

We can quickly find expressions for the derivatives \(\frac{\partial L_{i}}{\partial \hat{y}_{i}}\), \(\frac{\partial \hat{y}_{i}}{\partial W_{pq}}\), and \(\frac{\partial \hat{y}_{i}}{\partial b_{p}}\) as follows:

\[\begin{split}\begin{align*} \frac{\partial L_{i}}{\partial \hat{y}_{i}} &= -2\left(y_{i} - \hat{y}_{i}\right) \\ \frac{\partial \hat{y}_{i}}{\partial W_{pq}} &= \frac{\partial \sigma\left(\pmb{x} \pmb{W}^{\top} + \pmb{b}\right)_i}{\partial W_{pq}} = \frac{\partial \sigma\left(\sum_{j = 1}^{n} W_{ij} x_{j} + b_{i}\right)}{\partial W_{pq}} \\ &= \sigma\left(\sum_{j = 1}^{n} x_{j} W_{ij} + b_{i}\right) \left(1 - \sigma\left(\sum_{j = 1}^{n} x_{j} W_{ij} + b_{i}\right)\right) \frac{\partial \left(\sum_{j = 1}^{n} x_{j} W_{ij} + b_{i}\right)}{\partial W_{pq}} \\ &= \begin{cases} x_{q} \hat{y}_{i} \left(1 - \hat{y}_{i}\right) & p = i \\ 0 & p \neq i \\ \end{cases} \\ \frac{\partial \hat{y}_{i}}{\partial b_{p}} &= \frac{\partial \sigma\left(\pmb{x} \pmb{W}^{\top} + \pmb{b}\right)_i}{\partial b_{p}} = \frac{\partial \sigma\left(\sum_{j = 1}^{n} W_{ij} x_{j} + b_{i}\right)}{\partial b_{p}} \\ &= \sigma\left(\sum_{j = 1}^{n} x_{j} W_{ij} + b_i\right) \left(1 - \sigma\left(\sum_{j = 1}^{n} x_{j} W_{ij} + b_{i}\right)\right) \frac{\partial \left(\sum_{j = 1}^{n} x_{j} W_{ij} + b_{i}\right)}{\partial b_{p}} \\ &= \begin{cases} \hat{y}_{i} \left(1 - \hat{y}_{i}\right) & p = i \\ 0 & p \neq i \\ \end{cases} \\ \end{align*}\end{split}\]

Therefore, we get the following complete expressions for the gradients of the loss function with respect to the weight matrix and bias vector after substitution and a few additional simplifications:

\[\begin{split}\begin{align*} \frac{\partial L}{\partial W_{pq}} &= \frac{2}{m} x_{q} \hat{y}_{i} \left(\hat{y}_{i} - y_{i}\right) \left(1 - \hat{y}_{i}\right) \\ \frac{\partial L}{\partial b_{p}} &= \frac{2}{m} \hat{y}_{i} \left(\hat{y}_{i} - y_{i}\right) \left(1 - \hat{y}_{i}\right) \\ \end{align*}\end{split}\]

An implementation of this algorithm in Python using NumPy might look like this:

import numpy as np


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def forward_propagation(X, W, b):
    return sigmoid(np.dot(W, X) + b)


def backward_propogation(x, W, b, y):
    y_hat = forward_propagation(x, W, b)
    dW = (2 / m) * x * y_hat * (y_hat - y) * (1 - y_hat)
    db = (2 / m) * y_hat * (y_hat - y) * (1 - y_hat)
    return dW, db


def train(X, Y, learning_rate):
    W = np.random.randn(Y.shape[0], X.shape[0])
    b = np.random.randn(Y.shape[0])
    for i in range(X.shape[0]):
        y_hat = forward_propagation(X[i], W, b)
        dW, db = backward_propagation(X[i], W, b, Y[i])
        W = W - learning_rate * dW
        b = b - learning_rate * db
    return W, b


def main(X, Y, learning_rate):
    W, b = train(X, Y, learning_rate)

Assume that X, Y, W, and b are NumPy NDArrays holding np.float32 where X has shape (examples, n), Y has shape (examples, m), W has shape (m, n), and b has shape (m,). Additionally, assume that learning_rate is a floating point number representing the learning rate. Look at the train function in the Python code. We must update the weights and biases of the neural network based on the current values of the weights and biases. This requires stateful computations that are not present in the previous matrix-matrix multiplication example. Additionally, the Python code initializes the weights and biases randomly, which is not something that FhY can achieve out of the box. Therefore, FhY supports additional constructs like the state type qualifier and op / native routines. The state type qualifier acts in the same way as a static variable in a C/C++ function. The value of a state variable is preserved across invocations of the procedure. Additionally, the native routine is used to define operations that are not implemented in FhY and are implemented in another language (e.g., C/C++). Finally, the op construct in FhY is used to conveniently define pure functions that return one value. op routines are used to define simple computations that are used in more complex procedures.

Using these constructs, we can implement this training loop in FhY as follows:

# --- Declared in another file ---

native initialize_weights(input int32 m, input int32 n) -> output float32[m, n];
native initialize_bias(input int32 m) -> output float32[m];

# --- Declared in primary file ---

op sigmoid(input float32[m] x) -> output float32[m] {
   temp index[1:m] i;
   param float32 e = 2.71828;
   return 1 / (1 + e ** (-x[i]));
}

op forward(input float32[n] x, input float32[m, n] W, input float32[m] b) -> output float32[m] {
   temp index[1:m] i;
   temp index[1:n] j;
   temp float32[m] FC_out;

   FC_out[i] = sum[j](W[i, j] * x[j]) + b[i];
   return sigmoid(FC_out);
}

op backward_propagation(input float32[n] x, input float32[m, n] W, input float32[m] b, input float32[m] y) -> output (float32[m, n], float32[m]) {
   temp index[1:m] i;
   temp index[1:n] j;

   temp float32[m] y_hat = forward(x, W, b);
   temp float32[m, n] dW;
   temp float32[m] db;

   dW[i, j] = (2.0 / m) * x[j] * y_hat[i] * (y_hat[i] - y[i]) * (1.0 - y_hat[i]);
   db[i] = (2.0 / m) * y_hat[i] * (y_hat[i] - y[i]) * (1.0 - y_hat[i]);
   return (dW, db);
}

proc train(input float32[examples, n] X, input float32[examples, m] Y, input float32 learning_rate, output float32[m, n] W, output float32[m] b) {
   state float32[m, n] W_state = initialize_weights(m, n);
   state float32[m] b_state = initialize_bias(m);

   temp index[1:examples] e;

   temp (float32[m, n], float32[m]) derivatives;
   temp float32[m, n] dW;
   temp float32[m] db;

   temp index[1:m] i;
   temp index[1:n] j;

   forall (e) {
      derivatives = backward_propagation(X[e], W_state, b_state, Y[e]);
      dW[i, j] = derivatives.1[i, j];
      db[i] = derivatives.2[i];
      W_state[i, j] = W_state[i, j] - learning_rate * dW[i, j];
      b_state[i] = b_state[i] - learning_rate * db[i];
   }

   W[i, j] = W_state[i, j];
   b[i] = b_state[i];
}

proc main(input float32[examples, n] X, input float32[examples, m] Y, input float32 learning_rate) {
   train(X, Y, learning_rate, W, b);
}

While the FhY code is more verbose than the Python code, programming with FhY provides safety because it is statically typed, enabling the compiler to catch bugs like indexing errors early. Additionally, the FhY code more closely resembles the mathematical notation, making it easier to implement and understand.

FhY Language Overview#

Given these examples, we can now provide an overview of the FhY language. FhY is a strongly typed language that resembles Rust and Python in syntax. A single .fhy file consists of components. Components can be procedure/operation definitions or native routine declarations (as mentioned earlier, we also intend to support compilation of many source files, pre-compiled FhY library support, and user-defined operations and data types; if you have any ideas, please let us know)

Procedures and operations are the primary constructs in FhY. They are like functions in other programming languages. The primary difference between procedures and operations is that procedures can contain stateful variables and output multiple values, while operations are pure functions that return one value. There are No pointers or references in FhY; all data is passed by value. Each line in a procedure or operation is a statement. Each statement either declares the type and type qualifier of a new variable, assigns a value to an existing variable, or represents some more complex structure like a looping structure of a branching structure. The assignment statements in FhY act like for loops when combined with index variables. You can imagine that each statement using some indices is equivalent to that statement nested inside a series of for loops over the ranges defined by those indices. Therefore, the typical structure of a procedure or operation is to define the inputs and outputs and their types, declare the indices necessary to iterate over the inputs and outputs, declare any temporary variables required, and then define the intended computations using the variables and the indices.

FhY also includes a state type qualifier for variables that preserve their values across invocations of the procedure. This is useful when dealing with variables you might normally define as a class member in an object-oriented programming language; for example, the weights of a neural network. Additionally, FhY includes a native routine construct that allows the user to define operations that are not implemented in FhY and are implemented in another language (e.g., C/C++).

With these features, FhY enables the user to write algorithms in a way that closely resembles mathematical notation, making it easier to implement and understand.

Type System#

FhY is a strongly typed language. There are two primary classes of types in FhY: numerical types and index types. Numerical types represent the data moving throughout the program. Index types represent the indices used to iterate over and access the data. Therefore, the type system represents a divide between the data plane and the address plane of the program. FhY also includes some additional types that do not fall into one of the two primary categories and are used to enable more advanced constructs; these will be discussed later.

The numerical types in FhY are multi-dimensional arrays with a base data type and a shape. A multi-dimensional array with zero dimensions is a scalar. Each value of the shape represents the size of the array along that dimension. For example, a 2-dimensional array with m rows and n columns would have a shape of [m, n]. The base data type of the array represents the type of the values stored in the array. For example, a 2-dimensional array of 32-bit integers would have a base data type of int32. For our initial proposal, we intend FhY to support signed integers, unsigned integers, floating point numbers (IEEE 754), and fixed point numbers with various bit widths. In the future, we intend to extend support for custom base data types declared by the user as well as weak types (e.g., an integer with an unspecified bit-width).

The index types in FhY are used to iterate over and access the data in the numerical types. Indices (i.e., variables with an index type) represent a set of values that can be used to index into a numerical type. For example, an index with a range of [1:N] represents the set of integers from 1 to N (i.e., \(\{1, 2, \ldots, N\}\)). Indices can also have a stride, which represents the step size between values in the set. For example, an index with a range of [1:10:2] represents the set of integers from 1 to 10 with a stride of 2 (i.e., \(\{1, 3, 5, 7, 9\}\)). Indices can only be declared in this way. However, using expressions, the user can create indices that represent more complex sets of values. For example, if we have two indices index[1:N] i and index[1:M] j, we can create a new index that is a linear combination of the two indices using the expression 2 * i + 3 * j + 3. This new index would represent the set of values \(\{8, 10, 11, \ldots, 2N + 3M + 3\}\).

FhY also includes a tuple type. Tuples do not fall into the two primary categories of types and are used to group multiple values. A tuple type is defined by the types of the values it contains. For example, a tuple type with two elements, the first element being an integer and the second element being a float, would be defined as (int32, float32).

As mentioned earlier, FhY includes type qualifiers as a means to dictate the access properties of variables. The type qualifiers in FhY are as follows:

Type Qualifier

Variable R/W Permissions

Value Known at Compile Time?

Declared Where?

input

Read only

No

proc / op arguments

output

Write only

No

proc / op arguments

state

Read and Write

No

proc body

param

Read only

Yes

proc / op arguments or body

temp

Read and Write

No

proc / op body

Additionally, recall that state variables are preserved across invocations of the procedure. Variables declared with the other type qualifiers are not preserved across invocations of the procedure and act like local variables in a C/C++ function.

ANTLR grammar#

We have created an initial ANTLR grammar for the FhY language. The full grammar is not included in the repository as of the time of the publication of this blog post. However, once we have settled on a syntax and begun implementing the FhY front-end, we will include the full grammar. Below, we have included the current version of the parsing rules for FhY (other than the literal rule for brevity).

grammar FhY;

/*
* Program Rules
*/

program
   : component*
   ;

component
   : function_declaration
   | function_definition
   ;

/*
* Function Rules
*/

function_declaration
   : function_header SEMICOLON
   ;

function_definition
   : function_header OPEN_BRACE function_body CLOSE_BRACE
   ;

function_header
   : function_type=FUNCTION_KEYWORD name=IDENTIFIER OPEN_PARENTHESES function_args CLOSE_PARENTHESES (ARROW qualified_type)?
   ;

function_args
   : (function_arg (COMMA function_arg)*)?
   ;

function_arg
   : qualified_type (name=IDENTIFIER)?
   ;

function_body
   : statement*
   ;

/*
* Statement Rules
*/

statement
   : declaration_statement
   | expression_statement
   | selection_statement
   | iteration_statement
   | return_statement
   ;

declaration_statement
   : qualified_type name=IDENTIFIER (EQUALS_SIGN expression)? SEMICOLON
   ;

expression_statement
   : (name=IDENTIFIER (OPEN_BRACKET expression_list CLOSE_BRACKET)? EQUALS_SIGN)? expression SEMICOLON
   ;

selection_statement
   : IF OPEN_PARENTHESES expression CLOSE_PARENTHESES OPEN_BRACE statement* CLOSE_BRACE (ELSE OPEN_BRACE statement* CLOSE_BRACE)?
   ;

iteration_statement
   : FORALL OPEN_PARENTHESES expression CLOSE_PARENTHESES OPEN_BRACE statement* CLOSE_BRACE
   ;

return_statement
   : RETURN expression SEMICOLON
   ;

/*
* Type Rules
*/

qualified_type
   : (type_qualifier=TYPE_QUALIFIER)? type
   ;

type
   : tuple_type
   | numerical_type
   | index_type
   ;

tuple_type
   : OPEN_PARENTHESES ((type COMMA) | (type (COMMA type)+))? CLOSE_PARENTHESES
   ;

numerical_type
   : dtype=DTYPE (OPEN_BRACKET shape CLOSE_BRACKET)?
   ;

shape
   : (expression (COMMA expression)*)?
   ;

index_type
   : INDEX OPEN_BRACKET range CLOSE_BRACKET
   ;

range
   : expression COLON expression (COLON expression)?
   ;

/*
* Expression Rules
*/

expression
   : nested_expression=OPEN_PARENTHESES expression CLOSE_PARENTHESES
   | unary_expression=(SUBTRACTION | BITWISE_NOT | LOGICAL_NOT) expression
   | multiplicative_expression=expression (MULTIPLICATION | DIVISION) expression
   | additive_expression=expression (ADDITION | SUBTRACTION) expression
   | shift_expression=expression (LEFT_SHIFT | RIGHT_SHIFT)expression
   | relational_expression=expression (LESS_THAN | LESS_THAN_OR_EQUAL | GREATER_THAN | GREATER_THAN_OR_EQUAL) expression
   | equality_expression=expression (EQUAL_TO | NOT_EQUAL_TO) expression
   | and_expression=expression AND expression
   | exclusive_or_expression=expression EXCLUSIVE_OR expression
   | or_expression=expression OR expression
   | logical_and_expression=expression LOGICAL_AND expression
   | logical_or_expression=expression LOGICAL_OR expression
   | ternary_expression=expression QUESTION_MARK expression COLON expression
   | primary_expression
   ;

expression_list
   : (expression (COMMA expression)*)?
   ;

primary_expression
   : tuple_access_expression=primary_expression DOT INT_LITERAL
   | function_expression=primary_expression (OPEN_BRACKET expression_list CLOSE_BRACKET)? OPEN_PARENTHESES expression_list CLOSE_PARENTHESES
   | tensor_access_expression=primary_expression OPEN_BRACKET expression_list CLOSE_BRACKET
   | atom
   ;

atom
   : tuple=OPEN_PARENTHESES ((expression COMMA) | (expression (COMMA expression)+))? CLOSE_PARENTHESES
   | identifier=IDENTIFIER
   | literal
   ;

Proposed High-Level Flow#

Now that we have outlined our current ideas for the FhY language, we can discuss the proposed high-level flow of the FhY compiler. Below is a diagram that outlines the high-level flow of the FhY compiler.

Proposed high-level flow of FhY

Proposed high-level flow of FhY#

In the current vision, the FhY compiler will contain a front-end for both the FhY language and PyTorch, as PyTorch is a common library for deep learning. However, we plan to include additional front-ends for other frameworks in the future and by user request.

In line with traditional compiler design, the front-ends will convert the source code into an intermediate representation (IR) that is easier to optimize and transform. As mentioned earlier, we will use an IR that resembles a data-flow graph, called the fractalized data-flow graph (f-DFG) IR.

Once converted to the f-DFG IR, the compiler will perform optimizations on the IR to improve the performance of the program. We aim to define an extensible pass infrastructure that will allow us and other users to develop and register optimization passes easily. The objective here is to leverage the power of community-driven development to bring powerful optimizations to the FhY compiler.

Next, an operation scheduling module will leverage the f-DFG IR’s varying granularities of computation to schedule operations to the available hardware devices. This includes scheduling operations that cannot be executed on a domain-specific accelerator or pre-compiled sub-routine calls to the CPU.

To support compilation to CPUs, we aim to generate C/C++ code from the f-DFG IR. For domain-specific accelerators, we aim to leverage prior research from our lab on multi-target compilation through the Codelet compiler. The Codelet compiler enables hardware designers to get code generation for their custom hardware without the having to implement a custom compiler back-end. To achieve this, the Codelet compiler leverages an architectural representation called an Architecture Covenant Graph (ACG) and a library of Codelets, a low-level programming abstraction that specifies the semantics of operations in terms of microarchitectural components of the target hardware architecture. If you are interested in learning more about the Codelet compiler, please see the tutorial slides from either MICRO or HPCA for GeneSys on the compiler. During the initial stages of development, we will lower the f-DFG IR to a form compatible with the existing implementation of the Codelet compiler. However, we plan to rewrite the Codelet compiler in the future to enable improved ease-of-use for hardware developers in adding their hardware to the Codelet compiler.

Proposed Front-End High-Level Flow#

Proposed high-level flow of the FhY front-end

Proposed high-level flow of the FhY front-end#

The goal of the FhY front-end is to convert the FhY source code into the f-DFG IR. The first step in this process is to parse the FhY source code using a lexer and parser. We propose to use ANTLR 4 to generate a lexer and parser for the FhY language as it a powerful and mature parser generator. Next, the generated parse tree will be converted into a typed abstract syntax tree (AST). Next, the AST will be verified correct through a semantic analysis and then finally converted into the f-DFG IR. We plan to specify the details of the FhY AST and the semantic analysis in a future blog post.

Tabled Issues#

  • Are FhY Einstein notation assignment statements blocking or non-blocking?

  • How do we define a construct for adding new reductions to FhY in source files while still maintaining a high-level implementation of the operation’s semantics (i.e., no explicit definition of parallelism)?

Conclusion#

This blog post serves as the first in a series of posts that will outline the design of the FhY language and compiler. While this post provided a lengthy discussion motivating and introducing our project, we intend for further posts to be more brief and highlight specific design decisions and challenges. We are excited to receive feedback from the community and see how the project evolves as we receive feedback.

  • Release Date: Thursday April 4th, 2024

  • Last Updated: Friday May 24th, 2024

  • Post Author(s): Christopher Priebe