For example, the benchmark could ask the AI to port 1000 methods in the mogan/texmacs C++ source to another language such as rust.
You could then evaluate the results by running mogan/texmacs with one ported method at a time to see if it seems to be computing the same thing as the original C++ method.