Skip to content

[TestCoverage] Failing Titanium Benchmarks for S10 Arch #3105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AlexandreSinger opened this issue Jun 4, 2025 · 7 comments
Open

[TestCoverage] Failing Titanium Benchmarks for S10 Arch #3105

AlexandreSinger opened this issue Jun 4, 2025 · 7 comments

Comments

@AlexandreSinger
Copy link
Contributor

While working on adding the Titanium benchmarks to the CI, I found that 5 circuits are currently failing through the default VTR flow:

  • ChainNN_LRN_LG
  • ChainNN_ELT_LG
  • ChainNN_BSC_LG
  • pricing
  • matrix_mult

I commented these 5 circuits out of the NightlyTest7 tasks for now, but these benchmarks should be revived.

See #3104 for context.

Initial Placement Issues

The first four benchmarks are all failing in the initial placer:

Image Image

The problem seems to be coming from how the initial placer handles (or does not handle) logical blocks which can be placed in multiple physical tile types:

Image

What these 4 circuits have in common is that they all have a lot of logical blocks which can be either LABs or MLABs; however they are very constrained in either physical LABs or physical MLABs.

This can be resolved by either having the auto-device sizer recognize this and increase the size of the device or by updating the initial placer to be more aware of this pattern and try to place the logical blocks which MUST be either a LAB or MLAB before placing the logical blocks which can be either.

Circuit Loading Issues

The last circuit, matrix_mult is failing while loading the circuit:

Image

This benchmark gets a segmentation fault during loading.

@AlexandreSinger
Copy link
Contributor Author

@vaughnbetz FYI

@vaughnbetz
Copy link
Contributor

@SamuelHo10 : I'm wondering if the matrixmult one could be related to the libarchfpga refactoring (from the stack trace it looks like it was in that region). Perhaps it has two clocks on a block or something else that is exercising some code that isn't exercised too often.

@amin1377 @kimiatkh : wondering if the initial placement failures were always there with auto-sized devices, or if they are new. I think @amin1377 has modified the initial placement code.
Probably we should have a strong ordering if we're nearly out of some kind of block that is competing for tiles with another block type. We could modify the initial placement gain function to prioritize that, either all the time or in the second and beyond placement attempt.

@amin1377
Copy link
Contributor

amin1377 commented Jun 4, 2025

@vaughnbetz: Just a quick note:
the change I made to initial placement hasn't been merged yet, as it degraded results on large benchmarks. We discussed this a while ago, and it’s likely due to blocks being placed too close together initially, which prevents other connected blocks from being placed nearby. I still need to try out the approach you suggested to see if it improves the QoR. The related PR is #3018.

@vaughnbetz
Copy link
Contributor

Got it, thanks @amin1377

@AlexandreSinger
Copy link
Contributor Author

@SamuelHo10 Sorry to steal your job, but I just happened to be looking into this issue and I think I found what was wrong with matrix_mult.

Turning on assert and debug I found the following:

Image

The assert that is being tripped is actually my assert. When I updated the Logical Models I added extra asserts to ensure that we never create / use models which do not exist. This appears to have revealed an issue with this blif file.

The code that this is failing at is dead simple:

Image

It just looks up a sub-circuit by the given name and tries to get information on that model. The code here assumes that the model must have already been created; as such its an assert within the logical models class.

The stack trace is telling us the name of the sub-circuit that is being looked up:
fourteennm_ram_block.opmode{quad_port}.output_type{comb}.port_a_address_width{10}.port_a_address2_width{10}.port_b_address_width{10}.port_b_address2_width{10}

From the name, I think this is a type of quad-port ram.

The reason this is crashing is because the stratix-10 architecture does not have quad-port RAMs (at least according to the architecture file). The issue, is that for some reason the matrix mult blif file is trying to use a quad_port RAM.

I used some Linux magic to parse all of the blif files in Titanium, and matrix_mult is the only one that is trying to use quad port memories:

Image

@vaughnbetz I believe that this is an issue with the benchmark itself, not VTR. Is there ever a case where the blif file uses models which are not described in the architecture?

My theory about what happened here is that the old code may have just ignoring these models and was performing undefined behaviour; but honestly I have no idea why this blif file would ever work...

Just to cover my bases, I did re-download the Titanium benchmarks from scratch and confirmed that matrix_mult is using these quad-port models which do not exist in the S10 architecture.

@vaughnbetz What should we do about this?

@kimiatkh
Copy link
Contributor

kimiatkh commented Jun 5, 2025

@AlexandreSinger When I was writing the Stratix 10 design, I didn't add support for the quad-port RAMs since they were barely used at the time. However, for the VTR 9 paper, I updated the architecture file to include support for quad-port RAMs as well. I'll try to create a pull request with the added support by next week.

@vaughnbetz
Copy link
Contributor

Nice investigation Alex and thanks Kimia.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants