** Description of Simulation Infrastructure **

For CRC-2, we provide simulation framework based on ChampSim simulator. The
framework models a detailed out-of-order core with the following parameters:

* 256-entry reorder buffer with no scheduling restrictions (i.e., any
  instruction that is ready in the scheduler window can be scheduled
  out-of-order).

* Processor has a 6-wide pipeline. A maximum of two loads and a maximum of one
  store can be issued every cycle.

* Simple gshare branch predictor 

* Cache hierarchy models
  - ITLB and DTLB: 16-set, 8-way, 4KB
  - Second-level TLB: 128-set, 12-way, 96KB
  - Private L1 instruction/data cache: 64-set, 8-way, 32KB
  - Private L2 cache (unified): 512-set, 8-way, 256KB
  - Shared LLC: 2048-set, 16-way, 2MB for single core 
                8192-set, 16-way, 8MB for 4-core
  - Cache hierarchy is non-inclusive and the baseline replacement policy is LRU

* All caches use writeback + write-allocate policy for dirty blocks. 
  Writeback bypassing is not allowed.

* All instructions have one-cycle latency except for memory accesses. L1 cache
  hits have a 4 cycle latency. L2 cache hits have an additional 8 cycle latency.
  L3 cache hits have an additional 20 cycle latency. Note that during the
  warmup, all cache latency is set to 1 cycle to accelerate the warmup process.

* Data prefetcher
  - L1 next-line prefetcher
  - L2 PC-based stride prefetcher
  - Prefetches are issued for whole cache blocks
  - The prefetcher is invoked only on demand requests, after its L1/L2 access
    latency, and after it has been determined to be a cache hit or miss.

* Each cache has a read queue for storing outstanding requests to that cache.
  The L1 read queue is processed at a maximum rate of 2 reads/cycle. The L2 and
  L3 read queues are processed at a maximum rate of 1 read/cycle. Prefetch
  requests are stored in a separate prefetch queue. A demand request waiting in
  the read queue has higher priority over the prefetch requests. Within the queue,
  each request is processed in FIFO order.

* The main memory is modeled in some detail (data bus contention, bank
  contention, write-to-read bus turnaround delays, and more). Singe-core
  configuration uses one 64-bit channel of memory and multi-core configuration
  uses two 64-bit channels.

* The main memory read queue can be processed out of order, according to a
  modified Open Row FR-FCFS policy. The DRAM core access latency for row hits is
  approximately 13.5ns, and for row misses is approximately 40.5ns. Many other
  timing constraints, such as tFAW and DRAM refresh, are not modeled.

* There are limited numbers of MSHRs tracking cache misses. If the MSHRs are all
  full, all further misses at that level will be stalled until a request
  tracking register is freed.

* Following functions can be used to get 
  - get_cycle_count(): Global clock cycle count
  - get_instr_count(uint32_t cpu): Number of instructions executed in this cpu
  - get_config_number(): Current configuration number (config1 ~ 6)