Reliable Physical Unclonable Function Design and Algorithm for Authentication and Key Generation

Wei Yan
*University of Connecticut - Storrs, wei.2.yan@uconn.edu*

Follow this and additional works at: https://opencommons.uconn.edu/dissertations

**Recommended Citation**

https://opencommons.uconn.edu/dissertations/1912
Reliable Physical Unclonable Function Design and Algorithms for Authentication and Key Generation

Wei Yan, Ph.D.
University of Connecticut, 2018

Security is becoming an increasing concern in electronic devices recently. Specifically, since the embedded systems and Internet of Things (IoTs) have become necessary parts of our life, more and more vulnerabilities are detected and made use of by attackers. Moreover, as the electronic component supply chain grows more complex due to globalization, with parts coming from a diverse set of suppliers, counterfeit electronics have been a major challenge that calls for immediate solutions. This is because the traditional solutions that using static digital ID and keys can be easily obtained or cloned. The current best practice is to place a secret key in non-volatile memory such as fuses and EEPROM, and use cryptographic primitives to authenticate a device and protect confidential information. To reduce the vulnerability of the systems, we have developed multiple methodologies in this work. The proposed methods include: the design optimization and implementation of ring oscillator physical unclonable function (RO PUF) on field programmable gate arrays (FPGA); a novel phase calibrated RO PUF and
the corresponding authentication solution; a PUF initialization table (PIT) that provides high accurate authentication; a PIT-based floating thresholding algorithm for key generation; a lightweight ring weight algorithm (RWA) that can be applied to the low-cost authentication; an efficient locality sensitive hash function (LSH) for not only similarity search, but also data clustering.
Reliable Physical Unclonable Function Design and
Algorithms for Authentication and Key Generation

Wei Yan
B.S., Nanjing University of Information Science and Technology, 2011
M.S., Suzhou Institute of Nano-Tech and Nano-Bionics, Chinese Academy of Science, 2014

A Dissertation
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Doctor of Philosophy
at the
University of Connecticut
2018
Doctor of Philosophy Dissertation

Reliable Physical Unclonable Function Design and Algorithms for Authentication and Key Generation

Presented by

Wei Yan, M.S.

Major Advisor

John A. Chandy

Associate Advisor

Omer Khan

Associate Advisor

Lei Wang

University of Connecticut

2018
Dedicated to my parents.
ACKNOWLEDGEMENTS

First and foremost, I have to sincerely thank my supervisor, Dr. John Chandy, for his great guidance and support. His steadfast support of my Ph.D study and research was greatly needed and deeply appreciated. His sage advice, insightful criticisms, and patient encouragement aided the writing of this thesis in innumerable ways.

Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Lei Wang, Dr. Omer Khan, Dr. Zhijie Shi, and Dr. Faquir Jain for their insightful comments and encouragement.

Furthermore, it is great honor to work with my friends, Fatemeh Tehranipoor and Chenglu Jin, who have always been a constant source of support and encouragement during the challenges of my Ph.D years. To all my friends, thank you for your understanding and encouragement in many, many moments of crisis. Your friendship makes my life a wonderful experience. I cannot list all the names here, but you are always on my mind and in my heart.

Last but not the least, I want to thank my parents, who have taught me to work hard for the things that I aspire to achieve.
# TABLE OF CONTENTS

1. **Introduction** .................................................. 1
   1.1 Security Vulnerabilities of Embedded Systems and IoTs ................. 2
   1.2 Hardware Security Components ....................................... 2
   1.3 Problem Statement .................................................. 4
   1.4 Related Works and Limitation ....................................... 5
   1.5 Contributions .................................................... 6
      1.5.1 PUF Design Optimization and Implementation ....................... 7
      1.5.2 Post Processing Authentication Algorithm ........................ 7
      1.5.3 Solution for Key Generation Applications ........................ 7
   1.6 Thesis Outline ................................................... 7

2. **Background** ................................................... 9
   2.1 Strong and Weak PUFs ............................................. 10
   2.2 Ring Oscillator PUF ............................................... 10
   2.3 Arbiter PUF ........................................................ 11
   2.4 Bistable RO PUF ................................................... 13
   2.5 Butterfly PUF ...................................................... 14
   2.6 DRAM PUF .......................................................... 14
   2.7 Performance Comparison ........................................... 16
   2.8 Contributions ..................................................... 17
3. PUF Design Optimization and Implementation ............................................. 19
   3.1 Introduction ........................................................................................... 19
   3.2 RO PUF Description and Properties ......................................................... 20
      3.2.1 RO PUF Design Improvement ........................................................... 21
      3.2.2 RO PUF Implementation on an FPGA .............................................. 25
   3.3 Phase Calibrated PUF ............................................................................ 30
      3.3.1 Phase Calibration Process ................................................................. 30
      3.3.2 Analysis of Phase Calibration Efficiency .......................................... 33
      3.3.3 Estimation of Intra-PUF hamming Distance ..................................... 36
   3.4 PCPUF Evaluation ................................................................................. 39
      3.4.1 Experimental Setup ........................................................................... 39
      3.4.2 Hardware Resource Utilization ......................................................... 40
      3.4.3 Randomness ....................................................................................... 40
      3.4.4 Uniqueness ....................................................................................... 40
      3.4.5 Stability ............................................................................................. 43
      3.4.6 Bias .................................................................................................... 44
   3.5 Conclusions ............................................................................................ 45

4. Secure Authentication Solution with PUF Response Instability .................. 46
   4.1 Introduction ............................................................................................ 46
   4.2 Unstable-Stable Response ...................................................................... 46
   4.3 Obfuscation-based Authentication ........................................................... 51
4.4 Performance Evaluation ............................................ 58
   4.4.1 USR Authentication Result .................................. 58
   4.4.2 Comparison with ECC Solutions ................................. 58

4.5 Conclusions .......................................................... 60

5. Ring Weight Algorithm for Lightweight Authentication .......... 61
   5.1 Introduction .......................................................... 61
   5.2 System Architecture ................................................ 62
      5.2.1 Error Correcting and Bit Matching ............................ 63
      5.2.2 Error Correcting/Bloom Filter Matching ...................... 64
      5.2.3 Fault Tolerant Extremum Matching ............................ 66
   5.3 Novel Fault Tolerant Methodology ................................ 66
      5.3.1 Ring Weight Algorithm .......................................... 67
      5.3.2 False Positive and False Negative ............................. 70
      5.3.3 Location Shift Scheme ........................................... 82
      5.3.4 Security Consideration ......................................... 84
   5.4 Evaluation and Discussion ......................................... 87
      5.4.1 Error Correcting/Bloom Filter Matching ...................... 88
      5.4.2 Weight and Location Tolerance ................................. 89
      5.4.3 Location Shift .................................................. 91
      5.4.4 Fault Tolerance Capacity ....................................... 92
      5.4.5 Weight Optimization ............................................ 92
6.5.4 Uniqueness ................................................................. 126
6.5.5 Stability ................................................................. 127
6.5.6 Bias .................................................................. 128
6.5.7 Result of PIT-based Authentication ......................... 129
6.5.8 Result of PIT-based Key Generation ......................... 130
6.6 Conclusions ............................................................... 132

7. Locality Sensitive Hash Function for Similarity Search and Clustering ........................................ 134
7.1 Introduction ............................................................... 134
7.2 Locality Sensitive Hash Function Background ................ 135
7.3 Motivations ............................................................... 136
7.4 FLSH Design and Implementation ............................... 136
7.4.1 LUT-based FLSH Array Design ............................... 137
7.4.2 FLSH Placement .................................................... 139
7.4.3 FLSH Routing ....................................................... 142
7.5 FLSH Application ...................................................... 144
7.5.1 Similarity Search ................................................... 145
7.5.2 Clustering ............................................................. 146
7.6 Evaluation and Discussion .......................................... 147
7.6.1 Convergence Rate ................................................. 148
7.6.2 Randomness ........................................................ 150
7.6.3 Fault Tolerant Capability ......................................... 151
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>7.6.4</td>
<td>Similarity Search Result</td>
<td>153</td>
</tr>
<tr>
<td>7.6.5</td>
<td>Clustering Result</td>
<td>154</td>
</tr>
<tr>
<td>7.7</td>
<td>Conclusions</td>
<td>156</td>
</tr>
<tr>
<td>8.0</td>
<td>Conclusions</td>
<td>157</td>
</tr>
<tr>
<td>8.1</td>
<td>PUF Design Optimization and Implementation</td>
<td>157</td>
</tr>
<tr>
<td>8.2</td>
<td>Secure Authentication Solution with PUF Response Instability</td>
<td>158</td>
</tr>
<tr>
<td>8.3</td>
<td>PUF Initialization Table for Robust Authentication and Key Generation</td>
<td>158</td>
</tr>
<tr>
<td>8.4</td>
<td>Ring Weight Algorithm for Lightweight Authentication</td>
<td>159</td>
</tr>
<tr>
<td>8.5</td>
<td>Locality Sensitive Hash Function for Similarity Search and Clustering</td>
<td>159</td>
</tr>
<tr>
<td>8.6</td>
<td>Summary of Conclusions</td>
<td>160</td>
</tr>
<tr>
<td>Bibliography</td>
<td></td>
<td>161</td>
</tr>
</tbody>
</table>
# LIST OF FIGURES

2.1 The architecture of a traditional RO PUF . . . . . . . . . . . . . . . . . . . 12
2.2 The architecture of a traditional Arbiter PUF . . . . . . . . . . . . . . . . . 12
2.3 The architecture of a typical BR PUF . . . . . . . . . . . . . . . . . . . . 13
2.4 The architecture of a typical butterfly PUF . . . . . . . . . . . . . . . . . . 15
2.5 The memory structure of a DRAM array . . . . . . . . . . . . . . . . . . . 16

3.1 An improved RO design for an FPGA . . . . . . . . . . . . . . . . . . . . 22
3.2 Architecture of our RO PUF . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 LUT-based RO placement example . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Improved LUT-based RO placement example . . . . . . . . . . . . . . . . . 26
3.5 RO PUF manual placement overview . . . . . . . . . . . . . . . . . . . . . 27
3.6 RO PUF code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Intra-PUF hamming distance without ECC . . . . . . . . . . . . . . . . . . 29
3.8 Input delay resource and timing diagram . . . . . . . . . . . . . . . . . . . 31
3.9 Architecture of PCPUF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.10 PUF phase calibration timing . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.11 Distribution of RO rising edges on average . . . . . . . . . . . . . . . . . . 37
3.12 Distribution of 1-bit RO output flipping rate . . . . . . . . . . . . . . . . . 38
3.13 Expected Intra-PUF hamming distance . . . . . . . . . . . . . . . . . . . . 39
5.15 Relationship between error number and $|\Delta L_{\max}|$ ........................................ 81
5.16 Location shift scheme diagram ................................................................. 85
5.17 False negative and false positive rate related to location-tolerance .............. 91
5.18 False negative and false positive rate related to weight-tolerance .............. 91
5.19 False positive of different shift spaces ..................................................... 92
5.20 False negative of different bit errors with different location-tolerance .......... 93
5.21 False negative of different bit errors with different weight-tolerance .......... 93
5.22 False negative of different weight distribution methods ............................. 94
5.23 False positive and false negative rates of different 2048 location shift methods 95
5.24 Effect of response bias on false positive and negative rates ...................... 96
5.25 Performance comparison of different fault tolerant methods ..................... 97
5.26 Overall effect of number of errors ............................................................ 97

6.1 Our PUF-based authentication design using PIT ........................................ 109
6.2 3D distributions of counter values and the floating threshold generation of three states for the RO PUF responses ......................................................... 116
6.3 Possible location distribution of larger states and smaller states based on two groups of RO PUF test results ......................................................... 118
6.4 Helper data generation for our floating thresholding key generation solution 120
6.5 PUF tests with temperature variation instrument ....................................... 124
6.6 Inter-PUF hamming distances ................................................................. 126
6.7 Intra-PUF hamming distance ................................................................. 127
LIST OF TABLES

2.1 Different PUF Performance Comparison ........................................... 18

3.1 FPGA Resource Usage of the Traditional RO PUF and the Optimized PCPUF Implementations ......................................................... 41

3.2 Results of NIST Randomness Tests ..................................................... 42

3.3 Bit Error Rate of RO PUFs and PCPUFs ................................................. 44

3.4 Bias of RO PUFs and PCPUFs ............................................................ 45

4.1 False Positive and False Negative of USR-based Authentication .................. 58

4.2 Performance of ECC-based Authentications ......................................... 59

5.1 Performance of Different Bit Group Selection ....................................... 68

5.2 Security Comparison of Different Schemes - Number of response guesses - Number of response guesses - Number of response guesses required to achieve 50% or 99.9% success rate ........................................... 87

5.3 16 × 8 Bits ECC/BF Authentication Results .......................................... 89

5.4 128 × 1 Bits ECC/BF Authentication Results ......................................... 90

5.5 Performance Comparison of Our Solutions .......................................... 99

5.6 Overall Comparison with Other Solutions .......................................... 100

5.7 Performance of RWA with Real PUFs ................................................ 101

5.8 False Negative of Bit-by-bit Comparison .............................................. 102
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>6.1</td>
<td>Parameters of Our PIT Model</td>
<td>108</td>
</tr>
<tr>
<td>6.2</td>
<td>Protocol of Our Authentication Method</td>
<td>110</td>
</tr>
<tr>
<td>6.3</td>
<td>FPGA Resource Usage of One PUF Implementation</td>
<td>125</td>
</tr>
<tr>
<td>6.4</td>
<td>Average Bit Error Rate</td>
<td>128</td>
</tr>
<tr>
<td>6.5</td>
<td>Results of Authentication with PIP</td>
<td>130</td>
</tr>
<tr>
<td>6.6</td>
<td>Overall Comparison of Error Correction and Bit Selection Helper Data Algorithms</td>
<td>131</td>
</tr>
<tr>
<td>6.7</td>
<td>Performance Comparison of Floating Thresholding Solution with Helper Data on Four Different Types of PUFs</td>
<td>133</td>
</tr>
<tr>
<td>7.1</td>
<td>Logic LUT Output of A Random FLSH Cell</td>
<td>141</td>
</tr>
<tr>
<td>7.2</td>
<td>Performance of FLSH-based Similarity Search</td>
<td>154</td>
</tr>
<tr>
<td>7.3</td>
<td>Performance of FLSH-based Clustering</td>
<td>155</td>
</tr>
</tbody>
</table>
Chapter 1

Introduction

In recent times, electronic devices have become so popular in our life, making people rely on the safety and security of those embedded systems and IoTs. When we store our sensitive information on these untrusted devices, the system vulnerabilities offer the attackers great opportunities of obtaining individual information. Unfortunately, though many conventional solutions have been applied to improve the security level, there are always new adversaries developed. As a fundamental reason, the vulnerabilities come from the hardware design and supply chain. To solve the problem, physical unclonable functions (PUF) are recommended as a secure solution of authentication, random number generation and key generation, which provides multiple features against hardware and software attacks. By implementing PUFs in the systems, commercial products overcome the traditional security shortages to a certain degree.
1.1 Security Vulnerabilities of Embedded Systems and IoTs

Not as much work has been done in secure system design despite its importance in establishing security and trust for embedded systems, SoCs, and IoTs. The vulnerabilities of these systems include physical tampering, malware attack, information leakage, and other aspects. The main threat of hardware is the counterfeit chips in the supply chain, which contains processors (ASICs, FPGAs, and microprocessors), nonvolatile memories (Flash memories, ROM, and RAM), IPs, and PCBs. As a consequence, effective protocols are required to authenticate and authorize each component in the system during the initialization. Apart from countermeasures, a secure hardware should also consider some methods against the probing attacks, snooping attacks, hardware Trojans, power analysis attacks, electromagnetic attacks and machine learning attacks. On the software side, there are brute-force attacks, man-in-the-middle attacks, buffer overflow attacks, fault injection attacks and others. To protect the system, people developed secure programming languages with trusted compiler and libraries, architecture and OS based countermeasures, static and dynamic code analyzers, and sandboxing approaches. Nevertheless, a system level design flow with security consideration has not been discussed widely yet.

1.2 Hardware Security Components

As we know, the traditional security solution mainly relies on the cryptography algorithms. Though they are very practical in reality, researchers have successfully devel-
oped side-channel attacks against many hardware cryptography implementations. Instead, trusted platform module has been proposed as a complex security solution in the computer system. However, such a high-cost design cannot be applied to lightweight embedded systems or IoTs. Meanwhile, different true number random generators (TRNGs) have been proposed these years. Hardware RNG can be used for key generation in certain applications, but it does not provide deterministic results. Thus, the application is limited. Another security related module is one-way hash function. Though it is deterministic and stable, complex computation is required.

To overcome the limitation of those existing methods, researchers tried to use the random process variations that are generated during the manufacturing process. Since the components and circuits on the board have their own uncontrollable parameters, it is possible to use the uniqueness to create unpredictable response. With this consideration in mind, physical unclonable functions were proposed by Gassend et al [1]. As a physical component, a PUF is very hard to duplicate or predict due to the results of random uncontrollable variables in the manufacturing process. When a PUF receives a challenge, it reacts with a response, which is known as a Challenge-Response Pair (CRP). An ideal PUF-based CRP provides strong advantages in that each response gives negligible information on the responses from different challenges to the same PUF or even identical challenges on different PUFs [2]. Rather than storing secrets in non-volatile memories, PUFs can provide significantly higher physical security by generating these secrets from unique PUF responses [3]. A PUF can be broadly classified as a “strong
PUF” or a “weak PUF”. The fundamental difference between weak and strong PUFs is the number of unique challenges that the PUF can process [4]. Both contain the same characteristics including randomness, uniqueness, and reliability. A PUF provides random output according to different input by involving the physical variations in the manufacturing process. This is significantly important in random number and secure key generation. Secondly, the output of PUFs must be unique. Given the same input, two PUFs output different values. As a result, it is hard to predict the responses of one PUF even if all the data is tested on another PUF. Finally, an ideal PUF should generate deterministic output, which is the basic requirement of key generation and authentication applications. Therefore, PUFs can be applied as a source of random but reliable data for applications such as generating unique IC identification numbers or encryption keys.

1.3 Problem Statement

Though PUFs have many advantages, it is not a perfect primitive for security applications. The biggest challenge of using PUFs is its instability in the output. It is essential that the PUF responds with consistent results given the same challenge. However, most PUF constructions are not 100% reliable. Because the physical variations have very limited effect on the output, the result sometimes stays in a metastability condition. Meanwhile, temperature, voltage, and other factors affect the stability to some degree. For TRNGs, bit errors may be positive and acceptable. However, authentication and
key generation applications require deterministic values in most scenarios. The high raw bit error rate (RBER) in the responses of PUFs needs strong error correcting codes (ECC) to guarantee a low false negative rate. However, a strong ECC requires large memory space and high calculation complexity. When the decoding demands are large enough, electronics systems with limited computation and storage capabilities may not be able to afford the burden. As a result, traditional PUFs may not offer enough reliability for industrial applications. In this work, we will mainly discuss how we solve the instability of PUFs. Multiply solutions are offered to address the problem under different applications and requirements.

1.4 Related Works and Limitation

To solve the instability issue of PUFs, post-processing methods are widely involved. Since security-critical products require very low bit error rate of their cryptographic keys, those optimized PUFs cannot be used directly as key generators. Therefore, various HDAs have been proposed to maximize the stability in PUF responses. As industrial practical solutions, error correction schemes ensures the key to be reproducible [5]. For example, Bose-Chaudhuri-Hocquenghem (BCH) codes can provide high accuracy by encoding and decoding data with large arrays [6, 7]. However, ECC also means heavy redundancies. When the key length grows, the required codeword can be extremely large while the decoding time becomes significantly long. This redundant data also leaks information about the PUF response itself leading to a potential security
vulnerability. Another disadvantage is that ECC has a strict limitation on the maximum number of bits it can correct. Given a number of errors beyond its capability, ECC will lose its functionality. Alternatively, bit selection schemes become popular recently \cite{5,8,9}. As a lightweight soft-decision coding, the least reliable bits in responses are discarded during the bit selection process. The most intuitive idea is to impose a global threshold and filter out all mismatched bits. Relatively, local thresholding retains only the most reliable bits \cite{5}. However, those algorithms have poor resistance to the variety of models caused by process variation or environmental noise. The reason is that some unstable bits will either cross the threshold or change the sequence, which means former models are no longer suitable. As a result, it is hard to regenerate the same bit set with the original sequence, and no bit errors cannot be guaranteed. According to the HDAs described above, both error correction and bit selection schemes are not satisfying in term of PUF stability improvement.

1.5 Contributions

In this work, we devote the main paragraph to the design and algorithm of PUF stability, which includes the optimization design, implementation, various authentication solution, and key generation method. Specially, the contributions are list as follows:
1.5.1 PUF Design Optimization and Implementation

We optimize the traditional PUF design by replacing the challenge register with a shift register or hash function. We also modify the architecture for timing improvement. Details of implementing PUF on FPGAs are discussed. Furthermore, a phase calibrated PUF is proposed to reduce the bit errors.

1.5.2 Post Processing Authentication Algorithm

We propose several authentication algorithms for different scenarios. Ring weight algorithm is a lightweight solution with limited computation, overhead, and accuracy. PUF initialization table provides very low false negative and false positive, though the calculation amount is much larger than RWA. Unstable-stable response is a trade-off between those two methods.

1.5.3 Solution for Key Generation Applications

By recording the feature of PUF in PITs, a floating thresholding key generation algorithm is discussed in this work. We involve both bit selection and error correction methods in the solution to improve the stability as much as we can. Meanwhile, security problems are also considered since it is a critical factor of the application.

1.6 Thesis Outline

- Instruction
• Background

• PUF Design Optimization and Implementation

• Secure Authentication Solution with PUF Response Instability

• PUF Initialization Table for Robust Authentication and Key Generation

• Ring Weight Algorithm for Lightweight Authentication

• Locality Sensitive Hash Function for Identification and Clustering

• Conclusion
Chapter 2

Background

Physical Unclonable functions map a set of inputs (challenges) to a set of outputs (responses) which are called challenge-response pairs (CRPs). During the PUF design and implementation, we make use of random variations of manufactured structures to derive an uniquely individual behavior for each Integrated Circuit (IC). In order to harness these PUF properties, these individual variations must be extracted and mapped; which is almost impossible since process variations and noises are not clonable. PUFs came to the stage at a time when traditional cryptography failed to stand its ground against physical attacks, side-channel attacks, etc. A significant advantage of PUF-based authentication is that unlike traditional key-based cryptographic systems, a PUF does not require the storage of secret keys in non-volatile memory. Instead, the secret key is hidden within the physical body (e.g. intrinsic properties) of the circuitry which is found to be unclonable and unpredictable. In this chapter, we will introduce several popular PUFs.
2.1 Strong and Weak PUFs

PUFs are usually classified into strong and weak depending on the number of challenge-response pairs (CRP) that they can handle. Weak PUFs support a small number of CRPs which are linearly related to the number of components used to build the PUF. They possess essentially a single, fixed challenge, as for example in the case of SRAM PUFs and DRAM PUFs. Strong PUFs, on the other hand, support a large number of CRPs such that polynomial time attacks become unfeasible. The lack of access restriction mechanisms on strong PUFs is therefore a key difference from weak PUFs. These type of PUFs are usually employed for device authentication applications. As with any embedded system design and IoT devices, there is an inherent trade-off between the constraints of the operational environment and the desired functional capabilities of the device.

2.2 Ring Oscillator PUF

A ring oscillator PUF (RO PUF) is a typical example of delay PUFs whose instability has been heavily studied but remains unsolved. Figure 2.1 shows the traditional RO PUF design, which contains an array of ring oscillators (RO), two multiplexers (MUXes), two counters, and a comparator. Because of manufacturing variations, the wire delay of each RO is not controllable, which leads to different frequencies of the RO output. By selecting two ROs according to the PUF input, we can measure the pulses in a defined unit time with the counters. For example, if the first counter holds
the larger value than the second one, the PUF output is ‘1’. Otherwise the PUF output is ‘0’. Ideally, the output keeps the same bit value giving a certain input, but in reality bit errors and bias are involved in the output. Though the systematic or correlated process variation and the environmental noise caused by the voltage and temperature variations degrade the output stability [10], the bit errors of FPGA-based PUFs are directly generated by a selected pair of ROs with close frequencies, which lead to the unstable measurement in the counters and the flipped output in the comparator. In [11], RO PUFs are characterized over 125 FPGAs. To improve the quality of ROs, the surrounding logic effect on the oscillator frequencies was studied and a strategy was proposed by placing and comparing ROs in a chain-like structure [12]. A reliability-improvement technique was used in pre-quantization phase of RO PUFs to reduce the noise in PUF responses [13]. More optimized approaches were mentioned in [14] by introducing configurable ROs. They also suggested to compare adjacent RO pairs by controlled RO placement, but the FPGA implementations were not clear. A group-based RO PUF was introduced in [15], which described a framework to filter the systematic variation and improve the hardware efficiency. However, its responses still required ECC.

### 2.3 Arbiter PUF

The arbiter PUF (APUF) is another typical delay PUF, which is widely studied and evaluated [16]. An Arbiter PUF extracts the device-specific variation as delay difference between two selector chains. In another word, the generic idea behind arbiter
**Fig. 2.1:** The architecture of a traditional RO PUF

PUF is to race the delay times between two signals. The two signals propagate through various routes depending on the value of the challenge. As shown in Figure 2.2, each challenge bit selects the input path of a MUX pair. Since the location of the MUXes are different, the input wire delay may not be exactly the same. The final value of the 1-bit response is determined by which signal arrives first. Thus, its output should be either ‘0’ or ‘1’, but it is unpredictable. Note that the pattern is also determined by the input length. If the challenge has $n$ bits, the propagation delay time should have $2^n$ combinations. Early work on Arbiter PUF modeling attacks had described successful approaches already, including the standard Arbiter PUF, XOR Arbiter PUF, lightweight PUF, and Feed-Forward Arbiter PUF [17] [18].

**Fig. 2.2:** The architecture of a traditional Arbiter PUF
2.4 Bistable RO PUF

Similar to an SRAM cell that contains a pair of cross-coupled inverters, the bistable RO PUF (BR PUF) is based on the fact that an inverter ring consisting of an even number of inverters has two possible stable states. Figure 2.3 shows an example of a 64-stage BR PUF [19]. Each stage contains two NOR gates, a DEMUX as gate input, and a MUX as gate output. Thus, when a 64-bit challenge is received by this BR PUF, either the upper or the lower paths of the MUX and DEMUX are selected. When we apply different challenges, different combinations of NOR gates and paths are selected as a result. In Figure 2.3, the total number of input patterns is $2^{64}$. In order to generate a response, the whole PUF needs to be reset as “high” first. After the challenge is applied to the bistable rings, we wait until the bistable rings to be ‘0’. A reset of “low” is launched afterwards. Then we wait for some time to let the bistable ring stabilize. Finally, we read out a 1-bit response. To generate more bits for responses, we repeat the progress for certain loops.

Fig. 2.3: The architecture of a typical BR PUF
2.5 Butterfly PUF

A Butterfly PUF (BPUF) includes a cross-coupled circuit, which is brought to the unstable state until it becomes stable. To make it clear, a cross-coupled circuit provides a positive feedback to record the value in a loop. It has two stable state and one unstable state. The unstable state can be easily changed to one of the stable states by a certain input. The behavior of BPUF is similar to the SRAM cell. As Figure 2.4 shown, the structure of 1-bit BPUF is as simple as two latches [20]. When the input signal is set to high, the BPUF becomes unstable at the beginning. If the input is set to low then, the BPUF turns to be one of the stable state, with an output of ‘0’ or ‘1’. The final state only depends on the wire delay of the circuit. Originally, since attackers cannot measure the internal delay of an FPGA or ASIC directly, it is hard to predict which state the BPUF is in. However, as the side-channel attacks become popular, they brings new vulnerabilities to BPUFs.

2.6 DRAM PUF

DRAM PUF (DPUF) was recently proposed as a new type of PUF [21][22], compared to the widely studies of SRAM PUF [8][23]. According to Figure 2.5, there are three approaches to use DRAM as PUFs. The first one makes use of the built-in-self-refresh module of DRAM chips. As we know, DRAM not only requires a power supply to retain data, but must also be periodically refreshed to prevent their data contents from fading away from the capacitors in their integrated circuits. Therefore, we can initialize
Fig. 2.4: The architecture of a typical butterfly PUF

all cells to ‘1’ and then turn off with refresh. Some cells will leak to ‘0’ while others are still ‘1’. Another approach is to use the remanence of DRAM cells. When a DRAM powers off, the data fades away over time in a random way. Thus, the remanence effect can also be applied to PUF. The final method is to use the startup value of DRAM. During startup, the storage capacitor has neither been charged nor discharged. Thus the voltage of the capacitor is equal to the bias voltage. Because of the manufacturing variations, the capacitance will be slight different, which means a random value assignment.
Fig. 2.5: The memory structure of a DRAM array

2.7 Performance Comparison

Table 2.1 shows different types of implemented, and designed, PUFs performance comparison in details. As shown in this Table, the first column demonstrates various performance metrics that need to be considered for each PUF in order to evaluate the effectiveness of each design. Note that the data of RO, abiter, BR, and butterfly PUF are from our implementation and measurement. We also list the memory-based PUF as a comparison. One should notice that the other major difference between the first four presented PUF designs and the fifth (DRAM) PUF is that the “PUF Type” differs between strong (large number of CRPs) and weak (small number of CRPs). The next metric that we evaluated was “Key length”, which was “128 bits” for every one of the PUF designs we examined. The “lookup table” and “flip flop” rows show the number of those elements used to implement each PUF solution on an FPGA. One should note that
DRAM has no values for this, since it is implemented using a commercial off the shelf (COTS) DRAM memory and not an FPGA. The “basic cells” row show the number of cells and type of cells (e.g. multiplexers, ring oscillators, inverters, demultiplexers, flash, memory) that we required to implement each PUF solution. The next metric we evaluated each PUF design using was “bit error rate”; defined as percentage of bit flips that occur. One should note that this is a good representation of the stability of each type of PUF. The “bias” represents the percentage of 1s and 0s in the generated output. Please note that the presented bias is in terms of the number of 0s present in the output.

The following metric, “uniqueness”, is a measure of how uncorrelated the response bits are across multiple implementations of the same PUF design. Ideally the response bit should differ with a probability of 50%. A key point to make on the uniqueness presented is that the RO and DRAM PUFs have a difference of half the bits (e.g. 50%). The last property is “randomness”; a measure of the unpredictability of the response bits. For our proposed PUF, the randomness results show promising capability of our new models. From this table of data we are able to compare and contrast the various solutions to determine those most favorable to any given scenario.

2.8 Contributions

In this chapter, we introduce five types of PUFs: Ring Oscillator PUF, Arbiter PUF, Bi-Stable PUF, Butterfly PUF, and DRAM PUF. The details of design and implementation are discussed in each section. Finally, we compare the performance of these PUFs. The
Table 2.1: Different PUF Performance Comparison

<table>
<thead>
<tr>
<th>PUF Design</th>
<th>Arbiter</th>
<th>RO</th>
<th>BR</th>
<th>Butterfly</th>
<th>DRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>PUF Category</td>
<td>Delay-based</td>
<td>Delay-based</td>
<td>Delay-based</td>
<td>Delay-based</td>
<td>Memory-based</td>
</tr>
<tr>
<td>PUF Type</td>
<td>Strong</td>
<td>Strong</td>
<td>Strong</td>
<td>Strong</td>
<td>Weak</td>
</tr>
<tr>
<td>Key Length (bit)</td>
<td>128</td>
<td>128</td>
<td>128</td>
<td>128</td>
<td>128</td>
</tr>
<tr>
<td>Lookup Table</td>
<td>97</td>
<td>3316</td>
<td>75</td>
<td>38960</td>
<td>N/A</td>
</tr>
<tr>
<td>Flip Flop</td>
<td>68</td>
<td>2675</td>
<td>30</td>
<td>16912</td>
<td>N/A</td>
</tr>
<tr>
<td>Basic Cells</td>
<td>56 MUXes</td>
<td>256 ROs</td>
<td>28 DE/MUXes</td>
<td>32768 Latches</td>
<td>921 DRAM Memory</td>
</tr>
<tr>
<td>Bit Error Rate</td>
<td>3.31%</td>
<td>7.12%</td>
<td>1.42%</td>
<td>6.80%</td>
<td>1.6%</td>
</tr>
<tr>
<td>Bias</td>
<td>-9% to 8%</td>
<td>-5% to 5%</td>
<td>-8% to 6%</td>
<td>-11% to 9%</td>
<td>-5% to 5%</td>
</tr>
<tr>
<td>Uniqueness</td>
<td>≈ 59/128</td>
<td>≈ 64/128</td>
<td>≈ 51/128</td>
<td>≈ 44/128</td>
<td>≈ 64/128</td>
</tr>
<tr>
<td>Randomness</td>
<td>&gt; 92%</td>
<td>&gt; 98%</td>
<td>&gt; 94%</td>
<td>&gt; 89%</td>
<td>&gt; 98%</td>
</tr>
</tbody>
</table>

hardware cost, bias, uniqueness, randomness, and stability are evaluated.
Chapter 3

PUF Design Optimization and Implementation

3.1 Introduction

Researchers have developed different types of PUFs since the concept was proposed [11, 16, 19, 20, 22]. Among those PUFs, RO PUF is a well-known type with typical performance. Its architecture can be easily implemented on digital circuit. Therefore, in this chapter, we focus on a practical RO PUF implementation on FPGAs. We optimize the architecture of RO PUFs and show the details of implementing RO PUFs on FPGAs. To improve the measurement accuracy of RO frequency, we use phase calibration process for frequency estimation. Meanwhile, We also minimize hardware resources and develop a flexible output data width. Furthermore, we explore the feature of instability in PUF responses and provide a new design that outputs stable bits according to the unstable RO pairs.
3.2 RO PUF Description and Properties

Problem Analysis

Few papers have discussed the impact of bias in RO PUF responses [14], which is the favoring of RO PUFs to either 1 or 0. According to our tests, we found that the counter design in the traditional RO PUF can contribute significantly to this bias. When \textit{CounterA} and \textit{CounterB} hold the same value, the comparator must make a choice of how to treat this case - in our example, the comparator puts this “equivalent” case in the “smaller” group. The “larger” group, on the other hand, can be active only when \textit{CounterA} receives more pulses than \textit{CounterB} in the given clock cycles. Therefore, the output contains more 0’s than 1’s. If ROs are placed more regularly, the frequencies may end up being closer, thus causing more ”equivalent” cases. This is aggravated especially when the counter size is limited. Regardless of the count cycles, the bias caused by the counter design is still as high as 5.03%. As a result, the bit error rate and the bias are intensified by the design fault and inaccurate measurement. Another problem of this RO PUF design is the inefficient usage of hardware resources. To generate just a single bit output, the design requires an \( \log N \)-bit input and \( N \) oscillators. Though we can apply a hash function to generate different challenges, its hardware implementation requires even more FPGA slices than the PUF itself [24]. A transient effect ring oscillator PUF was proposed with a good ratio of PUF response variability to response length, but the intra-device variation increases to 1.7% [25], compared to [11]
with 0.86% intra-device variation. In this chapter, we will discuss a practical phase calibration technique to improve the quality of RO pairs comparison, which shows a trade-off between the cost, stability, and flexibility.

### 3.2.1 RO PUF Design Improvement

Though RO PUFs have been heavily studied, there are still possibilities of improvement particularly with respect to implementations on FPGAs. In this section, we will discuss some practical RO PUF optimizations on the hardware.

**Ring Oscillator Design on FPGAs**

While PUFs have been primarily targeted towards ASIC designs, we explore their potential on FPGAs. According to our test on the Kintex-7 FPGA [26], the frequency of a 5-stage inverter chain RO is approximately 475 MHz when the system clock is 200 MHz. In the crossing timing domain between the ROs and PUF control logic circuit, a high RO frequency adds to the instability of measurements. As with all oscillators, the rate of oscillation is determined by the length of a delay implemented in a loop. Thus, to reduce the frequency, more inverters can be added in the ROs, but it requires more hardware resources. We provide an improved RO design that takes advantage of the lookup tables (LUTs) of the configurable logic blocks (CLBs) and the general purpose interconnect [27] in the FPGA. As shown in Figure 3.1, the first part consists of a 4-LUT delay and a 1-LUT inverter. The reset signal is used to control the enable timing
of all LUTs. In the second part, a single inverter in the loop implements a high gain inverting amplifier. The output frequency is divided by 2 in order to eliminate output glitches. This design only needs six LUTs and a D-type flip-flop (FD).

Fig. 3.1: An improved RO design for an FPGA

Shift Register and Hash Function for Resource Improvement

In order to generate variable length responses, we choose a design as shown in Figure 3.2. This example uses 128 ROs to provide 128-bit CRPs. Once a challenge is received, it is stored in a shift register. We select the first 7 bits from the register as the input of the upper decoder and MUX in order to select an RO from the array. Likewise, another RO is selected using the next 7 bits from the challenge shift register. If the addresses are the same, the second 7-bit address is added by 1 to avoid selecting the same RO for comparison. Next, we shift the challenge register to select a new RO pair. The shift pattern can be complex for security consideration. To make it easy to understand, we shift one bit to the left each time. If the challenge is 128 bits, we can generate a 128-bit response with 127 shift operations.

Alternatively, the shift register can be replaced by a non-cryptographic hash function. Compared to cryptographic hash function, which provides one-way, collision re-
sistant, and deterministic properties, non-cryptographic hash function is significantly faster and low-cost. It is also easy to be implemented on FPGAs. Below is the algorithm of ElfHash that we used in the PUF:

**Fig. 3.2: Architecture of our RO PUF**

### Timing Improvement

In the traditional design, metastability can exist at the input of the counters. It occurs when the counters receive a clock rising edge for comparison. If the RO output happens to flip when a counter value is changing, the result of comparison will be none of larger, smaller, or equal, but a metastable state. We have verified this scenario on our FPGA implementations. As a result, the instability of responses increases. Our solution is to use two registers to store the counter values. The registers and counters are in the same clock domain so that metastabilities are eliminated. The comparison result is recorded in the response register as an output bit.
Algorithm 1 ElfHash process

1: procedure PCP

2: \hspace{1em} hash ← 0

3: \hspace{1em} x ← 0

4: \hspace{1em} count ← 0

5: loop 1:

6: \hspace{1em} hash ← (h << 4) + challenge

7: \hspace{1em} challenge ← challenge + 1

8: \hspace{1em} if x = hash and 0xF0000000 then

9: \hspace{1em} \hspace{1em} hash ← hash xor (x >> 24)

10: \hspace{1em} hash ← hash and (not x)

11: \hspace{1em} count ← count + 1

12: \hspace{1em} if count = 128 then goto loop 1.
3.2.2  RO PUF Implementation on an FPGA

Few papers have clear descriptions on how the RO PUF is implemented on FPGAs [14, 28, 29]. Incorrect HDL coding can lead to logic unit rebuilding during synthesis, which results in functionality failure. For example, inverters built with `inv(1) <= not inv(0);` statements will be optimized out and cannot be placed as RO arrays.

LUT-based RO Placement

To implement the ring oscillators on FPGAs, we use the LUTs directly in our code and add element placement constraints to fix the LUTs with certain delay. In our work, we used the Vivado development kit and the Kintex-7 FPGA provided by Xilinx to show the RO PUF implementation on FPGAs [26, 30]. In the Kintex-7, a CLB contains a pair of slices, and each slice is composed of four 6-input LUTs and other elements [31]. By setting the initial value to “01”, a 1-input LUT (LUT1) can perform the same function as a digital inverter. An example of the traditional RO implementation is shown in Figure 3.3. We place one LUT3 (3-input LUT) in the left slice and four LUT1’s in the right slice. These five LUTs and the white connections form a basic RO. Apart from the feedback input of another inverter, LUT3 also includes a reset input and an enable input. The left LUTs in a CLB are reserved for other logic of the RO.

Figure 3.4 shows the RO placement given by Figure 3.1. Four delay LUT2s are placed in the left slice and the invert LUT locates in the right slice. The output of the first stage oscillator is connected to the FD, which builds another oscillator with another
LUT2 in the right slice. Since the delay LUTs contain no logic, ‘keep’ attributes is required in the design to stop logic optimization by the synthesis tool. Additionally, a combinatorial loop in the FPGA is considered bad design practice in most cases, which increases the number of cycles by infinitely going around the circle in the same path. To avoid unnecessary errors during synthesis and bitstream generation, some constraints are applied to let Vivado ignore these loops.

**Fig. 3.4:** Improved LUT-based RO placement example

**RO Array Placement on FPGA**

Figure 3.5 shows the manual placement of one RO PUF. To maintain the randomness and uniqueness, all the timing critical FFs and LUTs must be placed carefully. 128 ROs
are listed in six columns regularly in order to maximize the manufacturing differences between PUFs. Between columns, two slices are reserved for the relevant logic between ROs and the control unit. LUTs and FDs of the finite-state machine (FSM) are placed at the bottom of the RO array. Other primitives with no effects on the performance are placed by Vivado automatically.

![RO PUF manual placement overview](image)

**Fig. 3.5:** RO PUF manual placement overview

**Automatic Constraints Generation**

However, manual placement of PUFs can be very time-consuming. In our example, each RO array contains 768 LUTs and 128 FFs. If the hardware specification is modified, it may lead to the reconstruction of the architecture. As a result, most of the cells have to be placed again. To make it progress more efficient, we developed an automation code generation process in Figure 3.6. The hardware specification module defines the length of CRPs, performance requirements, and usable platforms. The information
is translated to acceptable format of Shell script so that the code generation script can figure out the scale of RO array. With pre-defined templates, we are able to generate the code and constraints automatically.

![Diagram](image)

**Fig. 3.6:** RO PUF code generation

**Manually Rounting Consideration**

On the other hand, improper routing also affects the uniqueness of PUFs. Though our LUT and FF placement fixes the RO routing path, the other parts are routed by Vivado automatically. The critical paths are the RO array enable and the counter enable. If the RO array enable has a long delay while the counter enable has a short delay, it indicates that the timing to generate RO pulses is postponed and the counter is enabled in advance. In consequence, the window for measurement may not match the RO output well, which leads to potential issues of uniqueness and bias. To eliminate the effect of expected delay, we adjust unsatisfied wire delay by changing paths manually in Vivado implementation and saving it as constraints.
Stability Analysis

A key metric of any PUF is its reliability. In order to evaluate the stability, we have implemented three RO PUFs on the FPGA, which are based on Figure 3.2. We generated 100,000 random challenges and applied the challenge twice to each PUF. The two responses for each PUF were compared and the Hamming distance between the two responses were recorded. Ideally, there should be no difference between the two responses because the challenge was the same and the Hamming distance should be zero. Figure 3.7 shows the Hamming distance distribution for the three PUFs. Without any error correction or fault tolerance methodology, there are about 10 bit errors in each 128-bit response. It is clear that a 7.81% bit error rate is unsatisfactory for authentication. Without very strong ECC, we will have a high false negative rate. To address the instability of PUFs, we will introduce a phase calibration process in the next section.

![Intra-PUF hamming distance without ECC](image)

**Fig. 3.7:** Intra-PUF hamming distance without ECC
3.3 Phase Calibrated PUF

Improving the stability of PUFs requires an accurate measurement of the RO frequency. With the traditional RO PUF, theoretically, one could use a longer measurement time to count RO pulses and thus improve the measurement accuracy. However, we found that extending the measurement time from 16 clock cycles to 512, in order to get larger samples, had limited improvement on the stability. Since the ROs are not driven by the system clock and each enable signal has its own delay, the output can be unstable when the counter is enabled and disabled [12]. As a result, the count may be off by one or two. Previously, we discussed adding a counter register to eliminate the unknown state, it does not address the correctness due to instabilities at the boundaries of the measurement cycle. Due to this unpredictable behavior, even though we tested some “stable” RO pairs 1,000 times and got the same results every time, the next measurement might still produce different value with a small probability. In this section, we propose an efficient solution instead of the repeated testing, which can solve this problem with limited cost.

3.3.1 Phase Calibration Process

Phase Calibration IP Core

The phase calibration process (PCP) is a critical part of our solution, which aims to measure the frequency of ROs fast and accurately. The basic idea is to shift the phase of the RO output signal in order to eliminate asynchronous timing measurement error.
To implement it on FPGAs, we use a primitive that offers a programmable delay function, i.e. Xilinx provides an input delay resource called IDELAYE2 [32]. It can be accessed directly from the FPGA logic and allows incoming signals to be delayed on an individual input pin basis. Figure 3.8 shows the IDELAYE2 primitive, which offers a variable delay mode that can control the delay value after configuration by manipulating the control signals CE and INC. When CE goes High, the increment/decrement operation begins on the next positive clock edge. The programmable delay taps in the IDELAYE2 primitive wrap-around. When the last tap delay is reached (tap 31) a subsequent increment function will return to tap 0. In Figure 3.8, a reset is detected (LD is High) in the first clock event, causing the output DATAOUT to select tap 0 as the output from the 31-tap chain. In the second clock event, a pulse on CE and INC is captured on the rising edge of C. This indicates an increment operation. The output changes without glitches from tap 0 to tap 1. In the third clock event, CE and INC are no longer asserted, thus completing the increment operation. The output remains at tap 1 indefinitely until there is further activity on the LD, CE, or INC pins.

Fig. 3.8: Input delay resource and timing diagram
Architecture of Phase Calibrated PUF

The architecture of our phase calibrated PUF (PCPUF) is shown in Figure 3.9. It mainly consists of a 128-RO array, a control unit, a comparison unit, and pair of decoders, MUXes, IDELAYE2s, and counters. The tap control is to control the delay of IDELAYE2. Since PCP requires a strict RO enable timing, only one selected RO pair is enabled in the measurement period. Otherwise, the control unit resets all the ROs.

Fig. 3.9: Architecture of PCPUF

PCPUF Algorithm

Algorithm 4 shows the details of the PCP. In order to generate a 1-bit response, the first two 7-bit values of the challenge register are loaded to the register $addr.A$ and register $addr.B$. After checking the values, we reset CounterA and CounterB. The tap value is loaded to IDELAYE2 primitives before being added by 1. By default, the tap value is 0. When the delay is set, the control unit enables the selected RO pair and starts testing. Each measurement takes 16 clock cycles. Then the tap value is added by 1 and the new measurement starts. Between two measurements, the selected RO pair is kept...
disabled before the delay is set. When we complete the test from tap 0 to tap 31, a 1-bit output is generated by comparing the counter values. Then the challenge register shifts its value and launches the next response generation. Though the PCP slows down the response generation comparing to the traditional RO PUFs, the PCPUF still keeps an output speed of 625,000 bit/s.

3.3.2 Analysis of Phase Calibration Efficiency

Considering the calculable factors of FPGA, the accuracy of frequency estimation by using the phase calibration process depends on the skew rate, threshold voltage, and the relevant signal cycle. The skew rate and the threshold voltage determine the skew between the system clock and the counters. When the control unit enables or disables the counters at the beginning or in the end of the phase calibration process, the RO outputs may be changing between the voltage thresholds, $V_{OH \ min}$ and $V_{OL \ max}$. This may lead to a metastable state of the counter value. According to the I/O buffer specification (IBIS) model of the Kintex-7, if the FFs are in the same bank, the skew should be 50 ps to 100 ps. The internal signal voltage of the Kintex-7 is designed to swing between -0.5 and 1.1 volts, with anything below 0.4 volts considered a ‘0’, and anything above 0.7 volts considered a ‘1’ [33] [34]. In the worst case, the metastability duration of each rising edge or falling edge is 18.75 ps. Since the RO output phase at the beginning of the window is unpredictable, we can calculate the probability of the metastability occurrence in Figure 3.10 as
Algorithm 2 PUF phase calibration process

1: procedure PCP

2: \( tap \leftarrow 0 \)

3: loop 1:

4: reset CounterA, CounterB

5: load \( addr_A, addr_B \)

6: if \( addr_A = addr_B \) then \( addr_B \leftarrow addr_A + 1 \)

7: loop 2:

8: load \( tap \) to \( IDELAYE2 \)

9: \( tap \leftarrow tap + 1 \)

10: enable \( RO(addr_A), RO(addr_B) \)

11: loop 3:

12: enable CounterA, CounterB

13: read \( RO(addr_A), RO(addr_B) \)

14: if count cycle < 16 then goto loop 3

15: keep CounterA, CounterB

16: disable \( RO(addr_A), RO(addr_B) \)

17: if \( tap < 32 \) then goto loop 2

18: if CounterA > CounterB then

19: output 1

20: else

21: output 0

22: shift Challenge

23: goto loop 1.
$p_{\text{rising\_edge}} = p_{\text{falling\_edge}} = \frac{18.75\, ps \times 2}{4210\, ps} = 0.00891 \quad (3.1)$

in which 4210 ps is the average RO cycle. Though the rising edge and falling edge have the same probability, they are not independent events for any RO within one tap delay. Now we focus on the entire process in Figure 3.10. We create a window of 16 system clock cycles (80 ns), within which the counters of the RO PUF become active. Due to the wire delay, the selected RO pair may not be enabled at the same time. The delay between each RO and the related counter depends on a tap ranging from 0 to 31. Since each taps of IDELAYE2 offers 78 ps delay in the normal environment, we can have as much as 2.418 ns delay of the RO output, which is long enough to find the metastability. Ideally, these 32 taps phase shift will not affect the RO frequency measurement. However, some measurements show different counter values due to the metastability. According to our simulation, one metastability occurrence at one edge of the window has a probability of 0.00775. The probability that two metastability occurrences at both edges of the window is 0.00113. There is no possibility of any more unpredictable states as the window only has two edges. Thus the expected value of metastability is

$$E[X] = 1 \times 0.00775 + 2 \times 0.00113 = 0.01116 \quad (3.2)$$

Therefore, among the 32 taps shift cases, most of the counter values reflect the
frequency of the selected RO correctly. The metastability only happens with a very low possibility.

![PUF phase calibration timing](image)

**Fig. 3.10:** PUF phase calibration timing

Apart from the calculable factors, measurement errors are also generated by the noise margin, which is the amount by which the signal exceeds the threshold for a proper ‘0’ or ‘1’. Being affected by the uncontrollable factors such as temperature, voltage, and aging effects, we can hardly calculate the fault tolerate capability of phase calibration process directly [35]. But we are able to obtain the probability of bit flipping after PCP through large amounts of tests. Next we estimate the efficiency of PCP by computing the intra-PUF hamming distance.

### 3.3.3 Estimation of Intra-PUF hamming Distance

The bit error rate is one of the most critical characteristic of PUFs. To estimate the RBER of PCPUFs, we need to know the intra-PUF hamming distance first, which can be calculated through the distributions of RO frequencies and bit flipping rate (BFR). As we mentioned, when the frequencies of two ROs are closer, temperature, voltage, and aging effects during operation lead to higher BFR. We begin with measuring the rising edges of ROs in the unit time and show the probability distribution in Figure 3.11.
The unit time is set to 80 ns (16 clock cycles). Since we add the measured data from tap 0 to tap 31 and compute the average value, the results may not be integral. A fitting Gaussian curve is generated according to the discrete data, which is used to provide an ideal discrete distribution \( D(f) \) for our computation.

![Fig. 3.11: Distribution of RO rising edges on average](image)

We define a specific bit flipping rate \( \lambda \) according to the difference of the discrete data. Marked as \( f_1 \) and \( f_2 \), the frequencies of two ROs determine the probability of bit flipping. A measured \( \lambda(|f_1 - f_2|) \) in Figure 3.12 shows the statistical BFR results, which follows part of a normal distribution \( 3.61 \times 10^8 N(-12.99, 1.88^2) \). When two ROs have a \( |f_1 - f_2| \) larger than 1, their output is very stable. However, the BFR increases sharply when the difference range is less than 0.5. To further improve the reliability of PUFs, these RO pairs can be blocked from the RO pair selection sets.

For any single bit, the probability of error occurrence can be computed as
Fig. 3.12: Distribution of 1-bit RO output flipping rate

\[ \sum_{f_2=0}^{\infty} \sum_{f_1=0}^{\infty} D(f_1) \times D(f_2) \times \lambda(|f_1 - f_2|) = 0.0029 \]  \hspace{1cm} (3.3)

The probability that 128-bit responses contain exactly \( n \) bit errors is

\[ \binom{128}{n} 0.0029^n (1 - 0.0029)^{128-n} \]  \hspace{1cm} (3.4)

According to Equation 6.5, the ideal intra-PUF hamming distance is shown in Figure 3.13. Therefore, we can calculate the average number of errors, which is 0.3688 in a 128-bit response.
3.4 PCPUF Evaluation

3.4.1 Experimental Setup

In this section, we present the measured data from six implemented PCPUFs on two KC705 boards, as shown in Figure 6.5. The temperature of FPGAs is controlled at 30°C according to the on-chip sensor and the voltage is 1.008V.
3.4.2 Hardware Resource Utilization

Table 6.3 lists the post-implementation primitive utilization of the traditional RO PUF and the optimized PCPUF. We exclude UART and FIFO module in the design to make a fair evaluation. The overall utilization of our PCPUF are 0.58% of the look-up tables and 0.05% of the flip-flops on the Kintex-7 FPGA. This is significantly low-cost compared to the traditional RO PUF design. If we apply the architecture in Figure 2.1, only 35 arrays can be placed on a Kintex-7 FPGA, which means four FPGAs are required to implement a 128-bit RO PUF. Since the shift register or hash function is used in the design, only one array is needed to generate multi-bit responses.

3.4.3 Randomness

We applied 15 NIST randomness tests to evaluate the randomness of our PCPUF responses. The p-values and proportions are listed in Table 3.2. As the p-values are all larger than 0.01, we accept the responses as random [36]. Though we use a pseudorandom number generator to generate challenges, the proportions show that at least 98% sequences pass the tests.

3.4.4 Uniqueness

The evaluation of the uniqueness of PCPUFs is shown in Figure 3.15, which compares three groups of 100,000 128-bit responses. The responses are generated by downloading the same bitstream to two different Kintex-7 FPGAs. The fitting curves have a $\mu$
Table 3.1: FPGA Resource Usage of the Traditional RO PUF and the Optimized PCPUF Implementations

<table>
<thead>
<tr>
<th>Ref Name</th>
<th>RO PUF</th>
<th>PCPUF</th>
<th>Functional Category</th>
</tr>
</thead>
<tbody>
<tr>
<td>LUT1</td>
<td>514</td>
<td>2</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT2</td>
<td>268</td>
<td>857</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT3</td>
<td>175</td>
<td>55</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT4</td>
<td>52</td>
<td>94</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT5</td>
<td>273</td>
<td>30</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT6</td>
<td>293</td>
<td>279</td>
<td>LUT</td>
</tr>
<tr>
<td>MUXF7</td>
<td>0</td>
<td>34</td>
<td>MuxFx</td>
</tr>
<tr>
<td>MUXF8</td>
<td>0</td>
<td>16</td>
<td>MuxFx</td>
</tr>
<tr>
<td>FDCE</td>
<td>20</td>
<td>16</td>
<td>Flop &amp; Latch</td>
</tr>
<tr>
<td>FDRE</td>
<td>430</td>
<td>195</td>
<td>Flop &amp; Latch</td>
</tr>
<tr>
<td>FDSE</td>
<td>1</td>
<td>1</td>
<td>Flop &amp; Latch</td>
</tr>
<tr>
<td>CARRY4</td>
<td>7</td>
<td>9</td>
<td>CarryLogic</td>
</tr>
<tr>
<td>IBUF</td>
<td>4</td>
<td>4</td>
<td>IO</td>
</tr>
<tr>
<td>OBUF</td>
<td>2</td>
<td>2</td>
<td>IO</td>
</tr>
<tr>
<td>BUFG</td>
<td>1</td>
<td>1</td>
<td>Clock</td>
</tr>
<tr>
<td>IDELAYE2</td>
<td>0</td>
<td>2</td>
<td>IO</td>
</tr>
<tr>
<td>IDELAYECTRL</td>
<td>0</td>
<td>1</td>
<td>IO</td>
</tr>
<tr>
<td>Slice</td>
<td>647</td>
<td>365</td>
<td>CLB Slice</td>
</tr>
<tr>
<td>Utilization</td>
<td>1.26</td>
<td>0.71</td>
<td>%</td>
</tr>
</tbody>
</table>
Table 3.2: Results of NIST Randomness Tests

<table>
<thead>
<tr>
<th>Statistical Test</th>
<th>P-value</th>
<th>Proportion</th>
</tr>
</thead>
<tbody>
<tr>
<td>frequency</td>
<td>0.066882</td>
<td>0.98972</td>
</tr>
<tr>
<td>block frequency</td>
<td>0.213309</td>
<td>0.98972</td>
</tr>
<tr>
<td>cumulative sums</td>
<td>0.534146</td>
<td>0.98978</td>
</tr>
<tr>
<td>runs</td>
<td>0.739918</td>
<td>0.99075</td>
</tr>
<tr>
<td>longest run</td>
<td>0.031497</td>
<td>0.98939</td>
</tr>
<tr>
<td>rank</td>
<td>0.219646</td>
<td>0.98988</td>
</tr>
<tr>
<td>FFT</td>
<td>0.392456</td>
<td>0.98991</td>
</tr>
<tr>
<td>non-overlapping template</td>
<td>0.122325</td>
<td>0.98940</td>
</tr>
<tr>
<td>overlapping template</td>
<td>0.062947</td>
<td>1.00000</td>
</tr>
<tr>
<td>universal</td>
<td>0.599114</td>
<td>1.00000</td>
</tr>
<tr>
<td>approximate entropy</td>
<td>0.637119</td>
<td>1.00000</td>
</tr>
<tr>
<td>random excursions</td>
<td>0.778616</td>
<td>0.98902</td>
</tr>
<tr>
<td>random excursions variant</td>
<td>0.137809</td>
<td>0.98905</td>
</tr>
<tr>
<td>serial</td>
<td>0.039329</td>
<td>0.99161</td>
</tr>
<tr>
<td>linear complexity</td>
<td>0.534146</td>
<td>1.00000</td>
</tr>
</tbody>
</table>

converging to 64 - thus, there are 50% bits flipped on average. Therefore, the responses are unique. Moreover, we have compared the responses from six PUFs on two FPGAs and they all follow a normal distribution.
3.4.5 Stability

As shown in Section 3.3, the average number of errors is 0.3688 in 128-bit responses, which means the theoretical bit error rate is 0.29%. To verify that, we generated challenges with a pseudorandom number generator. Each challenge is used twice in order to provide two responses for comparison. The test was repeated 100,000 times. Figure 3.16 shows the measured intra-PUF Hamming distance of our PCPUFs. Compared to the corresponding results with no ECC and phase calibration in Figure 3.7, the fitting curves for PCPUFs have obvious shift to the left and approach the ideal bound in Figure 3.13.

Table 3.3 shows the average bit error rate of three shift register based RO PUFs and six PCPUFs. Compared to RO PUFs, PCPUFs have obvious improvement on the response stability. Though the bit error rates are higher than the ideal bound, all the results are smaller than 1.00%. This is still acceptable if we take the measurement
Fig. 3.16: Intra-PUF Hamming distance with phase calibration

errors into consideration. Since we do not apply ECC in our PUFs, the bit error rate is low enough for authentication purposes. Within the temperature range of normal testing environment, we observed no significant changes of the bit error rate.

Table 3.3: Bit Error Rate of RO PUFs and PCPUFs

<table>
<thead>
<tr>
<th>FPGA 1</th>
<th>BER</th>
<th>FPGA 1</th>
<th>BER</th>
<th>FPGA 2</th>
<th>BER</th>
</tr>
</thead>
<tbody>
<tr>
<td>RO PUF 1</td>
<td>7.10%</td>
<td>PCPUF 1</td>
<td>0.54%</td>
<td>PCPUF 1</td>
<td>0.37%</td>
</tr>
<tr>
<td>RO PUF 2</td>
<td>8.09%</td>
<td>PCPUF 2</td>
<td>0.85%</td>
<td>PCPUF 3</td>
<td>0.70%</td>
</tr>
<tr>
<td>RO PUF 3</td>
<td>7.13%</td>
<td>PCPUF 3</td>
<td>0.62%</td>
<td>PCPUF 2</td>
<td>0.91%</td>
</tr>
</tbody>
</table>

3.4.6 Bias

Ideally, there should be 50% 1’s and 50% 0’s in a response. However, our RO PUF implementation on Kintex-7 FPGAs have a bias ranging from 41% to 58% according
to our tests. As a comparison, we list the bias of PCPUFs in Table 3.4. Each result is based on the mean of 100,000 128-bit responses. Without any post-processing, all six PUFs produce bias within ±2% of ideal, which prove the effectiveness of the phase calibration on improvement the RO PUF bias.

<table>
<thead>
<tr>
<th>FPGA 1</th>
<th>Bias</th>
<th>FPGA 1</th>
<th>Bias</th>
<th>FPGA 2</th>
<th>Bias</th>
</tr>
</thead>
<tbody>
<tr>
<td>RO PUF 1</td>
<td>40.85%</td>
<td>PCPUF 1</td>
<td>49.21%</td>
<td>PCPUF 1</td>
<td>48.72%</td>
</tr>
<tr>
<td>RO PUF 2</td>
<td>47.10%</td>
<td>PCPUF 2</td>
<td>49.77%</td>
<td>PCPUF 3</td>
<td>51.08%</td>
</tr>
<tr>
<td>RO PUF 3</td>
<td>58.42%</td>
<td>PCPUF 3</td>
<td>51.35%</td>
<td>PCPUF 2</td>
<td>50.14%</td>
</tr>
</tbody>
</table>

3.5 Conclusions

In this chapter, we demonstrated a practical design and implementation of RO PUFs on FPGAs. The phase calibration process focuses on improving the frequency measurement techniques in the crossing clock domain area. By involving phase calibration, the 7.81% bit error rate is reduced to less than 1%. We also evaluate other properties of PCPUF, including randomness, uniqueness, and bias.
Chapter 4

Secure Authentication Solution with PUF Response Instability

4.1 Introduction

Authentication is widely applied in the digital world, which poses special problems with electronic communication and supply chain device identification. It has vulnerabilities to man-in-the-middle attacks and counterfeiting. Previously we introduce a stable PCPUF design. Based on that, we will discuss a specific authentication application using the instability of PCPUF in this chapter. Unlike the normal usage of PUF response, this work converts the unstable bits to the stable output. It is noted that the method can be applied to other types of PUF as well. Furthermore, an obfuscation system is provided to enhance the security of authentication solution, which also takes advantage of the unstable response in PUFs.

4.2 Unstable-Stable Response

Modeling and machine learning attacks are well-known for PUF adversary. The successful prediction is due to the accurate calculation and measurement of physical pa-
rameters such as delay and frequency. To achieve a secure PUF-based solution against those attacks, we propose a novel method to generate responses, which uses the instability of PUFs. Since the unstable bits in the RO PUF response are hard to be predicted, they provides better security feature than the stable ones. However, those bits have been only considered in the random number generation application yet. For other scenarios, only stable bits are usually required. In our work, we find the possibility to transfer the unstable bits to stable bits and apply them to PUF responses, which is present as unstable-stable response (USR), a more stable and secure response.

**USR Generation**

Figure 4.1 shows how to generate USR with a PCPUF. We set another hash function module out of the PCPUF, which can provide multiply sub-challenges with only one challenge input. Once a sub-challenge is created, it is sent to the challenge register for temporary storage. By reading the value from the register, PCPUF selects two RO pairs and keeps outputting the 1-bit response to the first address of the unstable bit counter for 1,000 times. The accumulated value in the counter reflects the stability of this output. A number that close to 0 or 1,000 is known to be stable while 500 is regarded as the most unstable response. With the help of its internal hash function module, the model generates a 128-bit raw response and stores it in the counter. Since PCPUFs provide good stability in their responses, the obvious unstable bits are so limited that less than one bit can be found in 128 bits. In the unstable bit counter, we set the unstable bits as
‘1’ and stable ones as ‘0’. The counter records the number of ‘1’s and sends the 128 bits to the response register. Before the bias reaches the ideal fifty-fifty, the counter enables the hash function module in order to launch the next progress with a different internal challenge. Another raw response is generated and transferred to an internal response in the counter. We apply the bitwise OR operation to this internal response and the one that stores in the response register. After updating the register value, the new response should have a higher probability to contain more ‘1’s. By repeating the same process, the response in the register reaches a balance between ‘1’ and ‘0’ finally. Now it is ready to be served as the final response.

Fig. 4.1: Generating stable responses by using unstable raw responses
Proof of Theoretical Threshold

Note that the current design cannot guarantee a static threshold of how many times the process should be executed. The first reason is that the unstable bits selected by the hash function module are not controllable. Secondly, the ‘1’s in two responses may be overlapped. Thus, the amount of ‘1’ is reduced by the OR operation. To estimate the threshold, we define \( u \) as the number of unstable bits in 128-bit internal response. The probability that a certain bit is not set to ‘1’ in one process is \( 1 - \frac{u}{128} \). Since there are \( t \) loops, the probability that the bit is not set to ‘1’ by any loop is

\[
p_0 = (1 - \frac{u}{128})^t \quad (4.1)
\]

The probability that it is ‘1’ is therefore

\[
p_1 = 1 - (1 - \frac{u}{128})^t \quad (4.2)
\]

To eliminate the bias, \( t \) should satisfy

\[
\binom{128}{64} p_1^{64} p_0^{64} = 1 \quad (4.3)
\]

By solving the equation, we are able to set a reasonable threshold in order to control the
bias. When the unstable bits is 0.5%, $t$ is about 177. Though it is possible to control the bias with a dynamic adjustment, the required logic resource will increase significantly.

**USR-based Authentication**

As a basic PUF application, authentication does not require very low bit error rate of the PUF responses. Thus, we can apply our USRs directly to the solution. For the simplest authentication, we compare two responses and check if the mismatched bits are larger than a threshold $t$. In this application, false negative rate and false positive rate are critical for performance evaluation. Given a bit error rate $e$ in 128-bit USRs, the false negative should be

$$f_n = \sum_{n=t+1}^{\min[m,128]} \frac{16384-m}{128-n} \binom{m}{n}, m \approx 128^2 \times e$$

(4.4)

On the other side, the false positive is

$$f_p = \sum_{n=0}^{t} \binom{128}{n} (1-p)^n p^{128-n}$$

(4.5)

in which $p$ is the probability of a random collision in 128 trials. Ideally, the value is close to 0.5.

Apart from the accuracy, we also need to estimate the cost of this solution. One important aspect is the utilization of memory space. In the simplest solution, we do not require ECC as the bit error rate in USRs is very low. Thus, there is no additional
overhead except the CRPs. Given the CRP length of $n$, the memory cost is $2n$ bits per CRP. This is acceptable for small amounts of authentication requests. However, as the requests grow, the memory cost increases linearly. A lightweight solution is to record the instability information of PUFs instead of CRPs. The assumption is that we trust the security of the database, which contains some sensitive information of PUFs. In that case, we can calculate the USRs with the known hash function, instability RO pairs, and the corresponding challenges.

As we mentioned above, an effective solution is based on the assumption that the database is secure. However, we must consider the scenario that the information is disclosed to attackers. The typical solution is to encrypt the database. Alternatively, we can choose a solution that combines the ECC and Bloom filter [37], which is both secure and efficient.

### 4.3 Obfuscation-based Authentication

Since some authentication applications require higher security level, we provide another scheme for our PUFs with a strong obfuscation. Our prime goal is that even if a current response is disclosed, it does not leak non-negligible information about the next responses with the same challenge, which means attackers cannot use replay attack to pass the authentication when a repeated challenge is applied. Secondly, the scheme has strong modeling attack resistance by involving unstable bits in the responses, but the authentication accuracy will not be affected in the database. Furthermore, though probe
attacks can obtain the information, the final response is still unpredictable.

**Obfuscation Architecture and Algorithm**

The scheme starts from the N-bit challenge register in Figure 4.2. Before launching the challenges, a PCPUF stability model is created in the hardware memory. It records the stable RO pairs and unstable ones. After a challenge is received from the database, the register value will not be updated until the final response is generated. In the first loop, the challenge is hashed by a one-way hash function module with an initial seed. The hash value is sent to the secure address generator as a secure challenge. By mapping the generated addresses to the PCPUF stability model, we select $N-1$ stable bits and one very unstable RO pair. We enable the unstable RO pair in order to generate 1-bit output for obfuscation. The position of the inserted unstable bit depends on the hash value. The $N$-bit temporary response is saved in RegisterA. Then we apply exclusive or operation to the value between the RegisterA and RegisterB, and store the result in RegisterB. In the first loop, the values in RegisterB is 0 as it is reset along with a new challenge. Since the secure address generator knows the selected $N-1$ stable bits, they are sent to the one-way hash function module and hashed as the secure challenge of the second loop. Before the hash operation, 1-bit ‘0’ is added to the end of the stable bits to ensure that the input length is $N$-bit. This process repeats $T$ times to generate the final response. In the loop $t$, the secure address generator still select $N-1$ stable bits and one unstable RO pair, which is stored in RegisterA. But there are $t$ accumulated
unstable bits in RegisterB, given the different inserted positions by the secure address generator. Thus, the final response contains $T$ unstable bits.

![Fig. 4.2: Secure obfuscation-based authentication solution](image)

To check the collision, we involve a Bloom filter and its reset control logic module in our scheme. A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. With an obfuscation bit mask from the secure address generator, the $N - T$ stable bits are sent from RegisterB to the Bloom filter. After being added by $T$-bit ‘0’s, the string is hashed by $k$ hash functions and the hashed values are stored in a M-bit non-volatile memory (NVM) Bloom filter. The NVM should be tamper resistant against any malicious reset. Whenever a final response is generated, we check the Bloom filter first to ensure that the response has never been released before. If there is no collision, we update the Bloom Filter with the $k$ hash values and output RegisterB as the final response. However, Once a collision
is found, the final response will be stalled in RegisterB. As a further obfuscation, we rerun this process for $T$ times to generate a new response. However, this time we switch the hash function input to the entire RegisterA, which means the unstable bit is included in the one-way hash input. After $T$ loops, the final response will hardly disclose any information of the PUF state and pattern.

Due to memory space limitation, it is not possible to store infinite hash values in the NVM. Therefore, we use a reset control logic to cleanup the Bloom Filter when the false positive reaches a preset threshold. According to our scheme, the PUF is not allowed to generate the same response twice within the threshold times. Modeling attacks are also inefficient as stable response bits and internal state is obfuscated by the unstable bits, one-way hash function, and the Bloom filter. Even if the Bloom filter is reset, the pattern cannot be detected due to the $T$ times one-way hashing. The complexity of our scheme also makes probe attacks useless since the next final response is unpredictable given a known PCPUF model, Bloom filter content, and the current register value. Since the attackers cannot try all $2^N$ input patterns when $N$ is large, our scheme is secure.

**Database Consideration**

To authenticate our PUFs in a secure way, we avoid recording any data in the database that may disclose the information of the PUFs. Instead, we use Bloom filters for authentication, in which the hash functions provide strong resistance against software reverse
Algorithm 3 Secure authentication scheme

1: procedure OBFUSCATION

2: loop 1:
3:     reset RegisterB, t
4:     ChallengeRegister ← new challenge

5: loop 2:
6:     if \( t = 1 \) then hash(new challenge(N))
7:     else if \( t = T + 1 \) & no collision then goto loop 3
8:     else if \( t = T + 1 \) & collision then output RegisterB goto wait
9:     else if no collision then hash(RegisterA(N − 1) + ‘0’)
10:    else if collision then hash(RegisterA(N))

11: generate \( N − 1 \) stable bits with the PCPUF model
12: generate 1 unstable RO pair with the PCPUF model
13: generate 1 obfuscation bit with the RO array
14: generate inserted position with the hash value
15: RegisterA ← \( N \) bits
16: RegisterB ← RegisterA \( \oplus \) RegisterB
17: \( t \leftarrow t + 1 \)
18: goto loop 2

19: loop 3:
20: receive obfuscation bit mask
21: BloomFilter ← RegisterB(N − T) + T’0’
22: if collision then \( t \leftarrow 1 \) goto loop 2
23: else if no collision then update BloomFilter output RegisterB goto loop 1
engineering. When a PUF is manufactured, we test the PUF and store the PCPUF stability model to the database. Then we redo the response generation process by software, with our challenge set and the known initial seed in the PUF. Since we cannot predict the unstable bits in the response, all the possibilities must be generated by an authentication tree, as shown in Figure [4.3]. The responses of $R_1$ are generated after the first loop of calculations according to the challenge $C_1$. $R_1(0)$ and $R_1(1)$ are based on the assumption that the obfuscating unstable bit is ‘0’ and ‘1’. In the second loop, the responses of $R_2$ are generated with the $N - 1$ stable bits from $R_1$ and the unstable bits lead to four combinations. By following this progress, we obtain $2^T$ possibilities of the final response. With the same challenge $C_1$, the final response of this PUF must belong to the set of $RT$. Thus, all the responses in this set need to be recorded in the Bloom filter with $C_1$ as one-time CRP. As we apply NONCE to the challenge set, it is not necessary to calculate and store the final response set of the second $T$ loop. By mapping the response set with one-way hash functions, we can guarantee the data security in the database.

This security scheme sacrifices hardware resources however. For instance, after generating 1 million responses, the Bloom filter will be reset to maintain a false positive of $10^{-6}$. The size of NVM and the number of hash functions are calculated as:

$$M = -\frac{n \ln p}{(\ln 2)^2} \approx 3.42 \text{ MB} \quad (4.6)$$
Fig. 4.3: PUF authentication tree in the database

$$k = \frac{M}{n} \ln 2 \approx 20 \quad (4.7)$$

In the database, the false positive rate of the Bloom filter affects the authentication accuracy. Thus, it needs to be much lower than that in the PUF. If we set it to $10^{-20}$, the size of Bloom filter for 1 million CRPs is:

$$M' = \frac{n \ln p}{(\ln 2)^2} \approx 11.43 \text{ MB} \quad (4.8)$$

Obviously, the cost is higher than the USR-based authentication solution. In summary, there is always a trade-off between the cost and the security. Those low-cost solutions are suitable for a large amount of chips while this scheme enhances the security level for some critical applications.
4.4 Performance Evaluation

4.4.1 USR Authentication Result

The performance of USR-based authentication is shown in Table 4.1. Compared to the measured raw response of PCPUFs, USRs have lower bit error rate. When we set the threshold to 20, all three PCPUFs have a low false positive and false negative rate in theory. We also verify the bound by measuring 300,000 CRPs on an FPGA. The results show that no false positive or false negative is detected.

<table>
<thead>
<tr>
<th>FPGA 1</th>
<th>BER</th>
<th>$t$</th>
<th>False positive</th>
<th>False negative</th>
</tr>
</thead>
<tbody>
<tr>
<td>PCPUF 1</td>
<td>0.30%</td>
<td>20</td>
<td>$4.30 \times 10^{-16}$</td>
<td>$5.38 \times 10^{-22}$%</td>
</tr>
<tr>
<td>PCPUF 2</td>
<td>0.38%</td>
<td>20</td>
<td>$4.30 \times 10^{-16}$</td>
<td>$1.41 \times 10^{-19}$%</td>
</tr>
<tr>
<td>PCPUF 3</td>
<td>0.31%</td>
<td>20</td>
<td>$4.30 \times 10^{-16}$</td>
<td>$8.82 \times 10^{-22}$%</td>
</tr>
</tbody>
</table>

4.4.2 Comparison with ECC Solutions

Table 4.2 shows the other authentication solutions with Hamming ECC and BCH ECC. Hamming(255,247) can correct one bit error in 128-bit responses while BCH(255,128,15) has a correct capability of as many as 15. However, the overhead of BCH ECC is much larger than that of Hamming ECC. We apply both ECCs to the shift register based RO PUFs and our PCPUFs. Compared to the poor performance on RO PUFs, a lower bit error rate of PCPUFs takes more advantages from ECCs. When those 100,000 CRPs
are tested with BCH ECC, no false negative is detected. Though it is hard to estimate the accurate false negative of BCH ECC based authentication on PCPUFs, we can compare the overhead of these solutions in order to find a better trade-off. The codeword of BCH ECC is almost the same size as the key, while our USR-based solution has no additional overhead at all. Therefore, USR-based authentication is recommended when the memory space becomes a critical bottleneck.

Table 4.2: Performance of ECC-based Authentications

<table>
<thead>
<tr>
<th>PUF type</th>
<th>ECC type</th>
<th>overhead</th>
<th>False negative</th>
</tr>
</thead>
<tbody>
<tr>
<td>RO PUF 1</td>
<td>Hamming(255,247)</td>
<td>8</td>
<td>1</td>
</tr>
<tr>
<td>RO PUF 2</td>
<td>Hamming(255,247)</td>
<td>8</td>
<td>1</td>
</tr>
<tr>
<td>RO PUF 3</td>
<td>Hamming(255,247)</td>
<td>8</td>
<td>1</td>
</tr>
<tr>
<td>RO PUF 1</td>
<td>BCH(255,128,15)</td>
<td>116</td>
<td>3.82 × 10⁻²</td>
</tr>
<tr>
<td>RO PUF 2</td>
<td>BCH(255,128,15)</td>
<td>116</td>
<td>4.84 × 10⁻²</td>
</tr>
<tr>
<td>RO PUF 3</td>
<td>BCH(255,128,15)</td>
<td>116</td>
<td>2.21 × 10⁻²</td>
</tr>
<tr>
<td>PCPUF 1</td>
<td>Hamming(255,247)</td>
<td>8</td>
<td>1.21 × 10⁻²</td>
</tr>
<tr>
<td>PCPUF 2</td>
<td>Hamming(255,247)</td>
<td>8</td>
<td>2.10 × 10⁻²</td>
</tr>
<tr>
<td>PCPUF 3</td>
<td>Hamming(255,247)</td>
<td>8</td>
<td>1.59 × 10⁻²</td>
</tr>
<tr>
<td>PCPUF 1</td>
<td>BCH(255,128,15)</td>
<td>116</td>
<td>&lt; 10⁻⁶</td>
</tr>
<tr>
<td>PCPUF 2</td>
<td>BCH(255,128,15)</td>
<td>116</td>
<td>&lt; 10⁻⁶</td>
</tr>
<tr>
<td>PCPUF 3</td>
<td>BCH(255,128,15)</td>
<td>116</td>
<td>&lt; 10⁻⁶</td>
</tr>
</tbody>
</table>
4.5 Conclusions

In this chapter, we demonstrated the potential of using instability of PUF responses as stable output. The unstable-stable response method transfers unstable bits to stable output. The results show that USR has better performance than the traditional ECC solutions. Another contribution is the obfuscation design of PUF responses. By applying the unstable RO pairs as stable responses and obfuscation, we improved the stability and security of PCPUF-based authentication.
Chapter 5

Ring Weight Algorithm for Lightweight Authentication

5.1 Introduction

As the integrated circuit industry grows, counterfeiting ICs have also been growing as an illegal means of profit taking. In response to this trend, authenticating untrusted ICs has become a critical and pressing issue. Many methods are already available to identify and authenticate ICs [38]. Among them, solutions based on PUFs show the high level of security due to their unpredictable response [1].

The high bit error rate in the responses of PUFs requires strong error correcting codes (ECC) to guarantee a low false negative rate in authentication processes [39]. However, a strong ECC requires large memory space and high calculation complexity. Fuzzy extractor is another solution with higher security level apart from ECC, which involves even larger overhead [8, 40]. As a result, it causes a trade-off between the failure rate and the overhead. Though it is necessary to apply traditional ECC or fuzzy extractor when using PUFs as key generation purpose, IC authentication process does not need to correct the errors in responses, but tolerate them.
In this chapter, we propose a novel algorithm that not only keeps false positive and false negative small, but also reduces the overhead and computation time. This algorithm calculates the density of 128 bits distribution in a response and selects certain information for authentication. By tolerating the weight and location shift of these values, errors in responses affect the accurate rate negligibly. Additionally, the encrypted helper data from each chip improves the performance in a practical and secure way. Thus, a larger set can effectively prevent cloned chips from colliding and being authenticated by mistake.

5.2 System Architecture

Figure 5.1 shows an on-board self-authentication architecture where IC PUFs can be authenticated on the board itself. The system is composed of an authentication device that can issue challenges and receive responses from an IC PUF and a database where the challenge and response pairs are stored. In addition, the data generated by the reliable responses can be sent to the memory for offline usage. In the figure, the authentication device is shown as a FPGA device, but could be implemented with any trusted device with compute capabilities as well. For self-authentication, the database should be stored on the board in some type of nonvolatile memory. If authentication is done offline, the database is stored in a secure server. The database is a critical part of the authentication process and it suffers from three problems – namely size, error tolerance, and security. The size of this database can be a critical issue as it needs to
store all challenge response pairs and the on-board storage may be limited. Since most current PUFs are not reliable – i.e., there is the possibility of bit errors, and we assume that the IC PUF has no built-in error correction, the authentication device and database must handle any errors in the response. Finally, security is a prime concern, because if the database is compromised, attackers can easily obtain the CRPs from the database and clone the chip with faked responses of PUFs.

We show three different approaches to address these issues – standard error correction with encryption, error correction with Bloom filter, and fault tolerant extremum matching. Depending on the approach, the data stored in the database will differ.

![Fig. 5.1: The architecture of three IC authenticating solutions](image)

### 5.2.1 Error Correcting and Bit Matching

The simple solution is to store CRPs and ECC code words in the database for authentication. Since the responses from the PUFs may contain a number of bit errors [41][42], a strong ECC is required to guarantee a low false negative rate. However, this approach requires a larger database in order to hold the ECC code words. Moreover, it requires
encryption of the database which entails more computation effort as well as the difficulty of managing the encryption key.

### 5.2.2 Error Correcting/Bloom Filter Matching

In order to deal with the size of the database, one could use one of a number of lossless compression techniques. However, the database must be uncompressed eventually, meaning, while nonvolatile memory is reduced, the volatile memory must still be large enough to store the CRPs and ECC code words. Instead, for our application, we need a compressed data structure that can be queried without uncompression. A Bloom Filter is such a data structure: a space-efficient probabilistic data structure that can be used to test if an element is a member of a set [43]. The answer to this query can be either “possibly in set” or “definitely not in set”, which means it can have false positives but never false negatives. A Bloom filter is a bit array with \( k \) different hash functions. Feeding an element to these \( k \) hash functions produces \( k \) array positions which can be set to 1. Querying for an element requires feeding it to the \( k \) hash functions and verifying that the resulting bit positions are set to 1. Thus, it provides faster detection of information without resorting to associative lookup buffers [44, 45]. To enhance the information security, one-way hash functions can be used instead [46]. Due to the one-way hash functions, it is computationally hard for attackers to retrieve original responses from the Bloom filter array and code words. In the context of IC/PUF authentication, a Bloom filter can be used to store the challenge-response pairs. As a result, the database is
composed of challenges, ECC code words, and the Bloom filter. While the Bloom filter reduces the size of the database significantly by reducing the size of the response space, the need to store ECC code words still requires a large dataset.

Consider a Bloom filter with a size of $m$ bits and $k$ hash functions. If $n$ is the number of elements to be stored in the filter, the false positive rate, $p_f$, can be calculated as approximately \[ p_f = (1 - e^{-\frac{kn}{m}})^k \] (5.1)

For a given $n$ and a desired $p_f$, the optimal $m$ and $k$ can be calculated by the following equations:

\[ m = -\frac{n \ln p_f}{(\ln 2)^2} \] (5.2)

\[ k = \frac{m}{n} \ln 2 \] (5.3)

As an example with $n = 10^6$, a 28755176-bit array and 20 hash functions are required to keep the false positive rate below $10^{-6}$.

This solution calls for a strong BCH ECC due to the high bit error rate in responses [48]. On the other hand, it sacrifices time and memory space efficiency. For instance, to correct 12 errors in a 128-bit response, eight dimensions of Galois Field
and 92-bit code word are required, which means large overhead and long-time encoding and decoding. Further, if errors are more than the limitation of error correcting capacity, false negatives will occur due to the failure of error correcting and the false mappings into the Bloom filter.

5.2.3 Fault Tolerant Extremum Matching

The ideal solution is to use a strong input fault-tolerant matching algorithm to replace ECC. The fault-tolerant array stores limited information such that even if the array is disclosed, the responses are not available. The algorithm should be efficient while the security level remains high. The overhead should be small while the false positive and false negative are low. This algorithm is the main contribution of this chapter and is described in further details in the next section.

5.3 Novel Fault Tolerant Methodology

To avoid the large overhead of using ECC, we propose a novel fault tolerant authenticating scheme, which does not rely on ECC. It can achieve a better trade-off between false positive rates, false negative rates, the computing time and the memory and area overhead.
5.3.1 Ring Weight Algorithm

The ring weight algorithm (RWA) \cite{37} implements a similar function as the Bloom filter plus ECC, but it requires very limited overhead for large input-fault-tolerance capabilities. We assume a 128-bit response is separated into 32 groups, which form an outer ring as shown in Figure 5.2. Each group contains a 4-bit value ranging from 0 to 15 and is marked with a location value by sequential order. According to our simulation results in Table 5.1, we select 4-bit groups as a trade-off among false positive, false negative, and overhead bits (maximum weight plus maximum location). The 32 location values compose the middle ring. The inner ring contains 32 weights ranging from 10 to 0. Each weight is used only by the corresponding 4-bit group. If these weights are not well-assigned, it will affect the authentication accuracy. Thus, we test linear and non-linear weight distributions and select the optimal assignment according to our simulation results. The overall weight of a location is calculated by multiplying each 4-bit value by the current weight value aligned with it and taking the sum. Clearly, the range of the overall weight is from 0 to 1800, since a response can be set from all 0’s to all 1’s. Figure 5.2 shows the ring in the 1st location and has an overall weight of 849. This overall weight is computed by $10 \times 10(1010) + 9 \times 5(0101) + ... + 0 \times 11(1011) + ... + 9 \times 7(0111)$ and is recorded in the first location of the middle ring. The same process is done for every position as we rotate the weight dial clockwise for the next number. Thus, in the second rotated position, the second 4-bit value (0101) now is aligned with the weight of 10. We follow the previous rules to compute the overall
weight: $10 \times 5(0101) + 9 \times 5(0101) + \ldots + 0 \times 15(1111) + \ldots + 9 \times 10(1010)$. The new overall weight value is recorded in the second location of the middle ring. After rotating the weight dial 360 degrees, we can obtain 32 overall weights, among which the maximum and minimum value are recorded as the peak and valley value of the response. The corresponding peak and valley weight ring locations are also stored in the fault tolerant array as a criterion for judgment.

<table>
<thead>
<tr>
<th>scheme</th>
<th>false positive</th>
<th>false negative</th>
<th>overhead bits</th>
</tr>
</thead>
<tbody>
<tr>
<td>1-bit</td>
<td>31.95%</td>
<td>11.68%</td>
<td>15 + 7</td>
</tr>
<tr>
<td>2-bit</td>
<td>33.68%</td>
<td>10.84%</td>
<td>13 + 6</td>
</tr>
<tr>
<td>4-bit</td>
<td>35.72%</td>
<td>9.80%</td>
<td>11 + 5</td>
</tr>
<tr>
<td>8-bit</td>
<td>51.68%</td>
<td>5.05%</td>
<td>9 + 4</td>
</tr>
<tr>
<td>16-bit</td>
<td>74.56%</td>
<td>1.97%</td>
<td>7 + 3</td>
</tr>
</tbody>
</table>

A simple authentication method is to calculate the peak or valley information of a response and compare it with that in the fault tolerant array. As the errors increase, the overall weight and location value may not be fitted perfectly to the offline array value. Thus, the shifting of the location and weight value should be tolerated to some certain extents. By way of illustration, the offline recorded peak weight is 1138 and the related location is 17. However, the results calculated from the response turn out to be 1105 and 16 due to several bit errors. In this case, the response is rejected, and the device cannot
Fig. 5.2: Main structure of ring weight algorithm

pass the authentication process. In order to allow for errors, we can set a tolerance threshold where we accept deviations in the weight and location value. For example, if ±50 or ±1 are set as the acceptable tolerant range for weight or location value, the response can still be authenticated correctly by weight-tolerance or location-tolerance. However, allowing for tolerance thresholds introduces the possibility of false positives - i.e. responses that are authenticated even though they were not valid. For example, 1105 and 16 could be the weight and location calculated from an invalid response rather than due to errors in a valid response. Reducing the tolerance threshold ranges can reduce the incidence of false positives, but that would increase the probability of false negatives - i.e. responses that are rejected even though they were authentic. Thus, the tension is coming up with optimal values of ranges to minimize both false positives and
false negatives. In the following section, we discuss some of the theoretical foundations behind calculating false positive and false negative rates.

5.3.2 False Positive and False Negative

In this section, we provide the relationship between the parameters of RWA and the authentication results. In RWA, there are two parameters for fault tolerance usage: weight and location. The maximum and the minimum values can be used for authentication. In the aspect of the performance, false positive and false negative are the main factors of authentication process. Thus, three groups of choices generate eight combinations of simple authentication method.

Weight-based False Positive

We start the proof with the relationship between the maximum weight and the probability of false positive. Figure 5.2 shows a weight pattern ranging from 0 to 1800. \{W_1, \ldots, W_n\} is a set that records the maximum weight values of non-error responses. For random distributed responses, the probability density function (PDF) of weight belongs to a certain family of distribution with a vector \(W'\). The measurement error is given as \(W - W'\). Thus, the likelihood function of the joint probability is

\[
L(W') = L(W'; W_1, \ldots, W_n) = f(W_1 - W') \cdots f(W_n - W')
\]  \hspace{1cm} (5.4)
Maximum-likelihood estimation is applied as

$$\frac{d \ln L(W')}{dW'} = \sum_{i=1}^{n} \frac{f'(W_i - W')}{f(W_i - W')} = \sum_{i=1}^{n} g(W_i - W') = 0 \quad (5.5)$$

For each 4-bit group \{r_n, r_{n-1}, r_{n-2}, r_{n-3}\} in a response, the weight value is calculated by

$$W = \sum_{i=1}^{32} (w_i \times (2^3 r_n + 2^2 r_{n-1} + 2^1 r_{n-2} + 2^0 r_{n-3})), n = 4i \quad (5.6)$$

If $\bar{W}$ is the arithmetic mean value of $W$, for the equation (5.6), with $n = 2$, then

$$g(W_1 - \bar{W}) + g(W_2 - \bar{W}) = 0 \quad (5.7)$$

Since the minuend is the arithmetic mean of $W_1$ and $W_2$, the function can be simplified as $g(-W) = -g(W)$. Further, let $W_1 = \ldots = W_m = -W$, $W_{m+1} = mW$, in which $n = m + 1$. An arithmetic mean value of 0 leads to

$$\sum_{i=1}^{n} g(W_i - \bar{W}) = mg(-W) + g(mW) = 0 \quad (5.8)$$

Thus, we have $g(mW) = mg(W)$. The only continuous function that satisfies this equation is $g(W) = cW$, which means the PDF of $W$ is
\[ f(W) = ae^{cw^2} \quad (5.9) \]

Normalized \( f(W) \) follows the general normal distribution. The discrete function is described approximately as

\[ p_{\phi} \approx \frac{1}{\sigma \phi\left(\frac{W}{\sigma}\right)} \quad (5.10) \]

Therefore, the maximum weight follows a normal distribution. To estimate \( \sigma \) and \( \mu \), we ran one million CRPs as a sample set. The bits in these responses are distributed uniformly to guarantee the accuracy. The simulated data with a fitted curve are shown in Figure 5.3. Since the normal distribution curve achieves 0.999 adjusted R-square with the simulation result, we can safely say that it represents how the maximum weight values distribute from 0 to 1800. The parameters of the normal distributions are represented as \( N(1008, 106^2) \). In another words, the probability of maximum weight can be applied to that of any random 128-bit response.

To calculate the maximum weight-based false positive, we apply convolution to the curve in Figure 5.3. The convolution represents the collision rate between an invalid response and the actual valid response. The process is shown in Figure 5.4. The blue curve, representing the invalid response, starts shifting from the left side when \( \mu_{\text{blue}} \) —
$\mu_{\text{red}} = -1800$. At this point, the shaded area is equal to the probability of a collision with a value of $\Delta W_{\text{max}} = -1800$. As the blue curve shifts closer to the red curve, the shaded area increases as a smaller $\Delta W_{\text{max}}$ has a higher probability of collisions. After reaching the peak point $\Delta W_{\text{max}} = 0$, the probability decreases in a symmetrical way. The results of the convolution are shown in Figure 5.5. The figure also shows the result of a simulation of one million CRPs and one million random responses. As can be seen, the simulated values of $\Delta W_{\text{max}}$ are matched with the convolution data. Clearly, if we tolerate a wider range of $\Delta W_{\text{max}}$, the integral and thus, the false positive probability is higher. For example, if the tolerant range is 0, the false positive rate is 0.027. It increases to 0.769 with an interval between $-180$ to 180 in Figure 5.6. A tolerant range between $-500$ to 500 covers most collisions, but also means the probability of a false positive is nearly 1. Thus, the tolerance setting need to be controlled in a reasonable range in order to remain effective at keeping both the false positive and false negative
rates small.

**Fig. 5.4:** Convolution of the maximum weight distribution

**Fig. 5.5:** Collision of the maximum weight difference value

The above proof can also be applied to minimum weight-based false positive. The curves in Figure 5.3 and Figure 5.7 show axial symmetry properties with the axis of $W = 900$. In Figure 5.8, the convolution results match the simulated minimum weight difference. Therefore, minimum weight-based tolerance has the equivalent collision
Fig. 5.6: Probability of the maximum weight based false positive rate as that of the maximum weight-based tolerance.

Fig. 5.7: The minimum weight distribution of RWA

Location-based False Positive

As we mentioned in the RWA section, the peak location $L_{max}$ refers to the maximum location while the valley location $L_{min}$ refers to the minimum location. Since they can
be proved to have similarly effects on location collisions, only the peak location-based false positive is discussed in this section. This authentication is based on tolerating $\Delta L_{\text{max}}$ caused by the bit errors in responses, but it involves false positive as well. The collisions of $\Delta L_{\text{max}}$ are caused by uncertain locations of random responses. In Figure 5.9, we present the relationship between $\Delta L_{\text{max}}$ and the collision rate. Since we define $L_{\text{max}}$ as $\{1, 2, \ldots, 32\}$, $\Delta L_{\text{max}}$ has an interval of $[-31, 31]$. With a fixed original location, the tested response has an equal chance to appear at any place in the ring. Thus, all the accessible locations are marked with the same probability unit ‘1’ while all the unavailable locations are ‘0’. The collision rate can be calculated according to the amount of ‘1’ in the shape of diamond in the diagram. The position of $\Delta L_{\text{max}} = 0$ obtains the highest $p$: 32 out of 1024. On its both sides, there are progressive decreases in probabilities. $\Delta L_{\text{max}} = -31$ and $\Delta L_{\text{max}} = 31$ only appear when $L_{\text{max}}$ is 1 and 32. A comparison of the location collision is shown in Figure 5.10. We simulated one million
CRPs and generated a collision rate set, which overlapped the theoretical probabilities according to Figure 5.9.

**Fig. 5.9:** Distribution of the maximum location difference

<table>
<thead>
<tr>
<th>∆L_{max}</th>
<th>-31</th>
<th>-30-29</th>
<th>-17-16-15</th>
<th>-12-11-10-9</th>
<th>-6-5-4-3-2-1</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>15</th>
<th>16</th>
<th>17</th>
<th>29</th>
<th>30</th>
<th>31</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>3</td>
<td>3</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>...</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>15</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>16</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>17</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>...</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>30</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>31</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>32</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>μ</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>30</td>
<td>31</td>
<td>32</td>
<td>31</td>
<td>30</td>
<td>17</td>
<td>16</td>
<td>15</td>
</tr>
</tbody>
</table>

**Fig. 5.10:** Collision of the maximum location difference value

Notice that in the ring, ∆L_{max} follows a rule: −31 = 1, −30 = 2, ..., −16 = 16, ..., −2 = 30, −1 = 31. Thus, in Figure 5.10, the triangle from −31 to −16
can be flipped horizontally to the space from 1 to 16. The triangle from 17 to 31 can be flipped horizontally to the space from $-15$ to $-1$. As shown in Figure 5.11, by transferring $\Delta L_{max}$ into absolute value, the collision rate of location-based RWA becomes flat from $-15$ to 16. When tolerating one more location with the current location range, the collision rate increases by $1/32$. To reduce the false positive, we will discuss an optimization in the location shift scheme section.

![Fig. 5.11: Absolute value of the maximum location difference](image)

**Weight-based False Negative**

We consider an unauthenticated 128-bit response with 12 bit errors, there are $2^{128} \times {128 \choose 12}$ bit distribution patterns. In other words, we need to run the program for $8.07 \times 10^{54}$ times to calculate the true probability of weight-based false negatives. Instead, we estimate the probability by simulating one thousand samples to evaluate the distribution of $\Delta W_{max}$. Each sample is a set of one million simulation results. One set refers to one certain response with average 12 bit errors, but the error is distributed in different patterns. Here, the number of errors is assumed to follow a normal distribution. The
examples are given in Figure 5.12. If a response is all '0', it will only have negative $\Delta W_{max}$. A response with all '1' shows opposite features. However, most of the random responses fit a standard normal distribution with $\mu = 0$. All the sets together form a normal distribution in Figure 5.13. We can safely claim that, though it is difficult to calculate all the probabilities of weight-based false negatives, the results from the samples have statistical significance and can be regarded as a close approximation of the true probabilities. The simulated data matches the statistical curve in Figure 5.13. Most of the error cases can be tolerated with a $\Delta W_{max}$ between $-275$ to $275$. The range is determined mainly by the number of errors, as shown in Figure 5.14. With more errors, the standard deviation, $\sigma$, becomes larger and the probability is less concentrated around 0. Thus, a wider $\Delta W_{min}$ threshold must be applied to reduce the false negative rate. The minimum weight based false negative shows the same result, thus we skip this discussion.

Fig. 5.12: $\Delta W$ distribution for certain responses with random 12 bit errors
Fig. 5.13: Tolerant capabilities of the maximum weight difference value

Fig. 5.14: Relationship between error number and $\sigma$

**Location-based False Negative**

As with the weight-based fault-tolerance scheme, determining the location-based false negative rate is also a hard problem that cannot be solved in the polynomial time. Therefore, we provide the statistical results in Figure 5.15 instead. The results contain seven groups of simulations with different errors in responses. Again, the number of errors follows a normal distribution. In each group, ten millions authentication processes with
different response patterns are simulated. We calculate the average values to evaluate
the probability of $|\Delta L_{\text{max}}|$. To simplify the problem, we use the absolute values of the
location difference. With a limited interval from $-15$ to $16$, the results are more easily
affected by the errors. Thus, the curve shows different features compared to the normal
distribution. Meanwhile, the shapes of the curves are also determined by the weight
distribution in the ring. If the weight values in the ring are given in a linear way, the
curve shows a continuous feature. Otherwise, it is stepped. Compared to the curve
of $4$ bit errors, groups with more errors have higher level of randomness in distribu-
tion. Clearly, when there are $64$ bits flipped in a $128$-bit response, it cannot keep its
own feature any more. Therefore, the curve with $64$ bit errors is as flat as an uniform
distribution.

![Graph](image)

**Fig. 5.15:** Relationship between error number and $|\Delta L_{\text{max}}|$

One way to improve the performance of the algorithm is to use logic operations of
both location-tolerance and weight-tolerance. The intersection of location and weight
set can decrease false negatives but increase false positives. Contrarily, the union of location and weight set decreases the false positive rate but increases the false negative rate. The combination of the peak and valley values plays similar function. As the error rate grows, wider ranges of location and weight value are necessary to keep a low false negative. However, a larger set of values sacrifices the false positive for tolerating more bit errors. In a case that each 128-bit response of PUFs contains 12 random errors, a pure fault tolerant algorithm with very limited overhead cannot keep both false positive and false negative very low in theory. Thus, while false negative is required to be as low as possible, how to avoid the increasing of false positive becomes a critical issue.

5.3.3 Location Shift Scheme

Since fault tolerance is involved in the algorithm, false positive becomes an ineluctable problem for consideration. To decrease false positives, a location shift scheme is proposed. This scheme requires an ID set of at least 32 bits in unauthenticated chips. As shown in Figure 5.16, the low 20 bits record a serial number, which is the basic address of the fault tolerant array. Since we simulate one million chips, it requires at least 20 bits to generate unique IDs. The high 8 bits represent a random number that relates to the offset of the location. The middle 4 bits are encryption bits for data transfer security consideration. When a chip receives a challenge from authenticating system, the 32-bit ID will not be sent to the system until the response is generated. To transfer data securely, the 128-bit response and 32-bit ID are mixed together logically. Certain bits of
the basic address are flipped according to different 4-bit code word patterns. The 8-bit offset address is also encrypted with an 8-bit challenge by XOR operation. Specially, the eight bits of the challenge are selected by the quotient of the serial number division. When the 160-bit data are received by the authentication system, encrypted basic address, code word, encrypted offset address, and response are extracted according to our defined protocol. After receiving the response, ring weight module generates four values: peak weight, peak location, valley weight, and valley location. The locations are further shifted by offset address in order to extend the matching space to the original location value plus the offset value. Meanwhile, four values calculated in advance in the fault tolerant array are selected by the decrypted basic address and are sent to the comparator for authentication.

By adding the offset values, the original locations are extended from 32 to 256. Therefore, it is harder for a cloned chip with a faked ID to point to the correct location. Since the collision probability of locations is reduced, the false positive decreases as a consequence. In Figure 5.16, the offset values added to the peak and valley location are different because the valley location shift method uses revered offset bits, which can further increase the difficulty of the collisions. Another effective way to decrease the false positive is to extend the offset address. When the random bits in the ID are 11 bits, the location shift space is up to 2048. The probability of false positive can be written as

\[ P_{fp}^s = \frac{|\Delta L|}{s} \]  

(5.11)
The variable $s$ is the location shift space. The original false positive is reduced significantly with a set of $s$ between 256 and 2048. Hence, increasing the shift space makes random responses of a cloned PUF harder to find the correct peak and valley location and pass the authentication test. However, expanding the shift space destroys the ring architecture of $\Delta L$. The original ring is cut and extended to a line, which means the $\pm 16$ becomes $\pm 32$. The distribution of location-tolerance will not follow the theoretical equations any more.

### 5.3.4 Security Consideration

Apart from the accuracy rate, security is another important issue for authentication. The underlying PUF is still susceptible to potential modeling/ML attacks and RWA is not designed to address such issues in the PUF. Moreover, commercially it is not feasible for attackers to build many models since our focus is on authenticating a large amount of chips. However, an attacker could potentially generate a response that matches the RWA weight or location values. If we only use one simple RWA, it is not difficult to get two random responses with the same value. RWA has two thresholds related to collisions: $t_{\text{max location}}$ and $t_{\text{max weight}}$. The maximum location of RWA leads to a uniform distribution, as all the results lies evenly in 32 buckets. Here we apply the birthday paradox to evaluate the difficulty in finding a response that passes the maximum location threshold. As adapted from an argument of Paul Halmos, the probability that no two $\Delta L_{\text{max}}$ coincide is:
The above expression shows the interest in which the lower bound of a collision occurs when \( p(n) > 1/2 \), thus the inequality implies \( \bar{p}(n) < 1/2 \). Solving for \( n \) gives:

\[
\bar{p}(n) = \left( \prod_{k=1}^{n-1} \left( 1 - \frac{k}{32} \right) \right)^{n-1} < \left( e^{-k/32} \right) = e^{-n(n-1)/(2 \times 32)}
\]

\[\text{(5.12)}\]

The above expression shows the interest in which the lower bound of a collision occurs when \( p(n) > 1/2 \), thus the inequality implies \( \bar{p}(n) < 1/2 \). Solving for \( n \) gives:

\[
n^2 - n > 2 \times 32 \ln 2
\]

\[\text{(5.13)}\]
This inequality is satisfied when \( n = 8 \). Therefore, 8 responses suffice for ensuring that two hashes match with equal chance. Meanwhile, we assume that an upper bound is reached when \( p(n) > 0.999 \). Solving \( n^2 - n > 2 \times 32 \ln 1000 \) gives an approximate upper bound of \( n = 22 \). A very higher rate of finding a collision pair can be a potential security problem. By expanding the location range from 32 to 2048, the bound limitations are changed to 54 and 169.

The maximum weight of RWA, however, does not follow a uniform distribution. As shown in Figure 5.3, the maximum weight follows a normal distribution \( N(1008, 106^2) \). Assuming \( p_x \) and \( p_y \) are probability density functions for the maximum weight of two random responses, the average probability that two hash values coincide is:

\[
\begin{align*}
    p_{\text{average}} &= \sum_{x=y=0}^{1800} p_x p_y = 0.0027
\end{align*}
\]

This means that one pair of collisions may be found in about 371 pairs of responses. Though the maximum weight should theoretically match the range of weight from 0 to 1800, it barely appears at both sides of the normal distribution, but is centralized near the mean value of the curve. By replacing 32 with 371 in Equation 6.9, the lower bound and upper bound of weight-based collision are given as 24 and 73.

Mixed RWA refers to a combination of 11-bit maximum weight value, 11-bit minimum weight value, 5-bit maximum location value, and 5-bit minimum location value.
value. The theoretical maximum possibility of the mixed RWA value is the multiplication of four parts \(370.37 \times 370.37 \times 32 \times 32\). By taking four factors together, the upper bound and lower bound increase significantly. Table 5.2 shows the number of responses an attacker must try to achieve a 50% or 99.9% probability of success. Compared to the ECC/Bloom filter approach, our mixed RWA scheme is much more difficult for attackers to generate fitting responses and find a collision. The number of guesses can be limited to enhance the security level.

Table 5.2: Security Comparison of Different Schemes - Number of response guesses required to achieve 50% or 99.9% success rate

<table>
<thead>
<tr>
<th>scheme</th>
<th>50%</th>
<th>99.9%</th>
</tr>
</thead>
<tbody>
<tr>
<td>ECC &amp; Bloom filter</td>
<td>128</td>
<td>402</td>
</tr>
<tr>
<td>maximum or minimum location-based RWA</td>
<td>8</td>
<td>22</td>
</tr>
<tr>
<td>maximum or minimum location shift RWA</td>
<td>54</td>
<td>169</td>
</tr>
<tr>
<td>maximum or minimum weight-based RWA</td>
<td>24</td>
<td>73</td>
</tr>
<tr>
<td>mixed RWA</td>
<td>13955</td>
<td>44053</td>
</tr>
</tbody>
</table>

5.4 Evaluation and Discussion

In this section, results of different authentication schemes are compared: no ECC, traditional ECC only, ECC plus Bloom Filter, Ring Weight Algorithm, Helper Data Algorithm, and fuzzy extractor. We provide simulation results of authenticating one
million chips with different schemes. To measure the collision rate, another group of one million cloned chips are simulated with random responses. Since it is not practical to implement one million different real PUFs, we use simulation and assume the PUF responses have robust and uniform bit strings according to [11, 20, 42]. The PUF bit probabilities are independent and there is no significant bias in the PUF responses. Environmental conditions and aging only generate bit errors, but do not change our assumptions. In our simulations, we assume a roughly 10% bit error rate such that the bit errors in each response follow a normal distribution around a mean of 12. These assumptions track with results obtained in the literature [1, 11, 49]. Thus, the simulated results can present the real case of PUFs. Apart from simulation results, we also test the performance of RWA based on the implementation of ring oscillators-based PUFs (RO PUFs) and arbiter PUFs on Xilinx FPGA.

5.4.1 Error Correcting/Bloom Filter Matching

We start with a $16 \times 8$ ECC coding where the 128-bit response is split up into eight groups, each of which has its own BCH code word. Thus, one response with an average of 12 errors will be decoded eight times before authentication. The false positives and false negatives are shown in Table 5.3. $m, t, k$ and $r$ refer to the dimension of Galois Field, number of errors that can be corrected, length of information bit, and number of parity checks. BF counter defines the number of chips that cannot be authenticated. ECC counter presents the number of responses that cannot be corrected by the BCH
code. For the best case, only three chips (out of one million) failed to be correctly identified through the Bloom Filter and error correction. Data rate refers to the efficiency of the error correction scheme - i.e. $\frac{k}{k+r}$, the ratio of number of data bits to number of bits in the code. As can be seen, the overhead for these methods is quite high - to achieve the low error rate, one requires an overhead of 48 bits or three times the data.

Table 5.3: 16 × 8 Bits ECC/BF Authentication Results

<table>
<thead>
<tr>
<th>m</th>
<th>t</th>
<th>k</th>
<th>r</th>
<th>data rate</th>
<th>BF counter</th>
<th>ECC counter</th>
<th>false positive</th>
<th>false negative</th>
</tr>
</thead>
<tbody>
<tr>
<td>6</td>
<td>4</td>
<td>16</td>
<td>24</td>
<td>0.40</td>
<td>42556</td>
<td>42588</td>
<td>2.54 x 10^{-4}</td>
<td>4.26 x 10^{-2}</td>
</tr>
<tr>
<td>6</td>
<td>5</td>
<td>16</td>
<td>28</td>
<td>0.36</td>
<td>4731</td>
<td>4734</td>
<td>2.54 x 10^{-4}</td>
<td>4.73 x 10^{-3}</td>
</tr>
<tr>
<td>6</td>
<td>6</td>
<td>16</td>
<td>36</td>
<td>0.31</td>
<td>356</td>
<td>356</td>
<td>2.54 x 10^{-4}</td>
<td>3.56 x 10^{-4}</td>
</tr>
<tr>
<td>6</td>
<td>7</td>
<td>16</td>
<td>44</td>
<td>0.27</td>
<td>20</td>
<td>20</td>
<td>2.54 x 10^{-4}</td>
<td>2.00 x 10^{-5}</td>
</tr>
<tr>
<td>6</td>
<td>8</td>
<td>16</td>
<td>48</td>
<td>0.25</td>
<td>3</td>
<td>3</td>
<td>2.54 x 10^{-4}</td>
<td>3.00 x 10^{-6}</td>
</tr>
</tbody>
</table>

The overhead can be improved somewhat with a 128 × 1 BCH coding solution as shown in Table 5.4 at the cost of higher false negative rates. However, it still has a nearly 100% overhead in memory space. Also, the system is much more compute intensive due to a larger amount of encoding and decoding calculation.

5.4.2 Weight and Location Tolerance

To simplify the bit error correction processing, the ring weight algorithm is composed of the following parts: peak weight value, peak location value, valley weight value,
and valley location value which are calculated according to the responses. If the value locates in the tolerance range, then the response passes the authentication. Figures 5.17 and 5.18 show the performance of different location and weight tolerance ranges with 12-bit errors in each 128-bit response. For most cases, the simulated false positive and false negative rates match the theoretical result. With wider tolerance ranges, the false negative decreases but the false positive increases.

When using this authentication method, the difference of maximum weight and its probability of false positive should follow a normal distribution. The standard deviation \( \sigma \) depends on the range of weight value. A wider range of weight makes \( \sigma \) larger, which can filter more false responses for the authentication process. The false positive probability related to \( \Delta W \) is given by

\[
P_{fp}^{\Delta W} = \sum_{i=0}^{k} \left( \binom{n}{i} p^i (1-p)^{n-i} \right), \quad p = p_0
\]  

(5.15)
Fig. 5.17: False negative and false positive rate related to location-tolerance

Fig. 5.18: False negative and false positive rate related to weight-tolerance

5.4.3 Location Shift

The effect of the location shift scheme is shown in Figure 5.19. The offset address bits are extended to 11 bits in order to obtain up to 2048 location shift. As can be seen, the false positives drop significantly as peak and valley location value are shifted by offset bits.
5.4.4 Fault Tolerance Capacity

To measure the fault tolerant capacity, the average bit errors in responses is set from 3 to 18. Since bit error does not affect the false positive rate, which refers to a random response with or without bit errors, we only focus on the relationship between the bit error number, tolerance range, and the false negative rate. In Figures 5.20 and 5.21, \([-3, 3]\) location-tolerant range and \([-70, 70]\) weight-tolerant range show distinct shifting, which can help reduce the false negative rate but increase the false positive rate. When the tolerance value continues increasing, the data come to a convergence.

5.4.5 Weight Optimization

To further improve the authentication accuracy, the deviation of weight set \(\{w_1, \ldots, w_{32}\}\) is reduced. Giving up the weight trade-off showed in Figure 5.2, we merge the values to meet the requirement of false negative. Various weight patterns are measured, among
Fig. 5.20: False negative of different bit errors with different location-tolerance

![Graph showing false negative percentage against location-tolerance with different values of e.]

Fig. 5.21: False negative of different bit errors with different weight-tolerance

![Graph showing false negative percentage against weight-tolerance with different values of e.]

which results of five groups are shown in Figure 5.22. By reducing the maximum weight pattern from 16 to 7, the false negative decreases from 1.58% to 0.02%.

5.4.6 Offset Bit Flipping

For the location shift scheme, reversing the offset address bit before shifting the valley location value is an efficient way to further reduce false positives. Since the peak and
Fig. 5.22: False negative of different weight distribution methods

valley location values are added to different shift values, the probability of getting a collision becomes smaller. In Figure 5.23, 150s and 200s refer to same offset bits with ±150 and ±200 weight-tolerant range. 150d and 200d add different location shift values for original peak and valley location. As a result, offset bit flipping reduces the false positive rate greatly.

5.4.7 Application Field

RWA can be applied to different types of PUFs. However, depending on the particular PUF implementation, the distribution of 1’s and 0’s in the PUF response may not be uniformly distributed or may even have a bias to ’1’ or ’0’. Both the weight-tolerance and location-tolerance approaches have some resistance to these variations. We simulated nine groups of CRP authentications and show the worst cases in Figure 5.24.
**Fig. 5.23:** False positive and false negative rates of different 2048 location shift methods

With a 20% bias from the ideal baseline of 50% '1' and 50% '0', the false negative rate is still less than 0.5%. However, if 0's or 1's reach significant majority, any minority occurrence due to the bit error will obtain larger chance to cause the shift of the overall weight and location value. The false positive rate depends on both the bias in the baseline response as well as the bias in the response under test. The *false positive 1* curve shows the false positive rate if both responses have the same bias, and we can see that the false positive rate remains constant at less than $10^{-3}$. The curve of *false positive 2* shows a different situation, under which the bias of the baseline varies from 30% to 70% while the response under test keeps a bias of 50%. As a result, the false positive is reduced when there are more differences between the response pair, which means a collision will be more easily found with the same bias. Therefore, if the response of a PUF shows good randomness and acceptable bias, RWA can archive the lowest false negative. The bias will only help to reduce the false positive when the bits between the
baseline and the response under test have more than 5% difference.

![Graph](image)

**Fig. 5.24:** Effect of response bias on false positive and negative rates

### 5.4.8 Overall Performance Consideration

Figure 5.25 shows the effect of adding various optimization methods to RWA. We start with baseline and then add the location shift optimization, weight optimization and offset bit flipping optimization. By involving these schemes step by step, we can reduce the false positive and false negative significantly with only a slight increase in overhead.

Figure 5.26 shows how the number of errors affects the false negative rate. The effect becomes significant with more than 6 errors per 128 bits. Since the false positive rate is not affected by the number of errors, but is a consequence of the compression due to RWA, the false positive rate curve remains flat.
Fig. 5.25: Performance comparison of different fault tolerant methods

Fig. 5.26: Overall effect of number of errors

Table 5.5 shows the overall comparison of the different solutions. Since the bit error rate follows a normal distribution, the probability that no error occurs in 128 bits becomes infinitesimal. If no ECC is used for direct bit matching, no chip passes authen-
tication. Once BCH ECC is applied, the false negative decreases while the overhead increases. The ECC and Bloom Filter solution provides acceptable false positive and false negative rates at the cost of large computation and memory space. Our novel ring weight algorithm and corresponding optimization shows accurate authentication rate without ECC. We achieve the best performance by setting the absolute value of location-tolerance to 4 firstly. When the difference of two peak locations and the difference of two valley locations are smaller than 4, the response under test is authenticated as passed. If it is not satisfied, we extend the absolute value of location range to 32, but set the absolute value of weight-tolerance to 200. Given a response with its baseline that cannot meet all the constraints of the peak weight, valley weight, peak location, and valley location, it will be regarded as failed. Smaller memory space and less computation are also its advantages compared to other solutions. Meanwhile, it does not cause security issues since the collision is hard to find, even if attackers copy the location shift information.

There has been some prior work to use non-ECC helper data to reduce failure rates - in particular the Helper Data Algorithm (HDA) [8] and the fuzzy extractor [9]. In Table 5.6 we show a comparison of these algorithms with our RWA algorithm. The reported error rates, memory overhead, and failure rates come from the relevant papers. For our RWA algorithm, we show a range of error rates and the respective failure rates. As can be seen, HDA produces 128-bit PUF responses with a failure rate of $10^{-6}$. However, the memory overhead of HDA is 317 times that of RWA. If a nonce
### Table 5.5: Performance Comparison of Our Solutions

<table>
<thead>
<tr>
<th>solution</th>
<th>memory space (MB)</th>
<th>average 10 errors</th>
<th>average 12 errors</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>false positive</td>
<td>false negative</td>
</tr>
<tr>
<td>No ECC</td>
<td>15.26</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>ECC only</td>
<td>38.15</td>
<td>0</td>
<td>1.71×10^{-2}</td>
</tr>
<tr>
<td>ECC &amp; BF</td>
<td>15.85</td>
<td>2.54×10^{-4}</td>
<td>9.88×10^{-4}</td>
</tr>
<tr>
<td>Ring Weight</td>
<td>4.53</td>
<td>8.81×10^{-2}</td>
<td>5.18×10^{-2}</td>
</tr>
<tr>
<td>Ring Weight (Opt.)</td>
<td>5.25</td>
<td>8.20×10^{-4}</td>
<td>1.66×10^{-4}</td>
</tr>
</tbody>
</table>

is applied to the CRPs, the memory space taken by HDA would be a heavy burden to a database. Moreover, RWA can authenticate one million CRPs in 9 seconds on average while HDA takes 206 seconds. The fuzzy extractor algorithm needs 450 ring oscillators to build one 128-bit response with a failure rate of 10^{-6}. As the bit error rate increases, more ring oscillators are required to keep a low failure rate. However, RWA does not rely on hardware. In addition, though not reported in [9], the calculation complexity of the fuzzy extractor is much higher than that of RWA. With much smaller memory space, RWA can be applied to any type of PUF. Therefore, HDA and the fuzzy extractor have good applications in PUF-based key generation where reduced failure rate is important. However, the large size and computational complexity, make it impractical for large-scale chip authentication where a database would need to store multiple challenges for each device to be authenticated. Conversely, RWA exhibits the best trade-off and can be an acceptable PUF post-processing solution for IC authentication.
### Table 5.6: Overall Comparison with Other Solutions

<table>
<thead>
<tr>
<th></th>
<th>HDA</th>
<th>fuzzy extractor</th>
<th>RWA</th>
</tr>
</thead>
<tbody>
<tr>
<td>PUF type</td>
<td>SRAM PUF</td>
<td>RO PUF</td>
<td>any type</td>
</tr>
<tr>
<td>error rate</td>
<td>15%</td>
<td>9%</td>
<td>21%</td>
</tr>
<tr>
<td>helper data memory overhead (bit)</td>
<td>13952</td>
<td>450</td>
<td>770</td>
</tr>
<tr>
<td>PUF bits required for 128 bits</td>
<td>1536</td>
<td>450</td>
<td>770</td>
</tr>
<tr>
<td>failure rate</td>
<td>$10^{-6}$</td>
<td>$10^{-6}$</td>
<td>$10^{-5}$ to $10^{-3}$</td>
</tr>
<tr>
<td>computing time ($\mu s$ per response)</td>
<td>205</td>
<td>not mentioned</td>
<td>9</td>
</tr>
</tbody>
</table>

### 5.4.9 Performance of RWA with PUF Implementations

To take the measurement noise, bit bias, and other factors into consideration, we evaluated the performance of RWA with real CRPs from ring oscillator (RO) PUFs [1] and arbiter PUFs [39]. We implemented three RO PUFs on a Kintex-7 FPGA and applied 128-bit challenges to generate 128-bit responses. All the FPGA LUT-based inverters in the ROs are well-placed to guarantee the equivalent delays between any selected pair. Every challenge, which is generated by a pseudo random number generator, is measured 20 times on each PUF. The first 10 responses are used to generate a baseline while the average value of the last 10 responses is regarded as an unauthenticated response under test. As a strong PUF, each PUF provides 100,000 challenge-response pairs. Table 5.7 shows the response bias, bit error rate, false negative rate and false positive rate. As we mentioned in Figure 5.24, false positive 1 is computed between a baseline and a random response with the same bias while false positive 2 uses a random response with a fixed 50% bias. Note, the computed value of false negative from RO
PUF 3 is shown as 0, but it should be regarded as less than $1.00 \times 10^{-5}$ as we only authenticate 100,000 groups of data. We also evaluated the use of arbiter PUFs, which are also implemented on Kintex-7 FPGA. Though all the MUX cells are well-placed, the responses of arbiter PUFs still have more than 5% bit bias. Compared with the simulation results in Figure 5.24 and Figure 5.26, the measurement errors of both false negative rates and false positive rates are acceptable. These results show that RWA is effective in authenticating real PUFs.

<table>
<thead>
<tr>
<th>PUF type</th>
<th>bias</th>
<th>bit error rate</th>
<th>false negative</th>
<th>false positive 1</th>
<th>false positive 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>RO PUF 1</td>
<td>54.68%</td>
<td>11.56%</td>
<td>$1.09 \times 10^{-3}$</td>
<td>$8.40 \times 10^{-4}$</td>
<td>$8.90 \times 10^{-4}$</td>
</tr>
<tr>
<td>RO PUF 2</td>
<td>49.76%</td>
<td>7.13%</td>
<td>$4.80 \times 10^{-4}$</td>
<td>$8.00 \times 10^{-4}$</td>
<td>$9.00 \times 10^{-4}$</td>
</tr>
<tr>
<td>RO PUF 3</td>
<td>51.06%</td>
<td>3.12%</td>
<td>$&lt; 1.00 \times 10^{-5}$</td>
<td>$8.30 \times 10^{-4}$</td>
<td>$9.10 \times 10^{-4}$</td>
</tr>
<tr>
<td>arbiter PUF 1</td>
<td>41.26%</td>
<td>3.31%</td>
<td>$&lt; 1.00 \times 10^{-5}$</td>
<td>$6.40 \times 10^{-4}$</td>
<td>$5.80 \times 10^{-4}$</td>
</tr>
<tr>
<td>arbiter PUF 2</td>
<td>44.71%</td>
<td>4.22%</td>
<td>$6.00 \times 10^{-5}$</td>
<td>$8.10 \times 10^{-4}$</td>
<td>$9.10 \times 10^{-4}$</td>
</tr>
<tr>
<td>arbiter PUF 3</td>
<td>41.52%</td>
<td>2.89%</td>
<td>$3.00 \times 10^{-5}$</td>
<td>$7.10 \times 10^{-4}$</td>
<td>$7.00 \times 10^{-4}$</td>
</tr>
</tbody>
</table>

If we were willing to tolerate large numbers of bit errors in the responses, one could conceivably improve the false negative rates significantly. The following analysis provides an estimate of bounds on the false negative rates with errors. We collected one million responses from RO PUF 1, and unlike the NO ECC solution in Table 5.5, this
time, a threshold is set as the fault tolerance method. In other words, if the compared responses have fewer unmatched bits than the threshold, the authentication regards the response as a success. Since the bit errors in RO PUF 1 follow a normal distribution, a few responses with 20 or more errors require a large threshold. As shown in Table 5.8, we need to tolerate more than 1/4 fault bits of the response. As a result, even if we set no constraints on the number of errors and storage space, the performance can be hardly improved. Applying ECC or other helper data schemes to improve the performance increases the storage needs significantly which makes them less practical in industry settings. On the other hand, our RWA approach achieves comparable failure rates with reduced storage than this simple threshold-based scheme or other more advanced schemes.

Table 5.8: False Negative of Bit-by-bit Comparison

<table>
<thead>
<tr>
<th>threshold</th>
<th>false negative</th>
<th>threshold</th>
<th>false negative</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 bit</td>
<td>1</td>
<td>20 bits</td>
<td>8.65 × 10^{-2}</td>
</tr>
<tr>
<td>4 bits</td>
<td>9.98 × 10^{-1}</td>
<td>24 bits</td>
<td>1.30 × 10^{-2}</td>
</tr>
<tr>
<td>8 bits</td>
<td>9.47 × 10^{-1}</td>
<td>28 bits</td>
<td>1.06 × 10^{-3}</td>
</tr>
<tr>
<td>12 bits</td>
<td>7.01 × 10^{-1}</td>
<td>32 bits</td>
<td>4.30 × 10^{-5}</td>
</tr>
<tr>
<td>16 bits</td>
<td>3.26 × 10^{-1}</td>
<td>36 bits</td>
<td>1.00 × 10^{-6}</td>
</tr>
</tbody>
</table>
5.5 Conclusions

In this chapter, three solutions are presented for industrial authenticating challenge-response pairs from physically unclonable functions. Our simulation, based on one million PUFs, shows that a practical error correcting and Bloom filter matching scheme achieves $10^{-4}$ false positive rate, $10^{-2}$ false negative rate, and a data rate of 0.58. Our proposed fault tolerant ring weight scheme is implemented using three rings which indicate weight, location and response values. With different types of optimization, it can achieve a false positive rate of $10^{-4}$ and a false negative rate of $10^{-4}$ with a data rate of 0.74. We also apply true CRPs from RO PUFs and arbiter PUFs to show that RWA is effective in practice. Moreover, the new scheme, which does not rely on ECC, requires little computing time and memory space. The lower bound and upper bound of collisions are acceptable in the case of real industry needs. For a given amount of storage space, our scheme provides the best authentication accuracy.
Chapter 6

PUF Initialization Table for Robust Authentication and Key Generation

6.1 Introduction

In the previous chapter, we proposed a novel PUF design. Though the bit error rate has been reduced significantly, post-processing is still required for many applications. In this chapter, we propose a PUF initialization Process (PIP) to record the entire feature of a PUF, which can improve the stability in different applications. To evaluate the practicality, we provide two PIP-based authentication solutions with very low false positive and false negative rate. Additionally, we combine the error correction and bit selection scheme together by proposing a PIP-based key generation solution using our floating thresholding algorithm and helper data, which improves the stability of responses for various PUFs.
6.2 PUF Initialization Process

When PUF applications require reproducible responses with the same challenges, storing CRPs is a direct, high-cost and unsecure storage method. If we take into account the instability feature, which needs a strong ECC for post-processing, the whole solution becomes more time-consuming and memory-wasting. In this section, we provide a simple but practical data structure to record the useful features of PUFs, which is regarded as a fundamental element of improving the bit selection efficiency and PUF stability.

6.2.1 PUF Initialization Table Generation

The PUF initialization table (PIT) is a critical part in our solution, as the basic idea of PIP is to test the stability of all the RO pairs combinations in a PUF in order to generate a PIT with trusted output values. To create the PIT, we use an initialization challenge generator (ICG) to feed our optimized RO PUF with repeated challenges. Algorithm 4 states the details of PIP: We first fix the higher address register addr_A and change the lower address register addr_B from 0 to 127. After 1,000 repeated measurements, the register values of CounterL, CounterS, and CounterE are recorded in the PIT, in which we set the base address to addr_A, and the offset address ranges from 0 to 127. Then the test will restart between the next RO in array A and the 128 ROs in array B. When the PIP is completed, the PIT contains an entire feature copy of a PUF, which can be used for authentication and key generation. It should be noted that other PUFs do not
need to record *CounterE*, as they are unnecessary to generate output by comparing the counter values.

### 6.2.2 Analysis of PIT Model

When we downloaded the same bitstream file into the same FPGA, the generated PIT had minor changes. A solution is to calculate the average value of each parameter among multiple PITs from the same PUF. Here we use 10 samples to build the model. This PIT model presents more accurate features of a PUF compared to a random PIT. Data analysis of the PIT model is given in Table 6.1. Since all 16,384 RO pairs are tested 1,000 times, the sum of the three states is 1,000 for any selected challenge address. Each state has a counter value ranging from 0 to 1,000. In our 3-state PUF, *larger, smaller*, and *equal* state have roughly equal percentages. Among the 16,384 RO pairs, there are more than 3,000 *larger* cases and *smaller* cases with a counter value of 1,000 - in other words all 1000 tests give the same *larger* or *smaller* results. There are no cases where the *equal* state was recorded in all 1,000 measurements. This indicates that \( \frac{3105+3014}{16384} = 37.35\% \) RO pairs can generate very stable responses. We call them the confident set. Though it cannot guarantee the same values completely during the PIT regeneration, the stability is enough to identify a 128-bit response.
Algorithm 4 PUF initialization process

1: procedure PIP

2: reset $Counter_L$, $Counter_S$, $Counter_E$, $i$, $addr_A$, $addr_B$

3: loop 1:

4: reset $Counter_A$, $Counter_B$

5: loop 2:

6: read $RO(addr_A), RO(addr_B)$

7: if $clockcycle < 16$ then goto loop 2

8: if $Counter_A > Counter_B$ then

9: $Counter_L ← Counter_L + 1$

10: else if $Counter_A < Counter_B$ then

11: $Counter_S ← Counter_S + 1$

12: else if $Counter_A = Counter_B$ then

13: $Counter_E ← Counter_E + 1$

14: $i ← i + 1$

15: if $i < 1000$ then goto loop 1

16: output $Counter_L, Counter_S, Counter_E$

17: reset $Counter_L, Counter_S, Counter_E$

18: $i ← 0$

19: $addr_B ← addr_B + 1$

20: if $addr_B < 128$ then goto loop 1

21: $addr_B ← 0$

22: $addr_A ← addr_A + 1$

23: if $addr_A < 128$ then goto loop 1.
Table 6.1: Parameters of Our PIT Model

<table>
<thead>
<tr>
<th>Counter A vs. Counter B</th>
<th>larger</th>
<th>smaller</th>
<th>equal</th>
</tr>
</thead>
<tbody>
<tr>
<td>percentage of each state</td>
<td>33.93%</td>
<td>34.16%</td>
<td>31.91%</td>
</tr>
<tr>
<td>state counter value = 0</td>
<td>7384</td>
<td>7178</td>
<td>7838</td>
</tr>
<tr>
<td>state counter value = 1000</td>
<td>3105</td>
<td>3014</td>
<td>50</td>
</tr>
</tbody>
</table>

6.3 PIT-based Authentication Application

A direct application of PIT is PUF-based authentication, which is shown in Figure 6.1. Before embedding an RO PUF into a chip or a system, the PIP must be invoked once to obtain the unique PIT. The ICG becomes active during PIP and enables RO pairs in our predefined sequences. Meanwhile, CounterL, CounterS, and CounterE in the optimization unit collect the data from CounterA and CounterB in the RO PUF. These counter values are sent to FIFO and will be transferred to a database as a PIT. When the PIP finishes, normal CRP tests are launched for authentication. Ideally, our CRPs will only select RO pairs that have been deemed stable by the PIP. However, since we use a Challenge Shift Register to select RO pairs, it requires every 14-bit output from the 128 shifts to select a stable RO pair and the probability of finding such a challenge is only 37.35%. In order to address this problem, we provide block RAM-based solution and threshold-based solution.
6.3.1 Block RAM-based solution

One solution is to replace the challenge shift register with a block RAM (BRAM), which has a width of 14 bits and a depth of 128. Since there is no constraint on the input pattern, a challenge can be generated in the database by selecting 128 addresses from the confident set randomly. Then the $14 \times 128$ bits challenge is sent to the block RAM in the FPGA. The counter values are read as responses for authentication. Table 6.2 gives a typical protocol of the communication and the authentication. Since the challenges will only select stable RO pairs, a counter value of 1,000 means 100% stable. Other than that, the probability of the counter value follows half of the normal distribution $N(1000, 1)$ from 0 to 1,000 according to our tests. That is to say, when the counter value is located as 999, 998, and less than 998, it roughly stands for a confidence interval of 68%, 27%, and 5% in the normal distribution. Therefore we can simplify the response by using 3-bit coding instead of transferring counter values directly. The highest bit
stands for the response output. The lower two bits present the stability of the highest bit. Every value is multiplied with their corresponding score during authentication.

To decide whether a CRP passes or fails, the sum of these multiplication results is compared with a threshold score. We find a balance between false positive and false negative by testing different scores and thresholds.

<table>
<thead>
<tr>
<th>range of counter value</th>
<th>tolerance interval</th>
<th>output</th>
<th>score</th>
</tr>
</thead>
<tbody>
<tr>
<td>$CounterL = 1000$</td>
<td>$\mu$</td>
<td>111</td>
<td>0</td>
</tr>
<tr>
<td>$CounterL \in [999, 1000)$</td>
<td>$[\mu - \sigma, \mu)$</td>
<td>110</td>
<td>1</td>
</tr>
<tr>
<td>$CounterL \in [998, 999)$</td>
<td>$[\mu - 2\sigma, \mu - \sigma)$</td>
<td>101</td>
<td>3</td>
</tr>
<tr>
<td>$CounterL \in [0, 998)$</td>
<td>$[0, \mu - 2\sigma)$</td>
<td>100</td>
<td>5</td>
</tr>
<tr>
<td>$CounterS = 1000$</td>
<td>$\mu$</td>
<td>011</td>
<td>0</td>
</tr>
<tr>
<td>$CounterS \in [999, 1000)$</td>
<td>$[\mu - \sigma, \mu)$</td>
<td>010</td>
<td>1</td>
</tr>
<tr>
<td>$CounterS \in [998, 999)$</td>
<td>$[\mu - 2\sigma, \mu - \sigma)$</td>
<td>001</td>
<td>3</td>
</tr>
<tr>
<td>$CounterS \in [0, 998)$</td>
<td>$[0, \mu - 2\sigma)$</td>
<td>000</td>
<td>5</td>
</tr>
</tbody>
</table>

To prove the false negative bound, we compare the PIT model with five PITs generated by random measurements. The average occurrence times of score 1, 3, and 5 are marked as $i_m$, $j_m$, and $k_m$. In the random challenge generation process without selecting repeated RO pairs, a set $(i, j, k)$ is chosen for authentication, in which $i \in [0, i_m]$, $j \in [0, j_m]$, $k \in [0, k_m]$. Among all the combinations of $(i, j, k)$, some subsets
\((\hat{i}, \hat{j}, \hat{k})\) fail in the authentication. Therefore, the false negative rate is

\[
f_n = \sum_{(\hat{i}, \hat{j}, \hat{k})} \binom{L+S-i_m-j_m-k_m}{128-i-\hat{j}-\hat{k}} \binom{i_m}{\hat{i}} \binom{j_m}{\hat{j}} \binom{k_m}{\hat{k}}
\tag{6.1}
\]

The subsets also need to meet the following constraints:

\[
\hat{i} + \hat{j} + \hat{k} \leq \text{response bit width} \tag{6.2}
\]

\[
\hat{i} + \hat{j} \times 3 + \hat{k} \times 5 > \text{threshold score} \tag{6.3}
\]

The false positive proof is based on a scenario that the attacker knows our algorithm and only the 111 and 011 outputs are used to maximize the collision rate. Therefore, in the set \((i, j, k)\), both \(i\) and \(j\) are 0. Collision occurs when no more than \(k\) values are mismatched in order to meet the constraint \(k \times 5 \leq \text{threshold score}\). In that case, the false positive is

\[
f_p = \sum_{k=0}^{k_i} \frac{1}{2^{128}} \binom{128}{k}
\tag{6.4}
\]

6.3.2 Threshold-based solution

In another solution, 128-bit challenges are generated randomly without referring to the values in a PIT so that we do not need to modify the Challenge Shift Register in the RO PUFs. Instead, the values of CounterL, CounterS, and CounterE are sent to the
database directly. This is because unstable RO pairs are included by random challenges.

In order to tolerate the increasing instability, the authentication requires more accurate information. We also notice that the distributions of the counter values from unstable RO pairs are not as stable as those of stable RO pairs, which means using the confidence interval of a fixed normal distribution curve is impractical. Therefore, we propose an improved solution in Algorithm 5. We first check if the counter value of the larger state or smaller state has high stability in the corresponding address of the PIT model. Slight floating measurement errors are tolerated. When both states show low stability, we need to check all three states to evaluate the confidence degree. If no constraint is met, it is regarded as a failed RO pair. A CRP will pass authentication only when the failure number is smaller than a threshold $t$.

Algorithm 5 Segment of threshold-based authentication

1: procedure THRESHOLD

2: $score = 0$

3: for $i \in \{0, \ldots, 127\}$ do

4: get PIT $addr$ from the challenge

5: if $PIT(addr).CounterL > 800$ and $\Delta CounterL < 100$ then $n \leftarrow n$

6: else if $PIT(addr).CounterS > 800$ and $\Delta CounterS < 100$ then $n \leftarrow n$

7: else if $\Delta CounterL + \Delta CounterS + \Delta CounterE < 500$ then $n \leftarrow n$

8: else $n \leftarrow n + 1$

9: if $n \leq t$ then CRP passes
To calculate the false negative bound in this solution, we need to know the number of failed RO pairs between the PIT model and an unauthenticated PUF. Marked as \( m \), the number ranging from \( t + 1 \) to \( \min\{ m, 128 \} \) will cause an authentication failure. The false negative rate can be written as

\[
fn = \sum_{n=t+1}^{\min\{ m, 128 \}} \frac{\binom{16384 - m}{128 - n} \binom{m}{n}}{\binom{16384}{128}}
\] (6.5)

The false positive rate is similar to Equation 6.4, but the probability of meeting the constraint is not fixed to \( \frac{1}{2} \). We use \( p \) to represent the probability of a random collision in 128 trials.

\[
fp = \sum_{n=0}^{t} \binom{128}{n} (1 - p)^n p^{128 - n}
\] (6.6)

Traditional PUF applications require significant storage space for CRPs. For the same architecture of our RO PUF, our PIT can generate \( \binom{16384}{128} \) challenges given a fixed 128-bit data length. However, traditional authentication methods need almost infinite memory to store all the CRPs. By sticking to only stable challenges from the confident set, the memory size is reduced to \( \binom{6119}{128} \times 2 \times 128 = 8.23 \times 10^{264} \) MB. It still takes 2.50 MB to store only 10,000 CRPs for one PUF. Meanwhile, a small CRP set is not a secure way for authentication. If all the CRPs in the memory are used, a replay attack will be 100% successful. However, our PIT solution only requires 480 KB memory size to generate all possible CRPs for authentication. All we need to store are the initialized counter values (30 bits) for 128 \( \times \) 128 RO pairs. It can be even smaller (48 KB) if we
replace the counter value with a 3-bit threshold. Another advantage of using PIT is, with a set of $3.37 \times 10^{268}$ stable responses, we can apply a cryptographic nonce to our challenge generation in order to prevent replay attack.

6.4 PIT-based Key Generation Solution

PIT provides sufficient accuracy for authentication purposes. However, PUF-based key generator is much more difficult to be implemented. Given the same input, PUFs are required to consistently produce exactly the same output, which has been proved to be almost impossible due to the instability. To eliminate the bit errors, error correction codes are widely applied as industrial solutions. But the hardware memory of a PUF cannot afford the huge overhead when a large amount of keys are generated. On the other side, bit selection schemes hold a strong capability of avoiding using the most unstable bits but fail to eliminate the remaining errors. With each advantage in mind, we propose a PIT-based key generator in this section by combining these two solutions together.

6.4.1 PIT-based Floating Thresholding Bit Selection Scheme

Instead of selecting the most stable bits only, which reduces the CRP entropies significantly, we focus on finding a floating threshold for the current PUF response to minimize the effect from the systematic variation and the environmental noise. Our target is to reach a balance among the stability, entropies, and resources. The proce-
dure begins with generating five PITs from the same PUF, as shown in Algorithm 6. Before each PIT generation, we launch an entire PUF reset except the processor and the memory in order to enhance the accuracy of the stability estimation. This is because the RO PUF responses measured after physical reset show poorer stability than those running continuously without poweroff according to our tests. All five PITs are stored in the memory temporarily. Then, we fetch the counter values of the larger, smaller, and equal states from the first RO pair of each PIT. These five groups of data can be presented as five points on a 3D surface, which are shown as the red marks in Figure 6.2. Meanwhile, the 16,384 blue marks provide a clear distribution of counter values from a randomly selected PIT. In the next step, the processor calculates the center and radius of an enclosing circle for the five points. As we set the PIT loop to 1,000 \((counter_{\text{larger}} + counter_{\text{smaller}} + counter_{\text{equal}} = 1,000)\), all the points are enclosed in a circle instead of a sphere. The circle can be regarded as the possible range of counter values for the first RO pair. Since the distribution shows a convergence trend to two lines \((counter_{\text{larger}} + counter_{\text{equal}} = 1,000 \text{ and } counter_{\text{smaller}} + counter_{\text{equal}} = 1,000)\), we can simply consider the \(counter_{\text{larger}}\) or \(counter_{\text{smaller}}\) within the overlapped range between the circle and the two lines. After storing the overlapped boundary in the memory, we repeat the process for the remaining 16,383 groups of data and accumulate the new overlapped boundaries to the previous ones. Once completed, we obtain the possible location distributions of the larger state and smaller state for all the RO pairs in a PUF, which are shown as group 1 in Figure 6.3. The cumulative value of 0 is about
7,800 while the cumulative value of 1,000 is about 3,100. For the remaining part, we can find the minimum value and set the x-axis as a floating threshold. Particularly, the locations are 860 and 909 for the larger state and smaller state in group 1. Thus, we connect (860,0,140) and (0,909,91) with a magenta line in Figure 6.2 as the floating threshold of group 1. Now we select 128 points by hashing a challenge plus an offset. If a point is below the floating threshold and very close to either of the two blue lines, it generates a 1-bit response. Otherwise, it is regarded as unstable and will be ignored during the key generation. For either case, the challenge is incremented by 1 to generate a new hash value to select the next point.

Fig. 6.2: 3D distributions of counter values and the floating threshold generation of three states for the RO PUF responses
Algorithm 6 Segment of floating thresholding solution

1: procedure FLOATING THRESHOLDING

2: receive Challenge

3: for $i \in \{1, \ldots, 5\}$ do

4: reset PUF except MicroBlaze and SDRAM

5: generate PIT($i$) and store it in SDRAM

6: for $j \in \{1, \ldots, 16384\}$ do

7: fetch RO pair($j$) counter values from PIT(1, 5)

8: calculate the enclosing circle of the five points

9: cumulate and store the overlapped boundary

10: if Helper Data is invalid then

11: set the minimum boundaries as floating threshold

12: select 128 stable points with hash(Challenge++)

13: generate Helper Data for less stable RO pairs

14: output Response and Helper Data

15: else

16: regenerate the floating threshold with new PITs

17: decode Helper Data

18: select 128 stable points with hash(Challenge++)

19: output Response after error correction
Fig. 6.3: Possible location distribution of larger states and smaller states based on two groups of RO PUF test results

After group 2 is generated in Figure 6.3, a new floating threshold is plotted in Figure 6.2 as the cyan line. In this case, the new floating threshold is very close to the original magenta line. If the counter position of an RO pair changes between the stable and unstable state easily, we define it as a marginal point. Among all 16,384 points, there are about five whose center points stay on different sides of the floating thresholding lines. In the worst case, we assume that any of these five RO pairs can cause a failure when they are selected. Then the upper bound of the failure rate in 128-bit keys is

$$P_{\text{floating threshold}} = 1 - \left( \frac{7424}{128} \right) \left( \frac{7429}{128} \right) = 0.0833$$  \hspace{1cm} (6.7)

in which 7429 is the number of points that are located below the threshold line in group
1. On the other hand, if we use a fixed threshold, the failure rate is

\[ p_{\text{fixed threshold}} = 1 - \left( \frac{\binom{7524}{128}}{\binom{7622}{128}} \right) = 0.8118 \]  

(6.8)

6.4.2 PIT-based Floating Thresholding Error Correction Scheme

If challenges are generated by the PUF itself, we can avoid selecting the RO pairs with marginal points near the thresholds. Then, there is no extra overhead for error correcting. However, for the general case when challenges are provided from another device, helper data is required to correct any potential bit error in the responses. Unlike traditional ECCs, which address the entire binary string, we apply a low-cost helper data that only focuses on the sensitive bits generated by the points near the thresholds. Given the top five RO pairs with potential error correction requirements in our example, we rank them in order of stability. In Figure 6.4, they are located in the 46th, 116th, 47th, 24th, and 42nd bit in a 128-bit key circle with black buckets. To generate the helper data of the 46th bit, we set a start point at 0 and a distance. Now we search for another four locations with the equivalent distance between each, one of which must locate in 46, but none of which locates in 24, 42, 47, and 116. With a distance of 23, these five red buckets in the middle circle contain only one point near the floating threshold. Once a key is generated, we apply XOR operations to the bits located in Set 1 \{0, 23, 46, 69, 92\}, which will generate 1-bit codeword. To generate Set 2 and the codeword for the next bit located in 116, we shift the start point to 1 and repeat the...
previous steps. Since the distance of 81 is too large to generate the remaining three locations, we can use the least positive remainder after divided by 128. Thus the bits in the location of \{1, 82, 35, 116, 69\} generate the second codeword, which are shown as the orange buckets in the middle and external ring. Similarly, we select the green, blue, and purple buckets for the sets related to 47, 24, and 42.

Fig. 6.4: Helper data generation for our floating thresholding key generation solution

Given the start point \((s)\), locations \((L_n)\), and distances \((D_n)\), the helper data generation should follow

\[
\exists ! L_n = s + n - 1 + i \times D_n \quad i \in [0, m - 1] \forall L_n \neq s + n - 1 + i \times D_n \quad i \in [0, m - 1] n \in [1, n_{max}] \quad \bar{n} \in [1, n_{max}] \quad n \neq \bar{n}
\]  

(6.9)
in which the $n_{\text{max}}$ is the number of marginal points and $m$ is the number of buckets selected for XOR operations. In our example, both $n_{\text{max}}$ and $m$ are set to 5. When a key is generated for the first time, the input helper data is all 0 and regarded as invalid. The helper data, including start point, locations, and parity, is generated and output with its response. During the regeneration, the same helper data is received along with the original challenge. While most incorrect RO pair selections are solved by the new floating threshold, the rest ones can be corrected by the decoded helper data. Comparing these two groups of data, few points may stay on one side of the first threshold but the other side of the second threshold. If a marginal point is selected to generate one bit and it stays on the same side of a new threshold during the key regeneration, it can be easily verified by the decoded locations. Once it merges to the other side, there is 1-bit right shift for the binary string after this point. With the information of the recorded marginal points, this error can be detected and corrected easily. Given a marginal point ignored by the original key, it can be aware by the decoded locations during the key regeneration unless there is a location collision. But we are still able to detect an error since the entire locations and the parities are not matched. To correct the potential 1-bit right shift, we can try to ignore the recorded marginal point and check the new response with the helper data. In order to minimize the collision rate, we apply an additional checksum to the original key. For example, the checksum can be a 10-bit parities or hash value generated by the entire key. If there are several bit strings that match the locations and
the parities from the decoded helper data, we can compare the checksum of the new key with the original one to find out which one should be the correct key.

The exploding memory utilization of PUF-based key generation is more critical than other applications because of the highest accuracy requirement. Traditional solutions usually apply BCH ECC to correct as many as $t$ bit errors in 128-bit responses. If a typical RO PUF has a bit error rate of 9% in average, and the error number distribution requires an ECC with 18-bit correction capability. The number of parity-check digits $r$ can be calculated as $r = n - k = 2^m - 1 - 128 \leq mt = 18t$. The smallest $t$ that meets the inequality is 8, which means $r$ is at least 127 bits. According to our measurement, the failure rate is higher than $10^{-2}$. To archive a lower failure rate, the overhead can be much larger than the generated key. The linear expanding storage requirement limits the application. For the floating thresholding solution with a key length of 128 bits, the start point is $\log_2 128$ bits. Since the distances are variable, we fix the length using the largest one $D_{\text{max}}$. $D_{\text{max}}$ is recommended to be set to 127 according to our simulation, as it is large enough to find all the locations. The codeword also includes $n_{\text{max}}$-bit parities and $c$-bit checksum. Thus, the entire codeword has $7 + \log_2 D_{\text{max}} \times n_{\text{max}} + n_{\text{max}} + c$ bits.

### 6.4.3 Security Consideration of PIT-based solutions

Apart from the increasing cost of resources, PUF applications also suffer from being attacked or modeled as the PUF information stored in the database can be easily exposed to adversaries. For our key generation solution, this vulnerability can be safely ignored
as a PIT will be deleted after generating sufficient pairs of challenges and helper data. Unlike ECC, the remaining data will not disclose any information that could be used to recover the relevant responses.

With respect to PUF components, most modeling attacks are based on the assumption that the PUF structure is known [17]. Given $k$ ring oscillators in the RO PUF, two of them are selected and compared in order to produce a single output bit. This structure leads to $k(k - 1)/2$ CRPs. For the worst scenario, the CRP interface is public and accessible. Then one can measure all the RO frequencies $(f_1, ..., f_k)$ in ascending order. The approach requires a low-degree polynomial calculation with a complexity of $O(k \cdot \log k)$. However, it is inefficient and expensive to open the package, probe the circuit, and apply the attack in reality. First of all, one needs to identify the PUF type, which is difficult on an FPGA. Secondly, the interface may not release the original information of CRPs. When a hash function is added between the challenge and the RO input, the learning complexity increases. XOR architecture can also be applied between the RO output and the response, which is proved to be effective against the machine learning attacks [17]. If we take these facts into account, the 128-bit keys in our solution are still secure against attacks with limited cost.
6.5 Data Analysis and Evaluation

6.5.1 Experimental Setup

In this section, we continue to use RO PUFs to evaluate our solution step by step, with results measured from three Kintex-7 FPGA based KC705 boards. Note that each FPGA contains six PUFs. The temperature on the FPGA is controlled at 30°C and the voltage is 1.0V. To evaluate the stability, we also use temperature variation instrument to test the PUF with different temperature ranges.

![PUF tests with temperature variation instrument](image)

**Fig. 6.5:** PUF tests with temperature variation instrument

6.5.2 Hardware Resource Utilization

Here, we estimate the hardware cost of PIT-base PUF. The overall utilization on the Kintex-7 is only 1.75% of the available look-up tables and 0.65% of the flip-flops, which is shown in Table 6.3.
Table 6.3: FPGA Resource Usage of One PUF Implementation

<table>
<thead>
<tr>
<th>ref name</th>
<th>used</th>
<th>functional category</th>
</tr>
</thead>
<tbody>
<tr>
<td>FDRE</td>
<td>2640</td>
<td>Flop &amp; Latch</td>
</tr>
<tr>
<td>LUT6</td>
<td>1718</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT1</td>
<td>1026</td>
<td>LUT</td>
</tr>
<tr>
<td>MUXF7</td>
<td>598</td>
<td>MuxFx</td>
</tr>
<tr>
<td>LUT5</td>
<td>346</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT3</td>
<td>203</td>
<td>LUT</td>
</tr>
<tr>
<td>LUT2</td>
<td>191</td>
<td>LUT</td>
</tr>
<tr>
<td>MUXF8</td>
<td>189</td>
<td>MuxFx</td>
</tr>
<tr>
<td>OBUF</td>
<td>132</td>
<td>IO</td>
</tr>
<tr>
<td>LUT4</td>
<td>94</td>
<td>LUT</td>
</tr>
<tr>
<td>FDCE</td>
<td>20</td>
<td>Flop &amp; Latch</td>
</tr>
<tr>
<td>LDCE</td>
<td>14</td>
<td>Flop &amp; Latch</td>
</tr>
<tr>
<td>IBUF</td>
<td>4</td>
<td>IO</td>
</tr>
<tr>
<td>CARRY4</td>
<td>4</td>
<td>CarryLogic</td>
</tr>
<tr>
<td>FDSE</td>
<td>1</td>
<td>Flop &amp; Latch</td>
</tr>
<tr>
<td>BUFG</td>
<td>1</td>
<td>Clock</td>
</tr>
</tbody>
</table>

6.5.3 Randomness

The NIST statistical test suite is a common technique used for entropy source and cryptographic analysis [50]. Each input file contains 100,000 128-bit responses from one PUF. All 18 PUFs with PIP pass the 15 NIST randomness tests, with cumulative
p-values larger than 0.975.

6.5.4 Uniqueness

The uniqueness of the 18 PUFs is shown in Figure 6.6. The comparison is based on the same PUF placement and bitstream in the three FPGAs. Thus, all three PUFs in each group have no circuit level differences. We compare 100,000 128-bit responses from each PUF with the same challenge set. All six groups have 50% ± 0.97% bits (64 bits) flipped on average. The results are also verified by comparing the values of their PIT models.

Fig. 6.6: Inter-PUF hamming distances
6.5.5 Stability

Figure 6.7 provides the stability comparison among the traditional design, lightweight optimized design, and RO PUF with PIP. The raw bit error rate (RBER) in the traditional RO PUF is 7.13% in average. A few responses contain more than 20 bit errors, which necessitates strong fault tolerance or error correction method. With our optimization, the bit error rate decreases significantly. Further, if we only select the stable RO pairs for output according to the PIT model, 76.73% of the responses have no bit errors. Within the temperature range of a normal testing environment, the observed bit error rates of our PUFs with PIP are less than 0.50%.

![Intra-PUF hamming distance](image)

**Fig. 6.7:** Intra-PUF hamming distance

Table 6.4 lists the average bit error rates from 18 PUFs. Again, we mark the six PUFs on the second FPGA as $PUF_6$ to $PUF_{12}$ and those on the third FPGA as $PUF_{13}$ to $PUF_{18}$. Clearly, the stability of PUFs with PIP has been improved greatly compared
to the traditional RO PUFs.

<table>
<thead>
<tr>
<th></th>
<th>PUF 1</th>
<th>PUF 2</th>
<th>PUF 3</th>
<th>PUF 4</th>
<th>PUF 5</th>
<th>PUF 6</th>
</tr>
</thead>
<tbody>
<tr>
<td>Average Bit Error Rate</td>
<td>0.33%</td>
<td>0.47%</td>
<td>0.43%</td>
<td>0.37%</td>
<td>0.40%</td>
<td>0.35%</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>PUF 7</th>
<th>PUF 8</th>
<th>PUF 9</th>
<th>PUF 10</th>
<th>PUF 11</th>
<th>PUF 12</th>
</tr>
</thead>
<tbody>
<tr>
<td>Average Bit Error Rate</td>
<td>0.31%</td>
<td>0.32%</td>
<td>0.29%</td>
<td>0.36%</td>
<td>0.37%</td>
<td>0.30%</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>PUF 13</th>
<th>PUF 14</th>
<th>PUF 15</th>
<th>PUF 16</th>
<th>PUF 17</th>
<th>PUF 18</th>
</tr>
</thead>
<tbody>
<tr>
<td>Average Bit Error Rate</td>
<td>0.34%</td>
<td>0.43%</td>
<td>0.45%</td>
<td>0.41%</td>
<td>0.31%</td>
<td>0.38%</td>
</tr>
</tbody>
</table>

6.5.6 Bias

The response bias to ‘1’ in the traditional RO PUF design has an offset from −9% to 8% according to our tests - i.e. the PUF produces 41% to 58% 1’s instead of the ideal 50%. Apart from the physical uncontrollability, the close frequency of RO pairs intensifies instability. Using our PIT architecture reduces the bias significantly to 50% ± 1.82%.

There is a slight bias towards ‘0’ in our PITs as shown in Figure 6.8. For example, in the PIT model of PUF 1, there are 3105 0’s and 3014 1’s, which means the bias of a random response is 50% − 0.73%. To achieve a bias of 50%, we can select the value of ‘0’ or ‘1’ in the PIT model when generating challenges. Clearly, the bias is well-controlled in our design.
6.5.7 Result of PIT-based Authentication

The performance of our authentication solutions is tested on our RO PUFs with PIP. In the BRAM-based solution, the threshold score is set to 39 in order to balance the failure rate and the collision rate. The results from six PUFs are given as an example in Table 6.5. There are few mismatched values among all the stable RO pairs. On average, only 29 combinations of RO pairs lead to authentication failures. Using Equations 6.1 and 6.4, we can calculate the false negative and false positive rates and we see that they are smaller than $10^{-26}$. In the threshold-based solution, $m$ is set to 68, the maximum number of failed RO pairs in any PUF. If we set the threshold, $t$, to 30, using equations 6.5 and 6.7, the false negative and false positive rates are smaller than $10^{-47}$. Specially, attacks on the first solution do not reply on the PIT model, so the bound in the BRAM-based false positive rates are the same for all PUFs. However, the collisions in the threshold-based solution is related to the PIT model, which leads to different bound
in threshold-based false positive rates. Finally, we verified both solutions by testing 10 million random CRPs and no failure is detected during the authentication.

### Table 6.5: Results of Authentication with PIP

<table>
<thead>
<tr>
<th></th>
<th>PUF 1</th>
<th>PUF 2</th>
<th>PUF 3</th>
<th>PUF 4</th>
<th>PUF 5</th>
<th>PUF 6</th>
</tr>
</thead>
<tbody>
<tr>
<td>stable RO pairs</td>
<td>6119</td>
<td>6302</td>
<td>6262</td>
<td>6183</td>
<td>6291</td>
<td>6201</td>
</tr>
<tr>
<td><strong>BRAM-based Solution</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>$(i_m, j_m, k_m)$</td>
<td>(17,9,1)</td>
<td>(34,5,0)</td>
<td>(22,7,1)</td>
<td>(28,10,0)</td>
<td>(20,6,2)</td>
<td>(25,3,2)</td>
</tr>
<tr>
<td>$(\hat{i}, \hat{j}, \hat{k})$</td>
<td>29</td>
<td>22</td>
<td>23</td>
<td>70</td>
<td>23</td>
<td>7</td>
</tr>
<tr>
<td>false negative</td>
<td>$3.62 \times 10^{-27}$</td>
<td>$2.08 \times 10^{-45}$</td>
<td>$2.91 \times 10^{-33}$</td>
<td>$4.46 \times 10^{-28}$</td>
<td>$3.53 \times 10^{-30}$</td>
<td>$1.25 \times 10^{-41}$</td>
</tr>
<tr>
<td>false positive</td>
<td>$2.95 \times 10^{-28}$</td>
<td>$2.95 \times 10^{-28}$</td>
<td>$2.95 \times 10^{-28}$</td>
<td>$2.95 \times 10^{-28}$</td>
<td>$2.95 \times 10^{-28}$</td>
<td>$2.95 \times 10^{-28}$</td>
</tr>
<tr>
<td><strong>threshold-based Solution</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>m</td>
<td>53</td>
<td>49</td>
<td>52</td>
<td>68</td>
<td>60</td>
<td>47</td>
</tr>
<tr>
<td>p</td>
<td>0.1315</td>
<td>0.1137</td>
<td>0.1182</td>
<td>0.1250</td>
<td>0.1206</td>
<td>0.1093</td>
</tr>
<tr>
<td>false negative</td>
<td>$3.79 \times 10^{-53}$</td>
<td>$9.68 \times 10^{-55}$</td>
<td>$1.58 \times 10^{-53}$</td>
<td>$1.65 \times 10^{-48}$</td>
<td>$9.00 \times 10^{-51}$</td>
<td>$1.27 \times 10^{-55}$</td>
</tr>
<tr>
<td>false positive</td>
<td>$1.06 \times 10^{-59}$</td>
<td>$1.25 \times 10^{-65}$</td>
<td>$4.82 \times 10^{-64}$</td>
<td>$9.22 \times 10^{-62}$</td>
<td>$3.19 \times 10^{-63}$</td>
<td>$3.03 \times 10^{-67}$</td>
</tr>
</tbody>
</table>

#### 6.5.8 Result of PIT-based Key Generation

Seven key generation HDAs are compared in Table 6.6. According to our tests, the failure rate of the traditional RO PUF is not good enough even after we apply a non-standard BCH(244, 128, 15) ECC. Both [8] and [9] have a failure rate of $10^{-6}$, but they require large amounts of helper data. In [7], BCH(63, 24, 7) ECC is used to reach a failure rate of $10^{-38}$. However, the data rate and key length are still limitations. As a comparison, our floating thresholding solution with helper data has quite a low failure
rate. In addition, the key length is variable while the other methods have fixed key. The helper data for each key requires only $7 + 7 \times n_{\text{max}} + n_{\text{max}} + 10$ bits, which means 7-bit start point, 7-bit location for each marginal point, $n_{\text{max}}$-bit parity, and 10-bit checksum. For more than 90% keys with floating thresholds, helper data is not even necessary. Therefore, the average data rate is very low.

**Table 6.6: Overall Comparison of Error Correction and Bit Selection Helper Data Algorithms**

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>PUF type</td>
<td>RO PUF</td>
<td>RO PUF</td>
<td>SRAM PUF</td>
<td>RO PUF</td>
<td>hybrid</td>
<td>RO PUF</td>
<td>RO PUF</td>
</tr>
<tr>
<td>key length (bit)</td>
<td>128</td>
<td>128</td>
<td>128</td>
<td>128</td>
<td>24</td>
<td>128 (variable)</td>
<td>128 (variable)</td>
</tr>
<tr>
<td>PUF cells</td>
<td>128</td>
<td>128</td>
<td>1536</td>
<td>450</td>
<td>not applied</td>
<td>256</td>
<td>256</td>
</tr>
<tr>
<td>helper data (bit)</td>
<td>0</td>
<td>116</td>
<td>13952</td>
<td>1850</td>
<td>39</td>
<td>0</td>
<td>17 + 8 $\times n_{\text{max}}$</td>
</tr>
<tr>
<td>data rate</td>
<td>1</td>
<td>0.52</td>
<td>0.01</td>
<td>0.06</td>
<td>0.38</td>
<td>1</td>
<td>0.98</td>
</tr>
<tr>
<td>RBER</td>
<td>7.13%</td>
<td>7.13%</td>
<td>15%</td>
<td>9%</td>
<td>30%</td>
<td>7.13%</td>
<td>7.13%</td>
</tr>
<tr>
<td>failure rate</td>
<td>1</td>
<td>$4.08 \times 10^{-3}$</td>
<td>$10^{-4}$</td>
<td>$10^{-4}$</td>
<td>$2.01 \times 10^{-30}$</td>
<td>$8.72 \times 10^{-2}$</td>
<td>$9.38 \times 10^{-8}$</td>
</tr>
</tbody>
</table>

We also evaluate our solution on different types of PUFs, as shown in Table 6.7.

To build a PIT with the same depth of 16384, the minimum number of basic cells for each PUF is given for comparison. As we mentioned, two 128-RO arrays are sufficient to construct the RO PUF PIT by selecting one RO from each array ($128^2$). For Arbiter PUFs and BR PUFs, their entropies benefit from the serial architectures, which require only 14-bit challenge input and related units to build a PIT ($2^{14}$). Thus, the basic cell amount should be as least $14 \times 4$ MUXes for 2-XOR Arbiter PUFs and $14 \times 1$ MUX and
DEMUX for BR PUFs. However, the response bits of Butterfly PUFs are independent as the selected cross-coupled latches do not affect the output of others. As a result, 16384 × 2 latches are used in the Butterfly PUF. We also provide the LUT and FF utilization for implementing a PIT for each PUF. To obtain the failure rate, we randomly generate one billion keys for each PIT with the same seed. For the Butterfly PUF that keeps a similar RBER as the RO PUF, its failure rate is slightly reduced, but still on the order of 10⁻⁸. On the other hand, when the RBER is cut in half as with the Arbiter PUF, the failure rate decreases sharply to 10⁻⁹. If the RBER further drops to the degree as that of the Bistable Ring PUFs, no errors were detected in the key set, which means the false negative is smaller than 10⁻⁹. In summary, the performance of this algorithm is most affected by the raw bit error rate in the PUF responses, but its accuracy is sufficient for key generation on different PUFs.

6.6 Conclusions

In this chapter, we proposed a strong solution for PUF authentication and key generation. We investigated the possibility of PUF post-processing methods. We demonstrated the accuracy of PUF responses with initialization tables, which achieves an average bit error rate below 0.48%. By using PIT, the failure rate and the collision rate of authentication also decreased significantly. For the most critical key generation applications, we use PITs and dynamic floating thresholds to select the stable bits as output. With our helper data correction, the failure rates of four different PUF-based
Table 6.7: Performance Comparison of Floating Thresholding Solution with Helper

Data on Four Different Types of PUFs

<table>
<thead>
<tr>
<th>PUF type</th>
<th>RO PUF</th>
<th>Arbiter PUF</th>
<th>Butterfly PUF</th>
<th>BR PUF</th>
</tr>
</thead>
<tbody>
<tr>
<td>key length</td>
<td>128</td>
<td>128</td>
<td>128</td>
<td>128</td>
</tr>
<tr>
<td>basic cells</td>
<td>256 ROs</td>
<td>56 MUXes</td>
<td>32768 Latches</td>
<td>28 DE/MUXes</td>
</tr>
<tr>
<td>LUT</td>
<td>3316</td>
<td>97</td>
<td>38960</td>
<td>75</td>
</tr>
<tr>
<td>FF</td>
<td>2675</td>
<td>68</td>
<td>16912</td>
<td>30</td>
</tr>
<tr>
<td>RBER</td>
<td>7.13%</td>
<td>3.31%</td>
<td>6.80%</td>
<td>1.42%</td>
</tr>
<tr>
<td>failure rate</td>
<td>$9.30 \times 10^{-8}$</td>
<td>$5.00 \times 10^{-9}$</td>
<td>$8.50 \times 10^{-8}$</td>
<td>$&lt; 1.00 \times 10^{-9}$</td>
</tr>
</tbody>
</table>

key generators are reduced to $10^{-8}$. Finally, we evaluate the overall performance of the hybrid solution. The overhead is extremely low compared to the existing algorithms.
Chapter 7

Locality Sensitive Hash Function for Similarity Search and Clustering

7.1 Introduction

In an $e$-world, the rapid growth of data leads to the impossibility of searching large amounts of data by brute force, which leads to an ever-increasing need for fast indexing and searching of large databases. To make large-scale search practical, researchers have explored multiple approximate similarity search techniques. One novel and interesting approach is called the locality sensitive hash function, which solves the approximate or exact near neighbor search problem in high dimensional spaces. LSH hashes input items so that similar items map to the same “buckets” with high probability. Unlike conventional cryptographic hash functions, LSH aims to maximize the probability of a “collision” for similar items.
7.2 Locality Sensitive Hash Function Background

To define a formal definition of a function $h$ in the general LSH family $F$ [51], it should satisfy the following conditions for any two points $p, q$, which belong to a metric space $(M, d)$: if $d(p, q) \leq R$, then the probability that $h(p) = h(q)$ is at least $P_1$; if $d(p, q) \geq cR$, then the probability that $h(p) = h(q)$ is at most $P_2$. A family is interesting when $P_1 > P_2$. Then $F$ is called $(R, cR, P_1, P_2)$ sensitive [52]. The easiest LSH software approach is bit sampling for Hamming distance, which selects a random bit from the input d-dimensional vectors: $\{0, 1\}^d \rightarrow \{0, 1\}$. It has the following parameters: $P_1 = 1 - R/d$, $P_2 = 1 - cR/d$. In spite of easy computation, bit sampling is not a suitable solution for the high bit error rate scenario, as it causes significant false positive and false negative [53]. Another popular LSH is MinHash [54]. By applying the Jaccard index, MinHash is defined as the size of the intersection divided by the size of the union of the sample sets. Thus, $P_i[h(A) = h(B)] = J(A, B) = |A \cap B|/|A \cup B|$. However, it has the same shortage as that of bit sampling. Given 128-bit input strings with 12 different bits, the identification rate is only 0.83, which is not enough for similarity search applications. To achieve better performance, TLSH is proposed to process a byte string by using a sliding window to populate an array of bucket counts and construct the digest body with different binary values [55]. However, it requires an input byte stream with a minimum length of 512 bytes. As a consequence, TLSH cannot be applied to strings with variable length.
7.3 Motivations

As we have mentioned in the previous chapters, FPGA-based PUFs are suitable for authentication and key generation. We would like to extend the similar architecture to a wider range of applications. To meet the essential requirement of identifying similar items, we propose a novel FPGA-based LSH (FLSH) design in this chapter. This new LSH is designed with strong fault tolerant capability and high identification accuracy in mind. The second advantage is that, since our approach uses asynchronous logic in FPGAs instead of the complex processor calculation, it is low-cost and fast compared to the software based LSHs. On the other side, FLSH can be applied to clustering as well. However, to achieve these goals, the implementation on FPGAs requires strict placement and routing in order to control the internal wire delay. Therefore, we explore the FLSH implementation details with complex placement and routing policies. To prove the reality of FLSH-based similarity search and clustering applications, we evaluate the physical characteristics of FLSHs, including randomness, convergence, and fault tolerant capability.

7.4 FLSH Design and Implementation

We present the details of our FLSH design and implementation in this section. Unlike the general software-based LSHs, FLSH does not rely on processors for hash calculation. Instead, it is mainly constructed with lookup tables (LUTs) and latches. This low-cost design is more flexible and can be easily integrated into other hardware system
designs.

### 7.4.1 LUT-based FLSH Array Design

FLSH is built around a 2D array comprised of $M \times M$ cells. The array scale depends on the available hardware resources and the input length. In our example, we choose an architecture that contains $16 \times 16$ cells, which can be implemented in 256 configurable logic blocks (CLBs) of a Kintex-7 FPGA. There are two types of cells: $(M - 2)^2$ cells in the middle are connected with their neighboring cells by four bidirectional signals from different directions; and $4M - 4$ cells on the boundary receive the inputs from the internal cells as well as a $4M$-bit input data. Figure 7.1 shows the interconnect between 9 cells in a $16 \times 16$ array. Each cell takes in the input data and generates a 4-bit hash output that is connected to its neighbors. Because of the asynchronous cycles in the array, it takes time for the outputs to converge to a stable output which is then saved in a $4M^2$-bit FLSH output register. Theoretically, each of the four input wires to a cell should have identical delays. However, due to the physical randomness in the manufacturing process, each FPGA will generate a unique hardware hash function under this architecture. As a summary of the hashing process, in our $16 \times 16$ array, when a input is ready to broadcast, the system enables the 256-cell array for automatic calculation until most of the bits converge. Then it sends the 1,024 bitmap to the FLSH output register.

Figure 7.2 shows the internal structure of an FLSH cell. The cell input $in[3 : 0]$
Fig. 7.1: Topology of FLSH network

comes from the neighboring cell outputs. It is connected to the main logic LUTs, latch LUTs, and shift LUTs in the cell. The logic LUTs (*LUT6 NORTH, LUT6 WEST, LUT6 SOUTH, LUT6 EAST*) determine the 4-bit output value of a cell. Given the difference of wire delay from the neighboring cells, we use *LUT4 LATCH1* to lock the logic LUT output at a certain time through its LDCE (Transparent Data Latch with Asynchronous Clear and Gate Enable). These outputs (*out_north, out_west, out_south, out_east*) are not only the input of the neighboring cells, but also part of the FLSH output. As we know, a 4-bit input can select a 1-bit output from the 16-bit initial values in a LUT. However, according to our tests, if we only use *in[3 : 0]* as logic
LUT input, the output pattern is too simple in term of a hash function. Thus, shift functions \((LUT6 \text{ SHIFT}1, LUT6 \text{ SHIFT}2, LDCE \text{ SHIFT}1, LDCE \text{ SHIFT}2)\) are added to improve the output complexity. Both \(LUT6 \text{ SHIFT}1\) and \(LUT6 \text{ SHIFT}2\) share the input from \(in[3 : 0]\) and their LDCEs. For further timing consideration, we add another latch \(LUT4 \text{ LATCH}2\) to lock \(LDCE \text{ SHIFT}1\) and \(LDCE \text{ SHIFT}2\) output. When the logic LUTs receive the 2-bit shift instruction from the LDCEs, there are four combinations of logic operations. We show an example in Table 7.1. If the shift bits are “00”, the output will be selected from the original 16-bit initial value. When the shift bits change to “01”, we apply a 1-bit left shift to the original initial value. When they are “10”, a 1-bit right shift is executed instead. For the last case “11”, all the 16 bits are flipped to generate new initial values. In this way, we generate the 64-bit initial lookup table for the logic LUTs.

### 7.4.2 FLSH Placement

Place and route is an essential step to control the wire delay in our FLSH in order to meet the critical timing requirement of LSHs. For each cell, the input signals should keep the same delay in order to involve the physical randomness of FPGAs themselves. To clarify this feature, the same challenge can generate unique hash values on different FPGAs, but one FPGA will always provide identical responses to very similar challenges. Instead of using timing constraints to achieve this goal, it is more reasonable to find some CLBs with equivalent delay path between each other. We select 256 CLBs
Fig. 7.2: Basic FLSH cell structure in a CLB

to build a 16 × 16 array in Figure 7.3. Note that there is a visible distortion that the distance in the middle looks larger than others, but the actual paths between those CLBs are very close to each other and can be adjusted by manual routing. Since the scale of the array may be changed according to the input length, manual placement is not efficient. To make the implementation flow more flexible, scripts are used to generate the placement constraints automatically.

According to our design requirement, a CLB should perform the entire function of an FLSH cell. In Kintex-7 FPGAs, a CLB provides two slices. In each slice, there are four 6-input LUTs and eight flip-flops as well as multiplexers and arithmetic carry logic [31]. Therefore, the primitives in a CLB is sufficient to build one FLSH cell. We
**Table 7.1:** Logic LUT Output of A Random FLSH Cell

<table>
<thead>
<tr>
<th>neighboring cell output</th>
<th>output with shift0 = 0</th>
<th>output with shift0 = 1</th>
<th>output with shift0 = 0</th>
<th>output with shift0 = 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0001</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>0010</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0011</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0100</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>0101</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0110</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>0111</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1000</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1001</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1010</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1011</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1100</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1101</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1110</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1111</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
provide an example in Figure 7.3, which shows how the LUTs and LDCEs locate in a CLB. On the left side, we place \textit{LUT4 LATCH2}, \textit{LUT4 LATCH1}, \textit{LUT6 SHIFT1}, \textit{LUT6 SHIFT2}, \textit{LUT6 NORTH}, \textit{LUT6 WEST}, \textit{LUT6 SOUTH}, and \textit{LUT6 EAST} from top to bottom. \textit{LDCE SHIFT1}, \textit{LDCE SHIFT2}, \textit{LDCE NORTH}, \textit{LDCE WEST}, \textit{LDCE SOUTH}, and \textit{LDCE EAST} locate on the right side. It is important to note that the wires in Figure 7.4 only show the logical connection, which cannot present the correct delay of the post-routing paths. For some critical paths, manual routing is applied for accurate delay control.

**7.4.3 FLSH Routing**

The most challenging step implementing the FLSH architecture is controlling the wire delay between cells. “Matching delays” is a vain hope in any FPGA, as it is not prac-
Even though the placement constraints have been applied, Vivado may select unexpected paths, which leads to additional delay. From the statistical results of our implementations, more than 90% of the delay meets the timing constraints to some degree while the rest do not follow the desired routing path. Though we routed the wires
manually to keep the delay equivalent, it is hard to eliminate the difference entirely. As shown in Figure 7.5, the delay between the center cell and neighboring cells is not exactly the same. To minimize the difference, we apply fixed routing constraints. On the other hand, manual routing is known to be very time-consuming for large scale design. By fixing the routing paths, the implementation efficiency can be significantly improved.

![Critical wire routing among cells](image)

**Fig. 7.5:** Critical wire routing among cells

### 7.5 FLSH Application

LSH has been applied to several problem domains including near-duplicate detection, digital fingerprinting, image similarity identification, and hierarchical clustering. In
this section, we will introduce two typical applications of FLSH: similarity search and clustering.

### 7.5.1 Similarity Search

As a popular similarity search solution, LSH is widely applied for image databases, document collections, time-series databases, and genome databases. Unlike conventional LSHs that can map similar items to the same “buckets” in memory with high probability, FLSH transfers a string to a unique bitmap, as our asynchronous logic in the FPGA holds capabilities to generate similar bitmaps with similar strings. Thus, FLSH can be applied to similarity search application.

The fundamental issue of this application is how to extract the feature from two items efficiently. To post-process the hash value, we provide a simple but practical algorithm called max-min pooling. As shown in Figure 7.6, we first divide the 16 × 16 cell array into 16 sets. Every set contains 4 × 4 cells, which means 64 output bits. Then we accumulate the number of ‘1’ s in each set. By comparing the sum values, two sets with the maximum and the minimum value are marked as ‘1’ while other sets are ‘0’. If there are equivalent extremums, we set all of them to ‘1’. In this way, we compress the 1024-bit hash output to a 4 × 4 bitmap, which, because of biasing, usually includes 2 ‘1’ s and 14 ‘0’ s. When we compare the 16-bit sets from similar inputs, they can coincide with high probabilities. Though we use 64-bit input in our example, the actual length can be flexible. As a solution of any shorter input, the missing part is filled with
‘0’. If the length is more than 64-bit, we can divide the input into several 64-bit sections and apply Exclusive or (XOR) operation to fix the length to 64-bit. Alternatively, we use scripts to reconstruct the architecture by extending the scale with more cells.

**Fig. 7.6:** Cell reconstruction and hash regeneration of max-min pooling

### 7.5.2 Clustering

Clustering is a task of grouping a set of objects so that objects in the same group are more similar than those in other groups. Since FLSH output has the similar feature, we propose a convolutional weight algorithm to develop this application. Figure 7.7 shows a few FLSH cells with different colors. Each cell contains 4 bits. Thus, the total bitmap size is $32 \times 32$. In this algorithm, we build a red frame that includes 13 bits. Depending on the location, they are set with different weights. For example, $bit(i, j)$ in the center obtains a weight of 7. Then we give a weight of 3 to $bit(i-1, j)$, $bit(i+1, j)$, $bit(i, j-1)$, $bit(i, j+1)$, which are directly connected to the center bit. Next, a weight of 2 is set to $bit(i-1, j-1)$, $bit(i+1, j-1)$, $bit(i-1, j+1)$, $bit(i+1, j+1)$, which locate in the diagonal direction. Finally, the farthest four bits, $bit(i-2, j)$, $bit(i+2, j)$, $bit(i, j-2)$,
bit(i, j + 2), receive the minimum weight of 1. Now we calculate the convolutional weight of \( \text{bit}(i, j) \) by multiplying every bit with the corresponding weight.

When two items are given for clustering, our FLSH generates the \( 32 \times 32 \) bitmap for each input. We shift the center of the red frame from the first \( \text{bit}(0, 0) \) to the last \( \text{bit}(31, 31) \) in order to calculate the convolutional weight for each bit. If a bit locates at the boundary of the bitmap, we will fill the missing parts with ‘0’. After 1,024 shift and calculation, the bitmaps are transferred to two \( 32 \times 32 \) “weightmaps”. Then we compare the weights in the same location. If the absolute difference is smaller than a pre-defined threshold, the corresponding bits are recorded as matched to a degree. Otherwise, they are not in the same cluster. When most of the weight pairs are matched, these two strings are regarded as similar. The similarity coefficient can be used to evaluate the result of the convolutional weight algorithm for classification tasks. Note that we generate larger hash redundancies than its input to improve the performance. Since the similarity coefficient replaces the bitmap as output, it will not cause more overhead than other LSHs.

7.6 Evaluation and Discussion

In this section, we provide on-board measured results of our FLSH implementation. During the tests, the environment is set to the normal condition. The proposed FLSH needs 2048 LUTs, which takes only 1% resources of the XC7K325T-2FFG900C FPGA. Compared to other LSHs that require processors, the hardware utilization of FLSH is
7.6.1 Convergence Rate

As a hash function, the output is required to be stable. However, FLSHs output random data at the beginning, then it converges gradually. Due to our asynchronous circuit design and wire delay policy, the metastability and randomness are reasonable in the output. To obtain stable hashes, we tried different proportions of the logic LUT initial values. Technically, an obvious bias leads to a faster data convergence rate. However, too many ‘1’s or ‘0’s cause the loss of functionality. We have set LUT initial table

**Fig. 7.7:** Weight distribution of convolutional weight algorithm

very lightweight.
with 80% of ‘0’ and 20% of ‘1’ as the best tradeoff. To prove that, we used the Xilinx integrated logic analyzer to get 1,024 samples of the FLSH array output, as shown in Figure 7.8. Each column presents the hex values from 256 cells at a certain clock cycle. From left to right, the random hash becomes stable. Particularly, there is an avalanche convergence after the 400th sample. After that, more than 95% of FLSH cells become stable. The time to reach the convergence also decides the speed of FLSH. From our measurement, hash output converges at the 450th sampling clock cycles on average. Since the sampling clock cycle is 5 ns, the speed of FLSHs is calculated to be 444 KHashes/s or 434 Mibps.

Figure 7.9 provides the instability data trends along with the sampling time. If we collect the hash value before the 400th clock cycle, the convergence rate is less than 93%. However, the amount of unstable bits drops suddenly and the curve becomes flat,
as an avalanche convergence occurs in the interval between the 400th and 450th clock cycle. According to our measurements, 450 clock periods are sufficient for the output convergence.

Fig. 7.9: The relationship between the sampling time and the different bits in the output

7.6.2 Randomness

Conventional hash functions have strict randomness requirement for their hash values. Ideally, even if there is only one bit flipped in the input, 50% of the output should change. However, this concept does not fit the LSH family due to its special functionality. LSH output should not be influenced significantly by a few input bits, but it still produces random outputs when 50% of the input bits flip. Therefore, the NIST test is not a suitable standard to evaluate the randomness of LSH. Instead, we show a visual comparison in Figure 7.10. We display bit ‘1’ as white and ‘0’ as black on the 32 × 32 bitmaps. Given two random inputs, the FLSH generates entirely different hash
values. As we mentioned above, the FLSH output produces more 0 outputs than 1’s.

![Comparison of 32 × 32 bitmap with random inputs on an FLSH](image)

**Fig. 7.10:** Comparison of 32 × 32 bitmap with random inputs on an FLSH

Clearly, one sample cannot demonstrate the randomness of FLSHs. To verify it in a statistical sense, we use a linear feedback shift register (LFSR) based pseudo random number generator (PRNG) to generate 10,000 challenges for testing. By accumulating the number of times a ‘1’ appears in each location, we can calculate the probability and evaluate the randomness. From the 32 × 32 bitmap shown in Figure [7.11] there is no intensive fluctuation on the surface, which means the ‘1’s distribute evenly in the output. Note that the logic LUT input is not always random. It is affected by the shift logic and the initial input values. As a result, the ratio of ‘1’ and ‘0’ in the output may have a bias, compared to the pre-set initial value in Table [7.1].

### 7.6.3 Fault Tolerant Capability

The fault tolerant capability determines the performance of an LSH. A good LSH should be able to find the similarity between two items without being affected by the existing differences. When the same input is hashed, more ‘1’s are expected to be put
Fig. 7.11: Probability of ‘1’ in the bitmap of FLSH in the same buckets. Figure 7.12 compares two bitmaps generated by an FLSH with the same input. Since the convergence rate of FLSHs is smaller than 1, some hash values are not stable. As a result, there are 1.7% mismatched bits. However, this defect can be safely ignored as FLSH has strong fault tolerant capability.

Fig. 7.12: Comparison of 32 × 32 bitmap with the same input on an FLSH

Figure 7.13 shows the relationship between the flipped bits of the input and the mismatched bits of the relevant output. As increasing errors are injected in the hash
input, FLSH outputs more different bits. When there is more than eight bit errors in the input, the unstable output contains more than 100 bits, which means the stability drops below 90% and may affect the fault tolerant capability. However, this result is acceptable as the raw data of hash output. With our post-processing algorithms, we are still able to improve the performance of the FLSH-based applications.

![FLSH output instability with different input errors](image)

**Fig. 7.13:** FLSH output instability with different input errors

### 7.6.4 Similarity Search Result

The performance of FLSH-based similarity search is shown in Table 7.2. As a comparison with the traditional LSHs, the Hamming and Minhash similarity rate are tested with the same data set. Though the max-min pooling output is compressed to only 16 bits, the algorithm still provides a good similarity rate as the Hamming similarity does. Thus, given the 64-bit strings in our example, max-min pooling algorithm can be safely applied to similarity search.
<table>
<thead>
<tr>
<th>different bits in two FLSH 64-bit input</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hamming similarity of 64-bit input</td>
<td>93.75%</td>
<td>92.19%</td>
<td>90.63%</td>
<td>89.06%</td>
<td>87.50%</td>
<td>85.94%</td>
<td>84.38%</td>
</tr>
<tr>
<td>Jaccard similarity of Minhash</td>
<td>88.24%</td>
<td>85.51%</td>
<td>82.86%</td>
<td>80.28%</td>
<td>77.78%</td>
<td>75.34%</td>
<td>73.33%</td>
</tr>
<tr>
<td>similarity rate of max-min pooling algorithm</td>
<td>92.06%</td>
<td>91.22%</td>
<td>90.35%</td>
<td>89.87%</td>
<td>88.01%</td>
<td>85.72%</td>
<td>82.90%</td>
</tr>
</tbody>
</table>

Apart from the ability to detect similarity, we also need to show that FLSH can discriminate between dissimilar inputs. We use PRNG to generate two input data sets, with 10,000 random strings in each group. According to the results of the FLSH and convolutional weight algorithm, there are 437 mismatched bits in the 1,024-bit output on average. If we set the threshold to 100, the application can distinguish two random strings with a probability of almost 100%.

7.6.5 Clustering Result

According to Table 7.3, the number of mismatched bits in the bitmap increases when there are more different bits in the input sets. The same data shown in Figure 7.13 demonstrates the relationship between input bit errors and output bit errors. However,
the amount of mismatched bits grows faster after the fifth column. As a result, the Hamming similarity of FLSH bitmaps drops rapidly. When we apply the convolutional weight algorithm, the mismatched bits are significantly reduced. We set the weight threshold to 8 in our tests. Even though there are 10 different bits in the 64-bit input, our solution can generate 1,012 matched bits. The similarity rate is improved to 98.83%, which means only 1.17% inputs cannot be classified correctly. While the max-min pooling algorithm fits similarity search well, the convolutional weight algorithm performs better for clustering application.

<table>
<thead>
<tr>
<th>different bits in two 64-bit input sets</th>
<th>4</th>
<th>6</th>
<th>8</th>
<th>10</th>
<th>12</th>
<th>14</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td>mismatched bits among FLSH bitmaps</td>
<td>53</td>
<td>75</td>
<td>100</td>
<td>126</td>
<td>155</td>
<td>190</td>
<td>227</td>
</tr>
<tr>
<td>Hamming similarity of FLSH bitmaps</td>
<td>94.82%</td>
<td>92.68%</td>
<td>90.23%</td>
<td>87.70%</td>
<td>84.86%</td>
<td>81.44%</td>
<td>77.83%</td>
</tr>
<tr>
<td>matched bits of convolutional weight algorithm</td>
<td>1024</td>
<td>1021</td>
<td>1018</td>
<td>1012</td>
<td>996</td>
<td>960</td>
<td>909</td>
</tr>
<tr>
<td>similarity rate of convolutional weight algorithm</td>
<td>100.00%</td>
<td>99.71%</td>
<td>99.41%</td>
<td>98.83%</td>
<td>97.27%</td>
<td>93.75%</td>
<td>88.77%</td>
</tr>
</tbody>
</table>
As was noted earlier, with a 200 MHz clock on the FPGA and a convergence time of 450 samples, the FLSH can produce over 444 KHashes/s. The convolutional weight algorithm can be pipelined with the FLSH process, thus keeping the same hash rate of 444 KHashes/s. As a point of comparison, a Python software implementation of Minhash delivers only 3279 hashes/s on an Intel i7-4770 processor, roughly 135 times slower than the FPGA-based FLSH implementation.

7.7 Conclusions

In this chapter, an FPGA-based locality sensitive hash function is proposed to identify similar strings. We present the details of FLSH architecture, implementation, convergence, randomness, and stability. When there are bit errors in the hash input, our FLSH can tolerate them and keep most output unchanged. In the similarity search and clustering application, our algorithms provide reasonable performance and improvement. In summary, this FPGA based locality sensitive hash function is an efficient solution for identification and classification.
Chapter 8

Conclusions

Though many security solutions have been developed recently for the increasing needs of embedded systems and IoTs, there is still space for further improvement in the performance, cost, and security level of these systems. The trade-off becomes significant, since quite a few devices can not support very strong encryption with large computational complexity, but still require a certain degree of security. Therefore, designers must consider all aspects and find a balance between performance and cost. Physical unclonable functions, thus, become an ideal security primitive with strong advantages. In this thesis, we developed new architectures and novel techniques to overcome some of the main shortcomings of PUFs. The major contributions of our work will be summarized in this chapter.

8.1 PUF Design Optimization and Implementation

Physical unclonable functions are known as a security primitive for various applications. Its instability limits the usage as a strong error correction code is usually required,
which increases the total hardware resource utilization significantly. In chapter 2, several optimizations on the hardware architecture are proposed in order to improve the stability and reduce the system cost. With a shift register or hash function implementation on the FPGA, the required number of ROs decreases to only 128. By involving the IDELAY2 module, we are able to adjust the timing of the RO output and improve the stability.

8.2 Secure Authentication Solution with PUF Response Instability

For most PUF research, unstable bits generated by RO pairs are usually regarded as a negative part that should be eliminated or ignored for output. However, we demonstrate its positive aspect in chapter 3. By converting the unstable bits to stable output, using unstable-stable responses is proved to be a practical solution for authentication. Compared to the traditional ECC methods, USR achieves better performance. We also propose another algorithm using the instability of PUFs. The obfuscation design can guarantee a high security level of PUF responses against modeling attacks.

8.3 PUF Initialization Table for Robust Authentication and Key Generation

For some applications that need high accuracy, the bit error rate in PUF responses should be as low as possible. In chapter 4, we comprehensively evaluate the lower bound of PUF BER on FPGAs. We propose a floating thresholding solution to reduce the bit error rate in PUF responses and improve the performance of PUF-based appli-
cations. We combine the error correction scheme and bit selection scheme together by using an initialization table and unique helper data. Our results show that this novel approach improves the stability and data rate of various PUFs significantly.

8.4 Ring Weight Algorithm for Lightweight Authentication

Counterfeit Integrated Circuits (IC) can be very harmful to the security and reliability of critical applications. Physical Unclonable Functions (PUF) have been proposed as a mechanism for uniquely identifying ICs and thus reducing the prevalence of counterfeit. However, maintaining large databases of PUF challenge response pairs (CRPs) and dealing with PUF errors make it difficult to use PUFs reliably. To satisfy the authentication requirement of low-cost devices, we proposed a novel fuzzy identification scheme in chapter 5. It presents an innovative approach to authenticate CRPs on PUF-based ICs. The proposed method can tolerate considerable bit errors from responses of PUFs without the use of error correcting codes. Different types of optimization methods are applied to improve the overall performance.

8.5 Locality Sensitive Hash Function for Similarity Search and Clustering

Locality sensitive hash functions (LSHs) are a widely used approach to identify similar items or nearest neighbors in a data set. However, LSH algorithms have rarely been explored in hardware implementations. In chapter 6, we propose a novel locality sensitive hash function design and implementation using FPGAs. By using the unique
hardware delay generated during the fabrication process, we can identify the bitmaps from similar inputs. Our results have verified the efficiency and accuracy of our LSH for similarity search and clustering applications.

8.6 Summary of Conclusions

This thesis is devoted to a PUF-based trade-off design and algorithms for authentication and key generation applications on embedded systems and IoTs.
Bibliography


[34] *Kintex-7 FPGAs Data Sheet: DC and AC Switching Characteristics*. Xilinx, 11 2015. Rev. 2.15.


