# Question

Formatted question description: https://leetcode.ca/all/393.html

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
2. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

     Number of Bytes   |        UTF-8 Octet Sequence
|              (binary)
--------------------+-----------------------------------------
1          |   0xxxxxxx
2          |   110xxxxx 10xxxxxx
3          |   1110xxxx 10xxxxxx 10xxxxxx
4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


x denotes a bit in the binary form of a byte that may be either 0 or 1.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

Input: data = [197,130,1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.


Example 2:

Input: data = [235,140,4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.


Constraints:

• 1 <= data.length <= 2 * 104
• 0 <= data[i] <= 255

# Algorithm

For any byte B in UTF-8 encoding, if the first bit of B is 0, then B independently represents a character (ASCII code);

If the first bit of B is 1, and the second bit is 0, then B is a byte (non-ASCII character) in a multibyte character;
If the first two bits of B are 1, and the third bit is 0, then B is the first byte in the character represented by two bytes;
If the first three bits of B are 1, and the fourth bit is 0, then B is the first byte of the character represented by three bytes;
If the first four bits of B are 1, and the fifth bit is 0, then B is the first byte of the character represented by four bytes;


Therefore, for any byte in UTF-8 encoding,

• According to the first digit, it can be judged whether it is an ASCII character;
• According to the first two bits, it can be judged whether the byte is the first byte of a character encoding;
• According to the first four bits (if the first two bits are both 1), it can be determined that the byte is the first byte of the character code, and it can be judged that the corresponding character is represented by several bytes;
• According to the first five bits (if the first four bits are 1), it can be judged whether there is an error in the encoding or whether there is an error in the data transmission process.

# Code

• class Solution {
public boolean validUtf8(int[] data) {
int n = 0;
for (int v : data) {
if (n > 0) {
if (v >> 6 != 0b10) {
return false;
}
--n;
} else if (v >> 7 == 0) {
n = 0;
} else if (v >> 5 == 0b110) {
n = 1;
} else if (v >> 4 == 0b1110) {
n = 2;
} else if (v >> 3 == 0b11110) {
n = 3;
} else {
return false;
}
}
return n == 0;
}
}

• class Solution {
public:
bool validUtf8(vector< int >& data) {
int n = data.size();
for (int i = 0; i < n; ++i) {
if (data[i] < 0b10000000) {
continue;
} else {
int cnt = 0, val = data[i];
for (int j = 7; j >= 1; --j) {
if (val >= pow(2, j)) ++cnt;
else break;
val -= pow(2, j);
}
if (cnt == 1 || cnt > 4 || cnt > n - i) return false;
for (int j = i + 1; j < i + cnt; ++j) {
if (data[j] > 0b10111111 || data[j] < 0b10000000) return false;
}
i += cnt - 1;
}
}
return true;
}
};

• class Solution:
def validUtf8(self, data: List[int]) -> bool:
n = 0
for v in data:
if n > 0:
if v >> 6 != 0b10:
return False
n -= 1
elif v >> 7 == 0:
n = 0
elif v >> 5 == 0b110:
n = 1
elif v >> 4 == 0b1110:
n = 2
elif v >> 3 == 0b11110:
n = 3
else:
return False
return n == 0

############

class Solution(object):
def validUtf8(self, data):
"""
:type data: List[int]
:rtype: bool
"""
features = {0x00: 0, 0xc0: 1, 0xe0: 2, 0xf0: 3}
masks = [0xf8, 0xf0, 0xe0, 0x80]
new = True
followed = 0
i = 0
while i < len(data):
if new:
followed = -1
if (data[i] & mask) in features:
break
if followed == -1:
return False
elif followed != 0:
new = False
else:
new = True
else:
if (data[i] & 0xc0) != 0x80:
return False
followed -= 1
if followed == 0:
new = True
i += 1

return followed == 0


• func validUtf8(data []int) bool {
n := 0
for _, v := range data {
if n > 0 {
if v>>6 != 0b10 {
return false
}
n--
} else if v>>7 == 0 {
n = 0
} else if v>>5 == 0b110 {
n = 1
} else if v>>4 == 0b1110 {
n = 2
} else if v>>3 == 0b11110 {
n = 3
} else {
return false
}
}
return n == 0
}