Friday, March 11, 2022

Rearrange String k Distance Apart very easy O(n) solution

Problem description:

Given a non-empty string s and an integer k, rearrange the string such that the same characters are at least distance k from each other.
All input strings are given in lowercase letters. If it is not possible to rearrange the string, return an empty string "".
 
Example 1:
Input: s = "aabbcc", k = 3
Output: "abcabc"
Explanation:
The same letters are at least distance 3 from each other.
Example 2:
Input: s = "aaabc", k = 3
Output: ""
Explanation:
It is not possible to rearrange the string.
Example 3:
Input: s = "aaadbbcc", k = 2
Output: "bacabcd"
Explanation:
Another possible answer is: "abcabcda"
The same letters are at least distance 2 from each other.
 

 (Update: There is a much simpler construction, which I posted here. I'll update the article when I have time.)

After I solved this problem, It's funny to see that people used all kinds of complicated data structures for this problem, e.g. heap (priority queue), hash map, etc.

In fact, there is an extremely simple algorithm that doesn't need any of them and runs in O(n) time by construction.

Let's look at the problem. We must separate the same characters by distance at least k, so let's simply cut it into n/k pieces, each has length k. The last one must have length n%k, and it looks like this:

The most problematic character is, of course, the one that shows up the most. So first we count the frequency of the characters, and sort in descending order (sorting has complexity O(n lg n), but since the number of characters in a string are upper-bounded by a constant, it's considered O(1) here).

We simply put the first character in the front of each substring. If the frequency is more than the number of substrings, then it's impossible.

Next, we simply put the second character in the next empty place, continuing from where the last character was put. E.g, if we have 6 of 'x' and 5 of 'y', and 8 substrings, then after filling in the first two characters, the substrings are

xy
xy
xy
x
x
x
y
y

It would violate the condition if the last 'y' reaches or goes beyond the substring above the first 'y' - but it won't happen, since the frequency of the second character must be equal or smaller than the first one, so if that happens, then it must happen to the first character which is before it. Since we must check if the frequency of the first character is more than the number of substrings, we would've already noticed that it's impossible. And if they all have the same length n/k(+1), then 'y' would just fill in all the substrings from the first to the last as well, so there won't be any problems either!

Now we can fill in the characters from top to bottom, when it's full we move to the right. If n%k==0, that's all we need to do!

It would be a little more complicated if n%k!=0. Let's see what happens.

When we reach the point where the first n%k columns are full, and part of the next column is occupied, if the next character has frequency n/k, it will become a failure, because the last character will be right above the row of the first character, and it's one space to the right, so the distance is k-1:
So, how do we prevent this?
Since we know that if this character has frequency n/k+1 then it's impossible (we simply return empty string), we can guarantee that the frequency of this character is at most n/k. So, if we start from a new column, there wouldn't be any problem like this, just like what we did in the beginning! And we can bring the last few characters back to where we skipped!
Does this work? Yes, if there are enough columns. But if not, it may still cause a failure:
Suppose after we filled in the first n%k columns, and we skip the partially filled column, and we fill the empty column next to it with the next character, x. Then we keep going with the next character, y, which is the last character. Now, if we bring it back to the empty space, they will be to the left of the first few y's!

So, how can we fix it?








Notice that it's because the last few characters are brought from the top to the bottom, but the first few characters were already in the bottom, maybe we can change the direction! If we go from bottom to top, is this going to fix the problem?
Yes! This time, even if the character has frequency n/k, its tail wouldn't reach its head because of the same reasoning that we stated in the beginning. And we can simply move the last part to the front! Even if there's only one extra column, it wouldn't cause any trouble:







But hey, remember that the rows are strings, it seems a little too complicated to fill in the first n%k+1 columns, then skip some of the characters, fill the columns after it, then come back and fill in the last few characters from where we skipped. Can we do it in a better way?

Since we already reversed the direction of going up or down, what if we also reverse the direction of left and right?
Now, after we filled the first n%k columns, we simply continue from the right-most column and start filling it down to up! There wouldn't be any conflict due to the same reasoning we did for the left side, and it will end up where we left at the (n%k+1)th column.

The better thing is, we don't even need to do it backwards - we can simply reverse the order of filling it! After filling the first n%k columns, we check if the next character has frequency <= n/k, if not we simply return empty string, but if it's true, we fill the rest of the spaces from the character with the lowest frequency, and keep using the characters in the ascending order of frequency!

This way, we don't need any fancy data structures or complicated algorithms, it's all straight forward.

Wednesday, March 9, 2022

Finding flaws in an algorithm and constructing the counterexamples

Here is the current fastest solution on submissions page to Leetcode 49. Group Anagrams.

Given an array of strings strs, group the anagrams together. You can return the answer in any order.

class Solution {
public:
    vector<vector<string>> groupAnagrams(vector<string>& strs)
    {
        unordered_map<uint64_t, std::vector<std::string>> map;
        for (auto& word : strs)
            map[encode(word)].emplace_back(std::move(word));
       
        std::vector<std::vector<std::string>> result;
        for (auto& pair : map)
            result.emplace_back(std::move(pair.second));
           
        return result;
    }

    uint64_t encode(const std::string& word)
    {
        uint16_t counter[26] = {0};
        for (char c : word) ++counter[c - 'a'];
       
        uint64_t result = 0;
        uint64_t degree = 1;
       
        for (int i = 0; i < 26; ++i)
        {
            result += degree * (counter[i]);
            degree *= 26;
        }
        return result;
    }
};

Take a look at the "encode" function, which serves as a hash map from a string to an integer. Do you see any flaws?

I explained the flaw and gave a counterexample and a patched solution here. I spotted the problem immediately when I saw it, because I thought about something similar when I was solving the problem myself, and I noticed that it didn't work unless I can find a workaround. But somehow, the solution above passed all the test cases. So, yeah, in most cases it would work correctly, but with a certain probability it would fail, and it's guaranteed to fail when it's analyzed and targeted.

Here's a solution to a different problem which passed all the test cases but is also wrong: link. The problem is called "Flip Game II",
Description
You are playing the following Flip Game with your friend: Given a string that contains only these two characters: + and -, you and your friend take turns to flip two consecutive "++" into "--". The game ends when a person can no longer make a move and therefore the other person will be the winner.
Write a function to determine if the starting player can guarantee a win.
The solution in question is
class Solution {
public:
    bool canWin(string &s) {
        int _true = 0, _false = 0;
        if(s.length() < 2) {
            return false;
        }
        else {
            int i = 0, t = 0;
            while(i < s.length()) {
                if(!(s[i] == '+' && s[i + 1] == '+')) {
                    i++;
                }
                else {
                    t++;
                    int j = i;
                    for(; j < s.length(); ++j) {
                        if(s[j] != '+') {
                            break;
                        }
                    }
                    int len = j - i;
                    i = j;
                    bool tmp = true;
                    if(len != 3 && (len % 2)) {
                        tmp = false;
                    }
                    if(tmp) {
                        _true++;
                    }
                    else {
                        _false++;
                    }
                }
            }
        }
        if(_true == 2) {
            return false;
        }
        else if(_true > 0 || _false > 1) {
            return true;
        }
        else {
            return false;
        }
    }
};
This is linear in the length of the string, and it passed all the test cases, no wonder it's the first place on the leaderboard... except that, unfortunately, it's wrong.
It's because that I was looking for a linear solution myself, and I noticed that odd substrings and even substrings have different properties. But I figured that it's much more complicated than what the code above can compute. Can you give a counterexample such that the code above would give a wrong answer?
(Spoiler below)












I found interesting situations about the length of "+...+" substrings, especially with the even length ones. The strings that have odd length or length 2 are easy to deal with, but even length which >=4 are much more complicated. It seems that if the length mod 8 is 0 or 2, then it's also simple, but if it's 4 or 6 then it's complicated. (I wonder if the number of strings that have length mod 8 == 4 or 6 determine the result.)
E.g, if you give your opponent two length 4 substrings, you'll win, but give him one length 4 and one length 6, you'll lose, 6 6 you'll win, 4 8 you'll lose.

So, take a look at the solution above again. Do you see what he did?
He computed the number of odd and even length "+" substrings, and made the decision. But I already thought about this, it didn't work.
He decided that if the number of even strings is 2, then player 1 would fail for sure. But I already gave multiple examples where the number of even strings is 2, the situations are not all the same. Here's an counterexample: 4 6. If the string has a length 4 "+" substring and a length 6 one, then player 1 should win.

Testing it with input "++++-++++++" gives the wrong answer, as expected.
I wonder how the solution passed all the test cases... Is my counterexample a minimal one? What's the probability that the solution above would fail, given a random "+-" string? I'll leave these questions for future... if I have time...

Update 230827, another counter-example to the top solution to H403. Frog Jump

Monday, March 7, 2022

Efficient Solution to Range Minimum Query (RMQ) and Lowest Common Ancestor (LCA) problems with O(1) query and O(n) preprocessing

I found this interesting article Range Minimum Query and Lowest Common Ancestor after I read about BIT on the same site (it's the best explanation of how BIT works comparing to other articles imo, so I looked for some other interesting stuff).

After I read through it, I decided to implement it. I posted it here.

It's a nice article, except that the last section "AN <O(N), O(1)> ALGORITHM FOR THE RESTRICTED RMQ" is very confusing. The last two sentences of the first paragraph say that
"It’s obvious that elements in A can be just +1 or -1. Notice that the old value of A[i] is now the sum of A[1], A[2] … A[i] plus the old A[0]. However, we won’t need the old values from now on."

But then in the next paragraph, it says

"Let A’[i] be the minimum value for the i-th block in A and B[i] be the position of this minimum value in A."

It doesn't make much sense, since A is just a +1 -1 array, what's the point of finding the minimum in this array? I thought this must be a mistake, and I searched for some clarification.

There're not many discussions on this topic, probably because it's not quite practical. But from the few pages that I found, I confirmed my suspicion and figured out how it works, and it worked as expected.

Is it an overkill? Most likely. Having an O(log N) speedup - and it's only on the preprocessing time - is not a very big impact, and it's much more complicated than a simple ST.

Nonetheless, it's already surprising that it can be done at all! And I learned something new.

Someone mentioned that the algorithm can be simplified. Well, it already took me a weekend, so I won't spend more time on it. But I'll take a look of that paper when I have some spare time...

The limit of an averaging sequence

Consider this sequence of length $n$: $P_n=(p_0,p_1,p_2,\dots,p_{n-1})$, where $p_0=2$, and each time the next number $p_k$ can be any numbe...