{x|x∉x}: 2023

Tuesday, December 19, 2023

Maximum Product Difference Between Two Pairs, but with a twist

Yesterday's daily problem, E1913. Maximum Product Difference Between Two Pairs, is a very easy one.

Description:
The product difference between two pairs (a, b) and (c, d) is defined as (a * b) - (c * d).

For example, the product difference between (5, 6) and (2, 7) is (5 * 6) - (2 * 7) = 16.
Given an integer array nums, choose four distinct indices w, x, y, and z such that the product difference between pairs (nums[w], nums[x]) and (nums[y], nums[z]) is maximized.

Return the maximum such product difference.

Constraints:
  4 <= nums.length <= $10^4$
  1 <= nums[i] <= $10^4$

This problem is very easy, one just needs to find the largest two numbers and the smallest two. This works because all the elements are positive.

So, naturally, I wonder, what if 0 and negative numbers are allowed? I.e, the second constraint becomes $-10^4$ <= nums[i] <= $10^4$?

At first, I thought, this is just a little more complicated. But as I attempted to solve it, I figured that this problem has quite a few special situations that I must take care of. It takes a lot of patience and carefulness.

Here is the test program that I wrote in C++. Are you able to solve this problem correctly? (The solution must still be $O(n)$.)
(Let me know if you translated it into a different language. I'll add them or link them.)

```cpp
#include <bits/stdc++.h>
using namespace std;
int yourSolution(vector<int> &v){
    //input your solution here
    
    
    
    
    
    
}
int bruteForce(const vector<int> &v,int &i1,int &i2,int &i3,int &i4){
    const int sz=v.size();
    int r=0,e1,e2,e3,e4,t;
    for(int i=0;i<sz;++i){
        e1=v[i];
        for(int j=i+1;j<sz;++j){
            e2=v[j];
            for(int k=j+1;k<sz;++k){
                e3=v[k];
                for(int l=k+1;l<sz;++l){
                    e4=v[l];
                    t=abs(e1*e2-e3*e4);
                    if(t>r) r=t,i1=e1,i2=e2,i3=e3,i4=e4;
                    t=abs(e1*e3-e2*e4);
                    if(t>r) r=t,i1=e1,i2=e3,i3=e2,i4=e4;
                    t=abs(e1*e4-e3*e2);
                    if(t>r) r=t,i1=e1,i2=e4,i3=e3,i4=e2;
                }
            }
        }
    }
    return r;
}
default_random_engine rg(chrono::system_clock::now().time_since_epoch().count());
uniform_int_distribution<int> rollEle(1,10000);
const int optSize=12;
const int maxSize=32;
uniform_int_distribution<int> rollPos(0,maxSize-1);
uniform_int_distribution<int> roll0(0,1);
uniform_int_distribution<int> rollExtreme(0,1);
void noNeg(vector<int> &v){
    v.resize(maxSize);
    for(int i=0;i<maxSize;++i) v[i]=rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
}
void noPos(vector<int> &v){
    v.resize(maxSize);
    for(int i=0;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
}
void mixed(vector<int> &v){
    v.resize(maxSize);
    for(int i=0;i<maxSize/2;++i) v[i]=rollEle(rg);
    for(int i=maxSize/2;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
}
void neg2posN(vector<int> &v){
    if(rollExtreme(rg)){
        v=vector<int>{1,1,1,1,1,3,1,-1000,1,1,1,1,1,1,1,-1000,1,1,1,1,2,1,1,1,1};
        if(rollExtreme(rg)) v=vector<int>{1,1,1,1,1,3,1,1000,1,1,1,1,1,1,1,-1000,1,1,1,1,-2,1,1,1,1};
        if(roll0(rg)) v[0]=0;
        return;
    }
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[0]=-rollEle(rg);
    v[1]=-rollEle(rg);
}
void neg1posN(vector<int> &v){
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[rollPos(rg)]=-rollEle(rg);
}
void pos2negN(vector<int> &v){
    if(rollExtreme(rg)){
        v=vector<int>{-1,-1,-1,-1,-1,-3,1,-1000,-1,-1,-1,-1,-1,-1,-1,1000,-1,-1,-1,-1,-2,-1,-1,-1,-1};
        if(rollExtreme(rg)) v=vector<int>{1,1,1,1,1,3,1,-1000,1,1,1,1,1,1,1,-1000,1,1,1,1,2,1,1,1,1};
        if(roll0(rg)) v[0]=0;
        return;
    }
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[0]=rollEle(rg);
    v[1]=rollEle(rg);
}
void pos1negN(vector<int> &v){
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[rollPos(rg)]=rollEle(rg);
}
void pos2neg2(vector<int> &v){
    v.resize(4);
    v[0]=rollEle(rg);
    v[1]=-rollEle(rg);
    v[2]=-rollEle(rg);
    v[3]=rollEle(rg);
}
void pos2neg2and0(vector<int> &v){
    v.resize(maxSize);
    v[0]=rollEle(rg);
    v[1]=-rollEle(rg);
    v[2]=-rollEle(rg);
    v[3]=rollEle(rg);
}
void pos2neg1and0(vector<int> &v){
    v=vector<int>(maxSize,0);
    v[0]=rollEle(rg);
    v[1]=-rollEle(rg);
    v.back()=rollEle(rg);
}
void neg2pos1and0(vector<int> &v){
    v=vector<int>(maxSize,0);
    v[0]=-rollEle(rg);
    v[1]=rollEle(rg);
    v.back()=-rollEle(rg);
}
void neg1pos1and0(vector<int> &v){
    v=vector<int>(maxSize,0);
    v[1]=rollEle(rg);
    v.back()=-rollEle(rg);
}
void (*updateTest[optSize]) (vector<int>&)={noNeg,noPos,mixed,neg2posN,neg1posN,pos2negN,pos1negN,pos2neg2,pos2neg2and0,pos2neg1and0,neg2pos1and0,neg1pos1and0};
int main(){
    int i1,i2,i3,i4;
    int repeat=10000;
    vector<int> test;
    uniform_int_distribution<int> rollOpt(0,optSize-1);
    for(int i=0;i<repeat;++i){
        updateTest[rollOpt(rg)](test);
        int result=yourSolution(test);
        int maxi=bruteForce(test,i1,i2,i3,i4);
        if(result!=maxi){
            cout<<"iteration number:"<<i<<endl;
            cout<<"input is ";
            for(auto e:test) cout<<e<<",";
            cout<<endl;
            cout<<"Your answer is "<<result<<", but the maximum is abs("<<i1<<"*"<<i2<<"-"<<i3<<"*"<<i4<<")="<<maxi<<".\n";
            return 0;
        }
    }
    cout<<"All passed!\n";
    return 0;
}
```

For simplicity and speed, the test cases are not randomly shuffled, but it probably doesn't matter too much in this context. A bigger problem is, the extreme cases are too rare, so I added some of them manually. If you think there are other extreme cases that should be included, let me know!

I wrote my own solution and it passed all test cases, but I'm still not 100% sure it's correct. Let me know if you wrote a better test case generator that covers more corner cases!

IBM Ponder this 2023 11 solution

Problem description is here. The solution is given here.

When I saw the problem, I wondered if the diagonals include just the two main diagonals or the broken diagonals too. There's no explicit method to construct all magic squares of order $\geq 4$, but if we require that the broken diagonals also add up to the magic constant, i.e. we consider the pandiagonal magic squares, there is an explicit construction of all order 4 pandiagonal magic squares. So I just did that, and once I confirmed that the shortest path length of them is less than 50, I pretty much forgot about the more general magic squares. Well, I like the pandiagonal ones better anyway, because of their higher symmetry.

Once the matrix is constructed, the first thing to do is to check its parity, because we need that to determine which target it'll connect with. Another thing we may want to know is, a heuristic distance from this matrix to the target. One simple choice is the Manhattan distance, although I doubt its effectiveness. Anyway, here's my implementation of the order 4 pandiagonal magic squares:

```cpp
struct matrix{
    array<array<int,4>,4> mat;
    coor coors[16];
    int dis,to_target;
    int perm=0;
    matrix(int b,int c,int d,int e,int tranX,int tranY){//tranX and tranY from 0 to 3
        coors[mat[tranX][tranY]=0]=coor{tranX,tranY};
        coors[mat[tranX][(tranY+1)%4]=b+c+e]=coor{tranX,(tranY+1)%4};
        coors[mat[tranX][(tranY+2)%4]=c+d]=coor{tranX,(tranY+2)%4};
        coors[mat[tranX][(tranY+3)%4]=b+d+e]=coor{tranX,(tranY+3)%4};
        coors[mat[(tranX+1)%4][tranY]=b+c+d]=coor{(tranX+1)%4,tranY};
        coors[mat[(tranX+1)%4][(tranY+1)%4]=d+e]=coor{(tranX+1)%4,(tranY+1)%4};
        coors[mat[(tranX+1)%4][(tranY+2)%4]=b]=coor{(tranX+1)%4,(tranY+2)%4};
        coors[mat[(tranX+1)%4][(tranY+3)%4]=c+e]=coor{(tranX+1)%4,(tranY+3)%4};
        coors[mat[(tranX+2)%4][tranY]=b+e]=coor{(tranX+2)%4,tranY};
        coors[mat[(tranX+2)%4][(tranY+1)%4]=c]=coor{(tranX+2)%4,(tranY+1)%4};
        coors[mat[(tranX+2)%4][(tranY+2)%4]=b+c+d+e]=coor{(tranX+2)%4,(tranY+2)%4};
        coors[mat[(tranX+2)%4][(tranY+3)%4]=d]=coor{(tranX+2)%4,(tranY+3)%4};
        coors[mat[(tranX+3)%4][tranY]=c+d+e]=coor{(tranX+3)%4,tranY};
        coors[mat[(tranX+3)%4][(tranY+1)%4]=b+d]=coor{(tranX+3)%4,(tranY+1)%4};
        coors[mat[(tranX+3)%4][(tranY+2)%4]=e]=coor{(tranX+3)%4,(tranY+2)%4};
        coors[mat[(tranX+3)%4][(tranY+3)%4]=b+c]=coor{(tranX+3)%4,(tranY+3)%4};
        getDis();
    }
    void getDis(){
        dis=0;
        to_target=0;
        for(int i=0;i<16;++i) dis+=abs(coors[i].x-target0coors[i].x)+abs(coors[i].y-target0coors[i].y);
        if(checkPerm()&1){
            to_target=1;
            dis-=abs(coors[14].x-target0coors[14].x)+abs(coors[14].y-target0coors[14].y);
            dis-=abs(coors[15].x-target0coors[15].x)+abs(coors[15].y-target0coors[15].y);
            dis+=abs(coors[14].x-target1coors[14].x)+abs(coors[14].y-target1coors[14].y);
            dis+=abs(coors[15].x-target1coors[15].x)+abs(coors[15].y-target1coors[15].y);
        }
    }
    int checkPerm(){
        int arr[16];
        for(int i=0;i<4;++i)for(int j=0;j<4;++j) arr[4*i+j]=mat[i][j];
        perm=0;
        if(arr[15]!=0){for(int i=0;i<15;++i){if(arr[i]==0){perm+=(i/4)+(i%4)-1;swap(arr[i],arr[15]);break;}}}
        for(int i=1;i<=15;++i){
            if(arr[i-1]!=i){
                perm++;
                for(int j=0;j<15;++j){
                    if(arr[j]==i){swap(arr[j],arr[i-1]);break;}
                }
            }
        }
        return perm;
    }
    void print(ostream& output)const{
        output<<"distance is "<<dis<<endl;
        output<<"[";
        for(int i=0;i<4;++i){output<<"[";for(int j=0;j<3;++j) output<<mat[i][j]<<",";output<<mat[i][3]<<"]";if(i!=3) output<<",";else output<<"];";output<<endl;}
        output<<"-------------------\n";
    }
    void printDis(ostream& output)const{
        output<<"to_target="<<to_target<<endl<<"number:";
        for(int i=0;i<16;++i) output<<i<<" "; output<<endl<<"distance:";
        if(to_target){
            for(int i=0;i<16;++i) output<<(abs(coors[i].x-target1coors[i].x)+abs(coors[i].y-target1coors[i].y))<<" ";
            output<<endl;
        }
        else{
            for(int i=0;i<16;++i) output<<(abs(coors[i].x-target0coors[i].x)+abs(coors[i].y-target0coors[i].y))<<" ";
            output<<endl;
        }
    }
    void printPerm(ostream& output)const{output<<"perm="<<perm<<endl;}
    mr1 to_mr1()const{
        ull conf=0,co=0;
        for(int i=0;i<16;++i){co|=(ull)coors[i].x<<(4*i);co|=(ull)coors[i].y<<(4*i+2);}
        for(int i=0;i<4;++i)
            for(int j=0;j<4;++j)
                conf|=(ull)mat[i][j]<<(4*(4*i+j));
        return mr1{conf,co,(int8_t)dis};
    }
    mr0 to_mr0()const{
        ull conf=0,co=0;
        for(int i=0;i<16;++i){co|=(ull)coors[i].x<<(4*i);co|=(ull)coors[i].y<<(4*i+2);}
        for(int i=0;i<4;++i)
            for(int j=0;j<4;++j)
                conf|=(ull)mat[i][j]<<(4*(4*i+j));
        return mr0{conf,co,(int8_t)dis};
    }
};
```

I need to be able to find the element at given coordinates and also to find the coordinates of a given element, so I decided to store both. Next, I can build the graph and try to find a path. To save space, I noticed that an order 4 magic square can be expressed by a 64 bit integer - either configuration or coordinates. So I used that expression as a "reduced matrix". That's what the "mr0" and "mr1" are, corresponding to the different parities.

There are 384 order 4 pandiagonal magic squares, 192 for each parity.

Before implementing the reduced matrices, I tried to find a solution manually and see what the best result I could get was. I used this implementation and changed it a bit. First, I need to be able to set the parity (I could've computed the parity from the matrix input, but I decided to skip that. I could've also created a plain text input for the initial matrix, but it was not that important since I just wanted to see the result, so I kept it simple and just pasted the matrix into the source code). Next, I need the elements that I clicked in reverse order.

Parity:

I found a solution of path length 90 manually.

Now, getting back to the reduced matrices and graph building algorithm...

```cpp
unordered_map<ull,array<ull,4>> graph;
struct mr0{//reduced matrix
    ull conf,co;
    int8_t dis=0;
    mr0(ull conf1,ull co1,int8_t dist):conf(conf1),co(co1),dis(dist){}
    mr0 getMr(){
        ull confs[4];//left,right,up,down
        ull coos[4];
        int8_t dis[4];
        double prob[4]={0.0};
        if((co&3)!=0){//l
            confs[0]=conf;
            coos[0]=co;
            int x=(co-1)&3,y=(co>>2)&3;
            coos[0]--;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[0]+=1ULL<<(e*4);
            confs[0]&=~(15ULL<<locOfe);
            confs[0]|=e<<(4*(4*(x+1)+y));
            dis[0]=getDis(coos[0]);
            addg(confs[0]);
            if(dis[0]==0) return mr0{confs[0],coos[0],dis[0]};
            prob[0]=exp(-dis[0]*T);
            //cout<<"left config:\n";print(confs[0]);cout<<"distance="<<(int)dis[0]<<endl;
        }
        if((co&3)!=3){//r
            confs[1]=conf;
            coos[1]=co;
            int x=(co+1)&3,y=(co>>2)&3;
            coos[1]++;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[1]-=1ULL<<(e*4);
            confs[1]&=~(15ULL<<locOfe);
            confs[1]|=e<<(4*(4*(x-1)+y));
            dis[1]=getDis(coos[1]);
            addg(confs[1]);
            if(dis[1]==0) return mr0{confs[1],coos[1],dis[1]};
            prob[1]=exp(-dis[1]*T);
            //cout<<"right config:\n";print(confs[1]);cout<<"distance="<<(int)dis[1]<<endl;
        }
        if((co&12)!=0){//u
            confs[2]=conf;
            coos[2]=co;
            int x=co&3,y=((co>>2)-1)&3;
            coos[2]-=4;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[2]+=4ULL<<(e*4);
            confs[2]&=~(15ULL<<locOfe);
            confs[2]|=e<<(4*(4*x+y+1));
            dis[2]=getDis(coos[2]);
            addg(confs[2]);
            if(dis[2]==0) return mr0{confs[2],coos[2],dis[2]};
            prob[2]=exp(-dis[2]*T);
            //cout<<"up config:\n";print(confs[2]);cout<<"distance="<<(int)dis[2]<<endl;
        }
        if((co&12)!=12){//d
            confs[3]=conf;
            coos[3]=co;
            int x=co&3,y=((co>>2)+1)&3;
            coos[3]+=4;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[3]-=4ULL<<(e*4);
            confs[3]&=~(15ULL<<locOfe);
            confs[3]|=e<<(4*(4*x+y-1));
            dis[3]=getDis(coos[3]);
            addg(confs[3]);
            if(dis[3]==0) return mr0{confs[3],coos[3],dis[3]};
            prob[3]=exp(-dis[3]*T);
            //cout<<"down config:\n";print(confs[3]);cout<<"distance="<<(int)dis[3]<<endl;
        }
        for(int i=1;i<4;++i) prob[i]+=prob[i-1];
        for(int i=0;i<4;++i) prob[i]/=prob[3];
        //cout<<"Accumulated probabilities are:"; for(int i=0;i<4;++i) cout<<prob[i]<<","; cout<<endl;
        double p=roll1(generator);
        int l=0;
        for(;l<4;++l){if(p<prob[l]) break;}
        return mr0{confs[l],coos[l],dis[l]};
    }
    void addg(ull nei){
        auto& arr=graph[conf];
        for(int i=0;i<4;++i){
            if(arr[i]==nei) return;
            if(arr[i]==0){arr[i]=nei;break;}
        }
        auto& arr1=graph[nei];
        for(int i=0;i<4;++i){
            if(arr1[i]==0){arr1[i]=conf;return;}
        }
    }
    int getDis(ull coor){
        int dis=0;
        for(int i=0;i<16;++i) dis+=abs(int((coor>>4*i)&3)-target0coors[i].x)+abs(int((coor>>(4*i+2))&3)-target0coors[i].y);
        return dis;
    }
};
```

They store the configuration and coordinates of the matrix and the manhattan distance to their target. The "getMr()" method generates a neighboring matrix with probability proportional to $e^{-\text{distance}*T}$ where $T$ is a constant parameter. The newly generated matrix is stored in a graph, and once it reaches the target, I did a BFS to find the shortest path.

Because the shortest path may be starting from any of the magic squares that has the same parity as the target, I just started from the target and looked for any of the initial matrices.

```cpp
unordered_map<ull,int> layers;
unordered_set<ull> inits;
queue<ull> q;
ull bfs(ull start){
    while(!q.empty()) q.pop();
    q.emplace(start);
    int l=0;
    layers[start]=l;
    while(!q.empty()){
        int sz=q.size();
        l++;
        while(sz--!=0){
            const auto &arr=graph[q.front()];
            q.pop();
            for(int i=0;i<4;++i){
                ull t=arr[i];
                if(t==0) break;
                if(layers.find(t)!=layers.end()) continue;
                layers[t]=l;
                if(inits.count(t)){//cout<<"layers="<<l<<endl;
                    while(!q.empty()) q.pop(); return t;}
                q.emplace(t);
            }
        }
    }
    return 0;
}
vector<ull> getSteps(ull fin,ull tar){
    vector<ull> r{fin};
    ull nxt=fin;
    int l=layers[fin]-1;
    //cout<<"finished layers="<<l<<endl;
    while(nxt!=tar){
        const auto& arr=graph[nxt];
        for(int i=0;i<4;++i){
            if(arr[i]==0) break;
            if(layers[arr[i]]==l){nxt=arr[i];r.emplace_back(nxt);break;}
        }
        l--;
    }
    return r;
}
```

Since it's a randomized algorithm, the first path that it finds is usually quite large, on the magnitude of 10,000 and has large fluctuations. But as the graph gets bigger, there's a higher chance that a short path will be found, and the shortest path length decreases. Quickly the length decreases to about 100, then it hardly moves, until the program crashes due to using too much memory, because the graph gets too big. The best result that I got with this algorithm is 59.

So, I searched for a better solver, and I found one here. Again, I need to change the source code a little bit to enable different parities.

For many of the inputs, this gives a result very fast, but for the rest, it also used too much memory (more than 12GB) and crashed. The implementation is quite general and not optimized for order 4 matrices.

Anyway, the best result I got from that program is, 42! The answer to life, universe and everything! Coincidence? I think not!

Start:
1 2 3 4
5 6 7 8
9 10 11 12
13 15 14 0

Sequence:
[12, 8, 7, 11, 10, 6, 5, 9, 6, 10, 8, 12, 14, 8, 11, 3, 4, 7, 12, 11, 3, 5, 10, 15, 8, 3, 5, 4, 2, 10, 9, 1, 10, 9, 4, 2, 9, 4, 15, 8, 3, 14]

End:
10 4 9 7
1 15 2 12
6 8 5 11
13 3 14 0

Now that I look at it again, I noticed that their distance function skips the empty cell, which I didn't do in my distance calculation. I think that's why it didn't work too well... Also, I think I only changed the destination for the odd parities, but forgot to change the distance function too. Maybe that's why there are more configurations that it was unable to solve. The distance is not guaranteed to be the shortest for those odd parity ones due to that, too. Anyway, it's not hard to fix, but I got to move on...

One issue with the $A^*$ algorithm is, one can only go in one direction, and if the distance function is not helpful, in the worst case it just reduces to a BFS in this context. And the radius to explore equals the length of the shortest path. On the other hand, if we start from both ends, we just need to explore two balls of radius $l/2$, where $l$ is the shortest length. This makes a huge difference when the volume is exponential in radius.

So, another way to solve it is, since we know that a solution with length $\leq 50$ exists, we can simply explore the space of radius 25 from both sides and stop when they touch and get the result. If they don't touch, we move on to the next one.

Each configuration has from 2 to 4 neighbors, or 1 to 3 children. If the case of 1 and 3 children has the same probability, there are at most 2^25 configurations in each sphere, which is about 32M. Including the surface, it would be 48M, two of them is 96M, each configuration can be stored in 64bit, need one byte to store the distance, so it would be 864MB. Considering more space is needed for hashmap and other overheads, theoretically it can be done using within 2GB memory... I wonder how much memory it actually uses.

So the bonus problem is easier to solve in this sense. If the path length is just a little longer, this method will not be feasible.
On the contrary, my randomized algorithm can find a solution to all the starting configurations very fast, but it's very hard to get the shortest path. It's pretty good for the main problem, though. It finds a solution with length $\leq 150$ pretty fast.

Now, consider the general magic squares, the only difference is how we generate them. Apparently there are 7040 of them, and 880 of them if we exclude reflections and rotations. We could find the list that's available on the internet like here, and do the rotations and reflections to generate the others, or we could solve the 10 linear equations, where 9 of them are independent, and iterate through the permutations of the 7 independent elements. Because of the symmetry, we could fix one element, say 1, to 3 positions, so the total number of choices to consider is 3*15*14*13*12*11*10. Then we do the reflection and rotation to get all 7040 matrices. After we get all of them, the rest is pretty much the same.

Tuesday, November 7, 2023

IBM Ponder this 2023 09 solution

The problem can be found here.
Since the solution has been posted on the website, here's my solution:
Q1 is simple, just calculate them directly:

```cpp
#include <bits/stdc++.h>
using namespace std;
int gcd(int a,int b){
    if(a<b) swap(a,b);
    while(b){
        a=a%b;
        swap(a,b);
    }
    return a;
}
int main(){
    int a=531;
    //for(int i=2;i<1000;++i) {int d=gcd(i,a);cout<<d<<", ";a=a+d;}
    int cnt=0, i=2;
    while(cnt<10) {int d=gcd(i,a);if(d==5){++cnt;cout<<"(i="<<i<<",a="<<a<<"),";} a=a+d;i++;}
    return 0;
}
```

The second question takes a little observation, though. It's easy to see that the result must be an odd number. 9 and 15 don't work from reasoning, and 21 works. Working backwards, it's not hard to find that when $a_1=801$, $d_{21}=21$. (Somehow the solution page omitted the answer to this question. Probably overlooked.)

The bonus question is much harder to solve without the right tools, because it requires factoring many very large integers. Here's the code that I used to solve it:

```python
import primefac as pf
import os
from datetime import datetime
file=open("result2309.txt","a")
start=datetime.now()
print("starting time:",start)
file.write("starting time:"+str(start)+"\n")
def gcd(a,b):
    while(b>0):
        a=a%b
        a,b=b,a
    return a
a=531
i=2
cnt=0
while(cnt<200):
  g=gcd(a,i)
  if g!=1:
    if g==5:
      cnt+=1
      print(f"#{cnt}: i={i},a={a}")
      file.write(f"#{cnt}: i={i},a={a}\n")
      file.flush()
      os.fsync(file)
    a+=g
    i+=1
    continue
  n=a-i
  m=a
  d=0
  f=0
  factors=pf.primefac(n, verbose=False,methods=(pf.ecm,))
  for factor in factors:
    d=factor-(i%factor)
    if d<m:
      m=d
      f=factor
  if f==5:
    cnt+=1
    print(f"#{cnt}: i={i},a={a}")
    file.write(f"#{cnt}: i={i},a={a}\n")
    file.flush()
    os.fsync(file)
  a+=m
  i+=m
  #print(f"next one: (i={i},a={a})")
  a+=f
  i+=1
end=datetime.now()
print("ending time:",end)
print(str(end-start))
file.write("ending time:"+str(end)+"\n")
file.write(str(end-start)+"\n")
file.close()
```

It requires the package "primefac" installed.
The "ECM" is much faster than the default "pollardrho_brent" method for extremely large numbers. The default method took around 10 hours, while the ECM took about 17 minutes.

I'll update the logic behind the code when I have more time... Or you may find the paper linked on their solution page helpful.

Sunday, October 29, 2023

Probability of line segments intersecting - A surprisingly simple result

I was drawing random stuff on a piece of paper, suddenly this problem came into my mind:
If I draw a length 1 line segment randomly, then draw another one, what's the probability that they'll intersect?
Of course this description is ambiguous. To give a better description, we need to define things more accurately. First, we should limit the area that we can draw the line segments in. Second, we should define what it means by "drawing a line segment randomly".

To avoid annoying boundary situations, let's use periodic boundary conditions. But then, the size can't be too small, otherwise we may encounter complicated situations like crossing multiple times.
Apparently, if the second line segment intersects the first, both of the end points of the second line segment must lie within the envelope curve that has distance 1 to the first line segment. This shape is called a "stadium", where $a=r=1$ in this case. If we choose a rectangular area $w\times l$ where $w>=2$ and $l>=2$, multiple intersections won't happen.

The formal definition of the problem is as the following:
Consider a rectangular area of size $w\times l$ with periodic boundary conditions, where $w>=2,l>=2$. Draw two length 1 line segments randomly. Here, by randomly it means, first choose a random point with uniform distribution over the rectangular area, then from the circle that is centered at the first point with radius one, randomly choose a point with uniform distribution. These two points are the end points of the line segment.
Question: What's the probability that the two line segments intersect?


Let's solve this problem!
Because of the periodic boundary conditions, linear translations don't matter. We can move the first line segment to the center of the rectangle area. Then, the probability is only none zero if the second line segment lies within the stadium shape centered at the first line segment. It doesn't matter how the stadium shape is oriented, so we can assume that the first line segment lies horizontally.

To find the probability, we can just integrate the probability that the second line segment intersects with the first one for all possible first end points of the second line segment.

The area outside the stadium has zero probability density. Due to the symmetry of the shape, we only need to consider a quarter of the stadium, and multiply the answer by 4.
The situations are different depending on which part of the stadium shape the first end point is in. We can put them in 4 parts:



(Actually, $P_1$ and $P_3$ can be combined in the calculations in polar coordinates. I started with Cartesian, so I didn't notice until later.)

The probability is $p=\frac{1}{A}\times 4\times\frac{1}{2\pi}(p_1+p_2+p_3+p_4)$, where
$p_1=\int_0^1 rdr\int_0^{\frac{\pi}{2}}\theta-\sin^{-1}(r\sin\theta)d\theta$
$p_2=\int_{\frac{1}{2}}^1dx\int_0^\sqrt{1-x^2}\tan^{-1}\frac{x}{y}+\tan^{-1}\frac{1-x}{y}dy$
$\quad=\frac{1}{2}\int_0^1 rdr\int_{\sin^{-1}\frac{r}{2}}^{\frac{\pi}{2}}\theta+\tan^{-1}\frac{1-r\sin\theta}{r\cos\theta}d\theta$
$p_3=\int_0^1 rdr\int_0^{\sin^{-1}\frac{r}{2}}\theta+\cos^{-1}(r\cos\theta)d\theta$
$p_4=\int_0^{\frac{1}{2}}dx\int_\sqrt{1-x^2}^1 2\cos^{-1}y\ dy$

(We may combine $p_1$ and $p_3$ and just calculate $\int_0^1 rdr\int_{-\frac{\pi}{2}}^{\sin^{-1}\frac{r}{2}}\theta+\cos^{-1}(r\cos\theta)d\theta$ instead.)

I'm not sure if Cartesian or polar coordinate system is easier to manually integrate them... I put them on wolfram alpha, and here's the result:
$p_1=\frac{1}{4}$
$p_2=0.583664$
$p_3=0.155138$
$p_4=0.011197 =\frac{1}{72}(27 - 3 \sqrt 3 \pi - \pi^2)$

Then I added them together, the summation is... 0.999999. Well, this can't be a coincidence, can it?
I'll get the exact result when I have more time...
So, the probability is $\frac{2}{\pi A}$, where $A$ is the area that the line segment "occupies". 

I quickly wrote a program to simulate the problem in a $3\times 3$ square. Averaging 100000000 results, the probability is 0.0707416, where the theoretical value is $\frac{2}{9\pi}=0.07073553$.

This reminds me of the result of Buffon's needle problem, which has probability $\frac{2l}{\pi t}$, where $l$ is the length of the needles and $t$ is the spacing between the lines, $l\leq t$.

This result can be taken as a generalization of Buffon's needle problem. It can be described as
Assuming there is a random uniform distribution with density $\frac{1}{A}$ of line segments with length 1 on the 2D plane. More precisely, the distribution is, the middle points of the line segments is uniformly distributied on the plane with density $\frac{1}{A}$ and the angle is also uniformly distributed from 0 to $2\pi$.

Then, if we draw another line segment of length one randomly, as described in the beginning, the expected number of intersections between this line segment and the line segments in the "background" is $\frac{2}{\pi A}$.

I wonder if this can be further generalized. An obvious first step is, changing the size of the line segments, which probably will give something similar to the result of Buffon's needle problem. How else can we generalize this?

2023.10.30 Update:
I think, indeed, the Buffon's needle problem can be seen as a special case of this problem.

If we see each line as an infinite chain of length 1 line segments, and the distance between the lines is $t$, then each line segment occupies an area $t$, thus the expectation of the number of intersections between a length 1 needle with the lines is $\frac{2}{\pi A}=\frac{2}{\pi t}$. Due to the additivity of expectation, a length $l$ needle will have expectation $\frac{2l}{\pi t}$. When $l\leq t$, there can be at most one intersection, thus this is the probability of them intersecting.


Regarding the integrations, Claude Leibovici gives the exact expressions to them, except for $p_3$ which is still based on numerical evidence (albeit a very strong one).

It seems very complicated, so I thought maybe it would be easier to do it in the other direction.

If we integrate $r$ first instead,the interval of $\theta$ is $[0,\frac{\pi}{6}]$ and the interval of $r$ is $[2\sin\theta,1]$. The integration is

$p_3=\int_0^{\frac{\pi}{6}}d\theta\int_{2\sin\theta}^1 (\theta+\cos^{-1}(r\cos\theta))rdr$
$\quad=\int_0^{\frac{\pi}{6}}\frac{\theta}{2}(1-4\sin^2\theta)+\left[\frac{r^2\cos^2\theta}{2}\cos^{-1}(r\cos\theta)+\frac{1}{4}\sin^{-1}(r\cos\theta)-\frac{r\cos\theta}{4}\sqrt{1-r^2\cos^2\theta}\right]_{2\sin\theta}^1\frac{1}{\cos^2\theta}d\theta$
$\quad=\int_0^{\frac{\pi}{6}}\frac{\theta}{2}(1-4\sin^2\theta)+\left(\theta\frac{\cos^2\theta}{2}+\frac{1}{4}(\frac{\pi}{2}-\theta)-\frac{\sin\theta\cos\theta}{4}-\frac{\sin^2 2\theta}{2}(\frac{\pi}{2}-2\theta)-\frac{1}{4}(2\theta)+\frac{\sin 2\theta \cos 2\theta}{4}\right)\frac{1}{\cos^2\theta}d\theta$

I got lazy again and just lumped it into wolfram alpha. This time it gives the exact result, $\frac{1}{36}(9+3\sqrt 3\pi-2\pi^2)$.

Combined with the result $p_2=\frac{1}{72}(9-3\sqrt 3\pi+5\pi^2)$ and the other two known results, I think the proof that the probability equals $\frac{2}{\pi A}$ is complete.

Wednesday, October 18, 2023

The long journey from O(n^2) to O(n^1.5), Leetcode H1269. Number of Ways to Stay in the Same Place After Some Steps

H1269 description:

You have a pointer at index 0 in an array of size arrLen. At each step, you can move 1 position to the left, 1 position to the right in the array, or stay in the same place (The pointer should not be placed outside the array at any time).

Given two integers steps and arrLen, return the number of ways such that your pointer is still at index 0 after exactly steps steps. Since the answer may be too large, return it modulo $10^9 + 7$.

Constraints:
  1 <= steps <= 500
  1 <= arrLen <= $10^6$

An $O(steps*\min(steps,arrLen))$ time complexity, $O(\min(steps,arrLen))$ space complexity solution is not hard to find, which is the typical approach and featured in the Editorial.

Basically we build up a table from the first step to the last step, where each row records the number of ways to be in that position at that step. It's easy to see that one can move at most $steps/2$ steps to the right, so if $arrLen>steps/2$ we just need to calculate $step/2+1$ positions. And since the position after each step only depends on the position at the previous step, one can use just two arrays to represent them and switch them after each step.

Here is a slightly improved version based on that approach:

```cpp
#pragma GCC target("avx,mmx,sse2,sse3,sse4")
auto _=[]()noexcept{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);return 0;}();
unsigned int cnt[251];
const int mod=1000000007;
class Solution {
public:
    int numWays(int steps, int arrLen){
        if(arrLen<=2){
            if(arrLen==1) return 1;
            int p=(steps-1)/63,q=(steps-1)%63;
            unsigned long long r=(1ULL<<q)%mod,c=(1ULL<<63)%mod;
            while(p--) r=(r*c)%mod;
            return r;
        }
        int b=min(steps/2+1,arrLen),tmp2;
        unsigned long long tmp1;
        memset(cnt,0,b*sizeof(cnt[0]));
        cnt[0]=1;
        for(int i=0;i<steps;++i){
            tmp1=cnt[0];
            cnt[0]=(cnt[0]+cnt[1])%mod;
            int s=min(min(i+1,steps-i),b-1);
            for(int j=1;j<s;++j){tmp2=cnt[j]; cnt[j]=(tmp1+cnt[j]+cnt[j+1])%mod; tmp1=tmp2;}
            cnt[s]=(tmp1+cnt[s])%mod;
        }
        return cnt[0];
    }
};
```

Since the new position only depends on the 3 array elements of the previous step which is a fixed number, we don't really need to use an array to store that. Just use two integers to store two of the elements and roll them forward to update the single array. This saves half of the space.

Also noticing that the shape of the relevant elements in the array forms a triangle (first row has one non-zero element, second row has two, ..., after halfway we must go back, so the position afterwards are irrelevant and there's no need to comput them... second to last row has two relevant elements, the last has one), we can skip those unnecessary calculations, which saves at most half of the time.

When $arrLen==2$, the result has a very simple form: $2^{steps-1}$, so we just need to find $2^{steps-1} \mod 10^9+7$. The power function can be done in $O(\log n)$, but here since $steps<=500$, if we just factor $2^{steps-1}$ into power of $2^{63}$ and the remainder, it can be done in at most 8 operations, which is probably more efficient than using a power function.

This is an optimization to the typical approach. So far so good.

But, can this be optimized further?

Let's see what actually happens when we evolve the array after each step. $a_{j+1}[0]= a_j[0]+a_j[1]$, $a_{j+1}[1]= a_j[0]+a_j[1]+a_j[2]$, ...
If we write the array as a column vector, the new vector is the old one multiplied by a matrix:
$$\mathbf a_{j+1}=\mathbf H \mathbf a_j$$
where $\mathbf H$ is

```
1 1 0 0 0 ... 0
1 1 1 0 0 ... 0
0 1 1 1 0 ... 0
0 0 1 1 1 ... 0
...
0 ... 0 1 1 1 0
0 ... 0 0 1 1 1
0 ... 0 0 0 1 1
```

After $s$ operations, the resulting vector is $\mathbf a_s=\mathbf H^s\mathbf a_0$, where $\mathbf a_0$ is a vector whose first element equals one and the rest zero. The answer to the problem is just $\mathbf a_s[0]$, which equals the product of the first row of $\mathbf H^s$ and $\mathbf a_0$. Since the only non-zero element of $\mathbf a_0$ is the first element which is one, the answer is just $\mathbf H^s[0][0]$.

A simple implementation of matrix multiplication has $O(n^3)$ complexity. Strassen algorithm can bring it down to $O(n^{\log_2 7})\approx O(n^{2.81})$, and recent research brought the bound to as low as $O(n^{2.372})$. But they are considered galactic algorithms. The matrix $\mathbf H$ is at most 500*500, even the not-so-complicated Strassen algorithm probably won't accelerate it by much. Furthermore, it must be multiplied by $O(\log s)$ to include the complexity of computing the power $s$. So this does not improve the performance at all.

But noticing that the matrix is a symmetric tridiagonal toeplitz matrix, maybe we can exploit this symmetry and find a better algorithm.

A Hermitian matrix is always diagonalizable, $\mathbf H=\mathbf P \mathbf D \mathbf P^\dagger$, where $\mathbf D$ is a diagonal matrix with elements equal to the eigen values of $\mathbf H$, and $\mathbf P$ is a unitary matrix $\mathbf P^{-1}=\mathbf P^\dagger$. If $\mathbf H$ is symmetric, $\mathbf P$ is orthogonal, $\mathbf P^{-1}=\mathbf P^T$. This property makes it very easy to calculate powers of Hermitian/symmetric matrices, since all the $\mathbf P$ and $\mathbf P^{-1}$ in the middle cancel out, $\mathbf H^s=\mathbf P \mathbf D^s \mathbf P^\dagger$, and $\mathbf D^s$ is very easy to calculate - just raise every element on the diagonal to power $s$.

This property alone doesn't help much, since eigendecomposition typically has the same complexity as matrix multiplication. The only thing it improves is, the power is easy to solve, so the $\log s$ factor is gone. This is not good enough.
  
But this matrix is not only symmetric, it's also Toeplitz. A Toeplitz matrix can be decomposed in $O(n^2)$. A tridiagonal Toeplitz matrix has known eigen values, namely $a+2\sqrt{bc}\cos\left(\frac{k\pi}{n+1}\right), k=1,\dots,n$, where $a=b=c=1$ in this case.

Since $\mathbf D^s$ is diagonal, it's easy to see that $\mathbf H^s[0][0]=\sum_i \mathbf P[0][i] \mathbf D^s[i][i] \mathbf P^T[i][0] = \sum_i \mathbf P^2[0][i] \mathbf D^s[i][i]$. We already have the eigen values, now if we can find the first row of the matrix $\mathbf P$ without actually decomposing the matrix, it would be an $O(n)$ algorithm!

Here are the eigen values and the first rows of the $\mathbf P$ of various sizes of $\mathbf H$ that I found, the first rows are the sizes, the second rows are the eigen values, the third rows are the first row of the corresponding $\mathbf P$:
(If the table below is all messed up, try refreshing. For speed (I guess) or whatever reason, sometimes browsers don't load the scripts in order, which messes up the table.)

| 3*3 |  |  |
|:--:|:--:|:--:|
| $1+\sqrt{2}$, | 1, | $1-\sqrt{2}$ |
| 1/2, | $1/\sqrt{2}$, | 1/2 |

|4*4| | | |
|:--:|:--:|:--:|:--:|
|$(3+\sqrt{5})/2$,| $(1+\sqrt{5})/2$,|$	(1-\sqrt{5})/2	$,|$(3-\sqrt{5})/2$|
|$1/\sqrt{5+\sqrt{5}}$,|$1/\sqrt{5-\sqrt{5}}$,|$	1/\sqrt{5+\sqrt{5}}	$,|$1/\sqrt{5-\sqrt{5}}$|

|5*5| | | | |
|:--:|:--:|:--:|:--:|:--:|
|$1+\sqrt{3}$,|$		2	$,|$	1	$,|$	1-\sqrt{3}	$,|$	0$|
|$1/2\sqrt{3}$,|$	1/2$,|$		1/\sqrt{3}$,|$		1/2\sqrt{3}$,|$	1/2$|

|6*6| | | | | |
|:--:|:--:|:--:|:--:|:--:|:--:|
|2.80194,|		2.24698	,|	1.44504	,|	-0.801938,|	0.554958,|	-0.24698|
|0.231921	,|-0.417907,|	0.521121,|	0.231921,|	0.521121,|	0.417907|

7*7| | | | | | |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|2.84776	,|	2.41421	,|	1.76537		,|1,|		-0.847759	,|-0.414214,|	0.234633|
|0.191342	,|0.353553	,|0.46194,|		0.5	,|	0.191342,|	0.353553,|	0.46194|

Noticing the symmetry of the elements in $\mathbf P$ and the corresponding eigen values, remembering that the eigen values are $1+2\cos\theta$, it's easier to tell by returning the eigen values back to the $\cos$ value, i.e., subtract 1 from the values and then divide by 2.

|3*3|  |  |
|:--:|:--:|:--:|
| $\sqrt{2}/2$ ,|	0,|	$-\sqrt{2}/2$|
|1/2 ,|$1/\sqrt{2}$,| 1/2|

|4*4| | | |
| :--:|:--: | :--:|:--:|
|$(1+\sqrt{5})/4$,|$	(-1+\sqrt{5})/4$,|$	(-1-\sqrt{5})/4	$,|$(1-\sqrt{5})/4$|
|$1/\sqrt{5+\sqrt{5}}$,|$1/\sqrt{5-\sqrt{5}}$,|$	1/\sqrt{5+\sqrt{5}}	$,|$1/\sqrt{5-\sqrt{5}}$|

|5*5| | | | |
| :--:|:--: | :--: |:--:|:--:|
|$\sqrt{3}/2$,|$		1/2	$,|$	0	$,|$	-\sqrt{3}/2	$,|$	-1/2$|
|$1/2\sqrt{3}$,|$	1/2$,|$		1/\sqrt{3}$,|$		1/2\sqrt{3}$,|$	1/2$|

|6*6| | | | | |
| :--: | :--: |:--: |:--:|:--:|:--:|
|1.80194/2,|		1.24698/2	,|	0.44504/2	,|	-1.801938/2,|	-0.44504/2,|	-1.24698/2|
|0.231921	,|-0.417907,|	0.521121,|	0.231921,|	0.521121,|	0.417907|

7*7| | | | | | |
| :--: | :--: |:--: |:--:|:--:|:--:|:--:|
|1.84776/2	,|	1.41421/2	,|	0.76537/2		,|0,|		-1.847759/2	,|-1.414214/2,|	-0.76537/2|
|0.191342	,|0.353553	,|0.46194,|		0.5	,|	0.191342,|	0.353553,|	0.46194|

I noticed that for 3*3 5*5 and 7*7 matrices, the elements corresponding to $0$ are $1/\sqrt{2}$,$1/\sqrt{3}$,$1/\sqrt{4}$ respectively, and for 4*4 all elements have a factor of $\sqrt{5}$, so there could be a factor of $1/\sqrt{n+1}$. I guessed that it may also have something to do with $\sin\theta$, since as the absolute value of the eigen value decreases, the element value increases. And I figured it out!
The formula is,
$\theta_i=\frac{i\pi}{n+1}, i=1,2,...n$
$e_i=1+2\cos\theta_i$
$p_i=\sqrt{\frac{2}{n+1}}\sin\theta_i$
$ans=\sum_ie_i^{s}p_i^2, s=steps$
The following code implements this algorithm. The only problem now is, the result could be very large, so how do we find the remainder of it $\mod 10^9+7$ accurately?

```
#pragma GCC target("avx,mmx,sse2,sse3,sse4")
auto _=[]()noexcept{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);return 0;}();
unsigned int cnt[251];
const int mod=1000000007;
class Solution {
public:
    int numWays(int steps, int arrLen){
        if(arrLen<=2){
            if(arrLen==1) return 1;
            int p=(steps-1)/63,q=(steps-1)%63;
            unsigned long long c=(1ULL<<63)%mod,r=(1ULL<<q)%mod;
            while(p--) r=(r*c)%mod;
            return r;
        }
        int b=min(steps/2+1,arrLen),tmp2;

        //new algorithm
        double r=0.0;
        for(int i=1;i<=b;++i){
            double c=cos(M_PI*i/(b+1));
            r+=pow(1+2*c,steps)*(1-pow(c,2));
        }
        r=r*2/(b+1);
        if(r<mod) return round(r);

        //may not work due to rounding errors if result is too large, so back to old algorithm
        unsigned long long tmp1;
        memset(cnt,0,b*sizeof(cnt[0]));
        cnt[0]=1;
        for(int i=0;i<steps;++i){
            tmp1=cnt[0];
            cnt[0]=(cnt[0]+cnt[1])%mod;
            int s=min(min(i+1,steps-i),b-1);
            for(int j=1;j<s;++j){tmp2=cnt[j]; cnt[j]=(tmp1+cnt[j]+cnt[j+1])%mod; tmp1=tmp2;}
            cnt[s]=(tmp1+cnt[s])%mod;
        }
        return cnt[0];
    }
};
```

Simply changing to

```
        r=remainder(r,mod);
        return round(r);
```

gives numeric errors:

```
Input
steps = 47
arrLen = 38
Output 318653590
Expected 318671228
```

(In this example, the result is "r=1.149565e+20", it's accurate up to the 15th decimal place.
Which means we can safely do that when the result is less than $10^{15}$ which is about $2^{50}$, so we can change

```
if(r<mod) return round(r);
```

to

```
if(r<(1ULL<<50)) return round(fmod(r,mod));//fmod() avoids negative results that remainder() may return
```

and the result is precise enough.)

So it requires too much precision for very large outputs. But in general, it gives a very good approximation in $O(n)$ complexity. Amazing!

But can we improve the complexity for the original problem? Let's go down this path further and see where it leads us!

If we apply the trivial result that answer=1 when steps=1, we get an interesting identity,
$\sum_i(1+2\cos\theta_i)\sin^2\theta_i=\frac{n+1}{2}$
,where $\theta_i=\frac{i\pi}{n+1}, i=1,2,\dots,n$.
One way to prove it is, simply write $\cos\theta_k=(e^{i\theta_k}+e^{-i\theta_k})/2$ and $\sin\theta_k=(e^{i\theta_k}-e^{-i\theta_k})/2i$ and expand the expression, most of the summations cancel out, only the number 1/2 repeats $n+1$ times, thus completes the proof.

There is an easier way to prove it, though. Noticing that $\cos(\pi-\theta)=-\cos\theta$ and $\sin(\pi-\theta)=\sin\theta$, the terms $\cos\theta_k\sin^2\theta_k$ cancel out! Now we're left with $\sum_k \sin^2\theta_k=\sum\frac{1-\cos(2\theta_k)}{2}=\frac{n}{2}-\frac{1}{2}\sum\cos(2\theta_k)$. $\cos(2\theta_k)$ can be taken as the real part of $e^{i2\theta_k}$, where $\theta_k=\frac{k\pi}{n+1}, k=1,\dots,n$. So it becomes $\sum_{k=1}^n e^{i\frac{k 2\pi}{n+1}}$. It would make a full circle if we include $k=0$ which contributes an extra 1, thus the summation must equal -1. This gives the expected result $\frac{n+1}{2}$.

Can we generalize this to the general case with power $s$? Let's find out!

The function is zero for $\theta=0$, so adding $k=0$ doesn't change the result.
Then, if we also include $\theta_k=\frac{k\pi}{n+1}$ where $k=n+1,n+2\dots,2n+1$, they are just the repeat of the values for $k=0,1\dots,n$, thus the result is doubled.
Applying these changes and express the $\sin$ and $\cos$ in exponents, we get
$$ans=-\frac{2}{n+1}\sum_{k=0}^{2n+1}(e^{i\theta_k}+e^{-i\theta_k}+1)^s(e^{2i\theta_k}+e^{-2i\theta_k}-2)/8$$
If we expand it as a polynomial of $e^{i\theta_k}$, $ans=\sum_k \sum_m a_m e^{im\theta_k}$, most of the terms will cancel out after the summation of $k$, since the summation would become $(1-e^{i\frac{(2n+2)m\pi}{n+1}})/(1-e^{i\frac{m\pi}{n+1}})$, which is zero unless $\frac{m}{n+1}$ is a multiple of 2, in which case the summation is $2n+2$ since it's 1 repeated $2n+2$ times.

Thus, only the terms with power $m$ being a multiple of $2n+2$ contributes to the result. Each one of those contributs $a_m(2n+2)$. How do we find $a_m$? The coefficient of the term $e^{im\theta_k}$ in $(e^{i\theta_k}+e^{-i\theta_k}+1)^s$ is easy to find, namely $\sum_{j=0}^{\frac{s-m}{2}}\binom{s}{j}\binom{s-j}{j+m}$. Let's denote that by $c(m)$. Multiplying it by $e^{2i\theta_k}+e^{-2i\theta_k}-2$ will shift it to the powers $m+2$, $m-2$ and $m$ respectively.
Thus, the result is
$$ans=\frac{1}{2}(\sum_{m} 2 c(m)-c(m+2)-c(m-2))$$
where $c(m)=\sum_{j=0}^{\frac{s-m}{2}}\binom{s}{j}\binom{s-j}{j+m}$.

Then, what are the $m$s? Since the power can be at most $s$, and $m$ must be a multiple of $2n+2$, the ratio must be within $[-\frac{s}{2n+2},\frac{s}{2n+2}]$. We can then just iterate through that and multiply by $2n+2$. This completes the algorithm.

This algorithm has complexity $O(steps*max(steps/arrLen,1))$. Interestingly, one would expect the complexity to grow with $arrLen$, but with this algorithm it's the opposite. When $arrLen>steps$, the old algorithm at the beginning takes $O(steps^2)$ while this one takes $O(steps)$! If we combine this algorithm and the old algorithm which has complexity $O(steps*min(steps,arrLen))$, i.e., we use the new algorithm for large $arrLen$ and the old one for small $arrLen$, we get an $O(step^{1.5})$ algorithm! We choose the new one if $arrLen>\sqrt{steps}$ and the old one otherwise.

(Building the binomial table has complexity $O(n^2)$ but it is not included in the complexity, since it only needs to be built once. For multiple requests the amortized complexity can be arbitrarily low. Explictly, if the function is called $k$ times, while the maximum of the number of steps is $N$, then the amortized complexity is $\frac{N^2}{k}$. If $k\geq \sqrt N$, the amortized comlexity is still $O(N^{1.5})$.
On the other hand, one could use a factorial and inverse of factorial table instead, which can be built in $O(N)$, and $\binom{a}{b}$ can be calculated as $a!(b!)^{-1}(a-b)!^{-1}$, at the cost of 2 extra mulitplication operations on each call to binom. But this makes it exactly $O(N^{1.5})$.
)

The following code implements this algorithm:

```
#pragma GCC target("avx,mmx,sse2,sse3,sse4")
auto _=[]()noexcept{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);return 0;}();
unsigned int cnt[251];
const int mod=1000000007;
vector<vector<int>> binom{{1}};
void buildBinom(int n){
    for(int i=binom.size()-1;i<=n;++i){
        vector<int> v;
        v.reserve(i+2);
        v.emplace_back(1);
        const auto& b=binom.back();
        for(int j=0;j<i;++j) v.emplace_back((b[j]+b[j+1])%mod);
        v.emplace_back(1);
        binom.emplace_back(move(v));
    }
}
int getCoe(int s,int j,int t){
    if(j+t<0||s<2*j+t) return 0;
    return (long long)binom[s-j][j+t]*binom[s][j]%mod;
}
class Solution {
public:
    int numWays(int steps, int arrLen){
        if(arrLen<=2){
            if(arrLen==1) return 1;
            int p=(steps-1)/63,q=(steps-1)%63;
            unsigned long long c=(1ULL<<63)%mod,r=(1ULL<<q)%mod;
            while(p--) r=(r*c)%mod;
            return r;
        }
        if(steps==1) return 1;
        if(arrLen>steps||arrLen*arrLen>steps){
            buildBinom(steps);
            long long t=0;
            int tmp=2*arrLen+2;
            for(int m=-steps/tmp*tmp;m<=steps;m+=tmp){
                int upper=(steps-m)/2+2;
                for(int j=0;j<=upper;++j){
                    t=(t+2*getCoe(steps,j,m)-getCoe(steps,j,m+2)-getCoe(steps,j,m-2))%mod;
                }
            }
            return ((t*500000004)%mod+mod)%mod;
        }
        int tmp2;
        unsigned long long tmp1;
        memset(cnt,0,arrLen*sizeof(cnt[0]));
        cnt[0]=1;
        for(int i=0;i<steps;++i){
            tmp1=cnt[0];
            cnt[0]=(cnt[0]+cnt[1])%mod;
            for(int j=1;j<arrLen;++j){tmp2=cnt[j]; cnt[j]=(tmp1+cnt[j]+cnt[j+1])%mod; tmp1=tmp2;}
            cnt[arrLen]=(tmp1+cnt[arrLen])%mod;
        }
        return cnt[0];
    }
};
```

Ah, one more thing. The number 500000004 here acts as $2^{-1}$, because we can't simply divide $t$ by 2. To divide a number $\mod q$, we must multiply the inverse of the element. Now, the $O(n^{1.5})$ algorithm is complete.

To get rid of this $2^{-1}$, noticing the symmetry of the coefficients $c(m)=c(-m)$, it's easy to see that the sum of the terms corresponding to $m$ and $-m$ are the same. With this, the middle part can be written as

```
        if(arrLen>steps||arrLen*arrLen>steps){
            buildBinom(steps);
            long long t=0;
            for(int j=0;j<=steps/2;++j) t=(t+getCoe(steps,j,0)-getCoe(steps,j,2))%mod;
            int tmp=2*arrLen+2;
            for(int m=tmp;m<=steps;m+=tmp){
                int upper=(steps-m)/2+2;
                for(int j=0;j<=upper;++j){
                    t=(t+2*getCoe(steps,j,m)-getCoe(steps,j,m+2)-getCoe(steps,j,m-2))%mod;
                }
            }
            return (t+mod)%mod;
        }
```

which also saves the time by half.


I really enjoyed the journey to this surprising result. It reminds me of the first time I learned about Strassen algorithm. The fact that "this is achievable!" and seeing it unveils in front of me is so satisfying!

(ps: Just learned that the solution for the case when $arrLen>=steps/2+1$ is known as the Motzkin numbers, which has an recurrence relation: $M_n=\frac{2n+1}{n+2}M_{n-1}+\frac{3n-3}{n+2}M_{n-2}$. This gives an $O(n)$ algorithm without the need to build the binomial table for this special case. The only downside is, we must find $(i+2)^{-1} \mod 1e9+7$ for $i=1\dots steps$, which takes about 30 operations for each of them by finding $(i+2)^{1e9+5}$. But it's still a really good algorithm. I wonder if this can be generalized for the "correction terms" above, i.e., if there is a recurrence relation for $M_{n,m}=c(m)-(c(m-2)+c(m+2))/2$. The Motzkin numbers are the $M_{n,0}$ here. If so, the binomial table will not be necessary.)

Wednesday, August 2, 2023

Leetcode M808 Soup Servings Convergence Analysis

M808 description:

  There are two types of soup: type A and type B. Initially, we have $n$ ml of each type of soup. There are four kinds of operations:
   1. Serve 100 ml of soup A and 0 ml of soup B,
   2. Serve 75 ml of soup A and 25 ml of soup B,
   3. Serve 50 ml of soup A and 50 ml of soup B, and
   4. Serve 25 ml of soup A and 75 ml of soup B.
  When we serve some soup, we give it to someone, and we no longer have it. Each turn, we will choose from the four operations with an equal probability 0.25. If the remaining volume of soup is not enough to complete the operation, we will serve as much as possible. We stop once we no longer have some quantity of both types of soup.
  Note that we do not have an operation where all 100 ml's of soup B are used first.
  Return the probability that soup A will be empty first, plus half the probability that A and B become empty at the same time. Answers within $10^{-5}$ of the actual answer will be accepted.

Constraints:
  $0<=n<=10^9$

It's not difficult to find the algorithm, if one notices the recursive nature of the problem: State $(A,B) \rightarrow (A-100,B)$ or $(A-75,B-25)$ or $(A-50,B-50)$ or $(A-25,B-75)$.
But the upper bound of $n$ is $10^9$. Are we supposed to build a $10^9$ by $10^9$ table? How much time would that take?

It's easy to see that we can simplify the conditions a little bit. Apparently if the quantity is not a multiple of 25, the situation is the same as when the quantity is the next multiple of 25. So let's convert $n$ to the ceiling of its ratio to 25.
Now we can change the 4 operations to: $(m,n) \rightarrow (m-4,n-0)$ or $(m-3,n-1)$ or $(m-2,n-2)$ or $(m-1,n-3)$.
So it's easy to tell that the probability to win from the state $(m,n)$ is, $p(m,n)=\frac{1}{4}(p(m-4,n-0)+p(m-3,n-1)+p(m-2,n-2)+p(m-1,n-3))$.
The base cases are, $p(m,n)=\begin{cases} 0 & \text{if $m>0,n\leq0$}\\ 1 & \text{if $m\leq 0,n>0$}\\ 0.5 & \text{if $m\leq 0,n\leq 0$}\end{cases}$.

We could apply this subroutine recursively until we reach one of the base cases, but that would be very inefficient. The first important thing is we should memoize the intermediate results. We can do that in multiple ways. I chose the bottom up approach, i.e. just build the entire table. And since the conditions don't change, once I have the table, I can simply take a value from the table for any input.

But still, that would be a $4\times 10^7$ by $4\times 10^7$ table. Can we do better than that?
Here's the second crucial thing to notice: The larger $n$ is, the higher the probability that soup A will be empty first becomes, because there's a higher chance that we will serve more soup A than soup B in each step, we would expect soup A to decrease faster than soup B. If it keeps increasing, we can stop at the $n$ such that $p(n,n)>1-10^{-5}$, because for larger inputs, the difference between the result and 1 is always less than $10^{-5}$, so we can simply return 1!

The following program, which I submitted, is based on this intuition. It took 0ms, pretty fast.

```cpp
vector<vector<double>> tab;
bool first=true;
int oneTh=INT_MAX;
double check(int i,int j){
    if(i<0) return (1.0+(j>=0))/2;
    if(j<0) return 0;
    return tab[i][j];
}
double avg(int i,int j){
    if(i==-1){if(j==-1) return 0.5;return 1.0;}
    if(j==-1) return 0.0;
    return 0.25*(check(i-4,j)+check(i-3,j-1)+check(i-2,j-2)+check(i-1,j-3));
}
void init(){
    first=false;
    for(int n=1;n<=40000000;++n){
        for(int i=0;i<n-1;++i) tab[i].emplace_back(avg(i,n-1));
        if(1.0-avg(n-1,n-1)<0.00001){oneTh=n;break;}
        vector<double> tmp;
        for(int i=0;i<n;++i) tmp.emplace_back(avg(n-1,i));
        tab.emplace_back(move(tmp));
        // cout<<"n="<<n<<endl;
        // for(auto& v:tab){
        //     for(auto e:v) cout<<e<<",";
        //     cout<<endl;
        // }
    }
}
class Solution {
public:
    double soupServings(int n) {
        if(first) init();
        if((n=(n+24)/25)>=oneTh) return 1.0;
        if(n==0) return 0.5;
        return tab[n-1][n-1];
    }
};
```

(In the submission page, all the sample submissions that take less than 100ms use some magic number like "if(n>5000)" or "if(n>10000)", but the threshold could be easily determined during the computation.)

The main reason for this article is not about the program, so let's go back to the assumption. The assumption is based on our intuition, but can we prove it rigorously? Let's think about it!
The editorial of that problem on leetcode demonstrates the law of large numbers. But the fact that the probability converges to 1 doesn't justify the statement that after we first encounter a value $\geq 1-0.00001$, all the values afterwards satisfy it too.
Consider the function $f(x)=e^{-x/100}\cos^2(x/100)$. We know that it converges to 0 as $x\rightarrow\infty$, but that doesn't mean that we can stop when we encounter a value less than the criterion for the first time, because the values will keep increasing and decreasing alternately! So what we really need is a proof that it increases monotonically, or an upper bound such that even if it starts to increase again, it will never exceed the upper bound.

The probabilities that we calculate form a table, which we can represent as a grid graph. To make it easier to see, let's rotate it by $45^{\circ}$:

The value at each cell is the average of another 4 cells, and if the cells are outside the grid, the value is 0 or 1 or 0.5 respectively in the regions shown on the figure:

(From here, immediately we find two optimizations to our program: First, it's unnecessary to calculate half of the numbers in the grid. Because we only need the values at $(n,n)$, all the elements that we'll ever use are either in $(odd,odd)$ cells or $(even,even)$ ones. Second, we only need one layer of the values to construct the next layer. So instead of saving the entire table, we could just use two vectors to store the current layer and the next layer, which saves a lot of space. That will do for all odd $n$s, then we do the same to fill in all the even ones. Another possible optimization is, to find the last $p(n,n)$, we just need the values of the cells between these two lines, so we can start to shrink the vectors after half way through, which cuts the space complexity by half again. But that's only if we already know the last value of $n$, and it's probably going to be a little more complicated to implement.)

(From here on, "next layer" means two rows downwards, e.g. from coordinate (1,1) to (3,3) is one layer down.)

What we want is, a proof that the values at the cells in the middle increases monotonically as it goes downward, or an upper bound of those values as a function of the number of steps.

Since we've seen the law of large numbers, maybe we can continue on that path?
The central limit theorem states that with i.i.d variables, their average will converge to normal distribution.
The problem can be seen as a random walk: We start at the origin, each time we may take 1 step left, or no move, or 1 step right, or 2 steps right, each with probability 1/4, and we want to find out the probability that we are at location $l$ after $n$ steps. It's easy to find the expectation of the variables $\mu_0=(-1+0+1+2)/4=1/2$, and the variance $\sigma_0^2=(1+0+1+4)/4 - 1/4=5/4$.

If we start from $(n,n)$, we will move out of the grid in at most $\frac{n}{2}$ steps. By then, the range of our final location is $[-\frac{n}{2},n]$. The expectation of the summation is $\mu=\frac{n}{4}$. Let's show them on the figure:

If you're thinking "wait a minute - once we reach either one of the boundaries, we can't keep going, so how can we map the probability distribution along the boundaries to the line?", that's good thinking! Because indeed, it's not a one on one map. But we only cares about the summation of the probabilities, and notice that if our final location is in the 0 or 1 region, we must have come through the corresponding boundary, and if we come through either one of the boundaries, we won't be able to leave that region. That means, the summation of the probabilitese from $-\frac{n}{2}$ to 0 and from 0 to $n$ corresponds to the probability of B empty and A empty respectively. There is a discrepancy, though, because if we reach the first 3 rows inside the grid, we may enter the 0.5 region. But as $n$ increases, that probability converges to zero.

So, let's compute the probability distribution of our final location on that horizontal line. The central limit theorem states that the probability distribution should converge to a normal distribution with expectation  $\mu=\frac{n}{4}$ and variance $\sigma^2=\frac{5n}{8}$. The probability of A empty is
$$P=1-\int_{-\infty}^0 \mathcal{N}\left(\frac{n}{4},\sqrt{\frac{5n}{8}}\right) dx$$
We can manipulate the variable a little to convert it to an integration on the standard normal distribution:
$$P=1-\sqrt{\frac{5n}{8}}\int_{-\infty}^{-\frac{n}{4}\sqrt{\frac{8}{5n}}} \mathcal{N}(0,1) dx$$
$$=1-\sqrt{\frac{5n}{8}}\frac{1+\text{erf}(-\sqrt{\frac{n}{20}})}{2}$$

Now we can easily find out the threshold based on this approximation, in one line:

```cpp
    for(int i=1;i<2000;++i) if((1+erf(-sqrt(i/20.0)))*sqrt(5*i/32.0)<0.00001) {cout<<"i="<<i;break;}
```

We get $i=230$, which corresponds to an original value of $230\times 25=5750$.
The threshold I got from my program is $oneTh=179$, which corresponds to an original value of $4475$.
How close is this approximation? Let's find out the differences:

```cpp
    for(int i=1;i<tab.size();++i) cout<<setw(8)<<tab[i][i]<<" - "<<setw(8)<<(f=1-(1+erf(-sqrt(i/20.0)))*sqrt(5*i/32.0))<<" = "<<setw(8)<<tab[i][i]-f<<endl;
```

Output:
   0.625 - 0.702813 = -0.0778132
 0.65625 -    0.634 = 0.0222501
 0.71875 - 0.600243 = 0.118507
0.742188 - 0.583299 = 0.158888
0.757812 - 0.576178 = 0.181635
0.785156 - 0.575349 = 0.209808
0.796875 - 0.578759 = 0.218116
0.817871 - 0.585105 = 0.232766
0.827637 - 0.593511 = 0.234126
0.844849 - 0.603362 = 0.241487
0.852173 - 0.614214 = 0.237959
0.866699 - 0.625739 =  0.24096
0.872559 -  0.63769 = 0.234868
0.884827 - 0.649881 = 0.234946
0.889633 - 0.662167 = 0.227466
0.900076 - 0.674438 = 0.225637
0.904058 - 0.686609 =  0.21745
0.913005 - 0.698613 = 0.214392
0.916344 -   0.7104 = 0.205944
0.924045 - 0.721932 = 0.202114
 0.92687 - 0.733179 =  0.19369
0.933526 - 0.744121 = 0.189404
 0.93593 - 0.754743 = 0.181187
0.941703 - 0.765035 = 0.176668
0.943762 - 0.774991 = 0.168771
0.948783 - 0.784609 = 0.164174
0.950555 - 0.793889 = 0.156667
0.954934 - 0.802832 = 0.152102
0.956464 - 0.811443 = 0.145021
0.960291 - 0.819727 = 0.140564
0.961618 -  0.82769 = 0.133927
0.964968 -  0.83534 = 0.129628
0.966121 - 0.842684 = 0.123438
0.969061 -  0.84973 = 0.119331
0.970066 - 0.856487 = 0.113579
0.972648 - 0.862964 = 0.109684
0.973525 - 0.869169 = 0.104356
0.975797 - 0.875113 = 0.100684
0.976565 - 0.880803 = 0.0957618
0.978566 - 0.886249 = 0.0923166
0.979239 -  0.89146 = 0.0877792
0.981004 - 0.896444 = 0.0845595
0.981595 - 0.901211 = 0.0803844
0.983153 - 0.905767 = 0.0773854
0.983673 - 0.910123 = 0.0735498
0.985049 - 0.914285 = 0.0707642
0.985506 - 0.918261 = 0.0672452
0.986724 -  0.92206 = 0.0646639
0.987127 - 0.925688 = 0.0614394
0.988204 - 0.929152 = 0.0590524
 0.98856 -  0.93246 = 0.0561005
0.989515 - 0.935617 = 0.0538971
0.989829 - 0.938631 = 0.0511974
0.990675 - 0.941508 = 0.0491667
0.990953 - 0.944253 = 0.0466995
0.991703 - 0.946872 = 0.0448306
0.991949 - 0.949371 = 0.0425775
0.992614 - 0.951755 = 0.0408595
0.992832 - 0.954029 = 0.0388033
0.993423 - 0.956197 = 0.0372257
0.993616 - 0.958266 = 0.0353502
0.994141 - 0.960238 = 0.0339031
0.994312 - 0.962119 = 0.0321933
0.994779 - 0.963912 = 0.0308669
 0.99493 - 0.965622 = 0.0293089
0.995346 - 0.967251 = 0.0280942
 0.99548 - 0.968805 = 0.0266751
 0.99585 - 0.970286 = 0.0255634
0.995969 - 0.971698 = 0.0242713
0.996298 - 0.973043 = 0.0232546
0.996404 - 0.974325 = 0.0220786
0.996697 - 0.975547 = 0.0211493
0.996791 - 0.976712 = 0.0200793
0.997052 - 0.977821 = 0.0192303
0.997136 - 0.978879 = 0.018257
0.997368 - 0.979886 = 0.0174819
0.997443 - 0.980846 = 0.0165967
 0.99765 - 0.981761 = 0.0158893
0.997716 - 0.982632 = 0.0150846
0.997901 - 0.983462 = 0.0144392
 0.99796 - 0.984252 = 0.0137078
0.998125 - 0.985006 = 0.0131192
0.998177 - 0.985723 = 0.0124545
0.998324 - 0.986406 = 0.011918
0.998371 - 0.987057 = 0.0113141
0.998502 - 0.987677 = 0.0108252
0.998544 - 0.988268 = 0.0102766
0.998661 -  0.98883 = 0.00983115
0.998699 - 0.989366 = 0.00933291
0.998803 - 0.989876 = 0.00892724
0.998836 - 0.990362 = 0.00847478
 0.99893 - 0.990824 = 0.00810543
0.998959 - 0.991265 = 0.00769461
0.999043 - 0.991684 = 0.0073584
0.999069 - 0.992084 = 0.00698545
0.999144 - 0.992464 = 0.00667947
0.999167 - 0.992826 = 0.00634094
0.999234 - 0.993171 = 0.00606255
0.999255 -   0.9935 = 0.0057553
0.999314 - 0.993812 = 0.00550206
0.999333 -  0.99411 = 0.00522323
0.999386 - 0.994393 = 0.00499291
0.999403 - 0.994663 = 0.00473991
0.999451 -  0.99492 = 0.00453047
0.999466 - 0.995165 = 0.00430092
0.999508 - 0.995398 = 0.00411051
0.999522 - 0.995619 = 0.00390226
 0.99956 -  0.99583 = 0.00372917
0.999572 - 0.996031 = 0.00354026
0.999606 - 0.996223 = 0.00338295
0.999616 - 0.996405 = 0.0032116
0.999647 - 0.996578 = 0.00306864
0.999656 - 0.996743 = 0.00291323
0.999684 -   0.9969 = 0.00278334
0.999692 -  0.99705 = 0.0026424
0.999717 - 0.997192 = 0.0025244
0.999724 - 0.997328 = 0.00239658
0.999746 - 0.997457 = 0.00228939
0.999753 - 0.997579 = 0.00217349
0.999772 - 0.997696 = 0.00207614
0.999779 - 0.997808 = 0.00197105
0.999796 - 0.997913 = 0.00188264
0.999802 - 0.998014 = 0.00178736
0.999817 -  0.99811 = 0.00170708
0.999822 - 0.998202 = 0.0016207
0.999836 - 0.998288 = 0.0015478
0.999841 - 0.998371 = 0.0014695
0.999853 -  0.99845 = 0.00140331
0.999857 - 0.998525 = 0.00133233
0.999868 - 0.998596 = 0.00127225
0.999872 - 0.998664 = 0.00120791
0.999882 - 0.998729 = 0.00115337
0.999885 -  0.99879 = 0.00109505
0.999894 - 0.998849 = 0.00104556
0.999897 - 0.998904 = 0.000992697
0.999905 - 0.998957 = 0.000947776
0.999908 - 0.999008 = 0.000899868
0.999915 - 0.999056 = 0.000859103
0.999917 - 0.999101 = 0.000815685
0.999924 - 0.999145 = 0.000778694
0.999926 - 0.999186 = 0.000739347
0.999932 - 0.999226 = 0.000705783
0.999933 - 0.999263 = 0.000670127
0.999939 - 0.999299 = 0.000639675
 0.99994 - 0.999333 = 0.000607365
0.999945 - 0.999365 = 0.000579738
0.999946 - 0.999396 = 0.00055046
0.999951 - 0.999425 = 0.000525398
0.999952 - 0.999453 = 0.000498869
0.999956 -  0.99948 = 0.000476135
0.999957 - 0.999505 = 0.000452098
 0.99996 - 0.999529 = 0.000431477
0.999961 - 0.999552 = 0.000409699
0.999964 - 0.999573 = 0.000390995
0.999965 - 0.999594 = 0.000371264
0.999968 - 0.999614 = 0.0003543
0.999969 - 0.999632 = 0.000336424
0.999971 -  0.99965 = 0.00032104
0.999972 - 0.999667 = 0.000304845
0.999974 - 0.999683 = 0.000290893
0.999975 - 0.999699 = 0.000276222
0.999977 - 0.999713 = 0.00026357
0.999977 - 0.999727 = 0.000250279
0.999979 -  0.99974 = 0.000238807
 0.99998 - 0.999753 = 0.000226767
0.999981 - 0.999765 = 0.000216365
0.999982 - 0.999776 = 0.000205459
0.999983 - 0.999787 = 0.000196027
0.999984 - 0.999798 = 0.000186147
0.999985 - 0.999807 = 0.000177596
0.999985 - 0.999817 = 0.000168647
0.999986 - 0.999826 = 0.000160895
0.999987 - 0.999834 = 0.000152789
0.999988 - 0.999842 = 0.00014576
0.999988 -  0.99985 = 0.000138418
0.999989 - 0.999857 = 0.000132047
0.999989 - 0.999864 = 0.000125396

Indeed, the numeric evidence of convergence is very convincing.

But wait a minute... Does this count as a proof? Is it rigorous?

The simple answer is, no.
Because we can tell that it converges to normal distribution as $n$ goes to infinity, but we still don't know how fast it converges to that! We can tell that for the normal distribution we can stop at a certain $n$, but we don't have proof that the normal distribution is close enough to the actual distributon!
Wikipedia states that
If the third central moment $E[(X_1-\mu)^3]$ exists and is finite, then the speed of convergence is at least on the order of $1/\sqrt{n}$ (see Berry–Esseen theorem). Stein's method can be used not only to prove the central limit theorem, but also to provide bounds on the rates of convergence for selected metrics.
Apparently the third central moment is finite for our distribution, but a convergence speed of $1/\sqrt{n}$ is too slow for our problem. The criterion is $10^{-5}$, that would require $10^{10}$ steps, which corresponds to $n=2\times 10^{10}$, larger than the maximum of the constraint! I haven't looked into this "Stein's method", I wonder if it provides a higher lower bound on the speed. Also notice that, we not only need to bound the difference between the probability densities, we also need to bound the difference between the cumulative distribution function, since what we need is the value of the integration. 

We can also tell that the value approximated by normal distribution seems to be a lower bound of the actual value, except for the first one. But unless we can prove it, it's just an observation.

What if we don't use an approximation? What if we just sum up the discrete probabilities?
If we use the recursive relationship twice, we get:
$p(m,n)=\frac{1}{4}(p(m-4,n)+p(m-3,n-1)+p(m-2,n-2)+p(m-1,n-3))$
$=\left(\frac{1}{4}\right)^2(p(m-8,n)+2p(m-7,n-1)+3p(m-6,n-2)+4p(m-5,n-3)+3p(m-4,n-4)+2p(m-3,n-5)+p(m-2,n-6))$
And the pattern continues. The coefficients goes like this:
               1   1   1   1
           1   2   3   4   3   2   1
        1  3   6   10  12  12  10  6   3   1
    1   4  10  20  31  40  44  40  31  20  10  4  1
(more rows up to the 11th:
1,5,15,35,65,101,135,155,155,135,101,65,35,15,5,1,
1,6,21,56,120,216,336,456,546,580,546,456,336,216,120,56,21,6,1,
1,7,28,84,203,413,728,1128,1554,1918,2128,2128,1918,1554,1128,728,413,203,84,28,7,1,
1,8,36,120,322,728,1428,2472,3823,5328,6728,7728,8092,7728,6728,5328,3823,2472,1428,728,322,120,36,8,1,
1,9,45,165,486,1206,2598,4950,8451,13051,18351,23607,27876,30276,30276,27876,23607,18351,13051,8451,4950,2598,1206,486,165,45,9,1,
1,10,55,220,705,1902,4455,9240,17205,29050,44803,63460,82885,100110,112035,116304,112035,100110,82885,63460,44803,29050,17205,9240,4455,1902,705,220,55,10,1,
1,11,66,286,990,2882,7282,16302,32802,59950,100298,154518,220198,291258,358490,411334,440484,440484,411334,358490,291258,220198,154518,100298,59950,32802,16302,7282,2882,990,286,66,11,1,)

On each row, each number is the sum of 4 numbers of the previous row: 2 position to the left of the one above it, 1 position to the left, directly above it, and one to the right, and if it goes out of the boundary, it's considered 0.
Over all, there's an extra coefficient of $\frac{1}{4^t}$ to normalize the probability to one.
Each row has one more element on the left and two more on the right. The vertical line in our figure corresponds to the vertical line of {1,3,10,31...}. If we take the 4th layer of the even cells (coordinate (8,8) in the table below) as an example, the total probability is
$$\frac{1}{4^4}(31\times 0.5 + 40 + 44 + 40 + 31 + 20 + 10 + 4 + 1)=0.802734375$$
(For odd layers, it will recurse back to the row above the line in the figure, so there will be 3 elements multiplied by 0.5. Just a little more complicated.)
This is larger the actual result 0.796875. The reason is simple: When we enter the 0 region, there's a chance that we land at the boundary (coordinate (4,-1)), then there's a 1/4 chance that we enter 0.5, which is impossible since we should already stop once we're outside the grid. There are 3 ways to get to that cell, this gives an extra value of $0.5(3\times \frac{1}{4^3} \times \frac{1}{4})$. Removing this discrepancy gives the correct result.
But as $n$ increases, the discrepancy should converge to 0 exponentially.
So, if we can prove that the sum of the numbers before the middle line converges to zero monotonically, then we're done.
The first few sums are
$$(1 + 0.5)/4 =0.375$$
$$(1+2+3\times 0.5)/4^2=0.28125$$
$$(1+3+6+10\times 0.5)/4^3=0.234375$$
$$(1+4+10+20+31\times 0.5)/4^4=0.197265625$$
If we can find a closed form of this summation, it would be very helpful.
It's easy to find the equation for the first 4 terms from the recurrence relationship. The first is always 1, the second is $t$, the third is $1+\dots+t = t(t+1)/2$, the fourth is $t(t+1)(t+2)/6$ based on the fact $f_4(t)-f_4(t-1)=t(t+1)/2$, the fifth has this recurrence relation $f_5(t)-f_5(t-1)=(t-1)(t^2+4t+6)/6$ and increases as $t^4/24$. So in general, the $k$th term is a polynomial of degree $k-1$ where the coefficient of the leading term is $\frac{1}{(k-1)!}$. It's quite similar to the binomial coefficients, just more complicated. Maybe the techniques to bound the partial sum of binomial coefficients can be applied here, too. We're basically trying to bound the sum of the first $t+1$ (or $t+2$ for odd layers) terms of these unnamed coefficients, where the length of the row is $3t+1$. Based on the fact that it's less than the Gaussian distribution approximation, maybe the Chernoff bound can be applied here, too. This seems to be the most promising direction. I'll take a look at these when I have more time...

[2023 08 03 update]

I figured out how to prove the upper bound after giving it a litte more thought!
So, basically the idea is (Thanks to user2575jO on leetcode), we can write one step as two steps merged together: choosing from $\{-1,0,1,2\}$ each with $p=1/4$ is equivalent to first choosing from $\{0,2\}$ with $p=1/2$ and then choosing from $\{-1,0\}$ with $p=1/2$, then add the result together. The distributions are exactly the same.
Now if we move $t$ steps, the probability that we're at a location $\leq 0$ is
$$P(t)=\frac{1}{4^t}\sum_{l=0}^t\left(\binom t l\sum_{r=0}^{l/2}\binom t r\right)$$
where $l$ is the number of $-1$ moves and $r$ is the number of $+2$ moves. $l$ can be from 0 to $t$, and $r$ must be between $0$ and $\frac{l}{2}$, so that the final location $-l+2r\leq 0$.

Knowing that $r\leq \frac{t}{2}$, we can apply the Chernoff bound directly (with $p=\frac{1}{2}$):
$$Pr\left(r<\frac{l}{2}\right)=Pr\left(r<\frac{t}{2}-\left(\frac{t}{2}-\frac{l}{2}\right)\right)\leq\exp{\left(-2t\left(\frac{\frac{t}{2}-\frac{l}{2}}{t}\right)^2\right)}$$
$$Pr\left(r<\frac{l}{2}\right)\leq\exp{\left(-\frac{(t-l)^2}{2t}\right)}$$

Replace $2^{-t}\sum_{r=0}^{l/2}\binom t r$ with the upper bound above, now we have
$$P(t)\leq 2^{-t}\sum_{l=0}^t\binom t l\exp{\left(-\frac{(t-l)^2}{2t}\right)}$$
To simplify this expression, noticing the binomial coefficient, we really want to write the last term as a power of $l$. Let's expand it:
$$P(t)\leq 2^{-t}\sum_{l=0}^t\binom t l\exp{\left(-\frac{t^2-2lt+l^2}{2t}\right)}=2^{-t}e^{-t/2}\sum_{l=0}^t\binom t l e^l e^{-\frac{l^2}{2t}}$$
We are almost there, except for the annoying $e^{-\frac{l^2}{2t}}$ term. It's a normal distribution, centered at 0. If we can bound it by some function like $a e^{-bl}$, then we can write the term as a power of $l$. And that seems doable!
$$e^{-\frac{l^2}{2t}}\leq ae^{-bl}$$
$$e^{-\frac{l^2}{2t}+bl}\leq a$$
$$e^{-\frac{1}{2t}(l-bt)^2+b^2t/2}\leq a$$
If we choose $a=e^{b^2t/2}$, the inequality is always true. Apply this inequality with an arbitrary $b$, we get
$$P(t)\leq 2^{-t}e^{-t/2}\sum_{l=0}^t\binom t l e^l e^{b^2t/2} e^{-bl}$$
$$=2^{-t}e^{-t/2}e^{b^2t/2}\sum_{l=0}^t\binom t l e^l e^{-bl}$$
$$=2^{-t}e^{\frac{b^2-1}{2}t}\left(1+e^{1-b}\right)^t$$
$$=\left(\frac{1}{2}\left(e^{\frac{b^2-1}{2}}+e^{\frac{(b-1)^2}{2}}\right)\right)^t$$
Now we want to choose a $b$ to minimize this expression. Taking the derivative and we find this equation
$$be^{b-1}+b-1=0$$
Solving it numerically, we find
$$b=0.598942$$
Putting it back to the inequality above, we get
$$P(t)\leq 0.9047174^t$$
The threshold is at
$$0.9047174^t=10^{-5}$$
$$t=114.97\dots$$
Remember that $t$ is the number of steps, so this corresponds to $n=2t=230$, the same result as the Gaussian approximation above! (Probably not a coincidence, because even though they are very different for small $n$s, when I change $10^{-5}$ to $5\times 10^{-6}$, they still give the same result $n=244$. I think this has the same asymptotic behavior as the Gaussian approximation.) But this time the bound is proved.

(The $l$ should be replaced by $l+1$ or $l+2$ somewhere, because the value in the middle is 0.5, not 1, and in odd layers it recurses back to the layer of three 0.5s, so we must shift it one more step. But these are just small details, it shouldn't change the answer by much...)

[update end]

What else can we do, then? In the previous calculations, we're trying to find an upper bound. If we have an upper bound, we can use that to estimate where to stop, e.g. if we can prove that the normal distribution is an upper bound, we cau use the threshold $n=230$. But if we can prove the monotonicity, we can use the threshold at exactly where the value exceeds the criterion i.e. $n=179$, which is a stronger statement.

Let's go back to the grid, maybe.
The proof would be trivial if all the values increase monotonically along any vertical line. But unfortunately that is not the case. The elements in the top $10 \times 10$ grid are

     0.625,      0.75,     0.875,         1,         1,         1,         1,         1,         1,         1,
       0.5,     0.625,      0.75,   0.90625,    0.9375,   0.96875,         1,         1,         1,         1,
     0.375,       0.5,   0.65625,    0.8125,     0.875,    0.9375,  0.976562,  0.984375,  0.992188,         1,
      0.25,   0.40625,    0.5625,   0.71875,    0.8125,  0.890625,    0.9375,  0.960938,  0.984375,  0.994141,
   0.15625,    0.3125,   0.46875,     0.625,  0.742188,  0.828125,  0.890625,    0.9375,  0.966797,  0.980469,
     0.125,      0.25,     0.375,   0.53125,   0.65625,  0.757812,   0.84375,  0.902344,    0.9375,  0.960938,
   0.09375,    0.1875,  0.304688,  0.453125,  0.578125,    0.6875,  0.785156,  0.851562,  0.900391,  0.941406,
    0.0625,  0.140625,      0.25,  0.382812,       0.5,  0.617188,   0.71875,  0.796875,  0.863281,  0.912109,
 0.0390625,  0.109375,  0.203125,    0.3125,  0.429688,  0.546875,  0.652344,  0.742188,  0.817871,   0.87207,
   0.03125, 0.0859375,   0.15625,  0.253906,  0.367188,  0.480469,  0.585938,  0.683594,  0.763672,  0.827637,
   
The diagonal line is indeed increasing, but the third diagonal line has elements 0.875, 0.90625, 0.875, 0.890625, 0.890625, 0.902344, 0.900391, 0.912109... which increases and decreases alternately. And that is not the only diagonal line that shows this behavior. Let's take the 5th diagonal line as an example, the first few elements are

       1,       1,0.984375,0.984375,0.980469,0.980469,0.978516,0.979492,0.978271,0.979492,
0.979004,0.980286,0.980179,0.981461,0.981567,0.982796, 0.98303, 0.98418,0.984484,0.985547,
0.985884,0.986855,0.987203,0.988086, 0.98843,0.989229, 0.98956,0.990281,0.990595,0.991244,
0.991538, 0.99212,0.992393,0.992915,0.993167,0.993635,0.993865,0.994284,0.994495, 0.99487,
0.995062,0.995397,0.995572,0.995872, 0.99603,0.996298,0.996441,0.996681, 0.99681,0.997024,

It's not easy to tell if it increases or decreases. Let's see their differences:

           0,   -0.015625,           0, -0.00390625,           0, -0.00195312, 0.000976562,  -0.0012207,   0.0012207,-0.000488281,
  0.00128174,-0.000106812,  0.00128174, 0.000106812,  0.00122833, 0.000234604,  0.00115013, 0.000303984,  0.00106215, 0.000337005,
 0.000971675, 0.000347495, 0.000883162, 0.000343844, 0.000799075, 0.000331543, 0.000720632, 0.000314187,  0.00064834,  0.00029413,
 0.000582274, 0.000272915, 0.000522257, 0.000251551,  0.00046797,  0.00023068, 0.000419023, 0.000210703, 0.000374994, 0.000191853,

It alternates for about 12 times, then it keeps increasing.
Remembering that there are two independent sets of layers, we should find the differences within each set. The differences within the odd layers are

   -0.015625, -0.00390625, -0.00195312,-0.000244141, 0.000732422,  0.00117493,  0.00138855,  0.00146294,  0.00145411,  0.00139916,
  0.00131917,  0.00122701,  0.00113062,  0.00103482,  0.00094247,  0.00085519, 0.000773808, 0.000698651, 0.000629726, 0.000566848,
  
and the even layers

   -0.015625, -0.00390625,-0.000976562,           0, 0.000793457,  0.00117493,  0.00133514,  0.00138474,  0.00136614,  0.00130868,
  0.00123066,  0.00114292,  0.00105218, 0.000962528, 0.000876404, 0.000795173, 0.000719521, 0.000649703, 0.000585698, 0.000527314,

Now they no longer alternate.
Is it possible to prove that it can only decrease for a finite amount of times before it starts to increase monotonically?
Let's take a look at a diagonal line that's further away. Here are the differences of the elements on the 10th diagonal:
odd layers:
           0,-0.000488281,-0.000488281,-0.000366211,-0.000267029,-0.000171661,-9.20296e-05, -3.0756e-05,   1.508e-05, 4.81904e-05,
 7.11586e-05, 8.62759e-05, 9.54153e-05, 0.000100072, 0.000101421, 0.000100375,  9.7634e-05, 9.37348e-05,  8.9083e-05, 8.39835e-05,
even layers:
           0,-0.000366211,-0.000427246,-0.000335693,-0.000244141,-0.000161171,-9.08375e-05,-3.57032e-05, 5.90086e-06, 3.63067e-05,
  5.7715e-05, 7.20862e-05, 8.10546e-05, 8.59372e-05, 8.77771e-05, 8.73906e-05, 8.54098e-05, 8.23208e-05, 7.84945e-05, 7.42116e-05,

It seems that the further away from the middle, the more times it would decrease.

Let's see if we can find out more about this pattern! In the following two figures (first the odd layers, second the even layers), a cell $(m,n)$ is a black square if $p(m,n)-p(m-2,n-2)$ is negative, a white square if positive, and a dash if zero.

There is definitely a pattern here. The boundary between increasing cells and decreasing cells shifts further and further away from the middle diagonal line. If this can be proved, then it's obvious that the diagonal line doesn't have decreasing cells. But I haven't found a proof of that.

How about we consider something more solid - something that we can actually prove?
If we look at the table again, it's easy to see that the values increase rightward and decrease downward. Can we do better than that?
It's easy to see that the number on the left boundary decrease to $1/4$ of the value 4 steps (or 1 layer) above it. We can also tell that there are more and more 1s on the right boundary. Let's mark them on the figure:

The value of the light green cell is 1/4 of the green cell, whose value is 1/4 of the deep green cell.
With the same reasoning, we can tell that there's a similar pattern on the right boundary. There is one more cell of value 1 after each layer, and also the value of the cell next to the 1 (the blue cells) increases monotonically downward. The difference between 1 and the value of the blue cell is 1/4 of the difference between 1 and the light blue cell, and the difference between 1 and and the deep blue cell is 1/4 of the previous one. So the value of the cells along this line converges to 1. By induction, it's easy to show that any cell increases value along that line downward.

So we have two lines that guarantee to increase or decrease downward. Is there a line of equal values, then? If so, any angle between the increasing line and equal line should be increasing, and the other angles should be decreasing. But the angle of the equal line depends on where the cell is, because if the cell already has value 1, the equal line is the same as the increasing line. So as the location of the cell changes from the right most to the left most, the angle between the equal line and the increasing line increases from 0 to some positive value. The two figures above seems to support this statement. If we can prove that when the cell is in the middle, the equal line always points to the lower left, then we can tell that along the middle diagonal, the value increases monotonically.

Another thing that I noticed is, the summation of the values on the next layer is always 2.5 larger than the current layer, since all the values will appear exactly 4 time in the average, the extra 0s don't contribute, and the 1s contribute exactly $(4+3+2+1)/4=2.5$ to the summation of the next layer. I don't know how to use this fact in a proof, though.

Anyway, this summarizes my thoughts on this problem so far. If anyone finds a rigorous proof of this problem, I'd be glad to hear it!

Friday, June 9, 2023

Leetcode M1318 Minimum Flips to Make a OR b Equal to c, E1351 Count Negative Numbers in a Sorted Matrix

M1318 description:
  Given 3 positives numbers a, b and c. Return the minimum flips required in some bits of a and b to make ( a OR b == c ). (bitwise OR operation).
  Flip operation consists of change any single bit 1 to 0 or change the bit 0 to 1 in their binary representation.
Constraints:
  1 <= a <= $10^9$
  1 <= b <= $10^9$
  1 <= c <= $10^9$

This problem is very easy, a typical algorithm is to go through each bit position of the three numbers. If $c$ at that position is 1 but $a|b$ at that position is 0, we add one flip, and if $c$ is 0 we add one if $a$ is 1 and another one if $b$ is 1 at that position. (Since the numbers are positive, we don't need to check the 32nd bit.)

But putting these together, let's make a table:

c 1, a|b 0 : add 1
c 0, a&b 1 : add 2
c 0, a|b 1 : add 1

We can do the bitwise operations to the numbers beforehand, instead of checking each digit in the loop. So here's the first version that I submitted:

```cpp
class Solution {
public:
    int minFlips(int a, int b, int c) {
        int r=0;
        int t=1;
        int abc=(a|b)^c,aAb=a&b;
        for(int i=0;i<31;++i){
            if(abc&t){
                r++;
                if(aAb&t) r++;
            }
            t<<=1;
        }
        return r;
    }
};
```

Then I noticed that, I can also combine the '&' operation to get rid of the branching. The version without the branching reads,

```cpp
class Solution {
public:
    int minFlips(int a, int b, int c) {
        int r=0;
        int abc=(a|b)^c,aAb=a&b&abc;
        for(int i=0;i<31;++i) r+=(abc>>i&1)+(aAb>>i&1);
        return r;
    }
};
```

At this point, it's obvious that we are simply counting the number of '1's in the two numbers in binary, which is known as the Hamming weight or population count. There is a built-in function in GCC compiler for that, "__builtin_popcount(int)", so we can write it like this:

```cpp
class Solution {
public:
    int minFlips(int a, int b, int c) {
        uint abc=(a|b)^c,aAb=a&b&abc;
        #ifdef __GNUG__
        return __builtin_popcount(abc)+__builtin_popcount(aAb);
        #else
        int r=0;
        for(int i=0;i<31;++i) r+=(abc>>i&1)+(aAb>>i&1);
        return r;
        #endif
    }
};
```

I did some research and noticed that llvm also supports the function "__builtin_popcount", and MSVC has these functions "__popcnt16, __popcnt, __popcnt64". We can use these functions according to the compiler.
(Update: starting from C++20, there's an "std::popcount()" function in the header <bit>.)

Then I found this webpage, which lists the best bit hacks currently known for many computation tasks. I'm fascinated by the esoteric beauty of the algorithms and magic numbers. Using the algorithm in the list, we can write the solution as

```cpp
#ifndef __GNUG__
int popC_32(uint v){
    v = v - ((v >> 1) & 0x55555555);
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
    return ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
#endif
class Solution {
public:
    int minFlips(int a, int b, int c) {
        ios::sync_with_stdio(false);cin.tie(0);cout.tie(0);
        uint abc=(a|b)^c,aAb=a&b&abc;
        #ifdef __GNUG__
        return __builtin_popcount(abc)+__builtin_popcount(aAb);
        #else
        return popC_32(abc)+popC_32(aAb);
        #endif
    }
};
```

But since many CPUs support the instruction "POPCNT", it's probably better to use the built-in functions if they are available.

---

E1351 description:
  Given a m x n matrix grid which is sorted in non-increasing order both row-wise and column-wise, return the number of negative numbers in grid.
Constraints:
  m == grid.length
  n == grid[i].length
  1 <= m, n <= 100
  -100 <= grid[i][j] <= 100

An O(m+n) algorithm is easy to find,

```cpp
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        int k=n-1,r=0;
        for(int i=0;i<m;++i){
            while(k>=0&&grid[i][k]<0) k--;
            if(k<0){r+=(m-i)*n;return r;}
            r+=n-1-k;
        }
        return r;
    }
};
```

We may try to use binary search to increase speed,

```cpp
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        int k=n,r=0;
        for(int i=0;i<m;++i){
            k=upper_bound(grid[i].begin(),grid[i].begin()+k,0,greater())-grid[i].begin();
            if(k==0){r+=(m-i)*n;return r;}
            r+=n-k;
        }
        return r;
    }
};
```

But now it takes O(m*log(n)), which may be slower than O(m+n).
Also, if $m>n$, it would be better to do binary search in the columns, in which case we must write our own upper bound function. Also, because we'll access elements in different vectors, the overhead may be larger.

Let's consider the complexity. For m*log(n) to be less than m+n, we need $n>m*(log(n)-1)$. Considering that $n<100$, it can be satisfied if $n>6m$. But also considering the overhead and the constant coeffecient, we may try to use a larger criterion, something like this:

```cpp
int upperB(vector<vector<int>>& grid,int i,int k){
    if(grid[0][i]<0) return 0;
    int l=0,r=k;
    while(r-l>1){
        k=(l+r)/2;
        if(grid[k][i]>=0) l=k;
        else r=k;
    }
    return r;
}
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        if(n>8*m){
            int k=n,r=0;
            for(int i=0;i<m;++i){
                k=upper_bound(grid[i].begin(),grid[i].begin()+k,0,greater())-grid

[i].begin();
                if(k==0){r+=(m-i)*n;return r;}
                r+=n-k;
            }
            return r;
        }
        else if(m>64*n){
            int k=m,r=0;
            for(int i=0;i<n;++i){
                k=upperB(grid,i,k);
                if(k==0){r+=(n-i)*m;return r;}
                r+=m-k;
            }
            return r;
        }
        else{
            int k=n-1,r=0;
            for(int i=0;i<m;++i){
                while(k>=0&&grid[i][k]<0) k--;
                if(k<0){r+=(m-i)*n;return r;}
                r+=n-1-k;
            }
            return r;
        }
    }
};
```

But considering that the size of the input is quite small, it won't show much difference in this case, unless there're a lot of test cases where the input has one dimension much larger than the other.

Then I thought, instead of doing binary search vertically or horizontally, what if we do it... diagonally?
The good thing about it is, we can be sure that we can get rid of at least half of the inputs after each step, because if we choose any point on the diagonal and cut the rectangle into four smaller ones, the total area of the top left and the bottom right must be at least half of the area of the original rectangle.

The following code implements this algorithm:

```cpp
int negCount(int tl_r,int tl_c,int br_r,int br_c,vector<vector<int>>& grid){
    //cout<<"tl is {"<<tl_r<<","<<tl_c<<"}, br is {"<<br_r<<","<<br_c<<"},";
    if(tl_r==br_r||tl_c==br_c) return 0;
    if(br_r-tl_r==1){
        if(grid[tl_r][tl_c]<0){
            //cout<<"\nadded col count="<<br_c-c1<<endl;
            return br_c-tl_c;
        }
        int c1=tl_c,c2=br_c,cm;
        while(c2-c1>1){
            cm=(c1+c2)/2;
            if(grid[tl_r][cm]>=0)c1=cm;
            else c2=cm;
        }
        //cout<<"\nadded col count="<<br_c-c2<<endl;
        return br_c-c2;
    }
    if(br_c-tl_c==1){
        if(grid[tl_r][tl_c]<0){
            //cout<<"\nadded row count="<<br_r-r1<<endl;
            return br_r-tl_r;
        }
        int r1=tl_r,r2=br_r,rm;
        while(r2-r1>1){
            rm=(r1+r2)/2;
            if(grid[rm][tl_c]>=0)r1=rm;
            else r2=rm;
        }
        //cout<<"\nadded row count="<<br_r-r2<<endl;
        return br_r-r2;
    }
    int r1=tl_r,r2=br_r,c1=tl_c,c2=br_c,rm,cm;
    if(grid[r1][c1]<0){r2=r1;c2=c1;}
    else{
        while(r2-r1>1&&c2-c1>1){
            rm=(r1+r2)/2;
            cm=(c1+c2)/2;
            if(grid[rm][cm]>=0){r1=rm;c1=cm;}
            else{r2=rm;c2=cm;}
        }
        while(r2-r1>1){
            rm=(r1+r2)/2;
            if(grid[rm][cm]>=0)r1=rm;
            else r2=rm;
        }
        while(c2-c1>1){
            cm=(c1+c2)/2;
            if(grid[rm][cm]>=0)c1=cm;
            else c2=cm;
        }
    }
    //cout<<"r1 is "<<r1<<",c1 is "<<c1<<"\n";
    //cout<<"r2 is "<<r2<<",c2 is "<<c2<<"\nadded count="<<(br_r-r2)*(br_c-c2)<<endl;
    return (br_r-r2)*(br_c-c2)+negCount(tl_r,c2,r2,br_c,grid)+negCount(r2,tl_c,br_r,c2,grid);
}
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        return negCount(0,0,m,n,grid);
    }
};
```

What's the complexity of this algorithm? Let's take a look. Without loss of generality, let's assume $m>n$. The first function call will take log(n) first, when c1 and c2 meet, then it will take log(m/n) to let r1 and r2 meet, so it will take O(log(m)). Next, the complication starts. The time it takes depends on the shapes of the two rectangles that result from the first call, so it's hard to tell. If the index of the point where the positive and negative meet on the diagonal is (m1,n1), then the complexity of the next function call is log(max(m-m1,n1))+log(max(m1,n-n1)). The maximum it can get is log(m)+log(n), which is when the point is at the lower right corner - but that also means, we only has two rows left and we just need to do two more 1D upper-bound search. Considering this, the worse case is probably when the point is at the center. Let's find out the complexity in this scenario.

Each time the rectangle will be cut into 4 equal smaller ones, two of them will be the input of the next call. The complexity is log(m)+2log(m/2)+4log(m/4)+8log(m/8)+..., but it doesn't go down to 2^log(m)*log(m/m), instead it stops at 2^log(n)*log(m/(2^n)) because at that point we only have 1D arrays to search.

Summing up the terms, I got the complexity O(f(m,n)) where $f(m,n)=2n\log(m/n)+2n-\log m -2$ and $m\geq n$. When $m>>n$, it's approximately O(nlog(m/n)), and when $m=n$, it becomes O(m+n). I think it's a pretty good algorithm, although it's hard to tell with such small input sizes.

A variation of this algorithm is, instead of finding the exact cell that separates the positive and negative values, we can stop at the separation line of the smaller dimension, so it takes O(log(min(m,n))) instead of max(m,n). Then the two rectangles that would be the input of the next call will have an overlap in the larger dimension of size around m/n. The complexity analysis of this one would be even more complicated. Anyway, I wonder how the performance would be on a larger scale.