Tuesday, December 19, 2023

Maximum Product Difference Between Two Pairs, but with a twist

Yesterday's daily problem, E1913. Maximum Product Difference Between Two Pairs, is a very easy one.

Description:
The product difference between two pairs (a, b) and (c, d) is defined as (a * b) - (c * d). For example, the product difference between (5, 6) and (2, 7) is (5 * 6) - (2 * 7) = 16. Given an integer array nums, choose four distinct indices w, x, y, and z such that the product difference between pairs (nums[w], nums[x]) and (nums[y], nums[z]) is maximized. Return the maximum such product difference. Constraints: 4 <= nums.length <= 104 1 <= nums[i] <= 104
This problem is very easy, one just needs to find the largest two numbers and the smallest two. This works because all the elements are positive. So, naturally, I wonder, what if 0 and negative numbers are allowed? I.e, the second constraint becomes 104 <= nums[i] <= 104? At first, I thought, this is just a little more complicated. But as I attempted to solve it, I figured that this problem has quite a few special situations that I must take care of. It takes a lot of patience and carefulness. Here is the test program that I wrote in C++. Are you able to solve this problem correctly? (The solution must still be O(n).) (Let me know if you translated it into a different language. I'll add them or link them.)
#include <bits/stdc++.h>
using namespace std;
int yourSolution(vector<int> &v){
    //input your solution here






}
int bruteForce(const vector<int> &v,int &i1,int &i2,int &i3,int &i4){
    const int sz=v.size();
    int r=0,e1,e2,e3,e4,t;
    for(int i=0;i<sz;++i){
        e1=v[i];
        for(int j=i+1;j<sz;++j){
            e2=v[j];
            for(int k=j+1;k<sz;++k){
                e3=v[k];
                for(int l=k+1;l<sz;++l){
                    e4=v[l];
                    t=abs(e1*e2-e3*e4);
                    if(t>r) r=t,i1=e1,i2=e2,i3=e3,i4=e4;
                    t=abs(e1*e3-e2*e4);
                    if(t>r) r=t,i1=e1,i2=e3,i3=e2,i4=e4;
                    t=abs(e1*e4-e3*e2);
                    if(t>r) r=t,i1=e1,i2=e4,i3=e3,i4=e2;
                }
            }
        }
    }
    return r;
}
default_random_engine rg(chrono::system_clock::now().time_since_epoch().count());
uniform_int_distribution<int> rollEle(1,10000);
const int optSize=12;
const int maxSize=32;
uniform_int_distribution<int> rollPos(0,maxSize-1);
uniform_int_distribution<int> roll0(0,1);
uniform_int_distribution<int> rollExtreme(0,1);
void noNeg(vector<int> &v){
    v.resize(maxSize);
    for(int i=0;i<maxSize;++i) v[i]=rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
}
void noPos(vector<int> &v){
    v.resize(maxSize);
    for(int i=0;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
}
void mixed(vector<int> &v){
    v.resize(maxSize);
    for(int i=0;i<maxSize/2;++i) v[i]=rollEle(rg);
    for(int i=maxSize/2;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
}
void neg2posN(vector<int> &v){
    if(rollExtreme(rg)){
        v=vector<int>{1,1,1,1,1,3,1,-1000,1,1,1,1,1,1,1,-1000,1,1,1,1,2,1,1,1,1};
        if(rollExtreme(rg)) v=vector<int>{1,1,1,1,1,3,1,1000,1,1,1,1,1,1,1,-1000,1,1,1,1,-2,1,1,1,1};
        if(roll0(rg)) v[0]=0;
        return;
    }
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[0]=-rollEle(rg);
    v[1]=-rollEle(rg);
}
void neg1posN(vector<int> &v){
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[rollPos(rg)]=-rollEle(rg);
}
void pos2negN(vector<int> &v){
    if(rollExtreme(rg)){
        v=vector<int>{-1,-1,-1,-1,-1,-3,1,-1000,-1,-1,-1,-1,-1,-1,-1,1000,-1,-1,-1,-1,-2,-1,-1,-1,-1};
        if(rollExtreme(rg)) v=vector<int>{1,1,1,1,1,3,1,-1000,1,1,1,1,1,1,1,-1000,1,1,1,1,2,1,1,1,1};
        if(roll0(rg)) v[0]=0;
        return;
    }
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[0]=rollEle(rg);
    v[1]=rollEle(rg);
}
void pos1negN(vector<int> &v){
    v.resize(maxSize);
    for(int i=2;i<maxSize;++i) v[i]=-rollEle(rg);
    if(roll0(rg)) for(int i=0;i<3;++i) v[rollPos(rg)]=0;
    v[rollPos(rg)]=rollEle(rg);
}
void pos2neg2(vector<int> &v){
    v.resize(4);
    v[0]=rollEle(rg);
    v[1]=-rollEle(rg);
    v[2]=-rollEle(rg);
    v[3]=rollEle(rg);
}
void pos2neg2and0(vector<int> &v){
    v.resize(maxSize);
    v[0]=rollEle(rg);
    v[1]=-rollEle(rg);
    v[2]=-rollEle(rg);
    v[3]=rollEle(rg);
}
void pos2neg1and0(vector<int> &v){
    v=vector<int>(maxSize,0);
    v[0]=rollEle(rg);
    v[1]=-rollEle(rg);
    v.back()=rollEle(rg);
}
void neg2pos1and0(vector<int> &v){
    v=vector<int>(maxSize,0);
    v[0]=-rollEle(rg);
    v[1]=rollEle(rg);
    v.back()=-rollEle(rg);
}
void neg1pos1and0(vector<int> &v){
    v=vector<int>(maxSize,0);
    v[1]=rollEle(rg);
    v.back()=-rollEle(rg);
}
void (*updateTest[optSize]) (vector<int>&)={noNeg,noPos,mixed,neg2posN,neg1posN,pos2negN,pos1negN,pos2neg2,pos2neg2and0,pos2neg1and0,neg2pos1and0,neg1pos1and0};
int main(){
    int i1,i2,i3,i4;
    int repeat=10000;
    vector<int> test;
    uniform_int_distribution<int> rollOpt(0,optSize-1);
    for(int i=0;i<repeat;++i){
        updateTest[rollOpt(rg)](test);
        int result=yourSolution(test);
        int maxi=bruteForce(test,i1,i2,i3,i4);
        if(result!=maxi){
            cout<<"iteration number:"<<i<<endl;
            cout<<"input is ";
            for(auto e:test) cout<<e<<",";
            cout<<endl;
            cout<<"Your answer is "<<result<<", but the maximum is abs("<<i1<<"*"<<i2<<"-"<<i3<<"*"<<i4<<")="<<maxi<<".\n";
            return 0;
        }
    }
    cout<<"All passed!\n";
    return 0;
}
For simplicity and speed, the test cases are not randomly shuffled, but it probably doesn't matter too much in this context. A bigger problem is, the extreme cases are too rare, so I added some of them manually. If you think there are other extreme cases that should be included, let me know!

I wrote my own solution and it passed all test cases, but I'm still not 100% sure it's correct. Let me know if you wrote a better test case generator that covers more corner cases!

IBM Ponder this 2023 11 solution

Problem description is here. The solution is given here.

When I saw the problem, I wondered if the diagonals include just the two main diagonals or the broken diagonals too. There's no explicit method to construct all magic squares of order 4, but if we require that the broken diagonals also add up to the magic constant, i.e. we consider the pandiagonal magic squares, there is an explicit construction of all order 4 pandiagonal magic squares. So I just did that, and once I confirmed that the shortest path length of them is less than 50, I pretty much forgot about the more general magic squares. Well, I like the pandiagonal ones better anyway, because of their higher symmetry.

Once the matrix is constructed, the first thing to do is to check its parity, because we need that to determine which target it'll connect with. Another thing we may want to know is, a heuristic distance from this matrix to the target. One simple choice is the Manhattan distance, although I doubt its effectiveness. Anyway, here's my implementation of the order 4 pandiagonal magic squares:
struct matrix{
    array<array<int,4>,4> mat;
    coor coors[16];
    int dis,to_target;
    int perm=0;
    matrix(int b,int c,int d,int e,int tranX,int tranY){//tranX and tranY from 0 to 3
        coors[mat[tranX][tranY]=0]=coor{tranX,tranY};
        coors[mat[tranX][(tranY+1)%4]=b+c+e]=coor{tranX,(tranY+1)%4};
        coors[mat[tranX][(tranY+2)%4]=c+d]=coor{tranX,(tranY+2)%4};
        coors[mat[tranX][(tranY+3)%4]=b+d+e]=coor{tranX,(tranY+3)%4};
        coors[mat[(tranX+1)%4][tranY]=b+c+d]=coor{(tranX+1)%4,tranY};
        coors[mat[(tranX+1)%4][(tranY+1)%4]=d+e]=coor{(tranX+1)%4,(tranY+1)%4};
        coors[mat[(tranX+1)%4][(tranY+2)%4]=b]=coor{(tranX+1)%4,(tranY+2)%4};
        coors[mat[(tranX+1)%4][(tranY+3)%4]=c+e]=coor{(tranX+1)%4,(tranY+3)%4};
        coors[mat[(tranX+2)%4][tranY]=b+e]=coor{(tranX+2)%4,tranY};
        coors[mat[(tranX+2)%4][(tranY+1)%4]=c]=coor{(tranX+2)%4,(tranY+1)%4};
        coors[mat[(tranX+2)%4][(tranY+2)%4]=b+c+d+e]=coor{(tranX+2)%4,(tranY+2)%4};
        coors[mat[(tranX+2)%4][(tranY+3)%4]=d]=coor{(tranX+2)%4,(tranY+3)%4};
        coors[mat[(tranX+3)%4][tranY]=c+d+e]=coor{(tranX+3)%4,tranY};
        coors[mat[(tranX+3)%4][(tranY+1)%4]=b+d]=coor{(tranX+3)%4,(tranY+1)%4};
        coors[mat[(tranX+3)%4][(tranY+2)%4]=e]=coor{(tranX+3)%4,(tranY+2)%4};
        coors[mat[(tranX+3)%4][(tranY+3)%4]=b+c]=coor{(tranX+3)%4,(tranY+3)%4};
        getDis();
    }
    void getDis(){
        dis=0;
        to_target=0;
        for(int i=0;i<16;++i) dis+=abs(coors[i].x-target0coors[i].x)+abs(coors[i].y-target0coors[i].y);
        if(checkPerm()&1){
            to_target=1;
            dis-=abs(coors[14].x-target0coors[14].x)+abs(coors[14].y-target0coors[14].y);
            dis-=abs(coors[15].x-target0coors[15].x)+abs(coors[15].y-target0coors[15].y);
            dis+=abs(coors[14].x-target1coors[14].x)+abs(coors[14].y-target1coors[14].y);
            dis+=abs(coors[15].x-target1coors[15].x)+abs(coors[15].y-target1coors[15].y);
        }
    }
    int checkPerm(){
        int arr[16];
        for(int i=0;i<4;++i)for(int j=0;j<4;++j) arr[4*i+j]=mat[i][j];
        perm=0;
        if(arr[15]!=0){for(int i=0;i<15;++i){if(arr[i]==0){perm+=(i/4)+(i%4)-1;swap(arr[i],arr[15]);break;}}}
        for(int i=1;i<=15;++i){
            if(arr[i-1]!=i){
                perm++;
                for(int j=0;j<15;++j){
                    if(arr[j]==i){swap(arr[j],arr[i-1]);break;}
                }
            }
        }
        return perm;
    }
    void print(ostream& output)const{
        output<<"distance is "<<dis<<endl;
        output<<"[";
        for(int i=0;i<4;++i){output<<"[";for(int j=0;j<3;++j) output<<mat[i][j]<<",";output<<mat[i][3]<<"]";if(i!=3) output<<",";else output<<"];";output<<endl;}
        output<<"-------------------\n";
    }
    void printDis(ostream& output)const{
        output<<"to_target="<<to_target<<endl<<"number:";
        for(int i=0;i<16;++i) output<<i<<" "; output<<endl<<"distance:";
        if(to_target){
            for(int i=0;i<16;++i) output<<(abs(coors[i].x-target1coors[i].x)+abs(coors[i].y-target1coors[i].y))<<" ";
            output<<endl;
        }
        else{
            for(int i=0;i<16;++i) output<<(abs(coors[i].x-target0coors[i].x)+abs(coors[i].y-target0coors[i].y))<<" ";
            output<<endl;
        }
    }
    void printPerm(ostream& output)const{output<<"perm="<<perm<<endl;}
    mr1 to_mr1()const{
        ull conf=0,co=0;
        for(int i=0;i<16;++i){co|=(ull)coors[i].x<<(4*i);co|=(ull)coors[i].y<<(4*i+2);}
        for(int i=0;i<4;++i)
            for(int j=0;j<4;++j)
                conf|=(ull)mat[i][j]<<(4*(4*i+j));
        return mr1{conf,co,(int8_t)dis};
    }
    mr0 to_mr0()const{
        ull conf=0,co=0;
        for(int i=0;i<16;++i){co|=(ull)coors[i].x<<(4*i);co|=(ull)coors[i].y<<(4*i+2);}
        for(int i=0;i<4;++i)
            for(int j=0;j<4;++j)
                conf|=(ull)mat[i][j]<<(4*(4*i+j));
        return mr0{conf,co,(int8_t)dis};
    }
};
I need to be able to find the element at given coordinates and also to find the coordinates of a given element, so I decided to store both. Next, I can build the graph and try to find a path. To save space, I noticed that an order 4 magic square can be expressed by a 64 bit integer - either configuration or coordinates. So I used that expression as a "reduced matrix". That's what the "mr0" and "mr1" are, corresponding to the different parities.

There are 384 order 4 pandiagonal magic squares, 192 for each parity.

Before implementing the reduced matrices, I tried to find a solution manually and see what the best result I could get was. I used this implementation and changed it a bit. First, I need to be able to set the parity (I could've computed the parity from the matrix input, but I decided to skip that. I could've also created a plain text input for the initial matrix, but it was not that important since I just wanted to see the result, so I kept it simple and just pasted the matrix into the source code). Next, I need the elements that I clicked in reverse order.
Parity:
   
   

Clicks: 0

I found a solution of path length 90 manually.

Now, getting back to the reduced matrices and graph building algorithm...
unordered_map<ull,array<ull,4>> graph;
struct mr0{//reduced matrix
    ull conf,co;
    int8_t dis=0;
    mr0(ull conf1,ull co1,int8_t dist):conf(conf1),co(co1),dis(dist){}
    mr0 getMr(){
        ull confs[4];//left,right,up,down
        ull coos[4];
        int8_t dis[4];
        double prob[4]={0.0};
        if((co&3)!=0){//l
            confs[0]=conf;
            coos[0]=co;
            int x=(co-1)&3,y=(co>>2)&3;
            coos[0]--;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[0]+=1ULL<<(e*4);
            confs[0]&=~(15ULL<<locOfe);
            confs[0]|=e<<(4*(4*(x+1)+y));
            dis[0]=getDis(coos[0]);
            addg(confs[0]);
            if(dis[0]==0) return mr0{confs[0],coos[0],dis[0]};
            prob[0]=exp(-dis[0]*T);
            //cout<<"left config:\n";print(confs[0]);cout<<"distance="<<(int)dis[0]<<endl;
        }
        if((co&3)!=3){//r
            confs[1]=conf;
            coos[1]=co;
            int x=(co+1)&3,y=(co>>2)&3;
            coos[1]++;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[1]-=1ULL<<(e*4);
            confs[1]&=~(15ULL<<locOfe);
            confs[1]|=e<<(4*(4*(x-1)+y));
            dis[1]=getDis(coos[1]);
            addg(confs[1]);
            if(dis[1]==0) return mr0{confs[1],coos[1],dis[1]};
            prob[1]=exp(-dis[1]*T);
            //cout<<"right config:\n";print(confs[1]);cout<<"distance="<<(int)dis[1]<<endl;
        }
        if((co&12)!=0){//u
            confs[2]=conf;
            coos[2]=co;
            int x=co&3,y=((co>>2)-1)&3;
            coos[2]-=4;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[2]+=4ULL<<(e*4);
            confs[2]&=~(15ULL<<locOfe);
            confs[2]|=e<<(4*(4*x+y+1));
            dis[2]=getDis(coos[2]);
            addg(confs[2]);
            if(dis[2]==0) return mr0{confs[2],coos[2],dis[2]};
            prob[2]=exp(-dis[2]*T);
            //cout<<"up config:\n";print(confs[2]);cout<<"distance="<<(int)dis[2]<<endl;
        }
        if((co&12)!=12){//d
            confs[3]=conf;
            coos[3]=co;
            int x=co&3,y=((co>>2)+1)&3;
            coos[3]+=4;
            int locOfe=4*(4*x+y);
            ull e=(conf>>locOfe)&15;
            coos[3]-=4ULL<<(e*4);
            confs[3]&=~(15ULL<<locOfe);
            confs[3]|=e<<(4*(4*x+y-1));
            dis[3]=getDis(coos[3]);
            addg(confs[3]);
            if(dis[3]==0) return mr0{confs[3],coos[3],dis[3]};
            prob[3]=exp(-dis[3]*T);
            //cout<<"down config:\n";print(confs[3]);cout<<"distance="<<(int)dis[3]<<endl;
        }
        for(int i=1;i<4;++i) prob[i]+=prob[i-1];
        for(int i=0;i<4;++i) prob[i]/=prob[3];
        //cout<<"Accumulated probabilities are:"; for(int i=0;i<4;++i) cout<<prob[i]<<","; cout<<endl;
        double p=roll1(generator);
        int l=0;
        for(;l<4;++l){if(p<prob[l]) break;}
        return mr0{confs[l],coos[l],dis[l]};
    }
    void addg(ull nei){
        auto& arr=graph[conf];
        for(int i=0;i<4;++i){
            if(arr[i]==nei) return;
            if(arr[i]==0){arr[i]=nei;break;}
        }
        auto& arr1=graph[nei];
        for(int i=0;i<4;++i){
            if(arr1[i]==0){arr1[i]=conf;return;}
        }
    }
    int getDis(ull coor){
        int dis=0;
        for(int i=0;i<16;++i) dis+=abs(int((coor>>4*i)&3)-target0coors[i].x)+abs(int((coor>>(4*i+2))&3)-target0coors[i].y);
        return dis;
    }
};
They store the configuration and coordinates of the matrix and the manhattan distance to their target. The "getMr()" method generates a neighboring matrix with probability proportional to edistanceT where T is a constant parameter. The newly generated matrix is stored in a graph, and once it reaches the target, I did a BFS to find the shortest path.

Because the shortest path may be starting from any of the magic squares that has the same parity as the target, I just started from the target and looked for any of the initial matrices.
unordered_map<ull,int> layers;
unordered_set<ull> inits;
queue<ull> q;
ull bfs(ull start){
    while(!q.empty()) q.pop();
    q.emplace(start);
    int l=0;
    layers[start]=l;
    while(!q.empty()){
        int sz=q.size();
        l++;
        while(sz--!=0){
            const auto &arr=graph[q.front()];
            q.pop();
            for(int i=0;i<4;++i){
                ull t=arr[i];
                if(t==0) break;
                if(layers.find(t)!=layers.end()) continue;
                layers[t]=l;
                if(inits.count(t)){//cout<<"layers="<<l<<endl;
                    while(!q.empty()) q.pop(); return t;}
                q.emplace(t);
            }
        }
    }
    return 0;
}
vector<ull> getSteps(ull fin,ull tar){
    vector<ull> r{fin};
    ull nxt=fin;
    int l=layers[fin]-1;
    //cout<<"finished layers="<<l<<endl;
    while(nxt!=tar){
        const auto& arr=graph[nxt];
        for(int i=0;i<4;++i){
            if(arr[i]==0) break;
            if(layers[arr[i]]==l){nxt=arr[i];r.emplace_back(nxt);break;}
        }
        l--;
    }
    return r;
}
Since it's a randomized algorithm, the first path that it finds is usually quite large, on the magnitude of 10,000 and has large fluctuations. But as the graph gets bigger, there's a higher chance that a short path will be found, and the shortest path length decreases. Quickly the length decreases to about 100, then it hardly moves, until the program crashes due to using too much memory, because the graph gets too big. The best result that I got with this algorithm is 59.

So, I searched for a better solver, and I found one here. Again, I need to change the source code a little bit to enable different parities.

For many of the inputs, this gives a result very fast, but for the rest, it also used too much memory (more than 12GB) and crashed. The implementation is quite general and not optimized for order 4 matrices.

Anyway, the best result I got from that program is, 42! The answer to life, universe and everything! Coincidence? I think not!

Start:
1 2 3 4
5 6 7 8
9 10 11 12
13 15 14 0

Sequence: 
[12, 8, 7, 11, 10, 6, 5, 9, 6, 10, 8, 12, 14, 8, 11, 3, 4, 7, 12, 11, 3, 5, 10, 15, 8, 3, 5, 4, 2, 10, 9, 1, 10, 9, 4, 2, 9, 4, 15, 8, 3, 14]

End:
10 4 9 7
1 15 2 12
6 8 5 11
13 3 14 0 

Now that I look at it again, I noticed that their distance function skips the empty cell, which I didn't do in my distance calculation. I think that's why it didn't work too well... Also, I think I only changed the destination for the odd parities, but forgot to change the distance function too. Maybe that's why there are more configurations that it was unable to solve. The distance is not guaranteed to be the shortest for those odd parity ones due to that, too. Anyway, it's not hard to fix, but I got to move on...

One issue with the A algorithm is, one can only go in one direction, and if the distance function is not helpful, in the worst case it just reduces to a BFS in this context. And the radius to explore equals the length of the shortest path. On the other hand, if we start from both ends, we just need to explore two balls of radius l/2, where l is the shortest length. This makes a huge difference when the volume is exponential in radius.

So, another way to solve it is, since we know that a solution with length 50 exists, we can simply explore the space of radius 25 from both sides and stop when they touch and get the result. If they don't touch, we move on to the next one.

Each configuration has from 2 to 4 neighbors, or 1 to 3 children. If the case of 1 and 3 children has the same probability, there are at most 2^25 configurations in each sphere, which is about 32M. Including the surface, it would be 48M, two of them is 96M, each configuration can be stored in 64bit, need one byte to store the distance, so it would be 864MB. Considering more space is needed for hashmap and other overheads, theoretically it can be done using within 2GB memory... I wonder how much memory it actually uses.

So the bonus problem is easier to solve in this sense. If the path length is just a little longer, this method will not be feasible.
On the contrary, my randomized algorithm can find a solution to all the starting configurations very fast, but it's very hard to get the shortest path. It's pretty good for the main problem, though. It finds a solution with length 150 pretty fast.

Now, consider the general magic squares, the only difference is how we generate them. Apparently there are 7040 of them, and 880 of them if we exclude reflections and rotations. We could find the list that's available on the internet like  here, and do the rotations and reflections to generate the others, or we could solve the 10 linear equations, where 9 of them are independent, and iterate through the permutations of the 7 independent elements. Because of the symmetry, we could fix one element, say 1, to 3 positions, so the total number of choices to consider is 3*15*14*13*12*11*10. Then we do the reflection and rotation to get all 7040 matrices. After we get all of them, the rest is pretty much the same.

Tuesday, November 7, 2023

IBM Ponder this 2023 09 solution

The problem can be found here.
Since the solution has been posted on the website, here's my solution:
Q1 is simple, just calculate them directly:
#include <bits/stdc++.h>
using namespace std;
int gcd(int a,int b){
    if(a<b) swap(a,b);
    while(b){
        a=a%b;
        swap(a,b);
    }
    return a;
}
int main(){
    int a=531;
    //for(int i=2;i<1000;++i) {int d=gcd(i,a);cout<<d<<", ";a=a+d;}
    int cnt=0, i=2;
    while(cnt<10) {int d=gcd(i,a);if(d==5){++cnt;cout<<"(i="<<i<<",a="<<a<<"),";} a=a+d;i++;}
    return 0;
}
The second question takes a little observation, though. It's easy to see that the result must be an odd number. 9 and 15 don't work from reasoning, and 21 works. Working backwards, it's not hard to find that when a1=801, d21=21. (Somehow the solution page omitted the answer to this question. Probably overlooked.)

The bonus question is much harder to solve without the right tools, because it requires factoring many very large integers. Here's the code that I used to solve it:
import primefac as pf
import os
from datetime import datetime
file=open("result2309.txt","a")
start=datetime.now()
print("starting time:",start)
file.write("starting time:"+str(start)+"\n")
def gcd(a,b):
    while(b>0):
        a=a%b
        a,b=b,a
    return a
a=531
i=2
cnt=0
while(cnt<200):
  g=gcd(a,i)
  if g!=1:
    if g==5:
      cnt+=1
      print(f"#{cnt}: i={i},a={a}")
      file.write(f"#{cnt}: i={i},a={a}\n")
      file.flush()
      os.fsync(file)
    a+=g
    i+=1
    continue
  n=a-i
  m=a
  d=0
  f=0
  factors=pf.primefac(n, verbose=False,methods=(pf.ecm,))
  for factor in factors:
    d=factor-(i%factor)
    if d<m:
      m=d
      f=factor
  if f==5:
    cnt+=1
    print(f"#{cnt}: i={i},a={a}")
    file.write(f"#{cnt}: i={i},a={a}\n")
    file.flush()
    os.fsync(file)
  a+=m
  i+=m
  #print(f"next one: (i={i},a={a})")
  a+=f
  i+=1
end=datetime.now()
print("ending time:",end)
print(str(end-start))
file.write("ending time:"+str(end)+"\n")
file.write(str(end-start)+"\n")
file.close()
It requires the package "primefac" installed.
The "ECM" is much faster than the default "pollardrho_brent" method for extremely large numbers. The default method took around 10 hours, while the ECM took about 17 minutes.

I'll update the logic behind the code when I have more time... Or you may find the paper linked on their solution page helpful.

Sunday, October 29, 2023

Probability of line segments intersecting - A surprisingly simple result

I was drawing random stuff on a piece of paper, suddenly this problem came into my mind:
If I draw a length 1 line segment randomly, then draw another one, what's the probability that they'll intersect?
Of course this description is ambiguous. To give a better description, we need to define things more accurately. First, we should limit the area that we can draw the line segments in. Second, we should define what it means by "drawing a line segment randomly". To avoid annoying boundary situations, let's use periodic boundary conditions. But then, the size can't be too small, otherwise we may encounter complicated situations like crossing multiple times. Apparently, if the second line segment intersects the first, both of the end points of the second line segment must lie within the envelope curve that has distance 1 to the first line segment. This shape is called a "stadium", where a=r=1 in this case. If we choose a rectangular area w×l where w>=2 and l>=2, multiple intersections won't happen. The formal definition of the problem is as the following:
Consider a rectangular area of size w×l with periodic boundary conditions, where w>=2,l>=2. Draw two length 1 line segments randomly. Here, by randomly it means, first choose a random point with uniform distribution over the rectangular area, then from the circle that is centered at the first point with radius one, randomly choose a point with uniform distribution. These two points are the end points of the line segment. Question: What's the probability that the two line segments intersect?
Let's solve this problem! Because of the periodic boundary conditions, linear translations don't matter. We can move the first line segment to the center of the rectangle area. Then, the probability is only none zero if the second line segment lies within the stadium shape centered at the first line segment. It doesn't matter how the stadium shape is oriented, so we can assume that the first line segment lies horizontally. To find the probability, we can just integrate the probability that the second line segment intersects with the first one for all possible first end points of the second line segment. The area outside the stadium has zero probability density. Due to the symmetry of the shape, we only need to consider a quarter of the stadium, and multiply the answer by 4. The situations are different depending on which part of the stadium shape the first end point is in. We can put them in 4 parts:
(Actually, P1 and P3 can be combined in the calculations in polar coordinates. I started with Cartesian, so I didn't notice until later.) The probability is p=1A×4×12π(p1+p2+p3+p4), where p1=01rdr0π2θsin1(rsinθ)dθ p2=121dx01x2tan1xy+tan11xydy =1201rdrsin1r2π2θ+tan11rsinθrcosθdθ p3=01rdr0sin1r2θ+cos1(rcosθ)dθ p4=012dx1x212cos1y dy (We may combine p1 and p3 and just calculate 01rdrπ2sin1r2θ+cos1(rcosθ)dθ instead.) I'm not sure if Cartesian or polar coordinate system is easier to manually integrate them... I put them on wolfram alpha, and here's the result: p1=14 p2=0.583664 p3=0.155138 p4=0.011197=172(2733ππ2) Then I added them together, the summation is... 0.999999. Well, this can't be a coincidence, can it? I'll get the exact result when I have more time... So, the probability is 2πA, where A is the area that the line segment "occupies". I quickly wrote a program to simulate the problem in a 3×3 square. Averaging 100000000 results, the probability is 0.0707416, where the theoretical value is 29π=0.07073553. This reminds me of the result of Buffon's needle problem, which has probability 2lπt, where l is the length of the needles and t is the spacing between the lines, lt. This result can be taken as a generalization of Buffon's needle problem. It can be described as
Assuming there is a random uniform distribution with density 1A of line segments with length 1 on the 2D plane. More precisely, the distribution is, the middle points of the line segments is uniformly distributied on the plane with density 1A and the angle is also uniformly distributed from 0 to 2π. Then, if we draw another line segment of length one randomly, as described in the beginning, the expected number of intersections between this line segment and the line segments in the "background" is 2πA.
I wonder if this can be further generalized. An obvious first step is, changing the size of the line segments, which probably will give something similar to the result of Buffon's needle problem. How else can we generalize this? 2023.10.30 Update: I think, indeed, the Buffon's needle problem can be seen as a special case of this problem. If we see each line as an infinite chain of length 1 line segments, and the distance between the lines is t, then each line segment occupies an area t, thus the expectation of the number of intersections between a length 1 needle with the lines is 2πA=2πt. Due to the additivity of expectation, a length l needle will have expectation 2lπt. When lt, there can be at most one intersection, thus this is the probability of them intersecting. Regarding the integrations, Claude Leibovici gives the exact expressions to them, except for p3 which is still based on numerical evidence (albeit a very strong one). It seems very complicated, so I thought maybe it would be easier to do it in the other direction. If we integrate r first instead,the interval of θ is [0,π6] and the interval of r is [2sinθ,1]. The integration is p3=0π6dθ2sinθ1(θ+cos1(rcosθ))rdr =0π6θ2(14sin2θ)+[r2cos2θ2cos1(rcosθ)+14sin1(rcosθ)rcosθ41r2cos2θ]2sinθ11cos2θdθ =0π6θ2(14sin2θ)+(θcos2θ2+14(π2θ)sinθcosθ4sin22θ2(π22θ)14(2θ)+sin2θcos2θ4)1cos2θdθ I got lazy again and just lumped it into wolfram alpha. This time it gives the exact result, 136(9+33π2π2). Combined with the result p2=172(933π+5π2) and the other two known results, I think the proof that the probability equals 2πA is complete.

Wednesday, October 18, 2023

The long journey from O(n^2) to O(n^1.5), Leetcode H1269. Number of Ways to Stay in the Same Place After Some Steps

H1269 description:
You have a pointer at index 0 in an array of size arrLen. At each step, you can move 1 position to the left, 1 position to the right in the array, or stay in the same place (The pointer should not be placed outside the array at any time). Given two integers steps and arrLen, return the number of ways such that your pointer is still at index 0 after exactly steps steps. Since the answer may be too large, return it modulo 109+7. Constraints: 1 <= steps <= 500 1 <= arrLen <= 106
An O(stepsmin(steps,arrLen)) time complexity, O(min(steps,arrLen)) space complexity solution is not hard to find, which is the typical approach and featured in the Editorial.

Basically we build up a table from the first step to the last step, where each row records the number of ways to be in that position at that step. It's easy to see that one can move at most steps/2 steps to the right, so if arrLen>steps/2 we just need to calculate step/2+1 positions. And since the position after each step only depends on the position at the previous step, one can use just two arrays to represent them and switch them after each step.

Here is a slightly improved version based on that approach:
#pragma GCC target("avx,mmx,sse2,sse3,sse4")
auto _=[]()noexcept{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);return 0;}();
unsigned int cnt[251];
const int mod=1000000007;
class Solution {
public:
    int numWays(int steps, int arrLen){
        if(arrLen<=2){
            if(arrLen==1) return 1;
            int p=(steps-1)/63,q=(steps-1)%63;
            unsigned long long r=(1ULL<<q)%mod,c=(1ULL<<63)%mod;
            while(p--) r=(r*c)%mod;
            return r;
        }
        int b=min(steps/2+1,arrLen),tmp2;
        unsigned long long tmp1;
        memset(cnt,0,b*sizeof(cnt[0]));
        cnt[0]=1;
        for(int i=0;i<steps;++i){
            tmp1=cnt[0];
            cnt[0]=(cnt[0]+cnt[1])%mod;
            int s=min(min(i+1,steps-i),b-1);
            for(int j=1;j<s;++j){tmp2=cnt[j]; cnt[j]=(tmp1+cnt[j]+cnt[j+1])%mod; tmp1=tmp2;}
            cnt[s]=(tmp1+cnt[s])%mod;
        }
        return cnt[0];
    }
};
Since the new position only depends on the 3 array elements of the previous step which is a fixed number, we don't really need to use an array to store that. Just use two integers to store two of the elements and roll them forward to update the single array. This saves half of the space.

Also noticing that the shape of the relevant elements in the array forms a triangle (first row has one non-zero element, second row has two, ..., after halfway we must go back, so the position afterwards are irrelevant and there's no need to comput them... second to last row has two relevant elements, the last has one), we can skip those unnecessary calculations, which saves at most half of the time.

When arrLen==2, the result has a very simple form: 2steps1, so we just need to find 2steps1mod109+7. The power function can be done in O(logn), but here since steps<=500, if we just factor 2steps1 into power of 263 and the remainder, it can be done in at most 8 operations, which is probably more efficient than using a power function.

This is an optimization to the typical approach. So far so good.

But, can this be optimized further?

Let's see what actually happens when we evolve the array after each step. aj+1[0]=aj[0]+aj[1], aj+1[1]=aj[0]+aj[1]+aj[2], ...
If we write the array as a column vector, the new vector is the old one multiplied by a matrix:
aj+1=Haj
where H is
1 1 0 0 0 ... 0
1 1 1 0 0 ... 0
0 1 1 1 0 ... 0
0 0 1 1 1 ... 0
...
0 ... 0 1 1 1 0
0 ... 0 0 1 1 1
0 ... 0 0 0 1 1
After s operations, the resulting vector is as=Hsa0, where a0 is a vector whose first element equals one and the rest zero. The answer to the problem is just as[0], which equals the product of the first row of Hs and a0. Since the only non-zero element of a0 is the first element which is one, the answer is just Hs[0][0].

A simple implementation of matrix multiplication has O(n3) complexity. Strassen algorithm can bring it down to O(nlog27)O(n2.81), and recent research brought the bound to as low as O(n2.372). But they are considered galactic algorithms. The matrix H is at most 500*500, even the not-so-complicated Strassen algorithm probably won't accelerate it by much. Furthermore, it must be multiplied by O(logs) to include the complexity of computing the power s. So this does not improve the performance at all.

But noticing that the matrix is a symmetric tridiagonal toeplitz matrix, maybe we can exploit this symmetry and find a better algorithm.

A Hermitian matrix is always diagonalizable, H=PDP, where D is a diagonal matrix with elements equal to the eigen values of H, and P is a unitary matrix P1=P. If H is symmetric, P is orthogonal, P1=PT. This property makes it very easy to calculate powers of Hermitian/symmetric matrices, since all the P and P1 in the middle cancel out, Hs=PDsP, and Ds is very easy to calculate - just raise every element on the diagonal to power s.

This property alone doesn't help much, since eigendecomposition typically has the same complexity as matrix multiplication. The only thing it improves is, the power is easy to solve, so the logs factor is gone. This is not good enough.
  
But this matrix is not only symmetric, it's also Toeplitz. A Toeplitz matrix can be decomposed in O(n2). A tridiagonal Toeplitz matrix has known eigen values, namely a+2bccos(kπn+1),k=1,,n, where a=b=c=1 in this case.

Since Ds is diagonal, it's easy to see that Hs[0][0]=iP[0][i]Ds[i][i]PT[i][0]=iP2[0][i]Ds[i][i]. We already have the eigen values, now if we can find the first row of the matrix P without actually decomposing the matrix, it would be an O(n) algorithm!

Here are the eigen values and the first rows of the P of various sizes of H that I found, the first rows are the sizes, the second rows are the eigen values, the third rows are the first row of the corresponding P:
(If the table below is all messed up, try refreshing. For speed (I guess) or whatever reason, sometimes browsers don't load the scripts in order, which messes up the table.)
3*3
1+2, 1, 12
1/2, 1/2, 1/2
4*4
(3+5)/2, (1+5)/2, (15)/2, (35)/2
1/5+5, 1/55, 1/5+5, 1/55
5*5
1+3, 2, 1, 13, 0
1/23, 1/2, 1/3, 1/23, 1/2
6*6
2.80194, 2.24698 , 1.44504 , -0.801938, 0.554958, -0.24698
0.231921 , -0.417907, 0.521121, 0.231921, 0.521121, 0.417907
7*7
2.84776 , 2.41421 , 1.76537 , 1, -0.847759 , -0.414214, 0.234633
0.191342 , 0.353553 , 0.46194, 0.5 , 0.191342, 0.353553, 0.46194
Noticing the symmetry of the elements in P and the corresponding eigen values, remembering that the eigen values are 1+2cosθ, it's easier to tell by returning the eigen values back to the cos value, i.e., subtract 1 from the values and then divide by 2.
3*3
2/2 , 0, 2/2
1/2 , 1/2, 1/2
4*4
(1+5)/4, (1+5)/4, (15)/4, (15)/4
1/5+5, 1/55, 1/5+5, 1/55
5*5
3/2, 1/2, 0, 3/2, 1/2
1/23, 1/2, 1/3, 1/23, 1/2
6*6
1.80194/2, 1.24698/2 , 0.44504/2 , -1.801938/2, -0.44504/2, -1.24698/2
0.231921 , -0.417907, 0.521121, 0.231921, 0.521121, 0.417907
7*7
1.84776/2 , 1.41421/2 , 0.76537/2 , 0, -1.847759/2 , -1.414214/2, -0.76537/2
0.191342 , 0.353553 , 0.46194, 0.5 , 0.191342, 0.353553, 0.46194
I noticed that for 3*3 5*5 and 7*7 matrices, the elements corresponding to 0 are 1/2,1/3,1/4 respectively, and for 4*4 all elements have a factor of 5, so there could be a factor of 1/n+1. I guessed that it may also have something to do with sinθ, since as the absolute value of the eigen value decreases, the element value increases. And I figured it out!
The formula is,
θi=iπn+1,i=1,2,...n
ei=1+2cosθi
pi=2n+1sinθi
ans=ieispi2,s=steps
The following code implements this algorithm. The only problem now is, the result could be very large, so how do we find the remainder of it mod109+7 accurately?
#pragma GCC target("avx,mmx,sse2,sse3,sse4")
auto _=[]()noexcept{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);return 0;}();
unsigned int cnt[251];
const int mod=1000000007;
class Solution {
public:
    int numWays(int steps, int arrLen){
        if(arrLen<=2){
            if(arrLen==1) return 1;
            int p=(steps-1)/63,q=(steps-1)%63;
            unsigned long long c=(1ULL<<63)%mod,r=(1ULL<<q)%mod;
            while(p--) r=(r*c)%mod;
            return r;
        }
        int b=min(steps/2+1,arrLen),tmp2;

        //new algorithm
        double r=0.0;
        for(int i=1;i<=b;++i){
            double c=cos(M_PI*i/(b+1));
            r+=pow(1+2*c,steps)*(1-pow(c,2));
        }
        r=r*2/(b+1);
        if(r<mod) return round(r);

        //may not work due to rounding errors if result is too large, so back to old algorithm
        unsigned long long tmp1;
        memset(cnt,0,b*sizeof(cnt[0]));
        cnt[0]=1;
        for(int i=0;i<steps;++i){
            tmp1=cnt[0];
            cnt[0]=(cnt[0]+cnt[1])%mod;
            int s=min(min(i+1,steps-i),b-1);
            for(int j=1;j<s;++j){tmp2=cnt[j]; cnt[j]=(tmp1+cnt[j]+cnt[j+1])%mod; tmp1=tmp2;}
            cnt[s]=(tmp1+cnt[s])%mod;
        }
        return cnt[0];
    }
};
Simply changing to 
        r=remainder(r,mod);
        return round(r);
gives numeric errors:
Input
steps = 47
arrLen = 38
Output 318653590
Expected 318671228
(In this example, the result is "r=1.149565e+20", it's accurate up to the 15th decimal place.
Which means we can safely do that when the result is less than 1015 which is about 250, so we can change
if(r<mod) return round(r);
to
if(r<(1ULL<<50)) return round(fmod(r,mod));//fmod() avoids negative results that remainder() may return
and the result is precise enough.)

So it requires too much precision for very large outputs. But in general, it gives a very good approximation in O(n) complexity. Amazing!

But can we improve the complexity for the original problem? Let's go down this path further and see where it leads us!

If we apply the trivial result that answer=1 when steps=1, we get an interesting identity,
i(1+2cosθi)sin2θi=n+12
,where θi=iπn+1,i=1,2,,n.
One way to prove it is, simply write cosθk=(eiθk+eiθk)/2 and sinθk=(eiθkeiθk)/2i and expand the expression, most of the summations cancel out, only the number 1/2 repeats n+1 times, thus completes the proof.

There is an easier way to prove it, though. Noticing that cos(πθ)=cosθ and sin(πθ)=sinθ, the terms cosθksin2θk cancel out! Now we're left with ksin2θk=1cos(2θk)2=n212cos(2θk). cos(2θk) can be taken as the real part of ei2θk, where θk=kπn+1,k=1,,n. So it becomes k=1neik2πn+1. It would make a full circle if we include k=0 which contributes an extra 1, thus the summation must equal -1. This gives the expected result n+12.

Can we generalize this to the general case with power s? Let's find out!

The function is zero for θ=0, so adding k=0 doesn't change the result.
Then, if we also include θk=kπn+1 where k=n+1,n+2,2n+1, they are just the repeat of the values for k=0,1,n, thus the result is doubled.
Applying these changes and express the sin and cos in exponents, we get
ans=2n+1k=02n+1(eiθk+eiθk+1)s(e2iθk+e2iθk2)/8
If we expand it as a polynomial of eiθk, ans=kmameimθk, most of the terms will cancel out after the summation of k, since the summation would become (1ei(2n+2)mπn+1)/(1eimπn+1), which is zero unless mn+1 is a multiple of 2, in which case the summation is 2n+2 since it's 1 repeated 2n+2 times.

Thus, only the terms with power m being a multiple of 2n+2 contributes to the result. Each one of those contributs am(2n+2). How do we find am? The coefficient of the term eimθk in (eiθk+eiθk+1)s is easy to find, namely j=0sm2(sj)(sjj+m). Let's denote that by c(m). Multiplying it by e2iθk+e2iθk2 will shift it to the powers m+2, m2 and m respectively.
Thus, the result is
ans=12(m2c(m)c(m+2)c(m2))
where c(m)=j=0sm2(sj)(sjj+m).

Then, what are the ms? Since the power can be at most s, and m must be a multiple of 2n+2, the ratio must be within [s2n+2,s2n+2]. We can then just iterate through that and multiply by 2n+2. This completes the algorithm.

This algorithm has complexity O(stepsmax(steps/arrLen,1)). Interestingly, one would expect the complexity to grow with arrLen, but with this algorithm it's the opposite. When arrLen>steps, the old algorithm at the beginning takes O(steps2) while this one takes O(steps)! If we combine this algorithm and the old algorithm which has complexity O(stepsmin(steps,arrLen)), i.e., we use the new algorithm for large arrLen and the old one for small arrLen, we get an O(step1.5) algorithm! We choose the new one if arrLen>steps and the old one otherwise.

(Building the binomial table has complexity O(n2) but it is not included in the complexity, since it only needs to be built once. For multiple requests the amortized complexity can be arbitrarily low. Explictly, if the function is called k times, while the maximum of the number of steps is N, then the amortized complexity is N2k. If kN, the amortized comlexity is still O(N1.5).
On the other hand, one could use a factorial and inverse of factorial table instead, which can be built in O(N), and (ab) can be calculated as a!(b!)1(ab)!1, at the cost of 2 extra mulitplication operations on each call to binom. But this makes it exactly O(N1.5).
)

The following code implements this algorithm:
#pragma GCC target("avx,mmx,sse2,sse3,sse4")
auto _=[]()noexcept{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);return 0;}();
unsigned int cnt[251];
const int mod=1000000007;
vector<vector<int>> binom{{1}};
void buildBinom(int n){
    for(int i=binom.size()-1;i<=n;++i){
        vector<int> v;
        v.reserve(i+2);
        v.emplace_back(1);
        const auto& b=binom.back();
        for(int j=0;j<i;++j) v.emplace_back((b[j]+b[j+1])%mod);
        v.emplace_back(1);
        binom.emplace_back(move(v));
    }
}
int getCoe(int s,int j,int t){
    if(j+t<0||s<2*j+t) return 0;
    return (long long)binom[s-j][j+t]*binom[s][j]%mod;
}
class Solution {
public:
    int numWays(int steps, int arrLen){
        if(arrLen<=2){
            if(arrLen==1) return 1;
            int p=(steps-1)/63,q=(steps-1)%63;
            unsigned long long c=(1ULL<<63)%mod,r=(1ULL<<q)%mod;
            while(p--) r=(r*c)%mod;
            return r;
        }
        if(steps==1) return 1;
        if(arrLen>steps||arrLen*arrLen>steps){
            buildBinom(steps);
            long long t=0;
            int tmp=2*arrLen+2;
            for(int m=-steps/tmp*tmp;m<=steps;m+=tmp){
                int upper=(steps-m)/2+2;
                for(int j=0;j<=upper;++j){
                    t=(t+2*getCoe(steps,j,m)-getCoe(steps,j,m+2)-getCoe(steps,j,m-2))%mod;
                }
            }
            return ((t*500000004)%mod+mod)%mod;
        }
        int tmp2;
        unsigned long long tmp1;
        memset(cnt,0,arrLen*sizeof(cnt[0]));
        cnt[0]=1;
        for(int i=0;i<steps;++i){
            tmp1=cnt[0];
            cnt[0]=(cnt[0]+cnt[1])%mod;
            for(int j=1;j<arrLen;++j){tmp2=cnt[j]; cnt[j]=(tmp1+cnt[j]+cnt[j+1])%mod; tmp1=tmp2;}
            cnt[arrLen]=(tmp1+cnt[arrLen])%mod;
        }
        return cnt[0];
    }
};
Ah, one more thing. The number 500000004 here acts as 21, because we can't simply divide t by 2. To divide a number modq, we must multiply the inverse of the element. Now, the O(n1.5) algorithm is complete.

To get rid of this 21, noticing the symmetry of the coefficients c(m)=c(m), it's easy to see that the sum of the terms corresponding to m and m are the same. With this, the middle part can be written as
        if(arrLen>steps||arrLen*arrLen>steps){
            buildBinom(steps);
            long long t=0;
            for(int j=0;j<=steps/2;++j) t=(t+getCoe(steps,j,0)-getCoe(steps,j,2))%mod;
            int tmp=2*arrLen+2;
            for(int m=tmp;m<=steps;m+=tmp){
                int upper=(steps-m)/2+2;
                for(int j=0;j<=upper;++j){
                    t=(t+2*getCoe(steps,j,m)-getCoe(steps,j,m+2)-getCoe(steps,j,m-2))%mod;
                }
            }
            return (t+mod)%mod;
        }
which also saves the time by half.


I really enjoyed the journey to this surprising result. It reminds me of the first time I learned about Strassen algorithm. The fact that "this is achievable!" and seeing it unveils in front of me is so satisfying!

(ps: Just learned that the solution for the case when arrLen>=steps/2+1 is known as the Motzkin numbers, which has an recurrence relation: Mn=2n+1n+2Mn1+3n3n+2Mn2. This gives an O(n) algorithm without the need to build the binomial table for this special case. The only downside is, we must find (i+2)1mod1e9+7 for i=1steps, which takes about 30 operations for each of them by finding (i+2)1e9+5. But it's still a really good algorithm. I wonder if this can be generalized for the "correction terms" above, i.e., if there is a recurrence relation for Mn,m=c(m)(c(m2)+c(m+2))/2. The Motzkin numbers are the Mn,0 here. If so, the binomial table will not be necessary.)

Wednesday, August 2, 2023

Leetcode M808 Soup Servings Convergence Analysis

M808 description:
There are two types of soup: type A and type B. Initially, we have n ml of each type of soup. There are four kinds of operations: 1. Serve 100 ml of soup A and 0 ml of soup B, 2. Serve 75 ml of soup A and 25 ml of soup B, 3. Serve 50 ml of soup A and 50 ml of soup B, and 4. Serve 25 ml of soup A and 75 ml of soup B. When we serve some soup, we give it to someone, and we no longer have it. Each turn, we will choose from the four operations with an equal probability 0.25. If the remaining volume of soup is not enough to complete the operation, we will serve as much as possible. We stop once we no longer have some quantity of both types of soup. Note that we do not have an operation where all 100 ml's of soup B are used first. Return the probability that soup A will be empty first, plus half the probability that A and B become empty at the same time. Answers within 105 of the actual answer will be accepted. Constraints: 0<=n<=109
It's not difficult to find the algorithm, if one notices the recursive nature of the problem: State (A,B)(A100,B) or (A75,B25) or (A50,B50) or (A25,B75).
But the upper bound of n is 109. Are we supposed to build a 109 by 109 table? How much time would that take?

It's easy to see that we can simplify the conditions a little bit. Apparently if the quantity is not a multiple of 25, the situation is the same as when the quantity is the next multiple of 25. So let's convert n to the ceiling of its ratio to 25.
Now we can change the 4 operations to: (m,n)(m4,n0) or (m3,n1) or (m2,n2) or (m1,n3).
So it's easy to tell that the probability to win from the state (m,n) is, p(m,n)=14(p(m4,n0)+p(m3,n1)+p(m2,n2)+p(m1,n3)).
The base cases are, p(m,n)={0if m>0,n01if m0,n>00.5if m0,n0.

We could apply this subroutine recursively until we reach one of the base cases, but that would be very inefficient. The first important thing is we should memoize the intermediate results. We can do that in multiple ways. I chose the bottom up approach, i.e. just build the entire table. And since the conditions don't change, once I have the table, I can simply take a value from the table for any input.

But still, that would be a 4×107 by 4×107 table. Can we do better than that?
Here's the second crucial thing to notice: The larger n is, the higher the probability that soup A will be empty first becomes, because there's a higher chance that we will serve more soup A than soup B in each step, we would expect soup A to decrease faster than soup B. If it keeps increasing, we can stop at the n such that p(n,n)>1105, because for larger inputs, the difference between the result and 1 is always less than 105, so we can simply return 1!

The following program, which I submitted, is based on this intuition. It took 0ms, pretty fast.
vector<vector<double>> tab;
bool first=true;
int oneTh=INT_MAX;
double check(int i,int j){
    if(i<0) return (1.0+(j>=0))/2;
    if(j<0) return 0;
    return tab[i][j];
}
double avg(int i,int j){
    if(i==-1){if(j==-1) return 0.5;return 1.0;}
    if(j==-1) return 0.0;
    return 0.25*(check(i-4,j)+check(i-3,j-1)+check(i-2,j-2)+check(i-1,j-3));
}
void init(){
    first=false;
    for(int n=1;n<=40000000;++n){
        for(int i=0;i<n-1;++i) tab[i].emplace_back(avg(i,n-1));
        if(1.0-avg(n-1,n-1)<0.00001){oneTh=n;break;}
        vector<double> tmp;
        for(int i=0;i<n;++i) tmp.emplace_back(avg(n-1,i));
        tab.emplace_back(move(tmp));
        // cout<<"n="<<n<<endl;
        // for(auto& v:tab){
        //     for(auto e:v) cout<<e<<",";
        //     cout<<endl;
        // }
    }
}
class Solution {
public:
    double soupServings(int n) {
        if(first) init();
        if((n=(n+24)/25)>=oneTh) return 1.0;
        if(n==0) return 0.5;
        return tab[n-1][n-1];
    }
};
(In the submission page, all the sample submissions that take less than 100ms use some magic number like "if(n>5000)" or "if(n>10000)", but the threshold could be easily determined during the computation.)
The main reason for this article is not about the program, so let's go back to the assumption. The assumption is based on our intuition, but can we prove it rigorously? Let's think about it!
The editorial of that problem on leetcode demonstrates the law of large numbers. But the fact that the probability converges to 1 doesn't justify the statement that after we first encounter a value 10.00001, all the values afterwards satisfy it too.
Consider the function f(x)=ex/100cos2(x/100). We know that it converges to 0 as x, but that doesn't mean that we can stop when we encounter a value less than the criterion for the first time, because the values will keep increasing and decreasing alternately! So what we really need is a proof that it increases monotonically, or an upper bound such that even if it starts to increase again, it will never exceed the upper bound.

The probabilities that we calculate form a table, which we can represent as a grid graph. To make it easier to see, let's rotate it by 45:
The value at each cell is the average of another 4 cells, and if the cells are outside the grid, the value is 0 or 1 or 0.5 respectively in the regions shown on the figure:
(From here, immediately we find two optimizations to our program: First, it's unnecessary to calculate half of the numbers in the grid. Because we only need the values at (n,n), all the elements that we'll ever use are either in (odd,odd) cells or (even,even) ones. Second, we only need one layer of the values to construct the next layer. So instead of saving the entire table, we could just use two vectors to store the current layer and the next layer, which saves a lot of space. That will do for all odd ns, then we do the same to fill in all the even ones. Another possible optimization is, to find the last p(n,n), we just need the values of the cells between these two lines, so we can start to shrink the vectors after half way through, which cuts the space complexity by half again. But that's only if we already know the last value of n, and it's probably going to be a little more complicated to implement.)

(From here on, "next layer" means two rows downwards, e.g. from coordinate (1,1) to (3,3) is one layer down.)

What we want is, a proof that the values at the cells in the middle increases monotonically as it goes downward, or an upper bound of those values as a function of the number of steps.

Since we've seen the law of large numbers, maybe we can continue on that path?
The central limit theorem states that with i.i.d variables, their average will converge to normal distribution.
The problem can be seen as a random walk: We start at the origin, each time we may take 1 step left, or no move, or 1 step right, or 2 steps right, each with probability 1/4, and we want to find out the probability that we are at location l after n steps. It's easy to find the expectation of the variables μ0=(1+0+1+2)/4=1/2, and the variance σ02=(1+0+1+4)/41/4=5/4.

If we start from (n,n), we will move out of the grid in at most n2 steps. By then, the range of our final location is [n2,n]. The expectation of the summation is μ=n4. Let's show them on the figure:
If you're thinking "wait a minute - once we reach either one of the boundaries, we can't keep going, so how can we map the probability distribution along the boundaries to the line?", that's good thinking! Because indeed, it's not a one on one map. But we only cares about the summation of the probabilities, and notice that if our final location is in the 0 or 1 region, we must have come through the corresponding boundary, and if we come through either one of the boundaries, we won't be able to leave that region. That means, the summation of the probabilitese from n2 to 0 and from 0 to n corresponds to the probability of B empty and A empty respectively. There is a discrepancy, though, because if we reach the first 3 rows inside the grid, we may enter the 0.5 region. But as n increases, that probability converges to zero.

So, let's compute the probability distribution of our final location on that horizontal line. The central limit theorem states that the probability distribution should converge to a normal distribution with expectation  μ=n4 and variance σ2=5n8. The probability of A empty is
P=10N(n4,5n8)dx
We can manipulate the variable a little to convert it to an integration on the standard normal distribution:
P=15n8n485nN(0,1)dx
=15n81+erf(n20)2

Now we can easily find out the threshold based on this approximation, in one line:
    for(int i=1;i<2000;++i) if((1+erf(-sqrt(i/20.0)))*sqrt(5*i/32.0)<0.00001) {cout<<"i="<<i;break;}
We get i=230, which corresponds to an original value of 230×25=5750.
The threshold I got from my program is oneTh=179, which corresponds to an original value of 4475.
How close is this approximation? Let's find out the differences:
    for(int i=1;i<tab.size();++i) cout<<setw(8)<<tab[i][i]<<" - "<<setw(8)<<(f=1-(1+erf(-sqrt(i/20.0)))*sqrt(5*i/32.0))<<" = "<<setw(8)<<tab[i][i]-f<<endl;
Output:
   0.625 - 0.702813 = -0.0778132
 0.65625 -    0.634 = 0.0222501
 0.71875 - 0.600243 = 0.118507
0.742188 - 0.583299 = 0.158888
0.757812 - 0.576178 = 0.181635
0.785156 - 0.575349 = 0.209808
0.796875 - 0.578759 = 0.218116
0.817871 - 0.585105 = 0.232766
0.827637 - 0.593511 = 0.234126
0.844849 - 0.603362 = 0.241487
0.852173 - 0.614214 = 0.237959
0.866699 - 0.625739 =  0.24096
0.872559 -  0.63769 = 0.234868
0.884827 - 0.649881 = 0.234946
0.889633 - 0.662167 = 0.227466
0.900076 - 0.674438 = 0.225637
0.904058 - 0.686609 =  0.21745
0.913005 - 0.698613 = 0.214392
0.916344 -   0.7104 = 0.205944
0.924045 - 0.721932 = 0.202114
 0.92687 - 0.733179 =  0.19369
0.933526 - 0.744121 = 0.189404
 0.93593 - 0.754743 = 0.181187
0.941703 - 0.765035 = 0.176668
0.943762 - 0.774991 = 0.168771
0.948783 - 0.784609 = 0.164174
0.950555 - 0.793889 = 0.156667
0.954934 - 0.802832 = 0.152102
0.956464 - 0.811443 = 0.145021
0.960291 - 0.819727 = 0.140564
0.961618 -  0.82769 = 0.133927
0.964968 -  0.83534 = 0.129628
0.966121 - 0.842684 = 0.123438
0.969061 -  0.84973 = 0.119331
0.970066 - 0.856487 = 0.113579
0.972648 - 0.862964 = 0.109684
0.973525 - 0.869169 = 0.104356
0.975797 - 0.875113 = 0.100684
0.976565 - 0.880803 = 0.0957618
0.978566 - 0.886249 = 0.0923166
0.979239 -  0.89146 = 0.0877792
0.981004 - 0.896444 = 0.0845595
0.981595 - 0.901211 = 0.0803844
0.983153 - 0.905767 = 0.0773854
0.983673 - 0.910123 = 0.0735498
0.985049 - 0.914285 = 0.0707642
0.985506 - 0.918261 = 0.0672452
0.986724 -  0.92206 = 0.0646639
0.987127 - 0.925688 = 0.0614394
0.988204 - 0.929152 = 0.0590524
 0.98856 -  0.93246 = 0.0561005
0.989515 - 0.935617 = 0.0538971
0.989829 - 0.938631 = 0.0511974
0.990675 - 0.941508 = 0.0491667
0.990953 - 0.944253 = 0.0466995
0.991703 - 0.946872 = 0.0448306
0.991949 - 0.949371 = 0.0425775
0.992614 - 0.951755 = 0.0408595
0.992832 - 0.954029 = 0.0388033
0.993423 - 0.956197 = 0.0372257
0.993616 - 0.958266 = 0.0353502
0.994141 - 0.960238 = 0.0339031
0.994312 - 0.962119 = 0.0321933
0.994779 - 0.963912 = 0.0308669
 0.99493 - 0.965622 = 0.0293089
0.995346 - 0.967251 = 0.0280942
 0.99548 - 0.968805 = 0.0266751
 0.99585 - 0.970286 = 0.0255634
0.995969 - 0.971698 = 0.0242713
0.996298 - 0.973043 = 0.0232546
0.996404 - 0.974325 = 0.0220786
0.996697 - 0.975547 = 0.0211493
0.996791 - 0.976712 = 0.0200793
0.997052 - 0.977821 = 0.0192303
0.997136 - 0.978879 = 0.018257
0.997368 - 0.979886 = 0.0174819
0.997443 - 0.980846 = 0.0165967
 0.99765 - 0.981761 = 0.0158893
0.997716 - 0.982632 = 0.0150846
0.997901 - 0.983462 = 0.0144392
 0.99796 - 0.984252 = 0.0137078
0.998125 - 0.985006 = 0.0131192
0.998177 - 0.985723 = 0.0124545
0.998324 - 0.986406 = 0.011918
0.998371 - 0.987057 = 0.0113141
0.998502 - 0.987677 = 0.0108252
0.998544 - 0.988268 = 0.0102766
0.998661 -  0.98883 = 0.00983115
0.998699 - 0.989366 = 0.00933291
0.998803 - 0.989876 = 0.00892724
0.998836 - 0.990362 = 0.00847478
 0.99893 - 0.990824 = 0.00810543
0.998959 - 0.991265 = 0.00769461
0.999043 - 0.991684 = 0.0073584
0.999069 - 0.992084 = 0.00698545
0.999144 - 0.992464 = 0.00667947
0.999167 - 0.992826 = 0.00634094
0.999234 - 0.993171 = 0.00606255
0.999255 -   0.9935 = 0.0057553
0.999314 - 0.993812 = 0.00550206
0.999333 -  0.99411 = 0.00522323
0.999386 - 0.994393 = 0.00499291
0.999403 - 0.994663 = 0.00473991
0.999451 -  0.99492 = 0.00453047
0.999466 - 0.995165 = 0.00430092
0.999508 - 0.995398 = 0.00411051
0.999522 - 0.995619 = 0.00390226
 0.99956 -  0.99583 = 0.00372917
0.999572 - 0.996031 = 0.00354026
0.999606 - 0.996223 = 0.00338295
0.999616 - 0.996405 = 0.0032116
0.999647 - 0.996578 = 0.00306864
0.999656 - 0.996743 = 0.00291323
0.999684 -   0.9969 = 0.00278334
0.999692 -  0.99705 = 0.0026424
0.999717 - 0.997192 = 0.0025244
0.999724 - 0.997328 = 0.00239658
0.999746 - 0.997457 = 0.00228939
0.999753 - 0.997579 = 0.00217349
0.999772 - 0.997696 = 0.00207614
0.999779 - 0.997808 = 0.00197105
0.999796 - 0.997913 = 0.00188264
0.999802 - 0.998014 = 0.00178736
0.999817 -  0.99811 = 0.00170708
0.999822 - 0.998202 = 0.0016207
0.999836 - 0.998288 = 0.0015478
0.999841 - 0.998371 = 0.0014695
0.999853 -  0.99845 = 0.00140331
0.999857 - 0.998525 = 0.00133233
0.999868 - 0.998596 = 0.00127225
0.999872 - 0.998664 = 0.00120791
0.999882 - 0.998729 = 0.00115337
0.999885 -  0.99879 = 0.00109505
0.999894 - 0.998849 = 0.00104556
0.999897 - 0.998904 = 0.000992697
0.999905 - 0.998957 = 0.000947776
0.999908 - 0.999008 = 0.000899868
0.999915 - 0.999056 = 0.000859103
0.999917 - 0.999101 = 0.000815685
0.999924 - 0.999145 = 0.000778694
0.999926 - 0.999186 = 0.000739347
0.999932 - 0.999226 = 0.000705783
0.999933 - 0.999263 = 0.000670127
0.999939 - 0.999299 = 0.000639675
 0.99994 - 0.999333 = 0.000607365
0.999945 - 0.999365 = 0.000579738
0.999946 - 0.999396 = 0.00055046
0.999951 - 0.999425 = 0.000525398
0.999952 - 0.999453 = 0.000498869
0.999956 -  0.99948 = 0.000476135
0.999957 - 0.999505 = 0.000452098
 0.99996 - 0.999529 = 0.000431477
0.999961 - 0.999552 = 0.000409699
0.999964 - 0.999573 = 0.000390995
0.999965 - 0.999594 = 0.000371264
0.999968 - 0.999614 = 0.0003543
0.999969 - 0.999632 = 0.000336424
0.999971 -  0.99965 = 0.00032104
0.999972 - 0.999667 = 0.000304845
0.999974 - 0.999683 = 0.000290893
0.999975 - 0.999699 = 0.000276222
0.999977 - 0.999713 = 0.00026357
0.999977 - 0.999727 = 0.000250279
0.999979 -  0.99974 = 0.000238807
 0.99998 - 0.999753 = 0.000226767
0.999981 - 0.999765 = 0.000216365
0.999982 - 0.999776 = 0.000205459
0.999983 - 0.999787 = 0.000196027
0.999984 - 0.999798 = 0.000186147
0.999985 - 0.999807 = 0.000177596
0.999985 - 0.999817 = 0.000168647
0.999986 - 0.999826 = 0.000160895
0.999987 - 0.999834 = 0.000152789
0.999988 - 0.999842 = 0.00014576
0.999988 -  0.99985 = 0.000138418
0.999989 - 0.999857 = 0.000132047
0.999989 - 0.999864 = 0.000125396

Indeed, the numeric evidence of convergence is very convincing.

But wait a minute... Does this count as a proof? Is it rigorous?

The simple answer is, no.
Because we can tell that it converges to normal distribution as n goes to infinity, but we still don't know how fast it converges to that! We can tell that for the normal distribution we can stop at a certain n, but we don't have proof that the normal distribution is close enough to the actual distributon!
Wikipedia states that
If the third central moment E[(X1μ)3] exists and is finite, then the speed of convergence is at least on the order of 1/n (see Berry–Esseen theorem). Stein's method can be used not only to prove the central limit theorem, but also to provide bounds on the rates of convergence for selected metrics.
Apparently the third central moment is finite for our distribution, but a convergence speed of 1/n is too slow for our problem. The criterion is 105, that would require 1010 steps, which corresponds to n=2×1010, larger than the maximum of the constraint! I haven't looked into this "Stein's method", I wonder if it provides a higher lower bound on the speed. Also notice that, we not only need to bound the difference between the probability densities, we also need to bound the difference between the cumulative distribution function, since what we need is the value of the integration. We can also tell that the value approximated by normal distribution seems to be a lower bound of the actual value, except for the first one. But unless we can prove it, it's just an observation. What if we don't use an approximation? What if we just sum up the discrete probabilities? If we use the recursive relationship twice, we get: p(m,n)=14(p(m4,n)+p(m3,n1)+p(m2,n2)+p(m1,n3)) =(14)2(p(m8,n)+2p(m7,n1)+3p(m6,n2)+4p(m5,n3)+3p(m4,n4)+2p(m3,n5)+p(m2,n6)) And the pattern continues. The coefficients goes like this: 1 1 1 1 1 2 3 4 3 2 1 1 3 6 10 12 12 10 6 3 1 1 4 10 20 31 40 44 40 31 20 10 4 1 (more rows up to the 11th: 1,5,15,35,65,101,135,155,155,135,101,65,35,15,5,1, 1,6,21,56,120,216,336,456,546,580,546,456,336,216,120,56,21,6,1, 1,7,28,84,203,413,728,1128,1554,1918,2128,2128,1918,1554,1128,728,413,203,84,28,7,1, 1,8,36,120,322,728,1428,2472,3823,5328,6728,7728,8092,7728,6728,5328,3823,2472,1428,728,322,120,36,8,1, 1,9,45,165,486,1206,2598,4950,8451,13051,18351,23607,27876,30276,30276,27876,23607,18351,13051,8451,4950,2598,1206,486,165,45,9,1, 1,10,55,220,705,1902,4455,9240,17205,29050,44803,63460,82885,100110,112035,116304,112035,100110,82885,63460,44803,29050,17205,9240,4455,1902,705,220,55,10,1, 1,11,66,286,990,2882,7282,16302,32802,59950,100298,154518,220198,291258,358490,411334,440484,440484,411334,358490,291258,220198,154518,100298,59950,32802,16302,7282,2882,990,286,66,11,1,) On each row, each number is the sum of 4 numbers of the previous row: 2 position to the left of the one above it, 1 position to the left, directly above it, and one to the right, and if it goes out of the boundary, it's considered 0. Over all, there's an extra coefficient of 14t to normalize the probability to one. Each row has one more element on the left and two more on the right. The vertical line in our figure corresponds to the vertical line of {1,3,10,31...}. If we take the 4th layer of the even cells (coordinate (8,8) in the table below) as an example, the total probability is 144(31×0.5+40+44+40+31+20+10+4+1)=0.802734375 (For odd layers, it will recurse back to the row above the line in the figure, so there will be 3 elements multiplied by 0.5. Just a little more complicated.) This is larger the actual result 0.796875. The reason is simple: When we enter the 0 region, there's a chance that we land at the boundary (coordinate (4,-1)), then there's a 1/4 chance that we enter 0.5, which is impossible since we should already stop once we're outside the grid. There are 3 ways to get to that cell, this gives an extra value of 0.5(3×143×14). Removing this discrepancy gives the correct result. But as n increases, the discrepancy should converge to 0 exponentially. So, if we can prove that the sum of the numbers before the middle line converges to zero monotonically, then we're done. The first few sums are (1+0.5)/4=0.375 (1+2+3×0.5)/42=0.28125 (1+3+6+10×0.5)/43=0.234375 (1+4+10+20+31×0.5)/44=0.197265625 If we can find a closed form of this summation, it would be very helpful. It's easy to find the equation for the first 4 terms from the recurrence relationship. The first is always 1, the second is t, the third is 1++t=t(t+1)/2, the fourth is t(t+1)(t+2)/6 based on the fact f4(t)f4(t1)=t(t+1)/2, the fifth has this recurrence relation f5(t)f5(t1)=(t1)(t2+4t+6)/6 and increases as t4/24. So in general, the kth term is a polynomial of degree k1 where the coefficient of the leading term is 1(k1)!. It's quite similar to the binomial coefficients, just more complicated. Maybe the techniques to bound the partial sum of binomial coefficients can be applied here, too. We're basically trying to bound the sum of the first t+1 (or t+2 for odd layers) terms of these unnamed coefficients, where the length of the row is 3t+1. Based on the fact that it's less than the Gaussian distribution approximation, maybe the Chernoff bound can be applied here, too. This seems to be the most promising direction. I'll take a look at these when I have more time...
[2023 08 03 update]

I figured out how to prove the upper bound after giving it a litte more thought!
So, basically the idea is (Thanks to user2575jO on leetcode), we can write one step as two steps merged together: choosing from {1,0,1,2} each with p=1/4 is equivalent to first choosing from {0,2} with p=1/2 and then choosing from {1,0} with p=1/2, then add the result together. The distributions are exactly the same.
Now if we move t steps, the probability that we're at a location 0 is
P(t)=14tl=0t((tl)r=0l/2(tr))
where l is the number of 1 moves and r is the number of +2 moves. l can be from 0 to t, and r must be between 0 and l2, so that the final location l+2r0.

Knowing that rt2, we can apply the Chernoff bound directly (with p=12):
Pr(r<l2)=Pr(r<t2(t2l2))exp(2t(t2l2t)2)
Pr(r<l2)exp((tl)22t)

Replace 2tr=0l/2(tr) with the upper bound above, now we have
P(t)2tl=0t(tl)exp((tl)22t)
To simplify this expression, noticing the binomial coefficient, we really want to write the last term as a power of l. Let's expand it:
P(t)2tl=0t(tl)exp(t22lt+l22t)=2tet/2l=0t(tl)elel22t
We are almost there, except for the annoying el22t term. It's a normal distribution, centered at 0. If we can bound it by some function like aebl, then we can write the term as a power of l. And that seems doable!
el22taebl
el22t+bla
e12t(lbt)2+b2t/2a
If we choose a=eb2t/2, the inequality is always true. Apply this inequality with an arbitrary b, we get
P(t)2tet/2l=0t(tl)eleb2t/2ebl
=2tet/2eb2t/2l=0t(tl)elebl
=2teb212t(1+e1b)t
=(12(eb212+e(b1)22))t
Now we want to choose a b to minimize this expression. Taking the derivative and we find this equation
beb1+b1=0
Solving it numerically, we find
b=0.598942
Putting it back to the inequality above, we get
P(t)0.9047174t
The threshold is at
0.9047174t=105
t=114.97
Remember that t is the number of steps, so this corresponds to n=2t=230, the same result as the Gaussian approximation above! (Probably not a coincidence, because even though they are very different for small ns, when I change 105 to 5×106, they still give the same result n=244. I think this has the same asymptotic behavior as the Gaussian approximation.) But this time the bound is proved.

(The l should be replaced by l+1 or l+2 somewhere, because the value in the middle is 0.5, not 1, and in odd layers it recurses back to the layer of three 0.5s, so we must shift it one more step. But these are just small details, it shouldn't change the answer by much...)

[update end]
What else can we do, then? In the previous calculations, we're trying to find an upper bound. If we have an upper bound, we can use that to estimate where to stop, e.g. if we can prove that the normal distribution is an upper bound, we cau use the threshold n=230. But if we can prove the monotonicity, we can use the threshold at exactly where the value exceeds the criterion i.e. n=179, which is a stronger statement.

Let's go back to the grid, maybe.
The proof would be trivial if all the values increase monotonically along any vertical line. But unfortunately that is not the case. The elements in the top 10×10 grid are

     0.625,      0.75,     0.875,         1,         1,         1,         1,         1,         1,         1,
       0.5,     0.625,      0.75,   0.90625,    0.9375,   0.96875,         1,         1,         1,         1,
     0.375,       0.5,   0.65625,    0.8125,     0.875,    0.9375,  0.976562,  0.984375,  0.992188,         1,
      0.25,   0.40625,    0.5625,   0.71875,    0.8125,  0.890625,    0.9375,  0.960938,  0.984375,  0.994141,
   0.15625,    0.3125,   0.46875,     0.625,  0.742188,  0.828125,  0.890625,    0.9375,  0.966797,  0.980469,
     0.125,      0.25,     0.375,   0.53125,   0.65625,  0.757812,   0.84375,  0.902344,    0.9375,  0.960938,
   0.09375,    0.1875,  0.304688,  0.453125,  0.578125,    0.6875,  0.785156,  0.851562,  0.900391,  0.941406,
    0.0625,  0.140625,      0.25,  0.382812,       0.5,  0.617188,   0.71875,  0.796875,  0.863281,  0.912109,
 0.0390625,  0.109375,  0.203125,    0.3125,  0.429688,  0.546875,  0.652344,  0.742188,  0.817871,   0.87207,
   0.03125, 0.0859375,   0.15625,  0.253906,  0.367188,  0.480469,  0.585938,  0.683594,  0.763672,  0.827637,
   
The diagonal line is indeed increasing, but the third diagonal line has elements 0.875, 0.90625, 0.875, 0.890625, 0.890625, 0.902344, 0.900391, 0.912109... which increases and decreases alternately. And that is not the only diagonal line that shows this behavior. Let's take the 5th diagonal line as an example, the first few elements are

       1,       1,0.984375,0.984375,0.980469,0.980469,0.978516,0.979492,0.978271,0.979492,
0.979004,0.980286,0.980179,0.981461,0.981567,0.982796, 0.98303, 0.98418,0.984484,0.985547,
0.985884,0.986855,0.987203,0.988086, 0.98843,0.989229, 0.98956,0.990281,0.990595,0.991244,
0.991538, 0.99212,0.992393,0.992915,0.993167,0.993635,0.993865,0.994284,0.994495, 0.99487,
0.995062,0.995397,0.995572,0.995872, 0.99603,0.996298,0.996441,0.996681, 0.99681,0.997024,

It's not easy to tell if it increases or decreases. Let's see their differences:

           0,   -0.015625,           0, -0.00390625,           0, -0.00195312, 0.000976562,  -0.0012207,   0.0012207,-0.000488281,
  0.00128174,-0.000106812,  0.00128174, 0.000106812,  0.00122833, 0.000234604,  0.00115013, 0.000303984,  0.00106215, 0.000337005,
 0.000971675, 0.000347495, 0.000883162, 0.000343844, 0.000799075, 0.000331543, 0.000720632, 0.000314187,  0.00064834,  0.00029413,
 0.000582274, 0.000272915, 0.000522257, 0.000251551,  0.00046797,  0.00023068, 0.000419023, 0.000210703, 0.000374994, 0.000191853,

It alternates for about 12 times, then it keeps increasing.
Remembering that there are two independent sets of layers, we should find the differences within each set. The differences within the odd layers are

   -0.015625, -0.00390625, -0.00195312,-0.000244141, 0.000732422,  0.00117493,  0.00138855,  0.00146294,  0.00145411,  0.00139916,
  0.00131917,  0.00122701,  0.00113062,  0.00103482,  0.00094247,  0.00085519, 0.000773808, 0.000698651, 0.000629726, 0.000566848,
  
and the even layers

   -0.015625, -0.00390625,-0.000976562,           0, 0.000793457,  0.00117493,  0.00133514,  0.00138474,  0.00136614,  0.00130868,
  0.00123066,  0.00114292,  0.00105218, 0.000962528, 0.000876404, 0.000795173, 0.000719521, 0.000649703, 0.000585698, 0.000527314,

Now they no longer alternate.
Is it possible to prove that it can only decrease for a finite amount of times before it starts to increase monotonically?
Let's take a look at a diagonal line that's further away. Here are the differences of the elements on the 10th diagonal:
odd layers:
           0,-0.000488281,-0.000488281,-0.000366211,-0.000267029,-0.000171661,-9.20296e-05, -3.0756e-05,   1.508e-05, 4.81904e-05,
 7.11586e-05, 8.62759e-05, 9.54153e-05, 0.000100072, 0.000101421, 0.000100375,  9.7634e-05, 9.37348e-05,  8.9083e-05, 8.39835e-05,
even layers:
           0,-0.000366211,-0.000427246,-0.000335693,-0.000244141,-0.000161171,-9.08375e-05,-3.57032e-05, 5.90086e-06, 3.63067e-05,
  5.7715e-05, 7.20862e-05, 8.10546e-05, 8.59372e-05, 8.77771e-05, 8.73906e-05, 8.54098e-05, 8.23208e-05, 7.84945e-05, 7.42116e-05,

It seems that the further away from the middle, the more times it would decrease.

Let's see if we can find out more about this pattern! In the following two figures (first the odd layers, second the even layers), a cell (m,n) is a black square if p(m,n)p(m2,n2) is negative, a white square if positive, and a dash if zero.
There is definitely a pattern here. The boundary between increasing cells and decreasing cells shifts further and further away from the middle diagonal line. If this can be proved, then it's obvious that the diagonal line doesn't have decreasing cells. But I haven't found a proof of that.

How about we consider something more solid - something that we can actually prove?
If we look at the table again, it's easy to see that the values increase rightward and decrease downward. Can we do better than that?
It's easy to see that the number on the left boundary decrease to 1/4 of the value 4 steps (or 1 layer) above it. We can also tell that there are more and more 1s on the right boundary. Let's mark them on the figure:
The value of the light green cell is 1/4 of the green cell, whose value is 1/4 of the deep green cell.
With the same reasoning, we can tell that there's a similar pattern on the right boundary. There is one more cell of value 1 after each layer, and also the value of the cell next to the 1 (the blue cells) increases monotonically downward. The difference between 1 and the value of the blue cell is 1/4 of the difference between 1 and the light blue cell, and the difference between 1 and and the deep blue cell is 1/4 of the previous one. So the value of the cells along this line converges to 1. By induction, it's easy to show that any cell increases value along that line downward.

So we have two lines that guarantee to increase or decrease downward. Is there a line of equal values, then? If so, any angle between the increasing line and equal line should be increasing, and the other angles should be decreasing. But the angle of the equal line depends on where the cell is, because if the cell already has value 1, the equal line is the same as the increasing line. So as the location of the cell changes from the right most to the left most, the angle between the equal line and the increasing line increases from 0 to some positive value. The two figures above seems to support this statement. If we can prove that when the cell is in the middle, the equal line always points to the lower left, then we can tell that along the middle diagonal, the value increases monotonically. 

Another thing that I noticed is, the summation of the values on the next layer is always 2.5 larger than the current layer, since all the values will appear exactly 4 time in the average, the extra 0s don't contribute, and the 1s contribute exactly (4+3+2+1)/4=2.5 to the summation of the next layer. I don't know how to use this fact in a proof, though.

Anyway, this summarizes my thoughts on this problem so far. If anyone finds a rigorous proof of this problem, I'd be glad to hear it!

Friday, June 9, 2023

Leetcode M1318 Minimum Flips to Make a OR b Equal to c, E1351 Count Negative Numbers in a Sorted Matrix

M1318 description:
  Given 3 positives numbers a, b and c. Return the minimum flips required in some bits of a and b to make ( a OR b == c ). (bitwise OR operation).
  Flip operation consists of change any single bit 1 to 0 or change the bit 0 to 1 in their binary representation.
Constraints:
  1 <= a <= 109
  1 <= b <= 109
  1 <= c <= 109
This problem is very easy, a typical algorithm is to go through each bit position of the three numbers. If c at that position is 1 but a|b at that position is 0, we add one flip, and if c is 0 we add one if a is 1 and another one if b is 1 at that position. (Since the numbers are positive, we don't need to check the 32nd bit.)

But putting these together, let's make a table:
c 1, a|b 0 : add 1
c 0, a&b 1 : add 2
c 0, a|b 1 : add 1
We can do the bitwise operations to the numbers beforehand, instead of checking each digit in the loop. So here's the first version that I submitted:
class Solution {
public:
    int minFlips(int a, int b, int c) {
        int r=0;
        int t=1;
        int abc=(a|b)^c,aAb=a&b;
        for(int i=0;i<31;++i){
            if(abc&t){
                r++;
                if(aAb&t) r++;
            }
            t<<=1;
        }
        return r;
    }
};
Then I noticed that, I can also combine the '&' operation to get rid of the branching. The version without the branching reads,
class Solution {
public:
    int minFlips(int a, int b, int c) {
        int r=0;
        int abc=(a|b)^c,aAb=a&b&abc;
        for(int i=0;i<31;++i) r+=(abc>>i&1)+(aAb>>i&1);
        return r;
    }
};
At this point, it's obvious that we are simply counting the number of '1's in the two numbers in binary, which is known as the Hamming weight or population count. There is a built-in function in GCC compiler for that, "__builtin_popcount(int)", so we can write it like this:
class Solution {
public:
    int minFlips(int a, int b, int c) {
        uint abc=(a|b)^c,aAb=a&b&abc;
        #ifdef __GNUG__
        return __builtin_popcount(abc)+__builtin_popcount(aAb);
        #else
        int r=0;
        for(int i=0;i<31;++i) r+=(abc>>i&1)+(aAb>>i&1);
        return r;
        #endif
    }
};
I did some research and noticed that llvm also supports the function "__builtin_popcount", and MSVC has these functions "__popcnt16, __popcnt, __popcnt64". We can use these functions according to the compiler.
(Update: starting from C++20, there's an "std::popcount()" function in the header <bit>.)

Then I found this webpage, which lists the best bit hacks currently known for many computation tasks. I'm fascinated by the esoteric beauty of the algorithms and magic numbers. Using the algorithm in the list, we can write the solution as
#ifndef __GNUG__
int popC_32(uint v){
    v = v - ((v >> 1) & 0x55555555);
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
    return ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
#endif
class Solution {
public:
    int minFlips(int a, int b, int c) {
        ios::sync_with_stdio(false);cin.tie(0);cout.tie(0);
        uint abc=(a|b)^c,aAb=a&b&abc;
        #ifdef __GNUG__
        return __builtin_popcount(abc)+__builtin_popcount(aAb);
        #else
        return popC_32(abc)+popC_32(aAb);
        #endif
    }
};
But since many CPUs support the instruction "POPCNT", it's probably better to use the built-in functions if they are available.

E1351 description:
  Given a m x n matrix grid which is sorted in non-increasing order both row-wise and column-wise, return the number of negative numbers in grid.
Constraints:
  m == grid.length
  n == grid[i].length
  1 <= m, n <= 100
  -100 <= grid[i][j] <= 100
An O(m+n) algorithm is easy to find,
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        int k=n-1,r=0;
        for(int i=0;i<m;++i){
            while(k>=0&&grid[i][k]<0) k--;
            if(k<0){r+=(m-i)*n;return r;}
            r+=n-1-k;
        }
        return r;
    }
};
We may try to use binary search to increase speed,
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        int k=n,r=0;
        for(int i=0;i<m;++i){
            k=upper_bound(grid[i].begin(),grid[i].begin()+k,0,greater())-grid[i].begin();
            if(k==0){r+=(m-i)*n;return r;}
            r+=n-k;
        }
        return r;
    }
};
But now it takes O(m*log(n)), which may be slower than O(m+n).
Also, if m>n, it would be better to do binary search in the columns, in which case we must write our own upper bound function. Also, because we'll access elements in different vectors, the overhead may be larger.

Let's consider the complexity. For m*log(n) to be less than m+n, we need n>m(log(n)1). Considering that n<100, it can be satisfied if n>6m. But also considering the overhead and the constant coeffecient, we may try to use a larger criterion, something like this:
int upperB(vector<vector<int>>& grid,int i,int k){
    if(grid[0][i]<0) return 0;
    int l=0,r=k;
    while(r-l>1){
        k=(l+r)/2;
        if(grid[k][i]>=0) l=k;
        else r=k;
    }
    return r;
}
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        if(n>8*m){
            int k=n,r=0;
            for(int i=0;i<m;++i){
                k=upper_bound(grid[i].begin(),grid[i].begin()+k,0,greater())-grid

[i].begin();
                if(k==0){r+=(m-i)*n;return r;}
                r+=n-k;
            }
            return r;
        }
        else if(m>64*n){
            int k=m,r=0;
            for(int i=0;i<n;++i){
                k=upperB(grid,i,k);
                if(k==0){r+=(n-i)*m;return r;}
                r+=m-k;
            }
            return r;
        }
        else{
            int k=n-1,r=0;
            for(int i=0;i<m;++i){
                while(k>=0&&grid[i][k]<0) k--;
                if(k<0){r+=(m-i)*n;return r;}
                r+=n-1-k;
            }
            return r;
        }
    }
};
But considering that the size of the input is quite small, it won't show much difference in this case, unless there're a lot of test cases where the input has one dimension much larger than the other.

Then I thought, instead of doing binary search vertically or horizontally, what if we do it... diagonally?
The good thing about it is, we can be sure that we can get rid of at least half of the inputs after each step, because if we choose any point on the diagonal and cut the rectangle into four smaller ones, the total area of the top left and the bottom right must be at least half of the area of the original rectangle.

The following code implements this algorithm:
int negCount(int tl_r,int tl_c,int br_r,int br_c,vector<vector<int>>& grid){
    //cout<<"tl is {"<<tl_r<<","<<tl_c<<"}, br is {"<<br_r<<","<<br_c<<"},";
    if(tl_r==br_r||tl_c==br_c) return 0;
    if(br_r-tl_r==1){
        if(grid[tl_r][tl_c]<0){
            //cout<<"\nadded col count="<<br_c-c1<<endl;
            return br_c-tl_c;
        }
        int c1=tl_c,c2=br_c,cm;
        while(c2-c1>1){
            cm=(c1+c2)/2;
            if(grid[tl_r][cm]>=0)c1=cm;
            else c2=cm;
        }
        //cout<<"\nadded col count="<<br_c-c2<<endl;
        return br_c-c2;
    }
    if(br_c-tl_c==1){
        if(grid[tl_r][tl_c]<0){
            //cout<<"\nadded row count="<<br_r-r1<<endl;
            return br_r-tl_r;
        }
        int r1=tl_r,r2=br_r,rm;
        while(r2-r1>1){
            rm=(r1+r2)/2;
            if(grid[rm][tl_c]>=0)r1=rm;
            else r2=rm;
        }
        //cout<<"\nadded row count="<<br_r-r2<<endl;
        return br_r-r2;
    }
    int r1=tl_r,r2=br_r,c1=tl_c,c2=br_c,rm,cm;
    if(grid[r1][c1]<0){r2=r1;c2=c1;}
    else{
        while(r2-r1>1&&c2-c1>1){
            rm=(r1+r2)/2;
            cm=(c1+c2)/2;
            if(grid[rm][cm]>=0){r1=rm;c1=cm;}
            else{r2=rm;c2=cm;}
        }
        while(r2-r1>1){
            rm=(r1+r2)/2;
            if(grid[rm][cm]>=0)r1=rm;
            else r2=rm;
        }
        while(c2-c1>1){
            cm=(c1+c2)/2;
            if(grid[rm][cm]>=0)c1=cm;
            else c2=cm;
        }
    }
    //cout<<"r1 is "<<r1<<",c1 is "<<c1<<"\n";
    //cout<<"r2 is "<<r2<<",c2 is "<<c2<<"\nadded count="<<(br_r-r2)*(br_c-c2)<<endl;
    return (br_r-r2)*(br_c-c2)+negCount(tl_r,c2,r2,br_c,grid)+negCount(r2,tl_c,br_r,c2,grid);
}
class Solution {
public:
    int countNegatives(vector<vector<int>>& grid) {
        const int m=grid.size(),n=grid[0].size();
        return negCount(0,0,m,n,grid);
    }
};
What's the complexity of this algorithm? Let's take a look. Without loss of generality, let's assume m>n. The first function call will take log(n) first, when c1 and c2 meet, then it will take log(m/n) to let r1 and r2 meet, so it will take O(log(m)). Next, the complication starts. The time it takes depends on the shapes of the two rectangles that result from the first call, so it's hard to tell. If the index of the point where the positive and negative meet on the diagonal is (m1,n1), then the complexity of the next function call is log(max(m-m1,n1))+log(max(m1,n-n1)). The maximum it can get is log(m)+log(n), which is when the point is at the lower right corner - but that also means, we only has two rows left and we just need to do two more 1D upper-bound search. Considering this, the worse case is probably when the point is at the center. Let's find out the complexity in this scenario.

Each time the rectangle will be cut into 4 equal smaller ones, two of them will be the input of the next call. The complexity is log(m)+2log(m/2)+4log(m/4)+8log(m/8)+..., but it doesn't go down to 2^log(m)*log(m/m), instead it stops at 2^log(n)*log(m/(2^n)) because at that point we only have 1D arrays to search.

Summing up the terms, I got the complexity O(f(m,n)) where f(m,n)=2nlog(m/n)+2nlogm2 and mn. When m>>n, it's approximately O(nlog(m/n)), and when m=n, it becomes O(m+n). I think it's a pretty good algorithm, although it's hard to tell with such small input sizes.

A variation of this algorithm is, instead of finding the exact cell that separates the positive and negative values, we can stop at the separation line of the smaller dimension, so it takes O(log(min(m,n))) instead of max(m,n). Then the two rectangles that would be the input of the next call will have an overlap in the larger dimension of size around m/n. The complexity analysis of this one would be even more complicated. Anyway, I wonder how the performance would be on a larger scale.

A Recursive Geometric Proof for 1962 IMO Problem 6

Youtube recommended a video to me, where the following problem is solved: Consider an isosceles triangle. Let R be the radius of its circ...