Sunday, December 29, 2024

A Recursive Geometric Proof for 1962 IMO Problem 6

Youtube recommended a video to me, where the following problem is solved:
Consider an isosceles triangle. Let $R$ be the radius of its circumscribed circle and $r$ the radius of its inscribed circle. Prove that the distance $d$ between the centers of these two circles is $d=\sqrt{R(R-2r)}$.
As usual, I tried to solve it myself before watching the rest of the video. And I found a really nice solution. Let the triangle be $\triangle ABC$, where $AB=AC$. Let $O$ be the center of the circumscribed circle, and $I$ be the center of the inscribed circle. So, $d=IO=|AI-AO|=|AI-R|$. What we want to prove becomes \begin{align} d^2&=R(R-2r)\\ (AI-R)^2&=R(R-2r)\\ AI^2-2AI*R+R^2&=R^2-2rR\\ AI^2&=2R(AI-r)\\ \frac{AI}{2R}&=\frac{AI-r}{AI} \end{align} Let the inscribed circle touch $AB$ at $P$, and $AC$ at $Q$, and intersect $AI$ at $X$. So, $AI-r=AX$. We want to show that $\frac{AI}{2R}=\frac{AX}{AI}$. Notice that $2R$ is the diameter of the circumcircle, I wonder if $AI$ can be a diameter too... Notice that $IP\perp AB$ and $IQ\perp AC$, if we create a circle with diameter $AI$, it will pass through points $P$ and $Q$, so it's the circumcircle of $\triangle APQ$. Because of symmetry, we must have $AP=AQ$. So $\triangle APQ \sim \triangle ABC$. If $X$ is the incenter of $\triangle APQ$, it's obvious that $\frac{AI}{2R}=\frac{AX}{AI}$ due to similarity. $\angle APX=\frac{1}{2}\angle AIP=\frac{1}{2}\angle B=\frac{1}{2}\angle APQ$, so $\angle APX=\angle XPQ$. $\angle PAX=\angle QAX$. So, indeed, $X$ is the incenter of $\triangle APQ$, and the proof is complete. This shows an interesting property of isosceles triangles: $AI$ is the diameter of the circumcircle of $\triangle APQ$, and $X$ is its incenter. We can do this recursively to $\triangle APQ$, and get an even smaller version of this problem. On the other hand, we can also extend $AO$ to intersect its circumcircle at $D$, and draw a tangent line at $D$, then $DB\perp AB$ and $DC\perp AC$. If we extend $AD$ to $E$ so that $DE=DB=DC$, and create a line $l\perp AE$ at $E$ which intersects $AB$ at $M$ and $AC$ at $N$, $D$ would be the incenter of $\triangle AMN$. This can go down infinitely as well. $X$,$I$,$D$,..., they are both the incenter of the bigger triangle, and the bottom of the diameter of the smaller triangle. After I found this cool recursive geometric proof, I went back to watch the rest of the video. Again, I was disappointed - basically it used trigonometry to "brute force" the proof. It's not wrong, of course, but boring. It did mention that this is an IMO problem, so I wonder which one it was from. I searched it and it turned out to be the Problem 6 of 1962 IMO, and here is another solution. This solution not only solves this problem, but also proves that it applies to not only isosceles triangles, but ALL triangles! The solution on that page is an elegant geometric proof, but since it doesn't demonstrate the recursive structure of the problem on isosceles triangles, I think my proof is a nice alternative solution.

Thursday, December 26, 2024

Traversing a Tree with Adjacecncy Sums - LeetCode H2872. Maximum Number of K-Divisible Components, H3203. Find Minimum Diameter After Merging Two Trees

Here's a simple question: How do you traverse a tree, which is represented as a set of edges in the form of pairs of vertices?
My first reaction used to be a typical graph traversing algorithm, like BFS or DFS. They are the most common algorithms for traversing graphs. But last week I found a better algorithm particularly suitable for trees which is more efficient than the typical BFS or DFS approach.

I invented this algorithm while solving the daily problem "H2872. Maximum Number of K-Divisible Components" on Dec 24, so let's use it as an example to see how this algorithm works.

2872. Maximum Number of K-Divisible Components:

There is an undirected tree with n nodes labeled from 0 to n - 1. You are given the integer n and a 2D integer array edges of length n - 1, where edges[i] = [ai, bi] indicates that there is an edge between nodes ai and bi in the tree.

You are also given a 0-indexed integer array values of length n, where values[i] is the value associated with the ith node, and an integer k.

A valid split of the tree is obtained by removing any set of edges, possibly empty, from the tree such that the resulting components all have values that are divisible by k, where the value of a connected component is the sum of the values of its nodes.

Return the maximum number of components in any valid split.

Constraints:
  • 1 <= n <= 3 * 104
  • 0 <= values[i] <= 109
  • 1 <= k <= 109
  • Sum of values is divisible by k.
  • ...
Let's look at the example:
Input: n = 5, edges = [[0,2],[1,2],[1,3],[2,4]], values = [1,8,1,4,4], k = 6
Output: 2
We can cut it into two parts, where the sum of the values of each part, 6 and 12 in this example, is a multiple of $k=6$. It's impossible to cut it into more parts without violating the condition. Fairly easy to understand. Before jumping right into the solution, I have a comment on if it's not limited to trees: If it's on any graph, then it can be NP-Hard. Example: A complete graph. You can choose any set of vertices, and they are all connected. So, basically the question becomes, there is a multiset of values, and you must find the maximum number of non-overlapping subsets that you can separate it into where the sum of each of them is a multiple of $k$. You're guaranteed that the sum of the multiset is a multiple of $k$, but it doesn't really matter. It can be an arbitrary multiset, and to make the sum a multiple of $k$, just add another number $-sum$ in it (or $mk-sum$ with a large enough $m$, if it has to be positive), where $sum$ is the sum of the original multiset. To determine if it's possible to separate it into two subsets is already NP-Complete, see Subset sum problem. Finding the maximum number of subsets should be harder, so I believe it's NP-Hard. Now it's time to solve this problem. First, we look for just one edge to cut. Where do we start? The situation is the simplest if the edge is in a long chain which ends at a leaf. In that case, we just need the sum of the values from the leaf to the cut to be a multiple of $k$. And if we find such an edge, we can simply cut it out and keep going. The tree becomes smaller, and we make progress. But what happens when the next vertex is not on a chain, instead it has at least 3 branches? If we can cut the edge that connects the current vertex and the branching one, then it's easy, this branch is simply removed and the branching vertex has its degree decreased by one. What happens if we can't cut that edge? That means the branch must be connected to that vertex forever. So, the branch is basically "attached" to it. In that case, we can simply "absorb" that branch into the vertex - in this problem, we absorb the sum of the values in that branch into the value of that vertex, and remove the branch. So, either way, we can remove branches after branches, until the branching vertex has degree 2, which turns it into a chaining vertex. At last, the tree will be reduced to a single chain, and when we reach another leaf, the whole tree is processed. I came up with this algorithm myself. A few days later, @Sergei mentioned that this was called "Kahn's algorithm". Then I checked the editorial, which called it "Topological Sort / Onion Sort". At this point, the algorithm is pretty straight forward. The only thing left is to implement it. First, we must construct the tree from the list of edges so that we can traverse it. There are many ways to store a graph. A typical approach is to use an adjacency list, where for each vertex we store an array of its neighboring vertices. A 2d C-array would be fast, but it must take $O(N_{max}^2)$ space, which may not be feasible. To save space, typically a vector of vectors or array of vectors is used. But that requires dynamic allocation for each vertex. Next, a problem is, I want to be able to remove a branch when I reach a branching vertex. In other words, I want to be able to modify the adjacency list to remove edges. It can be done with a vector of vectors, i.e, in the vector of neighbors of the branching vertex, find the vertex that we come from and delete it by swapping it with the last element, then pop_back(). But it's not very efficient (although the compexity is the same). A better way may be to use a hash set to store the list of neighbors, but it's also a bit complicated. So, I thought, is there any simpler ways to remove a neighbor of a vertex from the stored data? Imagine if we can just subtract the vertex id from the total of the neighbors... And that's where things start to become interesting. I explored this idea further: If, for every vertex, I only store the sum of the ids of its neighbors, how much can I achieve? We start from a leaf, and the sum gives us its only neighbor. Next, if we are at a vertex of degree 2, the sum is of its two neighbors. If we subtract the previous vertex from the sum, the result is the vertex that we need to go next! But what happens if the next vertex has a degree more than 2? In that case, we can't tell where to go next. But we don't need to go anywhere! The branch that we come from should be removed, and that can be done by subtracting the current vertex from the sum of the neighbors of the branching vertex, then decreasing its degree by one. Then, we simply start from another leaf, and repeat this process. Let's take a look at the example with this implementation.
The numbers inside the circles are the vertex ids, to the right are the values. The blue numbers to the left are the sum of neighbors. Suppose we start from the leaf #4. The current value is 4, we can't cut this edge. The sum of its neighbor is 2, so we check vertex #2. Its degree is 3, greater than 2, so we proceed by absorbing the current value into its value, i.e, update its value to 1+4=5 (black number). Then, we want to remove this branch. We do this by subtracting the vertex that we come from, 4, from the sum of neighbors of that vertex, 5, so the updates sum of neighbors of #2 becomes 5-4=1 (blue number). We also decrease the degree of #2 by one, so now its degree is 2. This branch is removed, so we start from another leaf, either #0 or #3. Now the tree is reduced to a chain, it's easy to traverse it and get the result. Basically, the algorithm goes like this: 1.Prepare the adjacency sums representation of the tree, i.e, an array of pairs of unsigned integers, one for the sum of its neighbors, the other for its degree, both initialized to be 0. Read the list of edges, and update the sums and degrees accordingly. 2.Go through the adjacency sums, until we find a degree one vertex - a leaf. Set "current vertex" to its id, and "previous vertex" to 0. Add its value to the sum of values. Check if the sum of values %k is 0. If so, add 1 to the result. Check its neighbor: If it has degree 1, it's over, add 1 to the result and return it. If it has degree 2, we're in a chain. Assign "current vertex" to "previous vertex", and subtract "current vertex" from the sum of neighbors of that vertex to get where to go next. Set that neighbor to "current vertex". Basically we moved one step along the chain. If its degree is more, we add our current sum of values to the value of that vertex. Then we subtract our "current vertex" from the sum of neighbors of that vertex, and decrease its degree by one. Then we go back to Step 2., and look for the next leaf and start from there. This way, we only need a single array of pairs of unsigned integers to process the entire tree! Although the complexity is still O(n), It's extremely fast and space efficient, because we only need one dynamic allocation - or even zero if we know the maximum input size and simply use a C-array. Also, the space is contiguous, which gives higher cache hit. The following is my solution:
```cpp
const int nmax=30000;
struct pa{int sn,d;};
pa adjl[nmax];
int d1vs[nmax];
class Solution {
public:
    int maxKDivisibleComponents(int n,const vector<vector<int>>& edges, vector<int>& values, int k) {
        if(k==1) return n;
        memset(adjl,0,n*sizeof(adjl[0]));
        for(const auto &v:edges){
            auto &v0=adjl[v[0]];
            v0.sn+=v[1];
            v0.d++;
            auto &v1=adjl[v[1]];
            v1.sn+=v[0];
            v1.d++;
        }
        int d1sz=0;
        for(int i=0;i<n;++i) if(adjl[i].d==1) d1vs[d1sz++]=i;
        int cnt=0;
        int pre=0;
        while(1){
            int cur=d1vs[--d1sz],nei=adjl[cur].sn-pre;
            //cout<<"pre="<<pre<<",cur="<<cur<<",nei="<<nei<<endl;
            int sum=values[cur]%k;
            while(sum){
                auto &neis=adjl[nei];
                //cout<<"degree of "<<nei<<" is "<<neis.d<<endl;
                if(neis.d==1) return cnt+1;
                if(neis.d==2){
                    sum=(sum+values[nei])%k;
                    neis.sn-=cur;
                    pre=cur;
                    cur=nei;
                    nei=neis.sn;
                    continue;
                }
                values[nei]=(values[nei]+sum)%k;
                neis.sn-=cur;
                neis.d--;
                break;
            }
            if(!sum){
                cnt++;
                //cout<<"sum==0;next cur="<<nei<<",pre="<<cur<<endl;
                if(adjl[nei].d==1) return cnt+1;
                if(adjl[nei].d==2) {d1vs[d1sz++]=nei;pre=cur;}
                else adjl[nei].sn-=cur,adjl[nei].d--,sum=0,pre=0;
            }
            else sum=0,pre=0;
        }
        return 0x1337C0DE;
    }
};
```
The previous runtime record was 70ms. The best runtime I got with this program was... 3ms. A 23 times speed up. And the memory record before was 158.6MB. My submission was 139.79MB. A huge difference. Here's my solution page.
You may have noticed that the algorithm that I used in this solution was slightly different from the one decribed above. That's because that later I noticed that I didn't really need to store all the degree one vertices in a stack ("d1vs"). I can simply go through each leaf one by one in the adjacency sums. Also, checking the sum of the value can be done before checking the neighbor, which simplifies the logic. Anyway, it's easy to simplify it.

There's another thing that can be improved. Notice that here I used a pair of "int". It's fine when the maximum vertex id is 3*10^4-1, because the sum must be smaller than 10^9. But if the maximum is larger, like in the next problem, it may overflow. Anyway, let's take a look at the problem first.

3203. Find Minimum Diameter After Merging Two Trees:

There exist two undirected trees with n and m nodes, numbered from 0 to n - 1 and from 0 to m - 1, respectively. You are given two 2D integer arrays edges1 and edges2 of lengths n - 1 and m - 1, respectively, where edges1[i] = [ai, bi] indicates that there is an edge between nodes ai and bi in the first tree and edges2[i] = [ui, vi] indicates that there is an edge between nodes ui and vi in the second tree.

You must connect one node from the first tree with another node from the second tree with an edge.

Return the minimum possible diameter of the resulting tree.

The diameter of a tree is the length of the longest path between any two nodes in the tree.

Constraints: 1 <= n, m <= 105
The intuition is, the vertices that we want to choose to connect the two trees should be the "center"s of them. But how do we define this "center", and how do we find it?
The center vertex should be the vertex that is the furthest away from the leaves. To find it, we can simply remove the vertices layer by layer, where the first layer comprises all the leaves.
But what happens when we approach a branching point?
Notice that the diameter of the tree is the longest path. We want to find the vertex that has the smallest radius, which is the maximum distance from it to any leaf. So during the process of removing the vertices layer by layer, if we encounter a branching point, the first branch that approaches it - meaning that it's the shorter than the others - should just be ignored, because only the longest branch counts.
That means, this problem is perfect for the adjacency sums representation! When we encounter a branching point, we simply remove this branch, and continue with another leaf.

Here's the algorithm:

0. If the edge list is empty, return 0 as its diameter.
1. Prepare the adjacency sums.
2. Put all the leaves in an array. Initialize the layer count to be 0.
3. While the size of the array is greater than 2:
  Increment layer count by 1
  Go through the elements in the array:
    If its neighbor has degree == 2, we move forward, replace this element with its next neighbor and update the array.
    Otherwise, we subtract this vertex from the neighbor's sum of neighbors, and decrease its degree, then remove this vertex from the array by assigning it with the last element, then pop_back.
4. Finally the tree has only two leaves, meaning that it's reduced to a chain. The diameter of the tree is just the layer count times two, plus the length of this chain, where is easy to compute. We just go from one leaf to the other and find the length.
5. Now we can find the diameters of both trees. After we connect their "center" vertices, if the new longest path goes through this new edge, the diameter is the sum of the radii of the two trees plus one. But it's also possible that one tree has much smaller radius than the other, so the diameter of the new tree is simply the one of the larger tree. We can simply take the maximum of all the 3 possiblilties.

Here is my implementation:
```cpp
struct pa{uint sn,d;} adjl[100000];
struct pr{uint ind,pre;} q[100000];
class Solution {
    int getD(const vector<vector<int>>& edges){
        const uint n=edges.size()+1;
        if(n==1) return 0;
        memset(adjl,0,n*sizeof(adjl[0]));
        for(const auto &v:edges){
            adjl[v[0]].sn+=v[1];
            adjl[v[0]].d++;
            adjl[v[1]].sn+=v[0];
            adjl[v[1]].d++;
        }
        int qsz=0,lay=0;
        for(uint i=0;i<n;++i) if(adjl[i].d==1) q[qsz++]=pr{i,0};
        while(qsz>2){
            lay++;
            uint i=0;
            while(i<qsz){
                int nei=adjl[q[i].ind].sn-q[i].pre;
                if(adjl[nei].d<=2){
                    q[i].pre=q[i].ind;
                    q[i].ind=nei;
                    ++i;
                    continue;
                }
                adjl[nei].sn-=q[i].ind;
                adjl[nei].d--;
                q[i]=q[--qsz];
            }
        }
        int l=0,ind=q[0].ind,pre=q[0].pre,tar=q[1].ind;
        while(ind!=tar){
            l++;
            int nxt=adjl[ind].sn-pre;
            pre=ind;
            ind=nxt;
        }
        return 2*lay+l;
    }
public:
    int minimumDiameterAfterMerge(const vector<vector<int>>& edges1,const vector<vector<int>>& edges2) {
        int d1=getD(edges1),d2=getD(edges2);
        return max(max(d1,d2),((d1+1)>>1)+((d2+1)>>1)+1);
    }
};
```
Here, I used "uint" for the sum. You may wonder, isn't it possible to overflow, since the maximum number of vertices is 10^5 this time, and the sum can be at most 1+2+...+99999=99999*50000 which is about 5*10^9? Or is it simply because this doesn't happen in the test cases?
It's true that the test cases don't trigger any overflows, because I tested with "int" as well. But later, I realized that it doesn't even matter if it overflows.
The reason is, we don't use its value until the degree is one. So, as we accumulate the sum, if it overflows, basically it becomes mod 2^32. While we run the program, we subtract the vertice from this sum and decrease the degree. When degree is one, the sum must be the id of the vertex mod 2^32 - which is simply its id as long as it's under 2^32! So, it works as long as the vertex ids are unique numbers within the range of "uint".
Here is my solution page.
As @Suboptimal Coder mentioned in the comment, we may use "xor" instead for this sum. In modern computer processors, they should both be one clock cycle. But there may be situations where "xor" is faster/more feasible on certain hardwares, since it requires less transistors. So, it's probably a good idea to replace the addition with xor.

The previous runtime record was 209ms. An optimized implementation of BFS by @Sergei, which used a custom queue, had a runtime of 16ms. At the time of writing this article, the best runtime of my algorithm is 3ms, as shown on the submission page. The old memory record was 282.3MB, while my program used 217.75MB, again a huge improvement.

If the edges of the tree are directed instead, the logic is actually simpler. We again start from the leaves, go up the chain, remove the branch when encountering a branching vertex. For each vertex, we need to store its parent vertex and its out-degree. It's more straight forward than the undirected version.

I don't know if this data structure / implementation has ever been proposed. Anyway, I'll just call it "Jiang's Adjacency Sums Algorithm for Tree Traversing" for now, until someone points out that it's already invented by someone else, XD.

Wednesday, May 8, 2024

The limit of an averaging sequence

Consider this sequence of length $n$: $P_n=(p_0,p_1,p_2,\dots,p_{n-1})$, where $p_0=2$, and each time the next number $p_k$ can be any number within $[2,1+p_{k-1}]$ for $k\in [1,n-1]$. Consider the set of all such sequences of length $n$, $\{P_n\}$. The first few sets are:
\begin{align*}n=1:& (2)\\
n=2:& (2,2),(2,3)\\
n=3:& (2,2,2),(2,2,3),(2,3,2),(2,3,3),(2,3,4)\\
n=4:& (2,2,2,2),(2,2,2,3),(2,2,3,2),(2,2,3,3),(2,2,3,4),(2,3,2,2),(2,3,2,3),(2,3,3,2),\\
 &(2,3,3,3),(2,3,3,4),(2,3,4,2),(2,3,4,3),(2,3,4,4),(2,3,4,5)
\end{align*}
Let $P_{nj}$ be any path in $\{P_n\}$, and $\prod P_{nj}$ be the product of all the numbers in that path. The question is, what is this limit: $\text{lim}_{n\rightarrow\infty} \sum_{P_{nj}\in \{P_n\}} (\prod P_{nj})^{-1}$?

The first few sums are

$$
\renewcommand\arraystretch{1.5}
\begin{matrix}
    n=1:& \frac{1}{2}\\
    n=2:& \frac{1}{2}\frac{1}{2}+\frac{1}{2}\frac{1}{3}\\
    n=3:& \frac{1}{2}\frac{1}{2}\frac{1}{2}+\frac{1}{2}\frac{1}{2}\frac{1}{3}+\frac{1}{2}\frac{1}{3}\frac{1}{2}+\frac{1}{2}\frac{1}{3}\frac{1}{3}+\frac{1}{2}\frac{1}{3}\frac{1}{4}
\end{matrix}$$

Answer: The limit is $e^{-1}$.
How do you prove that it converges to this value?

Monday, May 6, 2024

IBM Ponder this 2024 04 solution

Problem description, and the solution.
Hey, this time I got first place! Although there's not much to talk about.
The problem looks like a finite state machine problem. The states can be represented by a length $n$ ternary number, where the lowest digit represents the location of the largest disk, the second digit is the location of the second largest one, and so on. After each step, it transits to the next state, depending on the move.
The solution can be found by, like the solution page says, finding the periods of the sequences. Both the state and the move must be considered.

Or, just simulate the entire process. Out of all the problems that I've solved so far, this one is the most brute-forcible, if that's a word.
Each game is represented by 3 integer arrays, which correspond to the disks on the 3 rods, and a index of the move. Each time, read the move and change the arrays accordingly.
Usually the simulation of the periodic increment of the index is implemented as something like
  i=(i+1)%size;
or
  i=i+1;
  if(i==size) i=0;
But modulo and branching are costly operations, and the routine is called many times. To make it faster, I replaced it with a lookup table. The speedup is not huge, 90s to 63s, but still quite noticeable.

Wednesday, April 24, 2024

A nice identity of the sum of a certain fraction series

I saw this question on stackexchange, and I was intrigued. The equation in question is,
$$\sum_{k=1}^n\frac{2^{2k-1}}{k}\frac{\binom{2n-2k}{n-k}}{\binom{2n}{n}}=\sum_{k=1}^n\frac{1}{2k-1}$$
Are you able to prove this identity without looking at the proof that I gave under the question?





After simplifying the expression, it turned into
$$\sum_{k=1}^n\frac{1}{2k}\frac{2n}{2n-1}\frac{2n-2}{2n-3}\dots\frac{2n-2k+2}{2n-2k+1}=\sum_{k=1}^n\frac{1}{2k-1}$$
The first few $n$ gives:
$$\frac{1}{2}\frac{2}{1}=1$$
$$\frac{1}{2}\frac{4}{3}+\frac{1}{4}\frac{4}{3}\frac{2}{1}=1+\frac{1}{3}$$
$$\frac{1}{2}\frac{6}{5}+\frac{1}{4}\frac{6}{5}\frac{4}{3}+\frac{1}{6}\frac{6}{5}\frac{4}{3}\frac{2}{1}=1+\frac{1}{3}+\frac{1}{5}$$
$$\frac{1}{2}\frac{8}{7}+\frac{1}{4}\frac{8}{7}\frac{6}{5}+\frac{1}{6}\frac{8}{7}\frac{6}{5}\frac{4}{3}+\frac{1}{8}\frac{8}{7}\frac{6}{5}\frac{4}{3}\frac{2}{1}=1+\frac{1}{3}+\frac{1}{5}+\frac{1}{7}$$
By grouping the terms, I came up with this expression:
$$\frac{8}{7}\left(\frac{1}{2}+\frac{6}{5}\left(\frac{1}{4}+\frac{4}{3}\left(\frac{1}{6}+\frac{2}{1}\left(\frac{1}{8}\right)\right)\right)\right)=1+\frac{1}{3}+\frac{1}{5}+\frac{1}{7}$$
An interesting thing that I noticed is, if I replace the numbers $\frac{1}{2},\frac{1}{4},\frac{1}{6},\frac{1}{8}$ with ones, it suddenly becomes obvious. We get this identity:
$$\frac{8}{7}+\frac{8}{7}\frac{6}{5}+\frac{8}{7}\frac{6}{5}\frac{4}{3}+\frac{8}{7}\frac{6}{5}\frac{4}{3}\frac{2}{1}=\frac{8}{7}\left(1+\frac{6}{5}\left(1+\frac{4}{3}\left(1+\frac{2}{1}1\right)\right)\right)=8=2+2+2+2$$
Then I wondered, what if I inverse the order of them? And I found that
$$\frac{1}{8}\frac{8}{7}+\frac{1}{6}\frac{8}{7}\frac{6}{5}+\frac{1}{4}\frac{8}{7}\frac{6}{5}\frac{4}{3}+\frac{1}{2}\frac{8}{7}\frac{6}{5}\frac{4}{3}\frac{2}{1}=\frac{8}{7}\left(\frac{1}{8}+\frac{6}{5}\left(\frac{1}{6}+\frac{4}{3}\left(\frac{1}{4}+\frac{2}{1}\left(\frac{1}{2}\right)\right)\right)\right)=\left(1+\frac{1}{1}\right)\left(1+\frac{1}{3}\right)\left(1+\frac{1}{5}\right)\left(1+\frac{1}{7}\right)-1$$
which is very easy to prove.

The proof of the first equation, though, is much harder to construct, because the coefficient changes in the opposite direction of $n$ from the inside to the outside, so each time $n$ changes, the previous expression is no longer a part of the new expression, making the induction argument difficult to find. But after some manipulation, I found the proof, which I posted on the stackexchange page, so I won't repeat it here.
I wonder if more simple sequences can be constructed with other sequences as the coefficients

This reminds me of continued fraction. I think maybe this should have a name, like, continued product? But it seems that this name has already been taken by something different. So, maybe "rolling product"?
A similar concept seems to be Engel expansion, but here we have a way more general form, $a_1(b_1+a_2(b_2+a_3(b_3+\dots)))$, where $a_i$ and $b_j$ are rational, while the result is the sum of a different series. Or maybe this can further be generalized to any real sequences, or even complex sequences? I wonder...

Wednesday, April 10, 2024

IBM Ponder this 2024 03 solution

Problem description can be found here. The solution is posted here.
Not much to talk about this time. The result from $X_1$ to $X_{100}$ is,

1,8,9,9,15,15,15,24,24,24,90,90,90,105,105,105,114,114,114,114,
114,114,225,225,225,225,225,225,264,264,264,264,264,264,300,300,300,300,300,300,
300,300,300,300,300,300,945,945,945,945,945,945,945,945,945,945,945,945,945,945,
945,945,945,945,945,945,945,945,945,945,945,945,945,945,945,945,945,945,945,945,
945,945,945,945,945,945,945,945,5349,5349,5349,5349,5349,5349,5349,5349,5349,5349,5349,5349,

Recording only when the result changes, the indices and values of $X_i$ for $i\leq 300$ are,

inds:
0,1,2,4,7,10,13,16,22,28,34,46,88,100,103,124,157,247,274,283,
vals:
1,8,9,15,24,90,105,114,225,264,300,945,5349,7035,11739,17280,35475,46914,190365,351645,
idiffs:
1,1,2,3,3,3,3,6,6,6,12,42,12,3,21,33,90,27,9,
vdiffs:
7,1,6,9,66,15,9,111,39,36,645,4404,1686,4704,5541,18195,11439,143451,161280,

where "inds" are the indices (0-indexed, so it's $i-1$), "vals" are the values $X_i$, "idiffs" are the differences in the indices, and "vdiffs" are the differences in the values.
There's no obvious pattern here. So, the only way is to test every number.

To find $X_{1000}$, it's alright to use the slow way of generating a list of primes, i.e., testing if the number $n$ divides any primes $\leq \sqrt n$. The result is 115192665.

But this method is way too slow for the bonus question.
To test the primality of a large range of numbers, the most efficient way is also the most ancient way, by sieving. With sieving, the larger the range, the more efficient it is. But what range should I use?
If I use one bit for each number, 8GB RAM can hold $2^{36}$ numbers. Is that enough?

I decided to try that with CUDA. And I found this very nice GPU prime number generating algorithm. But it turns out that the result is beyond $2^{36}$. 

Let's take a look at the trend of the values. The following plot is the values vs indices in the result above plus the point (1000, 115192665) in log-log scale:
It seems that it still curves up, so at $n=2000$ it seems that it should be about $10^{11}$, which is about $2^{36.5}$. But it's hard to tell if that's enough, it may very well be much larger. What can I do?
Noticing that with primes list up to $2^n$ as seed, I can sieve primes up to $2^{2n}$. So, since I can already generate primes up to $2^{36}$, theoretically I can use that to generate primes up to $2^{72}$. But there's not enough ram or even storage to hold that many numbers.

Well, I can't hold all the numbers, but I can cut the range into sections, and sieve them one by one.

To save space, I only kept the odd numbers. The shifts, $i(i-1)/2$, are also separated into two lists, the odd ones and the even ones, for quick look up. It's not hard to come up with the formulas, i.e.

odd_sift[i]=(2*i+1)*(i+1-(i&1)), where i $\in$ [0, n/2+((n&3)==3) )
even_shift[i]=(2*i+1)*(i+(i&1)), where i $\in$ [0, n/2+((n&3)==1) )

So if the current number is odd, I only check even shifts, and odd shifts for even numbers.

The algorithm goes like this:

-Generate primes list up to $2^k$ on CPU.
-Using that as seed, generate primes list up to $2^{2k}$ with CUDA.
-Copy that to GPU. With that as seed, I can sieve primes up to $2^{4k}$.
-Determine the sizes of the bit arrays "good_numbers", "sieve_buffer", and integer arrays "odd_sift" and "even_shift", and allocate them on GPU. The sieve range must be greater than the good_numbers range plus $n(n-1)/2$. The larger the range, the more efficient it is. Initialize the bit arrays to 1 bits.
-Use the GPU prime sieving algorithm to sieve each range [sieve_start, sieve_end). Use binary search to find the starting index in the primes list. Also, since the minimum number to multiply is 3, and only odd numbers are considered, the number of threads is only 1/6 of the range.
-Check if the end of the range of good_numbers plus $n(n-1)/2$ is beyond the sieve range. If so, change sieve_start to the current starting point of good_numbers, and sieve this new range.
-For each bit index in good_numbers (offset by the current starting point), check if the number shifted by odd_shift or even_shift is a prime. If it hits a prime, mark the bit to 0.
-Use cub::DeviceReduce::Max() to check if there's any non-zero values in good_numbers. If not, move current start point to the previous end point. Otherwise, the answer is found, return the location of the first 1 bit.
(If cub::DeviceReduce has something like an ArgReduce() function which takes a custom reduce function, this could be done more efficiently, but I didn't find any. I've submitted a feature request, don't know if it'll be added one day.)
(Update: I was informed that the function that matches this description is thrust::find_if. Don't need to worry about the "thrust::device", just pass the C array and it works, like this:

    struct is_non_zero{
        __host__ __device__
        bool operator()(uint64_t x){return x!=0;}
    };
            auto iter=thrust::find_if(thrust::device,good_num_device,good_num_device+good_num_size,is_non_zero());
            int byteshift=iter-good_num_device;

This solves the problem at hand, but still, I think it'd be nice to have an argument version of reduce. Maybe it'll be useful somewhere else.
)

When I wrote this program, I fell into a pitfall. I tested the program for $n\leq 300$ and $n=1000$, it worked pretty well. But when I tried $n=2024$, it just kept going, until my GPU runtime got disconnected. Then I switched to the CPU version, and very quickly, an answer popped out, 1412159754600. I was surprised that the answer wasn't found by the GPU, but it's right after where it left off. I didn't think too much about it, so I just submitted it. Since you've already seen the answer, you probably noticed that this is not the correct answer. Indeed, this is about 12 times larger than the actual answer.

So, where did it go wrong? It took me quite some time to find out where the bug is. It turned out, it was not in the logic. It was not even in my implementation. It was in this line:

uint64_t i = (blockIdx.x * blockDim.x + threadIdx.x)*2+3;

It turned out, there were a lot of indices that $i$ didn't reach, meaning that many composite numbers were not sieved, instead they were left as prime. That's why the program never ended, it saw prime numbers everywhere, and every number hit a prime within only a few steps.
So what's wrong with this line? When I finally tried to output the indices "blockIdx.x, blockDim.x, threadIdx.x", a warning caught my attention. It said that I was trying to print an unsigned int type as an unsigned long long type. So, that's it. The indices, blockIdx.x etc, are all 32 bit unsigned int. But the number of threads can be greater than $2^{32}$. When that happens, it doesn't give a runtime error like a signed integer. It just quietly overflows. Very sneaky. Since these variables are not defined by me, it's not easy for me to check what type they are in.
I unconsciously neglected this possibility because I saw "blockIdx.x * blockDim.x + threadIdx.x" as the index in an array in many CUDA program examples, so I naturally thought, "they must be in size_t type, so that it can be used to represent any index in an array." Well, how wrong I was.
Anyway, at least I learned my lessons. Never assume the type of a variable not defined by myself, always check them. And, CUDA indices are 32 bit unsigned int. Guess I will remember this for the rest of my life...
Anyway, converting blockIdx.x to uint64_t fixed it.

With seed primes up to $2^{20}$, sieve_range=$2^{34}$, good_numbers range=$2^{22}$, the program finds the solution in 2m40s.



Miscellaneous thoughts:

Primality test:
While I was debugging the program, I looked for primality test algorithms to double check the number of steps.
There are a few primality test algorithms available. I heard about the AKS test 10 years ago. This time, I learned more about other algorithms, and I realized that although AKS algorithm is the first deterministic primality test that runs in polynomial time, in practice we may use a faster probabilistic test. The Baillie–PSW primality test is a good candidate. Even though it's probabilistc, no counter-examples have ever been found. I found the FLINT implememtation of the BPSW algorithm and learned how to install and use it. It was pretty easy to use (although the installation is very slow...).

Generalization to other sequences:
In general, we may consider the function
$f(a_0)=\min\{n|a_0+s_n \text{is prime}\}$
We get a different function $f$ for every different sequence $s_n$. The problem above is the case where $s_n=0+1+\dots+(n-1)=n(n-1)/2$.
A similar problem is when $s_n=(n-1)^2$.
A simpler sequence, $s_n=n-1$, results in $f(a_0)$ being the distance from $a_0$ to the next prime number, which gives this list.
We may also consider exponential sequencies, like $s_n=2^{n-1}$.

An interesting sequence is $s_n=(n-1)!$. Here, 0! is better defined as 0, I think. The biggest difference between this sequence and those above is, for any $a_0$, the function $f(a_0)$ with $s_n$ given above should be finite (except when $s_n=k^{n-1}$ and $\gcd(a_0,k)\neq 1$ or both $a_0$ and $k$ are odd), but it's not at all clear where it will stop until we find the solution. But with factorial sequence, it's the opposite: we must have $f(a_0)<m(a_0)$ or $f(a_0)=\infty$, where $m(1)=1$ and for $x\neq 1$, $m(x)$ is the minimum prime factor of $x$.

I did some calculations on what numbers will give infinite $f(a_0)$, grouped by their minimum prime factor.
min prime factor = 2: 8,14,20,24,26,... Which is obvious, those that equals an odd composite minus one.
min prime factor = 3: 33,63,75,...
min prime factor = 5: 295,445,505,665,... I didn't find any up to 100 by hand, so I wrote a program to find them.

To calculate this sequence efficiently, we need a sieve algorithm that leave the minimum prime factor of a number. This is not hard to construct, just initialize the values to uint(-1), and use min() function in the sieving. The problem is when the composite number is a product of very large primes, in which case we must test the primality of $n+k!$ for $k$ from 1 to $p-1$, where $p$ is $n$'s minimum prime factor. The number can be very large and there are a lot of tests to do.

Another similar sequence is the primorial sequence.

I wonder if these functions give any interesting result...

Thursday, March 14, 2024

IBM Ponder this 2024 02 solution

Problem description can be found here. The solution is posted here.
The first thing to do is, finding the probabilities that A wins a round and B wins a round. The range of the sum of the numbers is [5,59]. There are $total=$4*6*8*10*12*20=460800 combinations in total. Now I just need to find out how many combinations result in each sum, which can be done by repeatedly shifting and adding. Then the counts of combinations for A to win are summed, the same is done for B. The result is countA=114399, countB=230400, so the probability that A wins a round is $p_a$=countA/total, and for B it is $p_b$=countB/total. (For the bonus question, countB=116001.)

Next is a Markov process, with 2n+1 states: $N$ - the neutral state, $A_1$ - A wins one round, $A_2$ - A wins two rounds, ..., $A_n$ - A wins $n$ rounds, and $B_1$ to $B_n$ with similar meanings. The transition matrix is not hard to construct, basically, $A_n$ and $B_n$ results in themselves no matter the result (probability=1), for the other $A_i$s, the probability that it goes to $A_{i+1}$ is $p_a$, that it goes to $B_1$ is $p_b$, to $N$ is $1-p_a-p_b$, similar for $B_i$ and $N$.
The goal is to find out the probability that it ends up at $A_n$ or $B_n$ eventually.

The standard way to solve this type of problem is, to do an eigen decomposition on the matrix, and remove any eigen values that's less than one, then multiply the matricies back to get the probability. The reason is simple: After each round, the new probabilities vector is the old one multiplied by the Markov matrix, and the goal is finding the result after multiplying the matrix for infinite times, i.e. taking limit of the power of the matrix to infinity. With eigen decomposition, the pair of inverse matrices cancel out, only the eigen values are powered, and any eigen values that are less than one become zero when the power is taken to infinity.

By doing that, the solution to the first question is easy to find, which is about 0.000167566680157.

The bonus question, though, cannot be solved with this technique. The reason is that there is not enough numerical precision. The method above gives nonsensical results like probability larger than 1, and sum of the probabilities has an order of magnitude $10^{-17}$. It started to break down at around $n=26$, let alone when $n=300$. What should I do, then?

I thought, the reason for the eigen decomposition is to find out the limit of the power of the matrix, but what if I just power the matrix itself and see how that goes? Numerical precision might still be an issue, but it was worth a try.
The naive matrix powering algorithm that repeatedly muliplies the matrix is too slow. The efficient matrix powering algorithm has $O(\log(n))$ complexity, where $n$ is the power. Since I just need to take the limit to infinity, I simply repeatedly squared the matrix. After $n$ squares, the transition matrix $M$ is raised to $M^{2^n}$.
After 65 square operations, the result $p=prA/(prA+prB)$ converges to 0.01525753293679526. But the numerical error still appears, because $prA+prB=2.861639890862583e-163$ instead of 1. I wasn't sure if this invalidates the result. Maybe I can normalize the sum every time after I square it, so that it remains to be a Markov matrix, I thought. And this time, it converged to $p=0.01525753293679551$ after 603 iterations. Using "long double" type in python doesn't change the result by much.

At this point I believed that the result is accurate enough, so I submitted it. Then I got a reply from Ponder This, which informed me that there was a formula of the exact solution. So I took another look and figured it out!

A typical approach to find the final state probabilities is, finding a martingale. A martingale, as I understand it, is like a conserved quantity in physics. In physics, when we want to know the final state of an object, sometimes we don't need to know the dynamics at every moment in the process. Instead, we just need to use the conservation laws, which states that certain quantities must remain the same in the initial state and the final state, which provides a shortcut to the calculation. A martingale in a stochastic process is a conservation of expectation. If the expectation of a quantity doesn't change after one step in the process, then the expectation in the beginning must be the same of that in the end, which makes it very suitable for solving final probabilities.

We may rewrite this problem as a random walk problem. Let's use point 0 to represent the neutral state, and negative points to represent $A_i$, positive points for $B_i$. We start from point 0, and if we roll an A, we move to -1, and if we roll a B, we move to 1, and we stay at 0 if we roll a draw. At any positive (B) point except the last one which is the absorbing state, if we roll an A, we go straight to -1, and if we roll a B, we move to the right by 1, and a draw moves us to 0. Similar for negative points. At the absorbing states, we don't move.

I wasn't able to find a martingale in this model, because the distance that we travel depends on where we are, so the expectation of our location can't be conserved. But then I thought, the expectation of the location depends on both the probability and the distance. The probabilities are fixed, but I'm free to change the distances anyway I want!
Now, if I want the expectation to be conserved, the expectation of the change in the location at any point must be zero. Let the coordinate of $A_1,A_2,\dots,A_n$ be $-a_1,-a_2,\dots,-a_n$ respectively, and similar for $B_i$. Denote $p_a,p_b,p_n$ for the probability that $A$ wins the round, $B$ wins, and a draw respectively. The expectation after one step from point 0 is $p_a(-a_1)+0+p_b(b_1)$ which must equal zero. At $A_1$, the expectation of the displacement after one step is $p_a(a_1-a_2)+p_na_1+p_b(a_1+b_1)$ which again must be zero. Similarly we get $p_a(a_{i-1}-a_i)+p_na_{i-1}+p_b(a_{i-1}+b_1)=0$. To get the equations for $B_i$, simply exchange $a_i$ and $b_i$, $p_a$ and $p_b$. At the two end points $-a_n$ and $b_n$, we never move, so it's guaranteed that the expectation doesn't change. Now, our coordinate is a martingale, because no matter where we are, the expectation of our coordinate after one step is always the same. Which means, if we start from 0, the expectation of our coordinate after infinite steps is still 0. Since we must be either at $-a_n$ with probability $prA$ or at $b_n$ with probability $prB$, we must have $prA\cdot(-a_n)+prB\cdot b_n=0$. Now we just need to find the coordinates of the two end points, which can be solved by solving the recursive equations.

The only constraint on $a_1$ and $b_1$ is the equation at point 0. So, as long as $a_1:b_1=p_b:p_a$, we can scale them by any constant. For simplicity, let's set $a_1=p_b, b_1=p_a$. With $p_a+p_b+p_n=1$, we can simplify the equations to $p_aa_i=a_{i-1}+p_ap_b$.
Let $D_i=a_i-a_{i-1}$, so we have $p_aD_i=D_{i-1}$.
$D_i=\frac{D_1}{p_a^{i-1}}$.
Here we define $D_1=a_1=p_b$. We then have $D_i=\frac{p_b}{p_a^{i-1}}$.
Summing it up,
$$a_n=\sum_{i=1}^n D_i = p_b\frac{\frac{1}{p_a^{n}}-1}{\frac{1}{p_a}-1}$$
The final result, the probability that $A$ first wins $n$ consecutive rounds, is
$$prA=\frac{b_n}{a_n+b_n}$$
$$=\frac{1}{\frac{a_n}{b_n}+1}$$
$$=\frac{1}{\frac{p_b}{p_a}\frac{\frac{1}{p_a^n}-1}{\frac{1}{p_a}-1}\frac{\frac{1}{p_b}-1}{\frac{1}{p_b^n}-1} +1}$$
$$=\frac{1}{\frac{1-p_b}{1-p_a}\frac{\frac{1}{p_a^n}-1}{\frac{1}{p_b^n}-1} +1}$$
Plugging in $p_a,p_b,n$, we get the result.

Tuesday, February 20, 2024

A Formula for Accumulated XOR Operations in a Continuous Range (LC E268)

I came up with this formula when I was solving E268. Missing Number again.

Description:
Given an array nums containing n distinct numbers in the range [0, n], return the only number in the range that is missing from the array. Constraints: n == nums.length 1 <= n <= $10^4$ 0 <= nums[i] <= n All the numbers of nums are unique.
Here is my original solution:
```
class Solution {
public:
    int missingNumber(vector<int>& nums) {
        int s=0,l=nums.size();
        for(int &i:nums)s+=i;
        return l*(l+1)/2-s;
    }
};
```
And of course there's the explicit sum solution, without using the formula:
```
class Solution {
public:
    int missingNumber(vector<int>& nums) {
        int result = nums.size();
        for (int i = 0; i < nums.size(); ++i) {
            result += i;
            result -= nums[i];
        }
        return result;
    }
};
```
But it should be slower than using the formula, since there are n additions, instead of one multiplication and one division.

And then I saw the bitwise operation solution, like this:
```
class Solution {
public:
    int missingNumber(vector<int>& nums) {
        int result = nums.size(),i=0;
        for(int num:nums){
            result ^= num;
            result ^= i;
            i++;
        }
        return result;
    }
};
```
Which avoids possible integer overflows. But it suffers from the same issue as the explicit summation solution: it takes O(n) operations to find the accumulated result.
So, I wonder, is there an easier way to get the accumulated result?
And I found a surprisingly simple form of it. I'm surprised nobody mentioned it so far (at least not in the top solutions).
Let's see what the accumulated results are for n=0,1,2,...
      n   Xor from 0 to n
      0    0
      1    1
     10   11
     11    0
    100  100
    101    1
    110  111
    111    0
   1000 1000
   
There is clearly a pattern. After giving it some thought, it's not hard to see why the result is simply a 1 or 0 if n is odd. If you xor an odd number with its previous number, you'll just get a 1. So it simply depends on how many 1s are there. So the result is 1 if n=4k+1, and 0 if n=4k+3.
When n is even, we just need to xor it with its previous accumulated result, which ends with an odd number.

So, we can get the accumulated result in O(1), similar to the summation with the formula.
```
#pragma GCC target("avx,mmx,sse2,sse3,sse4")
auto _=[]()noexcept{ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);return 0;}();
class Solution {
public:
    int missingNumber(const vector<int>& nums) {
        int s,n=nums.size();
        switch(n&3){
            case 0:s=n;break;
            case 1:s=1;break;
            case 2:s=n^1;break;
            case 3:s=0;
        }
        for(int i:nums)s^=i;
        return s;
    }
};
```
This can easily be generalized to the accumulation of an arbituary continuous range of non-negative integers [a,b], it's just XorAccumulated(a-1)^XorAccumlated(b).
Actually, it works for negative range as well, given the nice properties of two's complement expression. So this works for all integer ranges.

Sunday, February 11, 2024

IBM Ponder this 2024 01 solution

Problem description can be found here. The solution is posted here.

The first question is quite similar to finding the solutions to the magic squares in Ponder this 2023 11. I didn't implement the solver back then, but this time I got the chance to do it.
The equations easy to construct,

  x1 + x2 - x3 - x4 = 5
  x5 + x6 + x7 - x8 = 10
  x9 - x10 + x11 + x12 = 9
  x13 - x14 + x15 - x16 = 0
  x1 + x5 + x9 - x13 = 17
  x2 + x6 - x10 - x14 = 8
  x3 - x7 - x11 + x15 = 11
  x4 + x8 + x12 + x16 = 48
  x1 + ... + x16 = 136

There are 16 variables and 9 equations. Now I just need to assign 7 of them as free variables, and express the other 9 as linear combinations of them.

I iterated through all possible ways to assign different numbers between 1 and 16 to the 7 free variables, and check if it's a solution. There are $\binom{16}{7}=11440$ combinations, not too large. It's very fast to solve.

The complete list of 84 solutions is shown below.
  5 13 2 11 15 10 1 16 3 7 4 9 6 8 14 12
  14 8 2 15 4 16 1 11 5 9 3 10 6 7 13 12
  12 6 2 11 10 15 1 16 4 5 3 7 9 8 13 14
  11 8 1 13 4 16 2 12 7 10 3 9 5 6 15 14
  12 8 4 11 13 10 2 15 1 3 5 6 9 7 14 16
  8 16 6 13 12 5 2 9 1 10 7 11 4 3 14 15
  8 13 2 14 7 11 1 9 5 12 6 10 3 4 16 15
  11 16 7 15 13 2 5 10 1 4 3 9 8 6 12 14
  13 8 7 9 6 16 2 14 1 12 5 15 3 4 11 10
  6 16 8 9 14 7 2 13 1 12 5 15 4 3 10 11
  13 14 7 15 6 11 2 9 1 12 4 16 3 5 10 8
  12 13 5 15 6 9 3 8 1 10 7 11 2 4 16 14
  12 9 5 11 6 16 3 15 1 10 4 14 2 7 13 8
  7 11 1 12 13 10 3 16 5 4 2 6 8 9 15 14
  7 15 5 12 13 10 3 16 1 8 2 14 4 9 11 6
  15 9 5 14 11 8 3 12 1 2 4 6 10 7 13 16
  12 9 5 11 14 8 3 15 1 2 4 6 10 7 13 16
  8 13 5 11 10 14 1 15 3 12 2 16 4 7 9 6
  13 16 10 14 9 5 3 7 1 11 4 15 6 2 8 12
  7 16 8 10 13 3 9 15 1 5 2 11 4 6 14 12
  15 10 8 12 13 7 1 11 3 5 2 9 14 4 6 16
  13 7 10 5 9 14 3 16 1 11 4 15 6 2 8 12
  13 9 10 7 15 8 3 16 1 5 2 11 12 4 6 14
  13 12 6 14 5 9 4 8 1 10 7 11 2 3 16 15
  6 15 5 11 12 10 4 16 1 9 3 14 2 8 13 7
  15 6 5 11 10 12 4 16 1 2 3 7 9 8 13 14
  15 9 5 14 1 12 8 11 4 7 2 10 3 6 16 13
  12 16 8 15 10 5 1 6 4 11 3 13 9 2 7 14
  12 6 8 5 10 15 1 16 4 11 3 13 9 2 7 14
  15 6 11 5 10 12 4 16 1 8 3 13 9 2 7 14
  15 9 11 8 13 6 5 14 1 4 2 10 12 3 7 16
  15 10 8 12 5 7 11 13 1 3 2 9 4 6 16 14
  13 7 10 5 6 11 9 16 1 8 4 12 3 2 14 15
  13 9 10 7 6 8 12 16 1 5 2 11 3 4 15 14
  6 16 2 15 12 9 3 14 4 7 1 11 5 10 13 8
  12 9 2 14 6 16 3 15 4 7 1 11 5 10 13 8
  8 16 4 15 12 9 3 14 2 7 1 13 5 10 11 6
  16 9 5 15 2 13 7 12 3 6 1 11 4 8 14 10
  6 16 8 9 12 7 2 11 3 14 5 15 4 1 10 13
  6 10 3 8 7 16 2 15 9 14 1 13 5 4 11 12
  7 16 3 15 6 10 2 8 9 14 1 13 5 4 11 12
  15 6 9 7 10 14 2 16 3 8 1 13 11 4 5 12
  14 16 10 15 6 3 9 8 2 7 1 13 5 4 11 12
  8 16 7 12 11 2 10 13 3 4 1 9 5 6 15 14
  16 10 8 13 12 7 2 11 3 4 1 9 14 5 6 15
  16 12 8 15 2 7 10 9 3 6 1 11 4 5 14 13
  2 15 4 8 12 7 5 14 9 11 1 10 6 3 13 16
  2 14 4 7 12 8 5 15 9 11 1 10 6 3 13 16
  14 8 7 10 2 16 4 12 6 13 1 15 5 3 9 11
  6 16 9 8 12 2 10 14 4 7 1 11 5 3 13 15
  10 12 9 8 16 6 2 14 4 7 1 11 13 3 5 15
  12 16 9 14 6 2 10 8 4 7 1 11 5 3 13 15
  16 6 9 8 2 12 10 14 4 7 1 11 5 3 13 15
  16 6 9 8 10 12 2 14 4 7 1 11 13 3 5 15
  16 12 9 14 2 6 10 8 4 7 1 11 5 3 13 15
  16 12 9 14 10 6 2 8 4 7 1 11 13 3 5 15
  16 10 8 13 4 7 11 12 2 3 1 9 5 6 15 14
  7 15 11 6 13 4 9 16 2 8 1 14 5 3 10 12
  12 8 9 6 14 10 2 16 4 7 1 11 13 3 5 15
  13 8 7 9 2 16 6 14 5 12 1 15 3 4 11 10
  5 14 6 8 9 11 2 12 7 16 3 15 4 1 10 13
  9 14 6 12 5 11 2 8 7 16 3 15 4 1 10 13
  14 12 8 13 2 6 11 9 5 7 1 10 4 3 15 16
  12 9 5 11 3 16 6 15 4 10 1 14 2 7 13 8
  11 10 4 12 5 15 6 16 3 8 1 13 2 9 14 7
  15 9 5 14 11 8 3 12 4 2 1 6 13 7 10 16
  12 9 5 11 14 8 3 15 4 2 1 6 13 7 10 16
  16 13 10 14 3 5 9 7 4 8 1 12 6 2 11 15
  9 13 11 6 15 4 5 14 3 8 2 12 10 1 7 16
  6 16 7 10 9 8 4 11 3 14 5 15 1 2 13 12
  6 16 8 9 10 7 4 11 3 14 5 15 2 1 12 13
  12 10 4 13 11 7 6 14 3 1 2 5 9 8 15 16
  16 7 10 8 6 14 3 13 4 11 1 15 9 2 5 12
  16 13 10 14 6 8 3 7 4 11 1 15 9 2 5 12
  12 9 10 6 14 8 3 15 4 7 1 11 13 2 5 16
  11 16 7 15 3 6 9 8 5 10 1 13 2 4 14 12
  6 15 9 7 10 8 5 13 3 14 4 16 2 1 11 12
  6 15 9 7 10 5 11 16 3 8 1 13 2 4 14 12
  7 10 8 4 9 11 6 16 3 12 5 13 2 1 14 15
  8 12 6 9 4 14 5 13 7 15 1 16 2 3 11 10
  4 16 8 7 10 6 9 15 5 11 1 14 2 3 13 12
  10 12 9 8 5 6 13 14 4 7 1 11 2 3 16 15
  4 15 8 6 10 7 9 16 5 11 1 14 2 3 13 12
  4 14 8 5 10 9 7 16 6 13 1 15 3 2 11 12

To solve the bonus problem, a straight forward algorithm is, to simply iterate through the combinations of the assignment of operators '+' or '-', such that they are different in at least 12 positions from the original assignment, for all pairs of the solutions. There are 84*83/2*(1+2+2^2+...+2^12)=28553826 combinations to check. It's simple to implement but it took a little while to get the results.

To iterate through the assignments of operators, I expressed each assignment as an unsigned integer, each bit in the binary representation is 1 for '+' and 0 for '-'. To get an assignment that is different in $k$ positions from the original assignment, I just used an integer that has $k$ 1s in its binary form and xor to the original assignment. To iterate through all binary numbers that has $k$ 1s, I wrote a binNextPerm() function, which is basically the binary equivalence of next_permutation(), but with the __builtin_ctz() functions, each iteration is O(1).

A better algorithm is, to find the results that are partially correct and continue checking the further rows/cols. For that, I needed to map each solution of each row/col to the solution index. So, I used a hash map to store the results and mapped them to a hash set. The solutions must be on the same row/col, with the same assignment of operators, so the key must be the combination of these three. The result must be between -63 and 63, which takes at most 7 bits. There are 8 rows/cols combined, and 8 ways to assign the operators, so 13 bits are enough to store the key.

After the mapping is done, a recursive algorithm is used to find the results. Given two indices, it starts from the first row and looks for the indices of solutions that gives the same result for that row with the same assignment of the operators, and check if the indices include the second index. If it's found, it continues to the next row/col, otherwise it continues to the next assignment of operators. When it goes through the 8 rows+cols, it adds the result, a triplet consisting of the two indices and the assignment of the operators, to the answer. This is way more efficient and solved the problem in a second. The answer is checked against the original assignment of the operators with xor operator to see if there are at least 12 differences, and is excluded if not.

There are 3 solutions listed on the website, but actually there are 6, although 4 of them are equivalent. But they are not identical, so I consider them different solutions. They are listed below.
  i=1,j=23
  14,8,2,15,4,16,1,11,5,9,3,10,6,7,13,12,
  13,12,6,14,5,9,4,8,1,10,7,11,2,3,16,15,
  mask:010110110001101100100010
  [-,+,-,+,+,+,-,+,+,-,-,-,-,+,+,+,-,+,-,-,-,-,-,+]
  12
  i=17,j=58
  8,13,5,11,10,14,1,15,3,12,2,16,4,7,9,6,
  12,8,9,6,14,10,2,16,4,7,1,11,13,3,5,15,
  mask:001101110100110001111111
  [-,-,+,+,-,+,+,+,-,+,+,-,+,+,+,+,-,-,+,+,+,+,-,-]
  12
  i=20,j=55
  15,10,8,12,13,7,1,11,3,5,2,9,14,4,6,16,
  16,12,9,14,10,6,2,8,4,7,1,11,13,3,5,15,
  mask:001000110001110001010111
  [-,-,+,+,-,-,+,-,-,-,+,-,+,+,+,+,-,-,+,-,+,-,-,+]
  12
  i=20,j=55
  15,10,8,12,13,7,1,11,3,5,2,9,14,4,6,16,
  16,12,9,14,10,6,2,8,4,7,1,11,13,3,5,15,
  mask:001000110001110001111111
  [-,-,+,+,-,+,+,-,-,-,+,-,+,+,+,+,-,-,+,+,+,-,-,+]
  12
  i=20,j=55
  15,10,8,12,13,7,1,11,3,5,2,9,14,4,6,16,
  16,12,9,14,10,6,2,8,4,7,1,11,13,3,5,15,
  mask:001000110100110001010111
  [-,-,+,+,-,-,+,-,-,-,+,-,+,+,+,+,-,-,+,-,+,+,-,-]
  12
  i=20,j=55
  15,10,8,12,13,7,1,11,3,5,2,9,14,4,6,16,
  16,12,9,14,10,6,2,8,4,7,1,11,13,3,5,15,
  mask:001000110100110001111111
  [-,-,+,+,-,+,+,-,-,-,+,-,+,+,+,+,-,-,+,+,+,+,-,-]
  12

Here the $i$ and $j$ are my indices of the 84 solutions, you can ignore them. "mask" is the assignment of the signs in binary form, which is translated to the form that they stated in the problem.
The last 4 solutions are equivalent, because after the '+' or '-' signs are expressions that evaluates to 0, so of course one can choose either of them. But still, they are different assignments of operators.

Tuesday, January 16, 2024

IBM Ponder this 2023 12 solution

Since the solution is posted today, I'll share my solution below.
Problem description is here.

When I saw this problem, the first thing that came to my mind is, to find a proper coordinate system. It's natural to consider the following coordinate system:
where $x$ and $y$ axis make $60^\circ$. The distance from a point at $(i,j)$ to the origin is, $d=\sqrt{i^2+j^2-2ij\cos(120^\circ)}=\sqrt{i^2+j^2+ij}$.
Next, considering the symmetry, we only need to consider the points that's within $30^\circ$ to $x$ axis, or equavalently, from $30^\circ$ to $60^\circ$, i.e. $j\geq i$.
So, our goal is finding all the $r$s such that 
  $i\geq 1,j\geq i$,
  $d=i^2+ij+j^2$ is not a perfect square,
  $\lfloor\sqrt{d}\rfloor=r$
has exactly $10^6$ distinct solutions for $d$.

Let's estimate the number of solutions for a given $r$. It's easy to find the density of the points, $2/\sqrt 3$, and the area of the ring is close to $2\pi r$, so there are about $4\pi r/\sqrt 3$ points. If we only consider $1/12$ of the ring, there are $\pi r / 3\sqrt 3\approx 0.6r$ points. But a finite proportion of the points are on the circle, and different points may have the same distance, so the actual ratio should be smaller than $0.6$.

To calculate exactly how many points with different distances are there inside the ring of radius $r$, the range of $i$ is from $\lceil r/\sqrt 3\rceil$ to $r$, and we can find the range of $j$ by solving the quadratic equation. Iterating through them and using a hash set to exclude the duplicate distances, we can find the result.

The first few numbers show that the ratio at arond $r=300$ is about $0.4$. As $i$ and $j$ increase, intuitively, the probability that multiple $(i,j)$ pairs give the same $d$ may increase, so the ratio may decrease slowly. At $r=1000$, the ratio lowers to $0.342$.

But we're not sure how fast it decreases, so it's hard to estimate where exactly we should start at. Typically this kind of estimation can be solved by binary search. A simple binary search gives:
  rupper=4313299,rlower=4313298,cnt=1000021
The ratio is about $0.232$. So, the next question is, where exactly should we start at? Let's try $431000$. The first few results are,
  r=4310000,cnt=999403
  r=4310001,cnt=999559
  r=4310002,cnt=999355
  r=4310003,cnt=999195
  r=4310004,cnt=998994
  r=4310005,cnt=999449
  r=4310006,cnt=999391
  r=4310007,cnt=999828
  r=4310008,cnt=998551
  r=4310009,cnt=999375
  r=4310010,cnt=999272
  r=4310011,cnt=999473
  r=4310012,cnt=999175
  r=4310013,cnt=999477
It seems that it's quite safe to start from here, since although it fluctuates, it's still more or less far from 1000000.
But is that true? Let's see a bit more results...
  r=4310052,cnt=999996
  r=4310062,cnt=1000173
Damn! This shows the possiblity that we are missing some results that are less than 4310000.

Using a larger range, I got
  4305000 -> 998172
  4320000 -> 1001878
A 2000 gap seems safe enough, although it's not guaranteed. Let's use this range to start with.

Since the result fluctuates a lot, to find a single solution is not much different from finding all the solutions. We just simply check all the radii in range. But from the hint, it's much easier to find the palindrome solution, since the choices are much fewer. We just need to check 430?034 and 431?134. The result is 4313134.

To find the other solutions, we must check all the 15000 radii. Now, performance becomes an issue. It would take about 13 hours for the python program that I first wrote to complete.
Then I realized that, the numbers are not that huge, they fit in "int" type in C++, and the squares can be stored in a "long long". And indeed, it's much faster.

Then I thought, can I accelerate it further with parallelization? I was thinking of openMP, but while I browse relevant information, I learned that there's class for parallelization in C++17 STL. So, I figured out how to use "std::execution::par" to run the functions in parallel, which sped up the program by a factor of the number of cores, which is 4.

With some simplification to the logic, so that "sqrt()" is only called once in the calculation for each x-coordinate and radius where the rest are all integer operations, I managed to get the result in 5038.41s. The time per each radius is 5038.41s/15000 = 0.336s, really fast. Even without the multi-core parallelization, it would still be about 1.34s per radius, much faster than the speed they show on the solution page.

I also considered GPU acceleration, but apparently I couldn't simply put an STL hash set inside a CUDA kernel. I noticed that there are open source hash set implementations for CUDA, but it's probably not easy to integrate them into google colab. Anyway, I'll look into them when it's necessary.

Since the computation was pretty fast, I ran it again to do some statistics. The results are,

  4310930,4311298,4312919,4313134,4313718,
  First over:r=4308795,cnt=1000034
  Min in range:r=4309083,cnt=998488
  Max in range:r=4315454,cnt=1001567
  Last under:r=4316001,cnt=999942

, where the first row are the results, "First over" is the first radius whose count is greater or equal to 1000000, "Min in range" is the minimum count after "First over", and "Max in range" is the maximum before "Last under", which is the last radius whose count is below 1000000.

So, this supports the justification to start from 4305000 and end at 4320000, since there are fairly large gaps between them and "First over" and "Last under" respectively. It's unlikely that we're missing a solution outside the range.

I made a histogram of counts in the range from "First over" to "Last under" and fit it to a normal distribution:
It looks close but there're definitely visible discrepancies. The sudden decrease near the boundary may be due to the cut-off, since if I extend the boundary, there will be more numbers filling in. So it's probably better to exclude the numbers that are close to the boundary. But even without them, the fit is still kind of off. I wonder what kind of distribution it is...

Update with CUDA (2024 01 21):

At first, I considered running multiple radii and sort the result, but it's about 5 times slower (about 7 hours), which is expected because sorting is slow. Then, I thought about counting sort again. It takes much more space and time than a hash set if the numbers only cover a small portion of the range, but it can be easily parallelized. The ratio of squared radii to range is about $\frac{10^6}{(4.3\times 10^6+1)^2-(4.3\times 10^6)^2}\approx\frac{1}{8.6}$. It's not too bad.

Next is to determine how many radii I can run at the same time. If I use long long type, with 128 radii, the range is about $128*2*4320000\approx 2^{30}$ long long numbers, which takes about 8GB. But I just need to store 0 and 1 in the range, and later sum over them to get the count of radii, which is about 10^6*128, still far from the maximum of int type. So, with int type, I can run 200 radii in parallel, which takes about 6.4GB.

Next is to assign the points to the kernel and mark their squared radii to 1. The points are in a long arc of a narrow ring. It's impossible to know where the point should be for a certain thread if I only want to run the points inside the region. So, at first I just precompute all the coordinates, then I can check each of them, since now they're in a one dimension array.

Finally, to find the number of squared radii between two radii, I need to sum over the data in range, so the count at radius r is the difference between the sum at the next radius and the one at this radius, count[$r$]=sum[$(r+1)^2-1$]-sum[$r^2$]. I wondered if there's an efficient algorithm to do this in parallel, but at first I just did a simple summation on CPU.

This version of the program took 5313.97s - it's actually slower, because most of the work was still done on CPU, and there's extra effort to copy data from ram to GPU and back. While the program was running, I thought about how to improve it.

First, I thought about how to do prefix sum in parallel. In a minute, I came up with an O(log n) steps algorithm - but it requires that O(n) threads can run in parallel, which is probably not feasible in many situations. I figured that there must already be research on this, so I searched about it. Apparently what I found was the "Naive Parallel Scan". I looked for the implementation of the efficient algorithm (bank-conflict-free Blelloch algorithm) and noticed that the cub library has this function, and it's included in the cccl.

Next, to get the coordinates of the points in parallel, I must put them in a parallelogram, 60 degrees slanted to the left or to the right. It's impossible to fit them all in the same parallelogram - there's too much empty space, since they are in a very narrow ring. So, I can only take a small portion at a time, and find the parallelogram that covers it entirely. The smaller the height, the less wasted space, but it needs to have enough numbers so that running them in parallel is beneficial.

After all of the above are integrated to the program, this is how it works:
--First I cut the range 4305000 -> 4320000 into steps with step size 200.
----Each region, which is 1/12 of the ring between radius r and r+200, are further cut horizontally where the height is in steps of 128.
------Each piece is extended to a parallelogram, and the coordinates of each point inside the parallelogram is computed in a CUDA kernel, and if the squared radius is in range, it marks the data at that location to 1.
----When all the pieces for a radius step are done, the cub prefix sum function is called to get the summations. Then another cuda kernel reads the summations at the two squared radii for each radius, subtracts them, and writes to the counts.
--Finally the counts are copied to the host and I get the results.

Guess how long did this version of the program take? 88.859s!!! More than 56 times speed up, it took 5.92ms per radius on average. It used about 1G system ram and about 13G GPU ram at peak, because the calculation of prefix sum requires the same amount of space for the destination, which is another 6.4GB.

I learned a lot in the process of writing this program, e.g. the Blelloch algorithm for prefix sum, how to debug cuda functions, etc. And the result is very satisfying!

A Recursive Geometric Proof for 1962 IMO Problem 6

Youtube recommended a video to me, where the following problem is solved: Consider an isosceles triangle. Let $R$ be the radius of its circ...