Cartesian tree

In computer science, a Cartesian tree is a binary tree derived from a sequence of numbers. The smallest number in the sequence is at the root of the tree; its left and right subtrees are constructed recursively from the subsequences to the left and right of this number. When all numbers are distinct, the Cartesian tree is uniquely defined from the properties that it is heap-ordered and that a symmetric (in-order) traversal of the tree returns the original sequence.

A sequence of numbers and the Cartesian tree derived from them.

Introduced by Vuillemin (1980) in the context of geometric range searching data structures, Cartesian trees have also been used in the definition of the treap and randomized binary search tree data structures for binary search problems, in comparison sort algorithms that perform efficiently on nearly-sorted inputs, and as the basis for pattern matching algorithms. A Cartesian tree for a sequence may be constructed in linear time.

Definition

The Cartesian tree for a sequence of distinct numbers is defined by the following properties:

  1. The Cartesian tree for a sequence is a binary tree with one node for each number in the sequence. Each node is associated with a single sequence value.
  2. A symmetric (in-order) traversal of the tree results in the original sequence. That is, the left subtree consists of the values earlier than the root in the sequence order, while the right subtree consists of the values later than the root, and a similar ordering constraint holds at each lower node of the tree.
  3. The tree has the heap property: the parent of any non-root node has a smaller value than the node itself.[1]

Based on the heap property, the root of the tree must be the smallest number in the sequence. From this, the tree itself may also be defined recursively: the root is the minimum value of the sequence, and the left and right subtrees are the Cartesian trees for the subsequences to the left and right of the root value.[2]

For a sequence of distinct numbers, the three properties above determine a unique Cartesian tree. If a sequence of numbers contains repetitions, a Cartesian tree may be determined for it by following a consistent tie-breaking rule before applying the above construction. For instance, the first of two equal elements may be treated as the smaller of the two.[2]

Efficient construction

A Cartesian tree may be constructed in linear time from its input sequence. One method is to simply process the sequence values in left-to-right order, maintaining the Cartesian tree of the nodes processed so far, in a structure that allows both upwards and downwards traversal of the tree. To process each new value , start at the node representing the value prior to in the sequence and follow the path from this node to the root of the tree until finding a value smaller than . The node becomes the right child of , and the previous right child of becomes the new left child of . The total time for this procedure is linear, because the time spent searching for the parent of each new node can be charged against the number of nodes that are removed from the rightmost path in the tree.[3]

An alternative linear-time construction algorithm is based on the all nearest smaller values problem. In the input sequence, one may define the left neighbor of a value to be the value that occurs prior to , is smaller than , and is closer in position to than any other smaller value. The right neighbor is defined symmetrically. The sequence of left neighbors may be found by an algorithm that maintains a stack containing a subsequence of the input. For each new sequence value , the stack is popped until it is empty or its top element is smaller than , and then is pushed onto the stack. The left neighbor of is the top element at the time is pushed. The right neighbors may be found by applying the same stack algorithm to the reverse of the sequence. The parent of in the Cartesian tree is either the left neighbor of or the right neighbor of , whichever exists and has a larger value. The left and right neighbors may also be constructed efficiently by parallel algorithms, so this formulation may be used to develop efficient parallel algorithms for Cartesian tree construction.[4]

Another linear-time algorithm for Cartesian tree construction is based on divide-and-conquer. The algorithm recursively constructs the tree on each half of the input, and then merges the two trees by taking the right spine of the left tree and left spine of the right tree (both of which are paths whose root-to-leaf order sorts their values from smallest to largest) and performs a standard merging operation, replacing these two paths in two trees by a single path that contains the same nodes. In the merged path, the successor in the sorted order of each node from the left tree is placed in its right child, and the successor of each node from the right tree is placed in its left child, the same position that was previously used for its successor in the spine. The left children of nodes from the left tree and right children of nodes from the right tree are left unchanged. The algorithm is also parallelizable since on each level of recursion, each of the two sub-problems can be computed in parallel, and the merging operation can be efficiently parallelized as well.[5]

It is possible to maintain the Cartesian tree of a dynamic input, subject to insertions of elements and lazy deletion of elements, in logarithmic amortized time per operation. Here, lazy deletion means that up to a constant fraction of the elements in the current tree may be marked as deleted, rather than actually removed. When too large a fraction of elements are marked, the tree is rebuilt.[6]

Applications

Range searching and lowest common ancestors

Two-dimensional range-searching using a Cartesian tree: the bottom point (red in the figure) within a three-sided region with two vertical sides and one horizontal side (if the region is nonempty) may be found as the nearest common ancestor of the leftmost and rightmost points (the blue points in the figure) within the slab defined by the vertical region boundaries. The remaining points in the three-sided region may be found by splitting it by a vertical line through the bottom point and recursing.

Cartesian trees may be used as part of an efficient data structure for range minimum queries, a range searching problem involving queries that ask for the minimum value in a contiguous subsequence of the original sequence.[7] In a Cartesian tree, this minimum value may be found at the lowest common ancestor of the leftmost and rightmost values in the subsequence. For instance, in the subsequence (12,10,20,15) of the sequence shown in the first illustration, the minimum value of the subsequence (10) forms the lowest common ancestor of the leftmost and rightmost values (12 and 15). Because lowest common ancestors may be found in constant time per query, using a data structure that takes linear space to store and that may be constructed in linear time, the same bounds hold for the range minimization problem.[8]

Bender & Farach-Colton (2000) reversed this relationship between the two data structure problems by showing that data structures for range minimization could also be used for finding lowest common ancestors. Their data structure associates with each node of the tree its distance from the root, and constructs a sequence of these distances in the order of an Euler tour of the (edge-doubled) tree. It then constructs a range minimization data structure for the resulting sequence. The lowest common ancestor of any two vertices in the given tree can be found as the minimum distance appearing in the interval between the initial positions of these two vertices in the sequence. Bender and Farach-Colton also provide a method for range minimization that can be used for the sequences resulting from this transformation, which have the special property that adjacent sequence values differ by ±1. As they describe, for range minimization in sequences that do not have this form, it is possible to use Cartesian trees to reduce the range minimization problem to lowest common ancestors, and then to use Euler tours to reduce lowest common ancestors to a range minimization problem with this special form.[9]

The same range minimization problem may also be given an alternative interpretation in terms of two dimensional range searching. A collection of finitely many points in the Cartesian plane may be used to form a Cartesian tree, by sorting the points by their -coordinates and using the -coordinates in this order as the sequence of values from which this tree is formed. If is the subset of the input points within some vertical slab defined by the inequalities , is the leftmost point in (the one with minimum -coordinate), and is the rightmost point in (the one with maximum -coordinate) then the lowest common ancestor of and in the Cartesian tree is the bottommost point in the slab. A three-sided range query, in which the task is to list all points within a region bounded by the three inequalities and , may be answered by finding this bottommost point , comparing its -coordinate to , and (if the point lies within the three-sided region) continuing recursively in the two slabs bounded between and and between and . In this way, after the leftmost and rightmost points in the slab are identified, all points within the three-sided region may be listed in constant time per point.[3]

The same construction, of lowest common ancestors in a Cartesian tree, makes it possible to construct a data structure with linear space that allows the distances between pairs of points in any ultrametric space to be queried in constant time per query. The distance within an ultrametric is the same as the minimax path weight in the minimum spanning tree of the metric.[10] From the minimum spanning tree, one can construct a Cartesian tree, the root node of which represents the heaviest edge of the minimum spanning tree. Removing this edge partitions the minimum spanning tree into two subtrees, and Cartesian trees recursively constructed for these two subtrees form the children of the root node of the Cartesian tree. The leaves of the Cartesian tree represent points of the metric space, and the lowest common ancestor of two leaves in the Cartesian tree is the heaviest edge between those two points in the minimum spanning tree, which has weight equal to the distance between the two points. Once the minimum spanning tree has been found and its edge weights sorted, the Cartesian tree may be constructed in linear time.[11]

As a binary search tree

Because a Cartesian tree is a binary tree, it is natural to use it as a binary search tree for an ordered sequence of values. However, defining a Cartesian tree based on the same values that form the search keys of a binary search tree does not work well: the Cartesian tree of a sorted sequence is just a path, rooted at its leftmost endpoint, and binary searching in this tree degenerates to sequential search in the path. However, it is possible to generate more-balanced search trees by generating priority values for each search key that are different than the key itself, sorting the inputs by their key values, and using the corresponding sequence of priorities to generate a Cartesian tree. This construction may equivalently be viewed in the geometric framework described above, in which the -coordinates of a set of points are the search keys and the -coordinates are the priorities.[12]

This idea was applied by Seidel & Aragon (1996), who suggested the use of random numbers as priorities. The data structure resulting from this random choice is called a treap, due to its combination of binary search tree and binary heap features. An insertion into a treap may be performed by inserting the new key as a leaf of an existing tree, choosing a priority for it, and then performing tree rotation operations along a path from the node to the root of the tree to repair any violations of the heap property caused by this insertion; a deletion may similarly be performed by a constant amount of change to the tree followed by a sequence of rotations along a single path in the tree.[12] A variation on this data structure called a zip tree uses the same idea of random priorities, but simplifies the random generation of the priorities, and performs insertions and deletions in a different way, by splitting the sequence and its associated Cartesian tree into two subsequences and two trees and then recombining them.[13]

If the priorities of each key are chosen randomly and independently once whenever the key is inserted into the tree, the resulting Cartesian tree will have the same properties as a random binary search tree, a tree computed by inserting the keys in a randomly chosen permutation starting from an empty tree, with each insertion leaving the previous tree structure unchanged and inserting the new node as a leaf of the tree. Random binary search trees had been studied for much longer, and are known to behave well as search trees (they have logarithmic depth with high probability); the same good behavior carries over to treaps. It is also possible, as suggested by Aragon and Seidel, to reprioritize frequently-accessed nodes, causing them to move towards the root of the treap and speeding up future accesses for the same keys.[12]

In sorting

Pairs of consecutive sequence values (shown as the thick red edges) that bracket a sequence value (the darker blue point). The cost of including this value in the sorted order produced by the Levcopoulos–Petersson algorithm is proportional to the logarithm of this number of bracketing pairs.

Levcopoulos & Petersson (1989) describe a sorting algorithm based on Cartesian trees. They describe the algorithm as based on a tree with the maximum at the root,[14] but it may be modified straightforwardly to support a Cartesian tree with the convention that the minimum value is at the root. For consistency, it is this modified version of the algorithm that is described below.

The Levcopoulos–Petersson algorithm can be viewed as a version of selection sort or heap sort that maintains a priority queue of candidate minima, and that at each step finds and removes the minimum value in this queue, moving this value to the end of an output sequence. In their algorithm, the priority queue consists only of elements whose parent in the Cartesian tree has already been found and removed. Thus, the algorithm consists of the following steps:[14]

  1. Construct a Cartesian tree for the input sequence
  2. Initialize a priority queue, initially containing only the tree root
  3. While the priority queue is non-empty:
    • Find and remove the minimum value in the priority queue
    • Add this value to the output sequence
    • Add the Cartesian tree children of the removed value to the priority queue

As Levcopoulos and Petersson show, for input sequences that are already nearly sorted, the size of the priority queue will remain small, allowing this method to take advantage of the nearly-sorted input and run more quickly. Specifically, the worst-case running time of this algorithm is , where is the sequence length and is the average, over all values in the sequence, of the number of consecutive pairs of sequence values that bracket the given value. They also prove a lower bound stating that, for any and (non-constant) , any comparison-based sorting algorithm must use comparisons for some inputs.[14]

In pattern matching

The problem of Cartesian tree matching has been defined as a generalized form of string matching in which one seeks a substring (or in some cases, a subsequence) of a given string that has a Cartesian tree of the same form as a given pattern. Fast algorithms for variations of the problem with a single pattern or multiple patterns have been developed, as well as data structures analogous to the suffix tree and other text indexing structures.[15]

History

Cartesian trees were introduced and named by Vuillemin (1980). The name is derived from the Cartesian coordinate system for the plane: in one version of this structure, as in the two-dimensional range searching application discussed above, a Cartesian tree for a point set has the sorted order of the points by their -coordinates as its symmetric traversal order, and it has the heap property according to the -coordinates of the points. Vuillemin (1980) described both this geometric version of the structure, and the definition here in which a Cartesian tree is defined from a sequence. Using sequences instead of point coordinates provides a more general setting that allows the Cartesian tree to be applied to non-geometric problems as well.[2]

Notes

References

  • Bender, Michael A.; Farach-Colton, Martin (2000), "The LCA problem revisited", Proceedings of the 4th Latin American Symposium on Theoretical Informatics, Springer-Verlag, Lecture Notes in Computer Science 1776, pp. 88–94
  • Berkman, Omer; Schieber, Baruch; Vishkin, Uzi (1993), "Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values", Journal of Algorithms, 14 (3): 344–370, doi:10.1006/jagm.1993.1018
  • Bialynicka-Birula, Iwona; Grossi, Roberto (2006), "Amortized rigidness in dynamic Cartesian trees", in Durand, Bruno; Thomas, Wolfgang (eds.), STACS 2006, 23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23-25, 2006, Proceedings, Lecture Notes in Computer Science, vol. 3884, Springer, pp. 80–91, doi:10.1007/11672142_5
  • Demaine, Erik D.; Landau, Gad M.; Weimann, Oren (2009), "On Cartesian trees and range minimum queries", Automata, Languages and Programming, 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 5-12, 2009, Lecture Notes in Computer Science, vol. 5555, pp. 341–353, doi:10.1007/978-3-642-02927-1_29, hdl:1721.1/61963, ISBN 978-3-642-02926-4
  • Fischer, Johannes; Heun, Volker (2006), "Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE", Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 4009, Springer-Verlag, pp. 36–48, doi:10.1007/11780441_5, ISBN 978-3-540-35455-0
  • Fischer, Johannes; Heun, Volker (2007), "A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array.", Proceedings of the International Symposium on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, Lecture Notes in Computer Science, vol. 4614, Springer-Verlag, pp. 459–470, doi:10.1007/978-3-540-74450-4_41, ISBN 978-3-540-74449-8
  • Gabow, Harold N.; Bentley, Jon Louis; Tarjan, Robert E. (1984), "Scaling and related techniques for geometry problems", STOC '84: Proc. 16th ACM Symp. Theory of Computing, New York, NY, USA: ACM, pp. 135–143, doi:10.1145/800057.808675, ISBN 0-89791-133-4, S2CID 17752833
  • Harel, Dov; Tarjan, Robert E. (1984), "Fast algorithms for finding nearest common ancestors", SIAM Journal on Computing, 13 (2): 338–355, doi:10.1137/0213024
  • Hu, T. C. (1961), "The maximum capacity route problem", Operations Research, 9 (6): 898–900, doi:10.1287/opre.9.6.898, JSTOR 167055
  • Kim, Sung-Hwan; Cho, Hwan-Gue (2021), "A compact index for Cartesian tree matching", in Gawrychowski, Pawel; Starikovskaya, Tatiana (eds.), 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, LIPIcs, vol. 191, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, pp. 18:1–18:19, doi:10.4230/LIPIcs.CPM.2021.18, ISBN 9783959771863
  • Leclerc, Bruno (1981), "Description combinatoire des ultramétriques", Centre de Mathématique Sociale. École Pratique des Hautes Études. Mathématiques et Sciences Humaines (in French) (73): 5–37, 127, MR 0623034
  • Levcopoulos, Christos; Petersson, Ola (1989), "Heapsort - Adapted for Presorted Files", WADS '89: Proceedings of the Workshop on Algorithms and Data Structures, Lecture Notes in Computer Science, vol. 382, London, UK: Springer-Verlag, pp. 499–509, doi:10.1007/3-540-51542-9_41
  • Nishimoto, Akio; Fujisato, Noriki; Nakashima, Yuto; Inenaga, Shunsuke (2021), "Position heaps for Cartesian-tree matching on strings and tries", in Lecroq, Thierry; Touzet, Hélène (eds.), String Processing and Information Retrieval - 28th International Symposium, SPIRE 2021, Lille, France, October 4-6, 2021, Proceedings, Lecture Notes in Computer Science, vol. 12944, Springer, pp. 241–254, arXiv:2106.01595, doi:10.1007/978-3-030-86692-1_20, S2CID 235313506
  • Oizumi, Tsubasa; Kai, Takeshi; Mieno, Takuya; Inenaga, Shunsuke; Arimura, Hiroki (2022), "Cartesian Tree Subsequence Matching", in Bannai, Hideo; Holub, Jan (eds.), 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, LIPIcs, vol. 223, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, pp. 14:1–14:18, doi:10.4230/LIPIcs.CPM.2022.14, ISBN 9783959772341, S2CID 246679910
  • Park, Sung Gwan; Amir, Amihood; Landau, Gad M.; Park, Kunsoo (2019), "Cartesian tree matching and indexing", in Pisanti, Nadia; Pissis, Solon P. (eds.), 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, LIPIcs, vol. 128, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, pp. 16:1–16:14, arXiv:1905.08974, doi:10.4230/LIPIcs.CPM.2019.16, ISBN 9783959771030, S2CID 162168587
  • Park, Sung Gwan; Bataa, Magsarjav; Amir, Amihood; Landau, Gad M.; Park, Kunsoo (2020), "Finding patterns and periods in Cartesian tree matching", Theoretical Computer Science, 845: 181–197, doi:10.1016/j.tcs.2020.09.014, S2CID 225227284
  • Seidel, Raimund; Aragon, Cecilia R. (1996), "Randomized Search Trees", Algorithmica, 16 (4/5): 464–497, doi:10.1007/s004539900061
  • Schieber, Baruch; Vishkin, Uzi (1988), "On finding lowest common ancestors: simplification and parallelization", SIAM Journal on Computing, 17 (6): 1253–1262, doi:10.1137/0217079
  • Shun, Julian; Blelloch, Guy E. (2014), "A Simple Parallel Cartesian Tree Algorithm and its Application to Parallel Suffix Tree Construction", ACM Transactions on Parallel Computing, 1: 1–20, doi:10.1145/2661653, S2CID 1912378
  • Song, Siwoo; Gu, Geonmo; Ryu, Cheol; Faro, Simone; Lecroq, Thierry; Park, Kunsoo (2021), "Fast algorithms for single and multiple pattern Cartesian tree matching", Theoretical Computer Science, 849: 47–63, doi:10.1016/j.tcs.2020.10.009, S2CID 225142815
  • Tarjan, Robert E.; Levy, Caleb C.; Timmel, Stephen (2021), "Zip trees", ACM Transactions on Algorithms, 17 (4): 34:1–34:12, arXiv:1806.06726, doi:10.1145/3476830, S2CID 49298052
  • Vuillemin, Jean (1980), "A unifying look at data structures", Communications of the ACM, New York, NY, USA: ACM, 23 (4): 229–239, doi:10.1145/358841.358852, S2CID 10462194
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.