@smilence 2020-03-12T06:12:42.000000Z 字数 8746 阅读 1985

第八部分：搜索与排序

算法面试指南

8.1 排序的基础知识和模式

a.Merge Sort
Runtime: $O(nlogn)$ for average and worst case.Memory: $O(n)$
典型的D&C策略算法。将线性数据结构（如array, vector或list )分成两部分，假定两部分都可以得到解决，合并两部分。
由于Merge Sort不依赖于随机读写，因此具有很强的普适性，比如适合用于sort Linked List结构。且与heapsort及部分改进型quick sort相区别，merge sort一般是stable sort。

def mergeSort(arr: List[int]) -> None:
    tmp = [0] * len(arr)
    def impl(start: int, end: int) -> None:
        nonlocal arr
        nonlocal tmp
        if start >= end:
            return
        mid = start + (end - start) // 2
        # sort left
        impl(start, mid)
        # sort right
        impl(mid+1, end)
        for i in range(start, end+1):
            tmp[i] = arr[i]
        # merge left and right
        l, r, i = start, mid + 1, start
        while l <= mid and (r <= end):
            if tmp[l] <= tmp[r]:
                arr[i] = tmp[l]
                l += 1
            else:
                arr[i] = tmp[r]
                r += 1
            i += 1
        while l <= mid:
            arr[i] = tmp[l]
            i += 1
            l += 1
    impl(0, len(arr) - 1)

b.Quick Sort
Runtime： $O(nlogn)$ average, $O(n^2)$ worst case.Memory: $O(logn)$
典型的D&C策略算法，将数组用pivot值分成两部分，然后默认这两部分会自己解决sorting问题。

def qSort(arr: List[int]) -> None:
    def impl(start: int, end: int) -> None:
        if start >= end:
            return
        pivot = arr[start] + (arr[end] - arr[start]) / 2
        l, r = start, end
        while l < r:
            if arr[l] <= pivot:
                l += 1
            elif arr[r] > pivot:
                r -= 1
            else:
                arr[l], arr[r] = arr[r], arr[l]
                l += 1
                r -= 1
        impl(start, l-1)
        impl(r+1, end)
    impl(0, len(arr)-1)

c.Heap Sort
Runtime： $O(nlogn)$ in all cases.Memory: $O(1)$ （除Quick Sort之外，另一个in place的排序算法）

vector<int> v;
std::make_heap (v.begin(),v.end());
int max = v.front();    // get the max;
std::pop_heap (v.begin(),v.end()); v.pop_back();  // pop the max;
std::sort_heap (v.begin(),v.end());

e.g. Merge k sorted linked lists and return it as one sorted list. (Leetcode: Merge k Sorted Lists);

d.Bin Sort, Bucket Sort 与 Radix Sort
Bin Sort： 利用table，将key的值直接作为index，将元素存入"bin"中。

for( i = 0; i < n; i++ )
    B[ getKey( A[i] ) ].append( A[i] );

Bucket Sort : 与Bin Sort类似，区别是将key在某一范围内的元素都存入一个"bin"中，再寻求后期处理。

Radix Sort : 如果元素的key可以看做一个有限位进制数（如string, vector<int>,int），那么可以按照从低位到高位（适用数值序）或高位到低位（适用词典序），根据key每一位的值进行Bin Sort，在维持其相对顺序的情况下进一步排序，直到遍历所有位。Runtime： $O(kN)$ in all cases.Memory: $O(N + k)$

相比其他Sorting方式，Bucket Sort与Radix Sort几乎不需要交换与比较。

2.Quick Select Algorithm
在无序数组内，在 $O(n)$ 时间内找到 k-th smallest ( 或 k-smallest ) item(s) 的方法(average case)，但局部无序。基本算法与Quick Sort一致，核心在于partition。

int quick_select(int *array, int left, int right, int k)
{
    if ( left == right ) return input[left];
    int index = partition(array, left, right);
    int size = index - left + 1;
    if ( size == k ) return index;  // the current index of the median
    else if ( size > k ) return quick_select(input, left, index - 1, k);
    else  return quick_select(input, index + 1, right , k - size);
}
int partition( arr, left ,right );

e.g.1 Compute the k-th largest element in an array A that runs in $O(n)$ expected time.( EPI: 11.13)
e.g.2 There are n points on a 2D plan, find the k points that are closest to origin ( x= 0, y= 0). (Facebook Interview)

3.External Sort
D&C策略。Divide the file into chunks, sort each chunk, and merge them.

Two options for merging:

a.Merge every two chunks, until there is only one file.

b.Form a min heap from the first element of all the k arrays. Extract the min element. It is the min element among all the arrays. Now take the next element from the array from which the min element has come, and put it in the min heap. Repeat the procedure.

8.2 二分搜索规律全解

D&C策略的搜索方式。对于已排序的有序container，如果不改变数据结构，Binary search几乎总是最优的检索方式，即将数据分割为两部分，再根据当前节点的分析进行剪枝。在使用binary search时，总是应该检查逼近方式与结束条件的关系：

def binary_search(A, target):
   lo = 0, hi = size(A) - 1
   while lo <= hi:
      mid = lo + (hi-lo)/2
      if A[mid] == target:
         return mid            
      else if A[mid] < target: 
         lo = mid+1
      else:
         hi = mid-1
   // target was not found

逼近方式的原则就是不能漏过有效值，并且每次不能“原地不动”。

求解给定值的一个位置，则可以用 low<=high 作为搜索条件，根据A[mid] 与目标值的关系取 low = mid +1或high = mid - 1逼近; 如果没有找到给定值，那么给定值的index最终一定在[high, low]之间 (high一定小于low), 注意区间可能会超出边界比如high < 0或者low >= length.
对于局部有序的数据，也可以根据其有序局部的特性，尽可能地逼近、剪枝，使用变种Binary Search进行检索。

e.g.1 A magic index is defined to be an index such that A[i] = i.Find a magic index in a sorted array. ( CtCI 9.3 ) { -1, 0, 1, 2, 4, 10 }

int getMagicIndex(int array[], int low, int high){
    if( low > high )
        return -1;
    int mid = low + (high-low)/2;
    if ( array[mid] == mid )
        return mid;
    if( mid > array[mid] )
        return getMagicIndex(array, mid+1, high);
    else
        return getMagicIndex(array, low, mid-1);
}

e.g.2 Find an element in a rotated array( was originally sorted).( CtCI 11.3， Leetcode: Search in Rotated Sorted Array II）

对于求解问题（无论是给定值还是范围）的下界的情况，应该在不符合条件时取low = mid+1，对应地，mid = low + (high-low)/2以保证每次循环都是“逼近”的；求解问题上界，在不符合条件时取high = mid-1，对应地，mid = high - (high-low)/2以保证“逼近”。在求上下界的情形下，都应该以 low<high 为结束条件，结束时 low==high 一定成立。

e.g.3 Given a sorted array of integers, find the starting and ending position of a given target value.( Leetcode: Search for a range ) [5, 7, 7, 8, 8, 10]

vector<int> searchRange(int A[], int n, int target) {
    vector<int> res(2,-1);
    if(!n)  return res;
    int low = 0, high = n -1;
    while(low < high){
        int mid = low + (high-low)/2; // closer to "low"
        if( A[mid] < target )
            low = mid +1; 
        else
            high = mid; // high >= low
    }
    if( low >= n || A[low]!= target)
        return res;
    res[0] = low;
    low = 0, high = n-1;
    while(low < high){
        int mid = high - (high-low)/2;
        if( A[mid] > target)
            high = mid - 1;
        else
            low = mid;
    }
    res[1] = high;
    return res;
}

e.g.4 Given a M x N matrix where each row and column is sorted in ascending order.Find an element.( CtCI 11.6)
e.g.5 Implement a function which takes as input a float varaible x and return its square root.( EPI: 11.9)

// 0 means equal, -1 means smaller, and 1 means larger.
int compare(double a, double b) {
  // Use normalization for precision problem.
  double diff = (a - b) / b;
  return diff < -numeric_limits<double>::epsilon()
             ? -1
             : diff > numeric_limits<double>::epsilon();
}
double square_root(double x) {
  // Decide the search range according to x.
  double l, r;
  if (compare(x, 1.0) < 0) {  // x < 1.0.
    l = x, r = 1.0;
  } else {  // x >= 1.0.
    l = 1.0, r = x;
  }
  // Keep searching if l < r.
  while (compare(l, r) == -1) {
    double m = l + 0.5 * (r - l);
    double square_m = m * m;
    if (compare(square_m, x) == 0) {
      return m;
    } else if (compare(square_m, x) == 1) {
      r = m;
    } else {
      l = m;
    }
  }
  return l;
}

e.g.6 There are two sorted arrays A and B of size m and n respectively. Find the median of the two sorted arrays. The overall run time complexity should be $O(log (m+n))$ . (Leetcode: Median of Two Sorted Arrays)

double findKthHelper(int A[], int m, int B[], int n, int k){
    if(m > n)   return findKthHelper(B, n, A, m, k);  //keep the size of A smaller than B
    if(m == 0)  return B[k-1];
    if(k == 1)  return min( A[0], B[0]);
    int aNum = min(m , k/2 );
    int bNum = k - aNum;
    if( A[aNum-1] == B[bNum-1])
        return A[aNum-1];
    else if( A[aNum-1] < B[bNum-1])
        return findKthHelper( A + aNum,m - aNum, B, n, k - aNum );
    else
        return findKthHelper( A,m, B + bNum, n - bNum, k - bNum);
}
double findMedianSortedArrays(int A[], int m, int B[], int n) {
    if( (m+n)%2 )
        return findKthHelper(A,m,B,n,(m+n)/2+1);
    else
        return (findKthHelper(A,m,B,n,(m+n)/2) + findKthHelper(A,m,B,n,(m+n)/2+1) )/2;
}

8.3 动态数据结构的维护

BST 可以看一个动态维护的一个sorted array。建立BST（对应std::map )通常可以满足有序要求，将检索对象作为TreeNode的key，将检索内容作为TreeNode的val。利用BST维护的稳定性，BST的节点可以方便地记录其子树的信息，并在节点间传递全局信息，方便检索。

Heap结构能够很方便地维护动态数据（stream）的最大值、最小值或中位数。 priority_queue相比简单的queue适用于需要优先级的情境。与queue类似的快速push和提取，但不方便update和检索（部分有序）。
Linked List: 快速的update、插入和删除，但不方便检索。可以注意到Linked List同时也可以是queue的底层实现方式，因此适合使用queue的情境，一般也可以使用更为强大的list。
Table: 更快速的检索（ $O(1)$ )，如unordered_map以及array， set，bitset<int>。

LRU Cache: 对时间敏感的动态问题，往往可以采用这种强大的数据结构。用Linked List存储，用Hashtable辅助检索，节点按照插入或更新顺序排列，支持快速的update、删除与检索。可以想象成一个可以随时提取或加入信息并保持全局时间有序的"Cache Pool"。在JAVA语言中，用LinkedHashMap实现。http://stackoverflow.com/questions/2504178/lru-cache-design/

e.g.1 Design an algorithm for computing the k-th largest element in a sequence of elements, one presented at a time.(EPI: 11.14)
e.g.2 Implement data structures to look up the rank of a number x( the number of values less than or equal to x) (CtCI 11.8)
e.g.3 If you were designing a web crawler, how would you avoid getting into infinite loops.( CtCI 10.5 )
e.g.4 Design a caching mechanism for the most recent queires of a search engine.( CtCI 10.7 )
e.g.5 Given a server that has requests coming in. Design a data structure such that you can fetch the count of the number requests in the last second, minute and hour.( Google interview )
http://stackoverflow.com/questions/18364490/implementation-of-a-hits-in-last-second-minute-hour-data-structure

e.g.6 N nodes, each node consists of a couple fields and methods. These are:

int id; //every node has an ID. All of these IDs are sequential, and begin with 0. 
int val; //every node has a value  
int max; //max = N. Every node knows how many nodes are in the system.  
void send(int idTo, int payload);  
int recv(int idFrom);

Write a single piece of code which runs on every node simultaneously, such that when it is finished running every node in the system knows the sum of the values of all the nodes in the system.( Ebay interview )

int getSum(){    //as member function of the node
    int parent = ( id - 1)/2;
    int left = 2*id + 1;
    int right = 2*id + 2;
    // DFS
    int subsum = val;
    if( right < max ) subsum += recv(right);
    if( left < max ) subsum += recv(left);
    if(parent >= 0 ) send( parent, subsum);
    int sum = 0;
    if(parent >= 0) sum = recv( parent);
    else sum = subsum;
    send( left, sum);
    send(right, sum );
    return sum;
}

8.4 扩展性和分布式系统问题

相对单机或者内存，数据量太大的问题，都是采用分治法D&C策略，即对问题进行预处理，将问题的输入进行分割、归类（sorting），放入相应的Bucket（单机上的某一块Chunk，或者分布式系统中的一台单机），再对每个Bucket进行后期处理，最后合并结果。一般来说，这两类问题处理起来没有任何差异，无非是把大的问题，分割成已知怎么解决的小问题，再合并小问题的解。

整个过程中应该用hash提供全局lookup办法，对于Memory Limits问题用简单的hash function来表示数据到不同硬盘文件(区块)之间的映射；对Scalability问题一般可以用hash table来记录输入对象的key与machine之间的映射。 用Hash来预处理的一个好处在于保证相同或者同类的元素出现在同一个bucket中（比如对大数据量求TOP10数据)。

e.g.1 Find all documents that contain a list of words ( CtCI Chapter 10 Example )
e.g.2 Find the top 10 popular search strings from large volume of search string log.
e.g.3 For 10 million strings, remove the duplicate ones from them.
e.g.4 Design the data structures for a very large social network and an algorithm to show the connection between two people.( CtCI 10.2)
e.g.5 Given an input file with one billion distinct non-negative integers, provide an algorithm to generate an integer not contained in the file.Assume you have only 10MB of Memory.(CtCI 10.3)
e.g.6 Find 10 largest numbers from one billion integers.