您的当前位置：首页正文

Golang 深度剖析 -- Map的实现原理

来源：易榕旅网

Golang标准库实现了Map容器，在很多业务场景都会用到这个容器，但是使用不当有可能会出现问题，这里我们来探究Map底层到底是如何实现的。

Map的实现在runtime/map.go，首先有这么一段介绍:

// A map is just a hash table. The data is arranged
// into an array of buckets. Each bucket contains up to
// 8 key/elem pairs. The low-order bits of the hash are
// used to select a bucket. Each bucket contains a few
// high-order bits of each hash to distinguish the entries
// within a single bucket.
// If more than 8 keys hash to a bucket, we chain on
// extra buckets.
// When the hashtable grows, we allocate a new array
// of buckets twice as big. Buckets are incrementally
// copied from the old bucket array to the new bucket array

map内部实现是一个哈希表，内部维护了一个buckets数组，每个buckets最多包含8个key-value对，每个key的哈希值的低位是buckets的index,高位会放到bucket中用来区分。如果超过8个元素，便用额外的buckets链在后面，也就是俗称的"拉链法"。当扩容时，会重新分配一个比当前空间大2倍的空间，前一个Buckets里的数据会缓慢复制到新的存储空间中。

map是这样的一个结构：

type hmap struct {
  count     int // 当前大小
  flags     uint8
  B         uint8  // 表示能够容纳  2^B 个 bucket
  noverflow uint16 // 溢出 buckets 的数量
  hash0     uint32 // 哈希种子
  buckets    unsafe.Pointer // buckets 的数组指针 
  oldbuckets unsafe.Pointer // 用于扩容时 的 buckets 数组
  nevacuate  uintptr        // rehash 的进度 
  extra *mapextra // optional fields
}

map 底层是一个 hmap的结构，hmap维护了一个buckets 数组，buckets数组的元素是一个bmap结构，真正的数据存放到bmap中：

// A bucket for a Go map.
type bmap struct {
  // tophash generally contains the top byte of the hash value
  // for each key in this bucket. If tophash[0] < minTopHash,
  // tophash[0] is a bucket evacuation state instead.
  tophash [bucketCnt]uint8   // bucketCnt = 8
}

bmap是一个长度为8的数组，map的key经过hash算法后会被放到bmap中，bmap只存hash value的 top byte，uint8类型表示hash value的高8位，这样的好处是每次查找key的时候不要比较key的每个字符，从而增加查找效率。

在bmap中数据并不是以 keyvaluekeyvalue 的形式存储的，而是通过keykey1value1value2 的形式存储的，这样的好处是更合理的利用空间。避免内存对齐造成空间浪费。

插入数据

插入的函数是mapassign，以下是部分实现：


func mapassign(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {
  if h == nil {
    panic(plainError("assignment to entry in nil map"))
  }
  if h.flags&hashWriting != 0 { //检测是否竞争读写
    throw("concurrent map writes")
  }
  hash := t.hasher(key, uintptr(h.hash0)) // 计算哈希值
  h.flags ^= hashWriting // 置标记位，表示正在写
  if h.buckets == nil { 
    h.buckets = newobject(t.bucket) // newarray(t.bucket, 1)
  }
again:   // 如果触发 rehash，需要重试
  bucket := hash & bucketMask(h.B)
  if h.growing() {
    growWork(t, h, bucket)
  }
  // 
  b := (*bmap)(unsafe.Pointer(uintptr(h.buckets) + bucket*uintptr(t.bucketsize)))
  top := tophash(hash)
}
bucketloop: 
  for {
    for i := uintptr(0); i < bucketCnt; i++ { //便利 单个 bucket 
      if b.tophash[i] != top {
          if isEmpty(b.tophash[i]) && inserti == nil {
               //如果找到一个 空位置，就直接插入数据，并退出循环
          }
        if b.tophash[i] == emptyRest {
          break bucketloop
        }
        continue
      }
      // 如果找到相同的 key ，那么久覆盖value
      k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
      if t.needkeyupdate() {
        typedmemmove(t.key, k, key)
      }
      elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
      goto done
    }
    ovf := b.overflow(t)
    if ovf == nil {
      break
    }
    b = ovf
  }
  // 判断如果增加元素触发扩容，则重复到原理流程
  if !h.growing() && (overLoadFactor(h.count+1, h.B) || tooManyOverflowBuckets(h.noverflow, h.B)) {
    hashGrow(t, h)
    goto again // Growing the table invalidates everything, so try again
  }
  // 如果所有buckets满了，就重新分配一个
  if inserti == nil {
    newb := h.newoverflow(t, b)
    inserti = &newb.tophash[0]
    insertk = add(unsafe.Pointer(newb), dataOffset)
    elem = add(insertk, bucketCnt*uintptr(t.keysize))
  }

根据源代码，可知对map进行插入会有这么几个步骤:

这里有个细节就是插入数据时只要遇到空位就会进行插入，而不是需要全部遍历一遍判断是否有相同的，这样做的好处就是减少的插入的时间开销，但是删除的时候需要遍历所有的buckets进行删除。

扩容

当map内数据大到一定程度就会触发map的扩容机制，这个机制被称为rehash ，map的扩容受LoadFactor的限制，在map.go中有两个参数来控制，其默认大小为6.5 ：

  // Maximum average load of a bucket that triggers growth is 6.5.
  // Represent as loadFactorNum/loadFactDen, to allow integer math.
  loadFactorNum = 13
  loadFactorDen = 2

map需要扩容的条件是通过overLoadFactor函数判断的：

func overLoadFactor(count int, B uint8) bool {
  return count > bucketCnt && uintptr(count) > loadFactorNum*(bucketShift(B)/loadFactorDen)
}

当 map内数据数量 / bucket 数量大于6.5时，便会触发扩容，扩容函数是hashGrow，下面是部分实现。

func hashGrow(t *maptype, h *hmap) {
  bigger := uint8(1)
  if !overLoadFactor(h.count+1, h.B) {
    bigger = 0
    h.flags |= sameSizeGrow
  }
  oldbuckets := h.buckets
  newbuckets, nextOverflow := makeBucketArray(t, h.B+bigger, nil)

  flags := h.flags &^ (iterator | oldIterator)
  if h.flags&iterator != 0 {
    flags |= oldIterator
  }
  h.B += bigger
  h.flags = flags
  h.oldbuckets = oldbuckets
  h.buckets = newbuckets
  h.nevacuate = 0
  h.noverflow = 0
 }

扩容会分配一个当前大两倍的空间，并把之前的Buckets置位现在的oldbuckets，另外一些相关参数也会更新，但是分配完后数据并不会马上复制到新的buckets中，而是通过惰性加载的方式，当访问到时候才会进行rehahs到新的buckets中。

查找数据

查找有多个实现，以mapaccess1为例:

unc mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {
  if raceenabled && h != nil {  //判断是否竞争
    callerpc := getcallerpc()
    pc := funcPC(mapaccess1)
    racereadpc(unsafe.Pointer(h), callerpc, pc)
    raceReadObjectPC(t.key, key, callerpc, pc)
  }
  if msanenabled && h != nil {
    msanread(key, t.key.size)
  }
  if h == nil || h.count == 0 { // 如果map为nil，返回零值
    if t.hashMightPanic() {
      t.hasher(key, 0) // see issue 23734
    }
    return unsafe.Pointer(&zeroVal[0])
  }
  if h.flags&hashWriting != 0 {
    throw("concurrent map read and map write")
  }
  hash := t.hasher(key, uintptr(h.hash0))
  m := bucketMask(h.B)
  b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.bucketsize)))
  if c := h.oldbuckets; c != nil { // 如果oldbuckets不为空，首先会在oldbuckets找
    oldb := (*bmap)(add(c, (hash&m)*uintptr(t.bucketsize)))
    if !evacuated(oldb) {
      b = oldb
    }
  }
  top := tophash(hash)
bucketloop:
  for ; b != nil; b = b.overflow(t) {
    for i := uintptr(0); i < bucketCnt; i++ {
      if b.tophash[i] != top {
        if b.tophash[i] == emptyRest {
          break bucketloop
        }
        continue
      }
      k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
      if t.indirectkey() {
        k = *((*unsafe.Pointer)(k))
      }
      if t.key.equal(key, k) {
        e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
        if t.indirectelem() {
          e = *((*unsafe.Pointer)(e))
        }
        return e
      }
    }
  }
  return unsafe.Pointer(&zeroVal[0])
}

根据上图代码，可以看出，如果oldbuckets存在，会在oldbuckets中寻找，如果已经evacuated就到新的buckets找。找到后返回对应的value。

在读map的代码中可以看到，如果对同一个map进行并发读写，会panic的，原因map并不是一个线程安全的容器，如果要避免这个情况，需要在用户层进行锁控制，还可以使用标准库中的sync.Map。

在Map的代码中可以看出作者对Map做了很大的优化，比如key经过哈希后低位用来定位buckets，而高位用来对buckets中做key寻址，这样大大缩小了key的查找范围。

另外还有一点不足的是在rehash的过程中，oldbuckets到newbuckets是值拷贝，这样开销会比较大，尽管开销做了均摊，但是如果改为指针拷贝代价会小很多。

更多干货请关注公众号：黑客的成长秘籍

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文