HashMap分析

前言

Map 这样的 Key Value 在软件开发中是非常经典的结构，常用于在内存中存放数据。Java 中最常用的 Map 有两种，首先是 HashMap ，其次是 ConcurrentHashMap 。

HashMap

从整个 HashMap 的声明可以看出它内部是基于数组 + 链表实现的，不过在 jdk1.7 和 1.8 中具体实现稍有不同。

基于 1.7

HashMap 在 jdk1.7 中的数据结构图：

首先来看 jdk1.7 中的实现：

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; //初始化桶大小，因为底层是数组，所以这是数组默认的大小。默认大小16
static final int MAXIMUM_CAPACITY = 1 << 30;    //桶最大容量
static final float DEFAULT_LOAD_FACTOR = 0.75f; //默认的负载因子
static final Entry<?,?>[] EMPTY_TABLE = {};
transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE; //真正存放数据的数组
transient int size;      //存放key-value元素的个数
int threshold;           //桶容量大小，可在初始化时显式指定，扩容判定threshold = capacity * loadFactor，默认为0.75 * 16 = 12
final float loadFactor;  //负载因子，可在初始化时显式指定。

// 这两个属性是在抽象类AbstractMap中定义的
transient volatile Set<K>        keySet = null;
transient volatile Collection<V> values = null;

Map 在使用过程中不断的往里面存放数据，当数量达到了 threshold 就需要将当前容量进行扩容，而扩容这个过程涉及到 rehash、复制数据等操作，所以非常消耗性能。因此通常建议能提前预估 HashMap 的大小最好，尽量的减少扩容带来的性能损耗。

根据代码可以看到其实真正存放数据的是

1	transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;

这个数组，那么它又是如何定义的呢？

// Entry 是 HashMap 中的一个内部类
static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;     // key，写入时的键
    V value;         // value，值
    Entry<K,V> next; // 用于实现链表结构，当有hash冲突，存储的下一个元素
    final int hash;  // 当前 key 的 hashcode

    /**
     * Creates new entry.
     */
    Entry(int h, K k, V v, Entry<K,V> n) {
        value = v;
        next = n;
        key = k;
        hash = h;
    }
    
    public final K getKey() {return key;}
    public final V getValue() {return value;}
    public final V setValue(V newValue) {
      V oldValue = value;
      value = newValue;
      return oldValue;
    }
}

以上即为 HashMap 的基本结构，接下来来看写入和获取方法：

put 方法

public V put(K key, V value) {
    if (table == EMPTY_TABLE) { // 判断当前数组是否需要初始化。
        inflateTable(threshold);
    }
    if (key == null) // 如果 key 为空，则 put 一个空值进去
        return putForNullKey(value); // 
    int hash = hash(key); // 计算根据 key 计算出 hash 值------
    int i = indexFor(hash, table.length); // 根据计算出的 hash 值定位出所在桶
    for (Entry<K,V> e = table[i]; e != null; e = e.next) { // 如果桶是一个链表则，需要遍历判断
        Object k;
        // hash 值、key 是否和传入 key 相等，如果相等则进行覆盖，并返回原来的值。
        if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
            V oldValue = e.value;
            e.value = value;
            e.recordAccess(this);
            return oldValue;
        }
    }
    // 桶是空的，说明当前位置没有数据存入
    modCount++;
    addEntry(hash, key, value, i); // 新增一个 Entry 对象写入当前位置
    return null;
}

static int indexFor(int h, int length) {
    // 初始容量是一个偶数，当 length-1 的时候，这个数的有效二进制位都是1。
    // 只要保证了 h 的分散性就行。&用来取mod运算，效率比%高。
    return h & (length-1);
}

void addEntry(int hash, K key, V value, int bucketIndex) {
    if ((size >= threshold) && (null != table[bucketIndex])) { // 判断是否需要扩容
        resize(2 * table.length); // 两倍扩充
        hash = (null != key) ? hash(key) : 0; // 当前的 key 重新 hash
        bucketIndex = indexFor(hash, table.length); // 重新定位
    }
    createEntry(hash, key, value, bucketIndex);
}
void createEntry(int hash, K key, V value, int bucketIndex) {
    // 将当前位置的桶传入到新建的桶中，如果当前桶有值就会在位置形成链表
    Entry<K,V> e = table[bucketIndex];
    table[bucketIndex] = new Entry<>(hash, key, value, e);
    size++;
}

get 方法

public V get(Object key) {
    if (key == null)
        return getForNullKey();
    Entry<K,V> entry = getEntry(key);
    return null == entry ? null : entry.getValue();
}
final Entry<K,V> getEntry(Object key) {
    if (size == 0) {
        return null;
    }
    int hash = (key == null) ? 0 : hash(key); // 根据 key 计算出 hash 值
    for (Entry<K,V> e = table[indexFor(hash, table.length)]; 
         e != null;
         e = e.next) { // 定位到具体的桶中
        Object k;
        if (e.hash == hash &&
            ((k = e.key) == key || (key != null && key.equals(k)))) 
            return e; //  key 及 hash 值相等时候就返回对应的值
    }
    return null; // 返回 null 
}

链表死循环

HashMap由并发引起的链表死循环

基于 1.8

HashMap 在 jdk1.7 的实现中有个明显缺点：

当 Hash 冲突严重时，在桶上形成的链表会变的越来越长，这样在查询时的效率就会越来越低；时间复杂度为 O(N)。

因此 1.8 中对大链表做了优化，修改为红黑树之后查询效率直接提高到了 O(logn)。

话不多说，上🐎：

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
static final int MAXIMUM_CAPACITY = 1 << 30;
static final float DEFAULT_LOAD_FACTOR = 0.75f;
static final int TREEIFY_THRESHOLD = 8;
transient Node<K,V>[] table;
transient Set<Map.Entry<K,V>> entrySet;
transient int size;

和 1.7 大体上都差不多，还是有几个重要的区别：

TREEIFY_THRESHOLD 用于判断是否需要将链表转换为红黑树的阈值。
Entry 修改为 Node。

Node 的核心组成其实也是和 1.7 中的 Entry 一样，存放的都是 key value hashcode next 等数据。

put 方法

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length; // 当前桶为空，进行初始化（resize 中会判断是否进行初始化）
        if ((p = tab[i = (n - 1) & hash]) == null) // 根据当前 key 的 hash 值定位到具体的桶中
            tab[i] = newNode(hash, key, value, null); // 为空表明没有 Hash 冲突就直接在当前位置创建一个新桶
        else { // 不为空，有 Hash 冲突
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p; // 比较当前桶中的 key、key 的 hash 值与写入的 key 是否相等，相等就赋值给 e
            else if (p instanceof TreeNode) // 如果当前桶为红黑树，那就要按照红黑树的方式写入数据
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                // 如果是个链表，就需要将当前的 key、value 封装成一个新节点写入到当前桶的后面（形成链表）
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        // 判断当前链表的大小是否大于预设的阈值，大于时就要转换为红黑树
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    // 如果在遍历过程中找到 key 相同时直接退出遍历
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // 存在相同的 key ，需要将值覆盖
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold) // 判断是否需要进行扩容
            resize();
        afterNodeInsertion(evict);
        return null;
    }

get 方法

public V get(Object key) {
        Node<K,V> e;
        return (e = getNode(hash(key), key)) == null ? null : e.value;
    }

    final Node<K,V> getNode(int hash, Object key) {
        Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
        // 将 key hash 之后取得所定位的桶
        if ((tab = table) != null && (n = tab.length) > 0 &&
            (first = tab[(n - 1) & hash]) != null) {
            // 判断桶的第一个位置的 key 是否为查询的 key，是就直接返回 value。
            if (first.hash == hash && // always check first node
                ((k = first.key) == key || (key != null && key.equals(k))))
                return first;
            if ((e = first.next) != null) {
                if (first instanceof TreeNode) // 红黑树
                    // 按照树的查找方式返回值
                    return ((TreeNode<K,V>)first).getTreeNode(hash, key);
                do { // 链表
                    // 按照链表的方式遍历匹配返回值
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        return e;
                } while ((e = e.next) != null);
            }
        }
        return null; // 桶为空则直接返回 null 
    }

线程不安全

HashMap 在并发时可能出现的问题主要有三个方面：

如果多个线程同时使用 put 方法添加元素，而且假设正好存在两个 put 的 key 发生了碰撞（根据 hash 值计算的 bucket 一样），那么根据 HashMap 的实现，这两个 key 会添加到数组的同一个位置，这样最终就会发生其中一个线程 put 的数据被覆盖。
如果多个线程同时检测到元素个数超过 threshold，这样就会发生多个线程同时对 Node 数组进行扩容，都在重新计算元素位置以及复制数据，但是最终只有一个线程扩容后的数组会赋给 table，也就是说其他线程的都会丢失，并且各自线程 put 的数据也丢失。

遍历方式

还有一个值得注意的是 HashMap 的遍历方式，通常有以下几种：

Iterator<Map.Entry<String, Integer>> entryIterator = map.entrySet().iterator();
while (entryIterator.hasNext()) {
    Map.Entry<String, Integer> next = entryIterator.next();
}
        
Iterator<String> iterator = map.keySet().iterator();
while (iterator.hasNext()){
    String key = iterator.next();
}

强烈建议使用第一种 EntrySet 进行遍历。第一种可以把 key value 同时取出，第二种还得需要通过 key 取一次 value，效率较低。

无论是 1.7 还是 1.8 其实都能看出 JDK 没有对它做任何的同步操作，所以并发会出问题，甚至 1.7 中出现死循环导致系统不可用（1.8 已经修复死循环问题）。

因此 JDK 推出了专项专用的 ConcurrentHashMap ，该类位于 java.util.concurrent 包下，专门用于解决并发问题。