Serialization

Effective Java 的笔记，代码、英语原文为主，批注、翻译为辅。

Item 85: Prefer alternatives to Java serialization

其他方法优于Java序列化

In summary, serialization is dangerous and should be avoided.

If you are designing a system from scratch, use a cross-platform structured-data representation(跨平台的结构化数据表示法) such as JSON or protobuf instead.

Do not deserialize untrusted data. If you must do so, use object deserialization filtering(对象反序列化过滤), but be aware that it is not guaranteed to thwart all attacks.

Avoid writing serializable classes. If you must do so, exercise great caution.

Item 86: Implement Serializable with great caution

非常谨慎地实现 Serializable

实现Serializable接口的代价

decreases the flexibility to change a class’s implementation once it has been released

一旦一个类被发布，就大大降低了更改该类实现的灵活性

A simple example of the constraints on evolution imposed by serializability concerns stream unique identifiers(流的唯一标识符), more commonly known as serial version UIDs(序列版本UID). Every serializable class has a unique identification number associated with it.

If you do not specify this number by declaring a static final long field named serialVersionUID, the system automatically generates it at runtime by applying a cryptographic hash function (SHA-1) to the structure of the class.

This value is affected by the names of the class, the interfaces it implements, and most of its members, including synthetic members generated by the compiler.

If you change any of these things, for example, by adding a convenience method(工具方法), the generated serial version UID changes. If you fail to declare a serial version UID, compatibility will be broken, resulting in an InvalidClassException at runtime.

increases the likelihood of bugs and security holes

增加了出现bug和安全漏洞的可能性

Normally, objects are created with constructors; serialization is an extralinguistic mechanism for creating objects(语言之外的对象创建机制).

Whether you accept the default behavior or override it, deserialization is a “hidden constructor” with all of the same issues as other constructors. Because there is no explicit constructor associated with deserialization, it is easy to forget that you must ensure that it guarantees all of the invariants established by the constructors and that it does not allow an attacker to gain access to the internals of the object under construction.

Relying on the default deserialization mechanism can easily leave objects open to invariant corruption and illegal access.

increases the testing burden associated with releasing a new version of a class.

增加了发行类新版本相关的测试负担

When a serializable class is revised, it is important to check that it is possible to serialize an instance in the new release and deserialize it in old releases, and vice versa.

The amount of testing required is thus proportional to the product of the number of serializable classes and the number of releases, which can be large. You must ensure both that the serialization-deserialization process succeeds and that it results in a faithful replica of the original object.

Implementing Serializable is not a decision to be undertaken lightly.

实现Serializable接口并不是一个轻松的决定。

Historically, value classes such as BigInteger and Instant implemented Serializable, and collection classes(集合类) did too.

例如：String、Date、Double、ArrayList、LinkedList、HashMap、TreeMap等

Classes representing active entities, such as thread pools, should rarely implement Serializable.

Inner classes should not implement Serializable.

内部类不应该实现Serializable接口。

They use compiler-generated synthetic fields(编译器产生的合成域) to store references to enclosing instances(外围实例) and to store values of local variables from enclosing scopes(外围作用域). How these fields correspond to the class definition is unspecified, as are the names of anonymous and local classes. Therefore, the default serialized form of an inner class is illdefined.

A static member class(静态成员类) can, however, implement Serializable.

P.S. 这边的compiler-generated synthetic fields, enclosing instances, enclosing scopes完全看不懂，不知道是自己哪部分没学过🤷‍♀️，只能做好笔记下次再看再学习了。前几天关注的一个博主说，学习东西本就是一个重复的过程。道理大家其实都知道，但实践用起来不容易，也忘记了，正好这个时间看到这些文字点醒了自己。反省自己的确是太浮躁了，很多东西只看一遍没学懂，以为是自己不适合计算机行业，殊不知只花了一点点功夫。很多经典的好书，读一遍是远远不够的，唯有多次重复才能真正掌握。

Item 87: Consider using a custom serialized form

考虑使用自定义的序列化形式

an object’s physical representation is identical to its logical content

The default serialized form of an object is a reasonably efficient encoding of the physical representation of the object graph rooted at the object(对象为根的对象图). （P.S. 中文版翻译，没懂的一个点）

In other words, it describes the data contained in the object and in every object that is reachable from this object.

It also describes the topology by which all of these objects are interlinked.

The ideal serialized form of an object contains only the logical data represented by the object. It is independent of the physical representation.

The default serialized form is likely to be appropriate if an object’s physical representation is identical to its logical content.

For example, the default serialized form would be reasonable for the following class, which simplistically represents a person’s name:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


// Good candidate for default serialized form
public class Name implements Serializable {
    /**
    * Last name. Must be non-null.
    * @serial
    */
    private final String lastName;

    /**
    * First name. Must be non-null.
    * @serial
    */
    private final String firstName;

    /**
    * Middle name, or null if there is none.
    * @serial
    */
    private final String middleName;
    ... // Remainder omitted
}

Logically speaking, a name consists of three strings that represent a last name, a first name, and a middle name. The instance fields in Name precisely mirror this logical content.

an object’s physical representation differs substantially from its logical data content

Near the opposite end of the spectrum from Name, consider the following class, which represents a list of strings (ignoring for the moment that you would probably be better off using one of the standard List implementations):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// Awful candidate for default serialized form
public final class StringList implements Serializable {
    private int size = 0;
    private Entry head = null;
    private static class Entry implements Serializable {
        String data;
        Entry next;
        Entry previous;
    }
    ... // Remainder omitted
}

Logically speaking, this class represents a sequence of strings.

Physically, it represents the sequence as a doubly linked list. If you accept the default serialized form, the serialized form will painstakingly mirror every entry in the linked list and all the links between the entries, in both directions.

Using the default serialized form when an object’s physical representation differs substantially from its logical data content has four disadvantages:

permanently ties the exported API to the current internal representation.
consume excessive space.
consume excessive time.
cause stack overflows.

A reasonable serialized form for StringList is simply the number of strings in the list, followed by the strings themselves. This constitutes the logical data represented by a StringList, stripped of the details of its physical representation.

Here is a revised version of StringList with writeObject and readObject methods that implement this serialized form.

the transient modifier(transient修饰符) indicates that an instance field is to be omitted from a class’s default serialized form.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


// StringList with a reasonable custom serialized form
public final class StringList implements Serializable {
    private transient int size = 0;
    private transient Entry head = null;
  
    // No longer Serializable!
    private static class Entry {
        String data;
        Entry next;
        Entry previous;
    }
  
    // Appends the specified string to the list
    public final void add(String s) { ... }

    /**
    * Serialize this {@code StringList} instance.
    *
    * @serialData The size of the list (the number of strings
    * it contains) is emitted ({@code int}), followed by all of
    * its elements (each a {@code String}), in the proper
    * sequence.
    */
    private void writeObject(ObjectOutputStream s) throws IOException {
        s.defaultWriteObject();
        s.writeInt(size);
        // Write out all elements in the proper order.
        for (Entry e = head; e != null; e = e.next)
            s.writeObject(e.data);
    }

    private void readObject(ObjectInputStream s) throws IOException, ClassNotFoundException {
        s.defaultReadObject();
        int numElements = s.readInt();
        // Read in all elements and insert them in list
        for (int i = 0; i < numElements; i++)
            add((String) s.readObject());
    }

    ... // Remainder omitted
}

The first thing writeObject does is to invoke defaultWriteObject, and the first thing readObject does is to invoke defaultReadObject, even though all of StringList’s fields are transient.

You may hear it said that if all of a class’s instance fields are transient, you can dispense with invoking defaultWriteObject and defaultReadObject, but the serialization specification requires you to invoke them regardless. The presence of these calls makes it possible to add nontransient instance fields in a later release while preserving backward and forward compatibility.

If an instance is serialized in a later version and deserialized in an earlier version, the added fields will be ignored. Had the earlier version’s readObject method failed to invoke defaultReadObject, the deserialization would fail with a StreamCorruptedException.

Note that there is a documentation comment on the writeObject method, even though it is private. This is analogous to the documentation comment on the private fields in the Name class. This private method defines a public API, which is the serialized form, and that public API should be documented.

Like the @serial tag for fields, the @serialData tag for methods tells the Javadoc utility to place this documentation on the serialized forms page.

any object whose invariants are tied to implementation-specific details

While the default serialized form would be bad for StringList, there are classes for which it would be far worse.

For StringList, the default serialized form is inflexible and performs badly, but it is correct in the sense that serializing and deserializing a StringList instance yields a faithful copy of the original object with all of its invariants intact.

This is not the case for any object whose invariants are tied to implementation-specific details.

the case of a hash table

For example, consider the case of a hash table. The physical representation is a sequence of hash buckets containing key-value entries. The bucket that an entry resides in is a function of the hash code of its key, which is not, in general, guaranteed to be the same from implementation to implementation. In fact, it isn’t even guaranteed to be the same from run to run.

Therefore, accepting the default serialized form for a hash table would constitute a serious bug. Serializing and deserializing the hash table could yield an object whose invariants were seriously corrupt. (P.S. 中文版翻译将"invariants"翻译为“约束关系”)

Java定制序列化的机制

对于有些字段，它的值可能与内存位置有关，比如默认的hashCode()方法的返回值，当恢复对象后，内存位置肯定变了，基于原内存位置的值也就没有了意义。还有一些字段，可能与当前时间有关，比如表示对象创建时的时间，保存和恢复这个字段就是不正确的。

如果类中的字段表示的是类的实现细节，而非逻辑信息，那默认序列化也是不适合的。为什么不适合呢？因为序列化格式表示一种契约，应该描述类的逻辑结构，而非与实现细节相绑定，绑定实现细节将使得难以修改，破坏封装。

比如，容器类中介绍的LinkedList，它的默认序列化就是不适合的。为什么呢？因为LinkedList表示一个List，它的逻辑信息是列表的长度，以及列表中的每个对象，但LinkedList类中的字段表示的是链表的实现细节，如头尾节点指针，对每个节点，还有前驱和后继节点指针等。

Java提供了多种定制序列化的机制，主要的有两种：

transient关键字
实现writeObject和readObject方法

将字段声明为transient，默认序列化机制将忽略该字段，不会进行保存和恢复。

比如，类LinkedList中，它的字段都声明为了transient

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


public class LinkedList<E>
    extends AbstractSequentialList<E>
    implements List<E>, Deque<E>, Cloneable, java.io.Serializable
{
    transient int size = 0;

    /**
     * Pointer to first node.
     */
    transient Node<E> first;

    /**
     * Pointer to last node.
     */
    transient Node<E> last;

    @java.io.Serial
    private static final long serialVersionUID = 876323262645176354L;
    
}

声明为了transient，不是说就不保存该字段了，而是告诉Java默认序列化机制，不要自动保存该字段了，可以实现writeObject/readObject方法来自己保存该字段。

LinkedList使用如下代码序列化列表的逻辑数据：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


/**
 * Saves the state of this {@code LinkedList} instance to a stream
 * (that is, serializes it).
 *
 * @serialData The size of the list (the number of elements it
 *             contains) is emitted (int), followed by all of its
 *             elements (each an Object) in the proper order.
 */
@java.io.Serial
private void writeObject(java.io.ObjectOutputStream s)
    throws java.io.IOException {
    // Write out any hidden serialization magic
    s.defaultWriteObject();

    // Write out size
    s.writeInt(size);

    // Write out all elements in the proper order.
    for (Node<E> x = first; x != null; x = x.next)
        s.writeObject(x.item);
}

需要注意s.defaultWriteObject();这一行是必需的，它会调用默认的序列化机制，默认机制会保存所有没声明为transient的字段，即使类中的所有字段都是transient，也应该写这一行，因为Java的序列化机制不仅会保存纯粹的数据信息，还会保存一些元数据描述等隐藏信息，这些隐藏的信息是序列化之所以能够神奇的重要原因。

LinkedList的反序列化代码为：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


/**
 * Reconstitutes this {@code LinkedList} instance from a stream
 * (that is, deserializes it).
 */
@SuppressWarnings("unchecked")
@java.io.Serial
private void readObject(java.io.ObjectInputStream s)
    throws java.io.IOException, ClassNotFoundException {
    // Read in any hidden serialization magic
    s.defaultReadObject();

    // Read in size
    int size = s.readInt();

    // Read in all elements in the proper order.
    for (int i = 0; i < size; i++)
        linkLast((E)s.readObject());
}

需要注意s.defaultReadObject();也是必需的。

总结

1、如果类的字段表示的就是类的逻辑信息，如上面的Student类，那就可以使用默认序列化机制，只要声明实现Serializable接口即可。

2、否则的话，如LinkedList，那就可以使用transient关键字，实现writeObject和readObject自定义序列化过程。

3、Java的序列化机制可以自动处理如引用同一个对象、循环引用等情况。

P.S. 以上内容部分参考自《Java编程的逻辑》

transient VS nontransient

Therefore, every instance field that can be declared transient should be.

This includes derived fields, whose values can be computed from primary data fields, such as a cached hash value.

It also includes fields whose values are tied to one particular run of the JVM, such as a long field representing a pointer to a native data structure.

Before deciding to make a field nontransient, convince yourself that its value is part of the logical state of the object.

If you use a custom serialized form, most or all of the instance fields should be labeled transient, as in the StringList example above.

impose synchronization on object serialization

Whether or not you use the default serialized form, you must impose any synchronization on object serialization that you would impose on any other method that reads the entire state of the object.

So, for example, if you have a thread-safe object that achieves its thread safety by synchronizing every method and you elect to use the default serialized form, use the following write-Object method:

1
2
3
4


// writeObject for synchronized class with default serialized form
private synchronized void writeObject(ObjectOutputStream s) throws IOException {
    s.defaultWriteObject();
}

If you put synchronization in the writeObject method, you must ensure that it adheres to the same lock-ordering constraints as other activities, or you risk a resource-ordering deadlock.

declare an explicit serialversion UID

Regardless of what serialized form you choose, declare an explicit serialversion UID in every serializable class you write. This eliminates the serial version UID as a potential source of incompatibility. There is also a small performance benefit. If no serial version UID is provided, an expensive computation is performed to generate one at runtime.

Declaring a serial version UID is simple. Just add this line to your class:

1

private static final long serialVersionUID = randomLongValue;

If you write a new class, it doesn’t matter what value you choose forrandomLongValue. You can generate the value by running the serialver utility on the class, but it’s also fine to pick a number out of thin air. It is not required that serial version UIDs be unique.

If you modify an existing class that lacks a serial version UID, and you want the new version to accept existing serialized instances, you must use the value that was automatically generated for the old version. You can get this number by running the serialver utility on the old version of the class—the one for which serialized instances exist.