Skip to main content

You Are What You Is: Defining Object Identity

July 27, 2006

{cs.r.title}






Object identity can be considered a rather academic topic, with "academic" taken in its negative sense. In this article, I try to show why having a strong understanding of what makes up your objects' identity can help you avoid a number of problems in your design and some tricky-to-find bugs.

What is identity?

Wikipedia's entry on the philosophical meaning of identity starts:

In philosophy, identity is whatever makes an entity definable and recognizable, in terms of possessing a set of qualities or characteristics that distinguish it from entities of a different type. Or, in layman's terms, identity is whatever makes stuff the same or different.

What makes up the identity of an object in an object-oriented system? A common answer is "object identity" is defined by being in the same spot in memory, which is called "reference equality" and matches Java's == operator. A consequence of this situation is if two objects are identical, a change to one will always affect the other. This distinguishes identity from the notion of "being equal," which can change over time. In fact, multiple notions of being equal are possible, such as String.equals() and String.equalsIgnoreCase().

Using reference equality as the only notion of object identity seems a good solution at first, but unfortunately this point of view is a little naive. A number of scenarios exist where this approach does not fit the intended semantics. I'll discuss some of them.

Keeping identity when your program stops

What happens with objects that are needed after you shut down a program and restart it? Objects usually get serialized and then get deserialized when they are needed again. If you define their identity through reference equality you have to consider them different objects every time you load them. Of course that makes sense from a technical perspective and it matches scenarios such as loading an object twice, but it is hardly practical when explaining what happens in terms of business logic.

Having objects represent external data

Objects can come from external sources, such as a relational database. The structures on which these objects are based often define their own identities such as using primary keys to identify rows in a table. If a program maps the same data onto two objects in memory, you get two distinct objects that refer to identical parts of the database and are equal in that sense. Technically this is a correct view, but is it useful when writing an application? Shouldn't both objects be considered the same?

Having identifiers in the business world

Quite similar to the last example, what happens if an application does not depend on external data, but the business side already has a strong notion of how to identify entities? Order numbers, passport IDs, and social security numbers are examples of this. If object identity is based solely on reference equality ("reference" in the object-oriented sense), then an application can have two objects with the same identifier from the business world. Unless extra measures are taken, these can have different values, which is most likely not acceptable.

What I propose is you make sure that the way a program understands the identity of objects is the same as the notion of identity that comes out of the business context. In fact, even core parts of the JDK do not use reference equality to define object identity, as I will show.

Object.equals() implements identity

At first, Java seems to follow the common approach to define object identity through its references. Running a program like this:

public class StringIdentity {
  public static void main(String[] args) {
    String a = new String("Hello");
    String b = new String("Hello");
   
    System.out.println(a == b);
    System.out.println(a.equals(b));
  }
}

returns this:

false
true

The two String objects do not have the same references, but they are considered equal because they contain the same values. Value equality can be checked, but reference equality is used to define the objects' identity. Is this always the case?

One of the consequences of having distinct identities is objects can be collected in a set. A set is a data structure that is able to hold one instance of each item; as an example, you cannot have a set that contains an identical number twice.

So how do sets relate to strings? I've established that a and b are not considered identical in Java, so you should be able to put them both into a set:

import java.util.HashSet;
import java.util.Set;

public class StringIdentity {
  public static void main(String[] args) {
    String a = new String("Hello");
    String b = new String("Hello");
   
    Set set = new HashSet();
    set.add(a);
    set.add(b);
    System.out.println(set.size());
  }
}

The result of running the code above is 1, not the 2 you would expect if the objects' identities were truly based on reference equality. If you swap the implementation from HashSet to TreeSet, the result will stay the same.

Are the implementations of HashSet and TreeSet broken? Not at all. The JavaDoc of the Set interface actually starts like this:

A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.

HashSet and TreeSet behave correctly; the two objects a and b are equal according to their equals() method, so only one of them goes into the set.

The JavaDoc also states that the Set interface models the mathematical set abstraction. A set is a data structure that cannot contain the same element twice, but it can contain two elements that are considered equal otherwise; an example of this is a set of toys with two balls the same shade and size. By using equals() in the definition of the Set interface as it is done in the JDK, the objects' identity is defined by the equality relation implemented by equals(), which means value identity in the case of String and other classes overriding equals().

As far as Set and similar collections are concerned, object identity is not always reference identity but is determined by Object.equals(), which defaults to standard object-oriented reference equality. It can also implement value equality or other equality relations, for example, reference equality based on external references such as a primary key of a database or an identifier from the business world.

Why think about identity?

How does identity relate to developing software? Having a solid understanding of what makes two objects identical can be very important to getting an implementation to match the requirements; if two distinct objects in a system represent something the customer considers to be one, problems can arise. A simple example would be one of the objects in the program getting changed by a user but retrieving the other one later. The changes would seem to be lost, or, even worse, they may disappear and appear depending on the way in which the objects were retrieved.

Even without looking at any requirements or customer's expectations you can construct a scenario where not thinking about identity properly can break your code. I'll implement a simple little class for points in a plane. Two instances of this class should be considered the same if they describe the same point in the plane; in short, I want to use value identity. Here is an implementation:

public class Point {
  private int x;
  private int y;
 
  public Point(int x, int y) {
    this.x = x;
    this.y = y;
  }

  public int getX() {
    return x;
  }

  public void setX(int x) {
    this.x = x;
  }

  public int getY() {
    return y;
  }

  public void setY(int y) {
    this.y = y;
  }
 
  public boolean equals(Object obj) {
    if(obj == null) {
      return false;
    }
    if(obj.getClass() != this.getClass()) {
      return false;
    }
    Point other = (Point) obj;
    return (other.x == this.x) && (other.y == this.y);
  }
 
  public int hashCode() {
    int hash = 7;
    hash = 31 * hash + this.x;
    hash = 31 * hash + this.y;
    return hash;
  }
}

This class implements value identity and at the same time is mutable. This means the object can change identity; whenever a setter changes one of the members, effectively it becomes a different object. If you add the following main method to the Point class and run the code:

  public static void main(String[] args) {
    Point a = new Point(5,5);
    Set set = new HashSet();
   
    set.add(a);
    a.setX(8);
    System.out.println(set.contains(a));
   
    set.add(a);
    System.out.println(set.size());
   
    set.remove(a);
    System.out.println(set.size());
    set.remove(a);
    System.out.println(set.size());
  }

the result will be (with a probability very close to 1):

false
2
1
1

What happened? Did I break HashSet?

I actually did break it. In the end, the set contains not only two items for which e1.equals(e2) holds, but they are the same in terms of reference identity, which means one of the invariants of the Set interface is broken. In addition, you can run into the following problems:

  • Hard-to-find bugs since asking if the object is in the set produces an unexpected result
  • Memory leaks since remove() fails and you get dangling references
  • Performance issues if you use a hash structure as cache for such objects

What happens is HashSet puts the objects in matching buckets, as does any hash structure. These buckets are determined by the hash code, that is, whatever hashCode() returns. When looking for an object during the execution of contains() the same approach is used: Calculate hashCode() and look in the matching bucket. However, if the hash code changes in the meantime (as it did in my example), then the wrong bucket is checked (unless coincidentally the hash buckets match, which can happen but is not likely). This results in the behavior you have seen.

There are a number of ways to fix this:

  • Do not use hash structures. This means losing a lot of performance, and structures based on binary search suffer from similar problems.
  • Add callbacks so HashSet can update its layout on changes. This approach is feasible but complex; everything would need to be a proper JavaBean with PropertyChangeListeners or something similar.
  • Replace HashSet with something that uses System.identityHashCode() instead of Object.equals(), therefore reverting to reference identity.
  • Get HashSet to rehash before it calls hashCode(), but its performance will be gone completely.
  • Implement equals(), but not hashCode(). This action generates its own set of troubles, as mentioned in Effective Java, Chapter 3 (see Resources).
  • Do not access any mutable fields in the implementation of equals().

Only the last approach is problem-free. Skipping the hashCode() implementation generally is a bad idea, and the earlier approaches all suffer from the problem that they are local fixes. If the same type of usage pattern appears somewhere else, the same problems will arise again. In any case, the semantics of the scenario described are not clear: According to the Set documentation my objects have value identity, but what I expect in my little main method is reference identity, otherwise the answer "false" for the call to contains() would be right.

It is feasible to leave equals() alone to keep the reference identity consistent throughout a program while implementing other equality relations in parallel but just use different names. While it is nearly impossible to keep anyone from using equals() in unexpected ways (it is declared on Object, after all), it is rather easy to call something else in the business logic of an application.

Value objects

If value identity is wanted for some objects, it can be quite useful to completely distinguish state objects and value objects. A state object is an object that stores mutable state and is identified through reference equality, very much a standard Java object with getters and setters. A value object is always immutable and is identified through value equality. In Java that means overriding equals() and using only private members with no setters. These members also have to be value objects themselves.

This way the semantics tend to be clearer and value objects have a number of additional advantages:

  • They can be persisted and restored without having to think about accidental duplicates.
  • They can be sent across a network without any further need for remote calls.
  • They can be cached and shared whenever it seems suitable (such as String.intern() does).

Conclusion

The notion of "identity" seems trivial at first but it can be important for the design and, consequently, the correct behavior of an object-oriented application. Through implementing equals(), the Java programmer has the option to define a specific type of identity, a very powerful but dangerous thing to do. Whenever implementing equals() the consequences should be well considered and the implementation should avoid any mutable members, including members that change value themselves and those that refer only to mutable objects.

Resources

width="1" height="1" border="0" alt=" " />
Peter Becker works as a consultant for iteratec GmbH in Munich, Germany.