Object identity can be considered a rather academic topic, with "academic" taken in its negative sense. In this article, I try to show why having a strong understanding of what makes up your objects' identity can help you avoid a number of problems in your design and some tricky-to-find bugs.
What is identity?
Wikipedia's entry on the philosophical meaning of identity starts:
In philosophy, identity is whatever makes an entity definable and recognizable, in terms of possessing a set of qualities or characteristics that distinguish it from entities of a different type. Or, in layman's terms, identity is whatever makes stuff the same or different.
What makes up the identity of an object in an object-oriented system? A common answer is "object identity" is defined by being in the same spot in memory, which is called "reference equality" and matches Java's == operator. A consequence of this situation is if two objects are identical, a change to one will always affect the other. This distinguishes identity from the notion of "being equal," which can change over time. In fact, multiple notions of being equal are possible, such as String.equals() and String.equalsIgnoreCase().
Using reference equality as the only notion of object identity seems a good solution at first, but unfortunately this point of view is a little naive. A number of scenarios exist where this approach does not fit the intended semantics. I'll discuss some of them.
Keeping identity when your program stops
What happens with objects that are needed after you shut down a program and restart it? Objects usually get serialized and then get deserialized when they are needed again. If you define their identity through reference equality you have to consider them different objects every time you load them. Of course that makes sense from a technical perspective and it matches scenarios such as loading an object twice, but it is hardly practical when explaining what happens in terms of business logic.
Having objects represent external data
Objects can come from external sources, such as a relational database. The structures on which these objects are based often define their own identities such as using primary keys to identify rows in a table. If a program maps the same data onto two objects in memory, you get two distinct objects that refer to identical parts of the database and are equal in that sense. Technically this is a correct view, but is it useful when writing an application? Shouldn't both objects be considered the same?
Having identifiers in the business world
Quite similar to the last example, what happens if an application does not depend on external data, but the business side already has a strong notion of how to identify entities? Order numbers, passport IDs, and social security numbers are examples of this. If object identity is based solely on reference equality ("reference" in the object-oriented sense), then an application can have two objects with the same identifier from the business world. Unless extra measures are taken, these can have different values, which is most likely not acceptable.
What I propose is you make sure that the way a program understands the identity of objects is the same as the notion of identity that comes out of the business context. In fact, even core parts of the JDK do not use reference equality to define object identity, as I will show.
Object.equals() implements identity
At first, Java seems to follow the common approach to define object identity through its references. Running a program like this:
public class StringIdentity {
public static void main(String[] args) {
String a = new String("Hello");
String b = new String("Hello");
System.out.println(a == b);
System.out.println(a.equals(b));
}
}
returns this:
false
true
The two String objects do not have the same references, but they are considered equal because they contain the same values. Value equality can be checked, but reference equality is used to define the objects' identity. Is this always the case?
One of the consequences of having distinct identities is objects can be collected in a set. A set is a data structure that is able to hold one instance of each item; as an example, you cannot have a set that contains an identical number twice.
So how do sets relate to strings? I've established that a and b are not considered identical in Java, so you should be able to put them both into a set:
import java.util.HashSet;
import java.util.Set;
public class StringIdentity {
public static void main(String[] args) {
String a = new String("Hello");
String b = new String("Hello");
Set set = new HashSet();
set.add(a);
set.add(b);
System.out.println(set.size());
}
}
The result of running the code above is 1, not the 2 you would expect if the objects' identities were truly based on reference equality. If you swap the implementation from HashSet to TreeSet, the result will stay the same.
Are the implementations of HashSet and TreeSet broken? Not at all. The JavaDoc of the Set interface actually starts like this:
A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.
HashSet and TreeSet behave correctly; the two objects a and b are equal according to their equals() method, so only one of them goes into the set.
The JavaDoc also states that the Set interface models the mathematical set abstraction. A set is a data structure that cannot contain the same element twice, but it can contain two elements that are considered equal otherwise; an example of this is a set of toys with two balls the same shade and size. By using equals() in the definition of the Set interface as it is done in the JDK, the objects' identity is defined by the equality relation implemented by equals(), which means value identity in the case of String and other classes overriding equals().
As far as Set and similar collections are concerned, object identity is not always reference identity but is determined by Object.equals(), which defaults to standard object-oriented reference equality. It can also implement value equality or other equality relations, for example, reference equality based on external references such as a primary key of a database or an identifier from the business world.
Why think about identity?
How does identity relate to developing software? Having a solid understanding of what makes two objects identical can be very important to getting an implementation to match the requirements; if two distinct objects in a system represent something the customer considers to be one, problems can arise. A simple example would be one of the objects in the program getting changed by a user but retrieving the other one later. The changes would seem to be lost, or, even worse, they may disappear and appear depending on the way in which the objects were retrieved.
Even without looking at any requirements or customer's expectations you can construct a scenario where not thinking about identity properly can break your code. I'll implement a simple little class for points in a plane. Two instances of this class should be considered the same if they describe the same point in the plane; in short, I want to use value identity. Here is an implementation:
public class Point {
private int x;
private int y;
public Point(int x, int y) {
this.x = x;
this.y = y;
}
public int getX() {
return x;
}
public void setX(int x) {
this.x = x;
}
public int getY() {
return y;
}
public void setY(int y) {
this.y = y;
}
public boolean equals(Object obj) {
if(obj == null) {
return false;
}
if(obj.getClass() != this.getClass()) {
return false;
}
Point other = (Point) obj;
return (other.x == this.x) && (other.y == this.y);
}
public int hashCode() {
int hash = 7;
hash = 31 * hash + this.x;
hash = 31 * hash + this.y;
return hash;
}
}
This class implements value identity and at the same time is mutable. This means the object can change identity; whenever a setter changes one of the members, effectively it becomes a different object. If you add the following main method to the Point class and run the code:
public static void main(String[] args) {
Point a = new Point(5,5);
Set set = new HashSet();
set.add(a);
a.setX(8);
System.out.println(set.contains(a));
set.add(a);
System.out.println(set.size());
set.remove(a);
System.out.println(set.size());
set.remove(a);
System.out.println(set.size());
}
the result will be (with a probability very close to 1):
false
2
1
1
What happened? Did I break HashSet?
I actually did break it. In the end, the set contains not only two items for which e1.equals(e2) holds, but they are the same in terms of reference identity, which means one of the invariants of the Set interface is broken. In addition, you can run into the following problems:
Hard-to-find bugs since asking if the object is in the set produces an unexpected result
Memory leaks since remove() fails and you get dangling references
Performance issues if you use a hash structure as cache for such objects
What happens is HashSet puts the objects in matching buckets, as does any hash structure. These buckets are determined by the hash code, that is, whatever hashCode() returns. When looking for an object during the execution of contains() the same approach is used: Calculate hashCode() and look in the matching bucket. However, if the hash code changes in the meantime (as it did in my example), then the wrong bucket is checked (unless coincidentally the hash buckets match, which can happen but is not likely). This results in the behavior you have seen.
There are a number of ways to fix this:
Do not use hash structures. This means losing a lot of performance, and structures based on binary search suffer from similar problems.
Add callbacks so HashSet can update its layout on changes. This approach is feasible but complex; everything would need to be a proper JavaBean with PropertyChangeListeners or something similar.
Replace HashSet with something that uses System.identityHashCode() instead of Object.equals(), therefore reverting to reference identity.
Get HashSet to rehash before it calls hashCode(), but its performance will be gone completely.
Implement equals(), but not hashCode(). This action generates its own set of troubles, as mentioned in Effective Java, Chapter 3 (see Resources).
Do not access any mutable fields in the implementation of equals().
Only the last approach is problem-free. Skipping the hashCode() implementation generally is a bad idea, and the earlier approaches all suffer from the problem that they are local fixes. If the same type of usage pattern appears somewhere else, the same problems will arise again. In any case, the semantics of the scenario described are not clear: According to the Set documentation my objects have value identity, but what I expect in my little main method is reference identity, otherwise the answer "false" for the call to contains() would be right.
It is feasible to leave equals() alone to keep the reference identity consistent throughout a program while implementing other equality relations in parallel but just use different names. While it is nearly impossible to keep anyone from using equals() in unexpected ways (it is declared on Object, after all), it is rather easy to call something else in the business logic of an application.
Value objects
If value identity is wanted for some objects, it can be quite useful to completely distinguish state objects and value objects. A state object is an object that stores mutable state and is identified through reference equality, very much a standard Java object with getters and setters. A value object is always immutable and is identified through value equality. In Java that means overriding equals() and using only private members with no setters. These members also have to be value objects themselves.
This way the semantics tend to be clearer and value objects have a number of additional advantages:
They can be persisted and restored without having to think about accidental duplicates.
They can be sent across a network without any further need for remote calls.
They can be cached and shared whenever it seems suitable (such as String.intern() does).
Conclusion
The notion of "identity" seems trivial at first but it can be important for the design and, consequently, the correct behavior of an object-oriented application. Through implementing equals(), the Java programmer has the option to define a specific type of identity, a very powerful but dangerous thing to do. Whenever implementing equals() the consequences should be well considered and the implementation should avoid any mutable members, including members that change value themselves and those that refer only to mutable objects.
Does this article make you rethink your approach to object equality?
Showing messages 1 through 19 of 19.
is it necessary to have hashcode() and equals() method overridable??
2008-01-23 16:23:18 eligetiv
[Reply | View]
I still don't understand the main point of having hashCode() and equals() to be overriden.
Is it not beneficial to make hashcode() and equals() methods as final and introduce yet another modifier so (may be "equals") that the fields that are to be compared in equals() can be marked with this modifier? I havent seen many cases where methods are used in the equals() method for comparison. This way atleast programmer need not worry about overriding hashcode(). And the hashcode() can be modified to check if any fields in their sub tree which has this special modifier.
Some ancient history
2006-08-01 09:05:08 bazzargh
[Reply | View]
Henry Baker's classic paper Equal Rights for Functional Objects or, The More Things Change, The More They Are the Same (from ACM OOPS, 1993) is probably in the mind of any Lispers reading this article. Its roughly the same ideas, but couched in lisp-isms; however Baker is making the more powerful argument - that a generic object identity routine can be written (ie you never need to override equals), with predictable properties, if you can distinguish whether objects are mutable or immutable.
Quoting from the paper: "Briefly, if an object is mutable, then we compare it using an "address-like" comparison, while if an object is immutable, then we recursively compare its components. This recursion is only used to define the semantics of [identity]; a given implementation may not need to recurse.". In java terms, the immutability required here is slightly stronger than "value objects", its introspectable immutability, ie all non-transient fields of immutable objects are final.
It's an interesting read, anyway.
Some ancient history
2006-08-02 08:32:19 peterbecker
[Reply | View]
Thanks for the link, I'm looking forward to the Lisp perspective on the topic. The description sounds quite like the distinction of "value objects" and "state objects" done in Lava. Lava's value objects can contain nothing but value objects themselves. The Lava page is linked in the resources, but didn't fit into the article anymore.
the difference between identity and equality is discussed in depth in almost every serious java fundanmentals book. So what ?
Even the side-effects of the implmentaions of equals() , hashcode() and compareTo() when using those objects in collections and maps are discussed in the javadocs.
If you run into this, your program is broken
2006-07-27 09:32:05 mernst
[Reply | View]
"equals-but-no-hash" and "rehash" are gross hacks and should not be taken seriously. For me, the underlying pattern to follow is: "Define (maybe context-dependent) the point in an object's lifetime, after which it is illegal to alter those fields that make up the object's "value-identity" (as implemented by #equals/#hashcode). At that point your object has "settled" on its identity. Only after this point, it's ok to put this object into a hash structure."
A software that violates this rule is not sound. It's not just about hashing. The semantics are broken. The above definition gives you a number of interpretations about the when's, how's and who's of enforcing it.
If you run into this, your program is broken
2006-07-27 13:23:40 peterbecker
[Reply | View]
I agree that this is a valid aproach to avoid the issue, and I also couldn't agree more that it starts as a semantic issue.
The approach you describe can actually be enforced easily. You add a flag that this particular point in the object's lifecycle is reached and then throw an IllegalStateException on each call to relevant setters. The extra costs at runtime are reasonably low -- a new boolean member and some checks in methods that should not be called often; after all that is what the checks enforce.
I did not intend to say that the problem can not be fixed, it just requires attention. More could be said on this topic, but the space in an article is limited.
Re: If you run into this, your program is broken
2006-07-28 12:30:30 breilly
[Reply | View]
Don't forget about this condition:
At that point your object has "settled" on its identity. Only after this point, it's ok to put this object into a hash structure.
This is difficult to enforce with the flag approach. The only alternative that I know of is to not allow the object to exist and not be in its settled state, which would mean getting it to a settled state by the time the constructor is finished.
Exception: if object construction happens entirely behind a factory or DAO. This can be difficult to accomplish. For example, what scope do you give the constructor so that the factory/DAO can access it but client code can't (including subclasses of the factory/DAO if not final). Then, you're left with the task of carefully reviewing the factory/DAO code to make sure it doesn't violate the rules. It certainly can be done, and I'm sure there are cases where it's necessary, but I would tend to avoid it if possible.
Re: If you run into this, your program is broken
2006-07-28 23:56:58 peterbecker
[Reply | View]
I believe if an object has a lifecycle that has this pre-initialized stage, then you are better off making sure that the object in that stage is accessible only within a limited scope. The idea is that you have some library code that initializes the object completely, sets the flag and then passes the reference. Noone but the library should have access to the object before initialization is complete. This pattern can used e.g. in a factory method.
Of course the library code can still do the wrong thing, but that's your fault, then ;-)
If the initialization has to be done in client side code, you can go the other path and use mutable companion classes, such as StringBuffer/StringBuilder -- another aspect that didn't fit into the article. That pattern is very safe, but it has the drawback of requiring two classes and two objects for each creation. But in my opinion that is a reasonable price for the safety you can get this way and the optimizations you can do later on the immutable objects can easily make up for it (on a side note: some optimizations are possible only on completely immutable objects, such as copying it across a wire instead of using remote callbacks).
"equals-but-no-hash" and "rehash" are gross hacks and should not be taken seriously. For me, the underlying pattern to follow is: "Define (maybe context-dependent) the point in an object's lifetime, after which it is illegal to alter those fields that make up the object's "value-identity" (as implemented by #equals/#hashcode). At that point your object has "settled" on its identity. Only after this point, it's ok to put this object into a hash structure."
A software that violates this rule is not sound. It's not just about hashing. The semantics are broken. The above definition gives you a number of interpretations about the when's, how's and who's of enforcing it.
write the equals() differently
2006-07-27 08:35:17 ahabra
[Reply | View]
In my equals() methods, I usually put this line at the beginning
if (this == obj) return true;
Would this not fix the problem you showed?
write the equals() differently
2006-07-27 13:14:53 peterbecker
[Reply | View]
This would only improve performance by avoiding more expensive checks, but it would not change the return value of the method and thus it would not fix the problem.
i'm really having a hard time...
2006-07-27 06:16:30 ilazarte
[Reply | View]
understanding why you would ever use identity. maybe performance? is it somehow tied to immutability? immutability to me continues to be a far off concept that is accepted as a "best practice", but the reasons as to why that is are buried, vague or strawman-ish. "it will prevent bugs".
eclipse generates equals and hashcode implementations and i think that is simply the cleaner route. what am i missing that i should do more research on? i just am not interested in dealing identity related issues in day to day programming. It seems equality is the way to go in terms of being practical and keeping complexity to a minimum.
i'm really having a hard time...
2006-08-01 17:04:36 shammah
[Reply | View]
Object identity is good when you want to know that the object is the exact same object.
- Caches, which may return invalid data when two instances of the same data are interchangeable.
- Related to above - multiple users editing the same information - normal equals methods will clobber one's information silently.
equals() is overrated except for value objects where there is a strict identity. I find I want objects to be equal in different ways in different circumstances.
Hence I am at least a little annoyed that Comparator is understood well, but there is no concept of an Equator.
i'm really having a hard time...
2006-08-02 09:03:13 peterbecker
[Reply | View]
Note that there is not only Comparator but also the Comparable interface -- both implementing the same basic idea of a partial order. Object.equals() is the equivalent of Comparable.compareTo() -- both create a relation (equivalence relation or total order) defined on the object class itself. But one you always implement, the other you can decide yourself.
The drawback of modelling the relations on the classes themselves is that there can be only one -- you can't change the way the comparisons work depending on context. And if you have multiple ways of comparing objects you can ask the question which one is to prefer.
The notion of an "Equator" as you call it (what about "EquivalenceRelation"?) is really lacking since it would allow defining a set's concept of identity in the same way I can use Comparator to define an order inside a SortedSet or similar structure -- a classic example for applying a Strategy pattern. Of course that would also pave the way to more accidents due to different understandings of an object's identity. :-) Just imagine getting a Set instance through a parameter and not being able to tell if it considers two of your objects as equal or not.
i'm really having a hard time...
2006-07-27 06:58:50 peterbecker
[Reply | View]
These are a number of questions and issue you raise, I'll try to address them one per paragraph.
Performance can be a good reason to use reference identity: compared to the simple comparison of two references a call to equals() and/or hashCode() can be much more expensive.
Immutability does not mean you can use reference identity instead of equality. If you compare two equal String instances you are not guaranteed that they are the same reference -- only if they both are interned, but that is hard to guarantee.
Immutable objects have the advantage that they can be used directly without fearing side effects. Image a class modelling a time interval between two dates. One invariant of that class should be that the start date is before the end date. If you use java.util.Date as start and end date and store the objects given in a constructor in your class, your invariant can break since Date is mutable and someone could change the start or end date so they do not fulfil your requirement. That's why you need defensive copies, which you don't need with immutable objects. See item 24 in Bloch's book for details. These types of bugs are extremely hard to find.
Another thing that would not work without immutability is the way String.intern() works: this type of caching relies on the fact that two elements that were equal will always be equal.
I beliebe that in the end the object model you use should be as close as possible to the common understanding of the domain your program resides in. Identity is a core aspect of that and thus I think trying to avoid issues with identity in your object design leads to larger problems later. And different types of identity can be useful from a domain perspective; I'd consider pure value identity as one of the more unusual cases -- external references are very common, and they often coincide with the internal references in which case the latter can be used for efficiency.
Immutability through Finalisation?
2006-08-18 02:04:37 temple_cloud
[Reply | View]
Yes... immutability is sometimes tricky to understand for me too especially in a non functional language like Java; you really need to know your stuff. I think the example in this interesting article can help highlight this.
MY knowledge of java is not that great but I think your example provides a good case for immutability as "best practice" from my current level of understanding. From my knowledge of Java I would "finalise" the state attributes in the objects I am placing into the set; and also remove the "setter" method (that allows the side effect to be produced). So something like this:
public class Point {
final int x;
final int y;
public Point (final int x1, final int y1) {
x = x1;
y = y1;
}
// Other methods
....
}
The class is "immutable"; once created it cannot be changed so the issues outlined in the hashSet problem cannot occur. So immutability can often be considered a "best practice".
However of course as mentioned in this article solutions depend strongly on the domain of the problem being solved; but if ammeable I believe this is a good solution to the hashset problem (and many others).
This is my current thinking and use this pattern (when I can) if I am working with a hashset. I would be interested to know any pitfalls with this if any as I find java can be a very subtle langauge!
Immutability through Finalisation?
2006-08-18 02:59:49 peterbecker
[Reply | View]
Yes, Java can be a very subtle language and it offers its share of pitfalls.
The approach you describe is the standard one and is usually safe -- unless someone deliberately breaks it using reflection, see e.g. http://www.javaspecialists.co.za/archive/Issue014.html . So if you really want to be safe you need to set up the SecurityManager right. It allows these tweaks by default.
Note that the final modifier is relevant only for internal safety and as hint for compiler optimizations -- the lack of a setter method is sufficient for the external view if the access level is set appropriately. Most likely the members should be private in your example anyway -- unless you want direct read access on the package level.
Immutability through Finalisation?
2006-08-18 03:19:00 temple_cloud
[Reply | View]
Thanks for pointing out my silly mistake Peter! Indeed those variables were actually intended to be private.
((Thankyou also for the article, the example and links; it is nice to see that such fundamental issues can be described in such a concise and intuative way.))
I guess since my objects have usually very short-lived life cycles, I'm not too worried about the value of a reference being changed over time. I can see how that would be a major issue in a long-lived application where instances might frequently be reused.
From your perspective, it seems that immutability prevents the burden of defensive copies (or maybe it doesn't?) but having that allows you to trust references to be the only test you need for the notion of sameness to another object. Is that correct?
I have read Josh's book, it's a great work; I can tell there things in there that I will only really learn the import of down the road.