A new tool for organizing and visualizing collections of electronic mail has been created by researchers at USC.
The tool is designed to help legal researchers, historians, archivists and others dealing with large e-mail archives.
Called “eArchivarius,” the new system uses sophisticated search software developed for Internet search engines such as Google to detect important relationships between messages and people by taking advantage of inherent clues that exist in e-mail collections.
Anton Leuski of the USC School of Engineering’s Information Sciences Institute will demonstrate the new system July 30 at the Association for Computing Machinery Special Interest Group conference on Information Retrieval in Toronto, Ontario.
Here are three practical scenarios in which eArchivarius could be used:
A large corporation has just received a subpoena for all e-mail messages on a specific question. Traditional keyword searches return an enormous volume of mail that must be scanned by lawyers and paralegals for applicability. In the same way, the recipients of the subpoenaed data must analyze it. Can this process be sped up and made more efficient?
A historian is analyzing the history of a government decision, using an e-mail archive. Reading the entire text gives a great deal of information about the decision, but only careful notes can keep track of such events as shifts over time in the distribution of information, and even then subtle changes are hard to catch. Can software help?
A library has just received a donation of a famous scientist’s e-mail correspondence. Besides just a simple listing of titles, addresses and dates, is there a way the information in the archive can be made more immediately useful and comprehensible to users?
The tool is capable of automatically creating a vivid and intuitive visual interface, using spheres grouped in space to represent the relationships it discovers. The display, based on the Lighthouse system, can shuffle the connections to bring different elements to the fore.
In one display configuration, each sphere represents an author in the system. The spheres are visualized in a two- or three-dimensional space in which the distance between them indicates the number of messages exchanged over a given period.
For one collection used as an experimental exercise — exchanges of e-mail among Reagan administration national security officials — this visualization immediately shows some recipients closely packed toward the center into a tight cluster with their most-frequent correspondents. Others, meanwhile, immediately can be seen to be literally out of the loop, far out on the periphery.
The spheres representing people also can be arranged under the influence of other factors: the content of the authored messages, for example. The resulting configuration shows existing communities of people who converse on the same topic and the relationships among those communities.
Selecting any e-mail recipient can open up another window, which provides a list of all the people with whom the selected person exchanged mail, and a time-graphed record that shows when the exchanges took place.
“For a historian trying to understand the process by which a decision was made over a course of months, this kind of access will be extremely valuable,” said Leuski, a research associate at ISI.
And the same interface can instantly return and display individual pieces of mail with links to the people who sent and received the e-mail.
“Similar messages” can be defined in terms of recipients, text keywords, or both.
The spheres also can be colored to show other relationships. Topic similarity, for example — the likelihood of a message to be about a particular topic — can be shown by more or less intense color.
Different colors indicate different topics, creating a map of how the information is distributed among the messages.
“What we have in effect is a four-dimensional display, with color added to the three spatial dimensions,” said Douglas Oard, an associate professor of computer science at the University of Maryland’s College of Information Studies.
Leuski and Oard have demonstrated the ability to find interesting patterns in collections as small as a few hundred e-mails, and the techniques they have developed now are being applied to thousands of e-mails sent and received by a single individual over a span of 18 years.
Scaling up to process millions of e-mails involving thousands of people will be the next challenge.
The elements of eArchivarius’ flexible and highly useful interface, Oard said, may someday find their way into e-mail client software.
“E-mail has become a major element of modern life, and the raw material of history,” said Oard. “We believe that eArchivarius offers a way into the e-mail labyrinth for researchers of all kinds.”
Contact Anton Leuski at (310) 448-8261.