Review: Senzing uncovers relationships hiding within big data

Used to combat fraud or uncover accidental data duplication, Senzing is a powerful yet lightweight tool with an artificial intelligence that is actually extremely smart.

abstract networks and connections
Thinkstock

Most of the time, when organizations are thinking about cybersecurity, they look at ways to monitor the connections between devices and programs, or even information and users. But the human aspects are often ignored, which can allow threat actors to go unnoticed when they launch new campaigns, or enable humans who are potential insider threats to permeate deep inside a targeted organization unnoticed.

Senzing began life in 2016 after being spun out from IBM. The goal of Senzing is to provide deep data analytics on potentially millions of records, without costing millions of dollars.

The program itself is deceptively simple in appearance. The entire thing can be downloaded for free from the Senzing website, so go ahead and try it out. There are both Mac and Windows versions of the program, and both can run on moderately powerful machines. Once installed, it no longer needs to connect to the internet, so it can even be used with air-gapped networks for total data security.

Using the program for any purpose to examine up to 10,000 records is completely free. The price scales up from there, to a high of $55,000 per month if you want to process a billion records or more through the system.

We tested Senzing using three databases with several thousand records each. When deployed to look for relationships within one of our three databases, the matching capabilities of the system became readily apparent. Data tends to get accidentally corrupted for different reasons. For example, people often make typos when entering data, so Michelle Jones can become Michele Jones, which can make it a separate database entry even though it represents the same person.

Senzing Single DB John Breeden II/IDG

Although this is the least powerful way to employ Senzing, finding duplicate or similar records within a single database is extremely quick and highly intuitive.

Senzing was able to find quite a few instances like that in a single database. It does that by looking at the supplemental data that is attached to each entry. So if Michele and Michelle have the same address and phone number, it’s a safe bet that they are the same person. But Senzing is a lot smarter than that. If there is a Michelle Jones and a Bridget Jones who share everything except their name, then you might be looking at a mother and daughter. It’s also possible that Bridget is a nickname for Michelle, so Senzing files it as a possible match until it can learn more.

This would make Senzing invaluable for complying with Europe’s new General Data Protection Regulation (GDPR), which requires that customers who ask to be removed from databases are removed in all instances. In that case, the company may be required to know about, and remove, both Michelle and Michele Jones, since they are the same person, and perhaps even Bridget Jones if there is enough of a match to suggest that she is also the same person.

Senzing Compare John Breeden II/IDG

Finding duplicate records in multiple databases, even if those records have differences based on mistakes or have been deliberately obfuscated, is a core strength of Senzing.

The real power of the program becomes apparent when adding a second database and comparing it to the first. As Senzing gets more information, and more fields to examine, it goes back and reexamines all of its previous assumptions. If there is a Terry Jones in the second database who uses the same e-mail address as Michelle, Bridget and Michele, Senzing could change its assumptions about all their relationships. Senzing may build out a family group, or suggest that they all might be an alias for one person. Whatever it decides, it fully explains its logic, so humans can check to see if they agree.

Senzing Match Similar John Breeden II/IDG

Unlike most programs that employ artificial intelligence, Senzing explains all of its logic in calling out relationships between the records that it is examining.

In addition to making very good decisions, there are very few, if any, commercial data examination programs that are designed by default to go back and retest all of its previous assumptions based on new information. When we added a third database to the test pool, Senzing was able to learn even more about the information in our lists, including things like different street names that are actually the same, even though they were listed as either Street, Court or Road in the database. It also identified a ZIP code for one duplicate record in the third database that was not present in the first two, and added that information to the set of data that it could search on moving forward.

In this way, Senzing is always self-correcting, and gets more accurate as more information is added. This is the opposite of how many big data tools operate, which tend to become corrupted over time as more data, and pieces of inaccurate data, are added to the system. This sometimes necessitates a reset of those programs, where with Senzing it simply makes it even more accurate.

Senzing Match Learning John Breeden II/IDG

As more records are added to the system, Senzing will learn new details, and go back and check every assumption that it has already made to see if the new information should change any previous conclusions.

To test Senzing in cybersecurity, the company provided us with a fictious database of employees to work with. We took one of those employees, a database administrator, and fired her for cause, running the scenario that she was engaging in criminal activity. Comparing her record to all others, it was discovered that she shared an address with two other employees, both in higher level areas within the company. This did not prove collusion between them, as they might just be roommates sharing a place to control expenses. But, it’s a good place for investigators to go if they are looking for potential insider threats, and certainly worth a conversation with those roommates.

It’s worth noting that all of these tests were performed in just a few seconds using databases with thousands of records, with Senzing powered by a fairly standard laptop. Yet all of the tests were concluded in just a few seconds. The longest test was when all three test databases were being compared, and it finished in under 30 seconds. A more powerful machine could decrease that time significantly.

Senzing has also been deployed to combat fraud. Although this feature was not live tested, company officials demonstrated how Senzing was used to uncover fraudulent activity at a financial institution. It was able to do this by matching several records together, all with different names, addresses and e-mails, and identified them as being from the same person. Senzing is quite good at seeing through obvious attempts at obfuscation, tying those purposely incorrect records to others using datapoints that nobody would probably think to examine as part of an investigation. And Senzing can do it at scale, to an entire database or a series of databases at the same time.

Used to combat fraud or uncover accidental data duplication, Senzing is a powerful yet lightweight tool with an artificial intelligence that is actually extremely smart. It can be deployed at a reasonable price – especially compared to many other big data analytics programs and services. The inclusion of a new graphical dashboard takes this technology out of the realm of the data scientist and places it in the hands of anyone with a complex set of data that requires detailed analysis.

Copyright © 2018 IDG Communications, Inc.

Microsoft's very bad year for security: A timeline