Monday, February 8, 2010

Temporal Weaknesses in Modern Data Mining

I read an interesting article by a guy I met (Jeff Jonas) in Cozumel who also did the Ironman. He talked about the temporal disparity in the availability of information and how it caused the terrorist to slip through the cracks recently. He specifically highlighted the issue of tracking a "person" or identity across things like borders, languages, and scenarios like people who change their name (meaning you'd have to track the name change and TRANSFORM queries based on the search TIME) I just wrote him with some comments and thoughts. I thought his article along with some thoughts on the technology required to solve the problem would be interesting to ponder.

 You could certainly store a query and some locking structure like stored procedure that enques data, or just a link to the modified record. But then you have a whole set of problems like deciding how often to process that data. Real time would certainly be too expensive.

Still even if you did something weekly I bet you'd catch a lot of interesting stuff, and possibly save some problems. Much like a human domain expert would remember some references, it would be good to keep
interesting terms, and certainly high priority items, in a distributed state where the local DB updates the requestor (include these fields in a protocol, say based on SIP or some other protocol to track locations/addresses, priority, longevity, and pass information in XML) and you could have distributed short term intelligence. Then you also have to think about how long to maintain searches, likely again in the context of resources. And then migration of infrastructure......

Still, those problems will be surmounted eventually. Especially if they are all controlled (not advocating gov control, just that a lot of info is always in the gov's hands) by one person. Certainly people could also provide services that crawl google only as records change, and then parse results etc. Engines that are open databases like the new
Wolfram search engine could be used, and profit in this way. Hmmm. Cool technology and a business model.

Some comments back from Jeff. I asked him about my thoughts. He implied it could be done with well thought out stored procedures:

Anyway, about your question .... Nope. 

Data finds queries must be real-time.  And yes this scales to billions of rows.  The trick is to store the queries, as if data, with the data.  It is all data. 

In fact, real-time scales better than batch: 

No comments: