How to implement an anonymous data collection scheme that allows the manufacturer to anonymously collect data from its end devices without knowing exactly which device it came from? Yes, this is one of the challenges for the second Blockchain Hackathon (part of LongHash Cryptocon Vol2) in Berlin on May 18-19 this year. More details here.
As an advantage to all developers, blockchain enthusiasts and crypto geeks who are aching to solve this challenge, here
Taraxa is a fast, scalable, and device-friendly public ledger designed to help IoT ecosystems become more trusted, autonomous, and valuable with the mission to build IoT’s trust anchor. Taraxa is built from the ground up, by a team of accomplished engineers and academics headquartered in Silicon Valley, hailing from prestigious academic institutions such as Stanford, Princeton, Brown, and Berkeley, all with a passion for enabling the machine to machine (M2M) economy of the future. Edited Excerpts of the interview:
Blockchain sounds like a generic infrastructural technology, what about Taraxa? Is it IoT-specific?
A large subset of IoT applications are stateless. Data anchoring, for example, where IoT devices make periodic commitments into the blockchain about the data they’ve collected, is a stateless operation. This means each transaction is not dependent on past or future transactions. Our protocol has two layers, the top DAG layer and the bottom finalization layer. The DAG layer gives information about transaction inclusion, which means for an IoT device performing a stateless transaction, as long as the transaction is included, it is fine, since such transactions are not impacted by ordering. Our unique design allows IoT devices to obtain this information much faster and earlier than other protocols, in which inclusion, finalization and execution are all tightly bounded into a single step.
Many IoT devices are resource-constrained and cannot run full nodes. That means they cannot completely independently store the entirety of the blockchain’s history, or have the computational resources to verify transactions, or the bandwidth to synchronize with the network. These are usually called light nodes. Current light node designs rely completely on a full node to update its state, giving that specific full node the opportunity to deceive or corrupt any light node it is connected to. Taraxa has a built-in mechanism whereby a light node could query a rando subset of the network for a re-validation of what it has been told by the full node it usually communicates with, giving it the capability to remain far more independent and trustless than what current conventional designs allow.
Why is there a need for anonymized data collection?
With the rapid proliferation of digitized technologies, the public at large has become increasingly aware of the omnipresence of data-collecting sensors as well as concerned about how they’re being used. Recent scandals involving Facebook and Google’s mishandling of user data sparked concerns worldwide among the public as well as regulators. The EU’s General Data Protection Regulation (GDPR) that came into effect in May of 2018 further placed privacy and data ownership at the center of civil discourse. These regulatory trends however are still extremely limited in scope in that they mostly require a user consent upon visiting websites which only acknowledges problem without fundamentally solving it. These concerns are especially thorny in the case of IoT devices, as they have increasingly become embedded directly into our environments without our knowledge, tracking everything from location and movement to voice and video. Much of this also happens with numerous third-parties whose involvement and activities are difficult to track, as well as across political jurisdictions each with their uniquely different regulatory requirements, further complicating social concerns.
If IoT as a technology is to continue proliferation, it must address data privacy concerns head-on and provide socially-acceptable solutions to guarantee secure data ownership and usage without triggering innovation-killing regulatory backlashes.
Are there any successful machine learning applications for anonymized data collection?
Short answer is – yes, any machine learning application can run on this type of data since the data itself is in plain text.
There are two types of anonymity – the anonymization of data source, and the anonymization of the data itself. This challenge is the former, but I will talk about both.
Anonymizing the data source means exactly that – you don’t know where the data came from, but you know that it is real, valid data. In this case, it is simply raw data like anything else, and you may run any machine learning algorithms over them to build applications.
Anonymizing the data itself is much more complex. It usually is done through two methods, via software or hardware. The software method involves what’s called homomorphic encryption, which allows an algorithm to perform arbitrary operations directly on encrypted data, without knowing what that data is. Fully homomorphic encryption is incredibly slow, roughly 50,000 – 100,000 times slower than normal execution. The hardware solution involves trusted execution environment (TEE), which cordons off a section of a processor that requires specific permissions (via cryptographic signatures) to access, effectively preventing unauthorized or malicious programs from accessing restricted memory. Much of the key storage, signing & validation processes are also hard-wired into the hardware so that process is impossible to hack.
What are some of the examples of devices the manufacturer produces? Why do they have to be cryptographically-guaranteed?
Any device that generates data which may be sensitive.
A consumer example would be smart speakers that respond to your voice commands. One persistent concern is whether companies like Google or Amazon are recording all our conversations. They tell us no but it’s difficult to tell for sure, and the machines often misinterpret conversations for commands which result in large segments of these conversations being sent to central servers. While companies need to collect data in order to offer us services, that data does not need to tie directly into our personal identities. It’s OK to know that “user X just asked about ways to cure a STD”, it is not OK to know that “user X is John Smith living at 123 Main Street”. The membership proof ensures that the companies can collect the necessary data to offer the right service, while they cannot associate that data with a person or entity.
How can the end user stop even the anonymized data collection on demand?
This could be easily done if the manufactures build in such functionalities, which they will do if users become highly privacy-conscious, enough to effect regulations that require such functionalities be built in. We are already seeing this happening across major software platforms, it is only a matter of time (very brief amount of time) before hardware platforms come under the same regulatory standards.
How is the user informed about the data collection?
The device manufacturer needs to build in functionality that allows the users to monitor such data collection (see previous question).
What could also be done independently of the manufacturer is to use packet-sniffing software that analyzes real time network traffic and understand what types of data is being sent and received. These types of software are usually used by network administrators or security professionals to protect their systems.
Does this challenge interest you? Apply here for the Hackathon before the 12th of May – the event is free for all developers who apply. Also, there is more. If you are a developer or aspiring entrepreneur in the blockchain/crypto space and want to know about the investment perspectives from Top Asian & European Funds in the Blockchain segment or business use cases in real word adoption, get your free tickets for Hash Talk which will be an afternoon-long summit focused on discussions and creating insights on investment, business, and tech in blockchain curated and brought by LongHash Germany. More details here.