As data volume continues to grow, AI has enabled the extraction of sustainable value from non-standardized data and allowed prompt action through high-level prediction. Concurrently, massive counterfeit data and technologies that deceive AI threaten the foundation relying on data. Securing the reliability of data and sustainability of AI is key to future data utilization.
What do you think when you hear the term big data? While some people might believe it describes reality in the business world, others think it is a buzzword for a concept that has fallen far short of expectations. Either way, the situation has changed dramatically from a few years ago. An increase in both the amounts and types of data generated, and the development of AI technologies, has enabled big data to fulfill its promise of creating new value.
The number of data files generated is expected to approximately double in two years. Moreover, the total amount of data to be generated in the world per year is anticipated to reach 163 zettabytes by 20251 . For example, one of the causes of this dramatic increase is large amounts of video files produced and viewed as part of an upsurge in video marketing. Another reason is the expansion of personal assistants in mobile phones and home-based smart speakers, which gather large amounts of voice data. Internet of Things (IoT) devices, which are predicted to exceed 20 billion in 20202 , will continue to increase data amounts and types including large amounts of unstructured data.
In the past, human experts called data scientists were the only ones able to create knowledge by formulating hypotheses about cause-and-effect relationships between data, followed by validation. With the emergence of deep learning, however, AI has evolved to the point where it can generate knowledge from data and make decisions. At the same time, the amount of information used for analysis is dramatically increasing and includes objects and people’s movements extracted from images, and emotions that can be estimated from voice data. For example, it is now possible to analyze fashion or travel trends from images on social media. In the future, many different types of unstructured data will be used to create more value.
1 Data Age 2025 (IDC) https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf
2 Gartner 2017 https://www.gartner.com/newsroom/id/3598917
The development of AI technologies has enabled the automatic generation of valuable content such as news articles, videos and financial statements. Although these services have already been commercialized, a new technology called GAN3 is significantly improving AI content generation. A GAN consists of a generator and a discriminator. The generator produces data that is extremely close to the learning data. The discriminator then identifies whether or not the data is fraudulent. This capability improves as each party competes with each other. For example, this process enables the production of video content that can replace actual participants with other individuals not present, creating an avatar in real time.
Such automatic content generation can streamline tasks and offers detailed and precise information not provided with manual content creation. In addition, extreme personalization could provide content with greater impact. For instance, imagine how personalized videos that feature your image could influence your decision making.
3 Abbreviation for the generative adversarial network
Fake news is being used to shape public opinion and it has become a major social issue. As social networks penetrate society, false information transmitted by individuals can easily proliferate. Unfortunately, AI automatic content generation technology has the capacity to produce enormous amounts of fraudulent articles that appear real including images and voices that look and sound authentic. Some predict that by 2020, the amount of AI-driven fake content generated will exceed AI’s ability to detect it. In addition, it is predicted that by 2022 a majority of people in advanced economies will consume more false than true information4 .
To solve this problem, technologies have been developed to detect fake news based on expressions in texts and patterns with which information spreads throughout the world. Companies that generate revenue from online advertising are particularly susceptible to fraudulent content and are moving quickly to address this issue. For example, Google and Facebook, which account for more than 60% of the world’s online advertising, are currently exploring a variety of defensive measures. These include using machine learning to infer fake news, cooperating with third-party organizations that perform fact checks, and leveraging user evaluation of articles as collective intelligence to assess reliability.
4 Gartner https://www.gartner.com/technology/research/predicts/
Data quality has a great impact on AI since its decisions are dependent on the data entered. For example, if AI uses a biased statement made by a user, it might make a decision that is discriminatory or biased. AI recognizes objects and voices with an accuracy that exceeds that of humans. However, use of deceptions may fool AI to understand information incorrectly. Data sets that deceive AI are called adversarial examples, and such false information poses a grave threat to companies and people using AI. For example, if street signs are manipulated with a malicious intent, humans will likely consider it a nuisance and interpret the signs without difficulty. However, with a self-driving vehicle, AI may misinterpret road signs as false information, which in turn could become life threatening. In another example, a surveillance camera with AI could be fooled by a person wearing glasses with special printing, resulting in a person being misidentified.
One solution being tested to solve these problems uses a dataset of pseudo-adversarial examples to pinpoint vulnerabilities in advance. In addition to information accuracy, AI’s robustness against malicious data will become increasingly important in the future.
Data holds the promise of generating limitless new value. However, the existence of fraudulent, biased and AI-deceiving data might jeopardize its use. So what is necessary to use data in a sustainable way? From the technological perspective, a technology that detects data falsities and bias plus the development of robust AI will be necessary. It is anticipated that technologies will continue to evolve in an endless fashion similar to the attack and defense cycles in cybersecurity. For this reason, systems must be built with the assumption that malicious data is present. Even with uncorrupted data, AI’s misjudgment can be a problem. For example, in 2015, Google’s image recognition system caused controversy when it recognized a human as a gorilla. (Google solved this problem by deleting tags related to primates). As long as AI is not 100% accurate, operation-level solutions will be necessary depending on the application.
There is a limit to the amount of data quality checks that one company can perform. Moreover, restricting data collected such as from IoT devices to a single company will greatly limit its value. Perhaps society needs a system in which everyone shares data and guarantees reliability. To accelerate this data revolution, it may be necessary to develop structures that distribute data, such as a data marketplace or information banks, as well as incentives for companies and individuals who offer data with guaranteed quality.
If the entire society works together to improve the reliability of both AI and data, the creation of value would be significantly improved. To do this, it may be necessary to enhance society’s perception of the value of data.