Text mining helps to look into the future
Text mining allows to forecast development of technologies and markets, assess the level of experts’ competence, and identify the most promising R&D trends. Ilya Kuzminov, leading expert at HSE Institute for Statistical Studies and Economics of Knowledge, spoke about the latest developments in the text mining area at the International Conference “Foresight and STI Policy”.
Map of the future
Text mining is a Big Data technique for obtaining relevant information by processing arrays of unstructured fulltext documents using natural language and machine learning tools.Text mining software is applied in a variety of areas ranging from marketing and monitoring attitudes in social networks to business intelligence and science and technology foresight, noted Ilya Kuzminov in his presentation entitled “Text mining: analysing fulltext sources and building ontologies for application in foresight studies”.
“Suppose we need to get an understanding of certain high-tech products’ market prospects in 2020”, noted Kuzminov speaking about applying text mining in foresight studies. “The software analyses a large number of expert reports containing relevant syntax constructions, marker words, and similar-meaning terms. The algorithm identifies all contexts which say that, e.g., the authors of the study believe that a certain indicator will have this value by this year. We can even place on one page all forecasts by reputable agencies, e.g. regarding oil prices, to see the full range of conflicting opinions. And this provides data for scientific analysis”.
HSE has developed and tested its own software tools for making forecasting estimates and identifying milestones in 31 subject areas, and assembled a database which contains tens of thousands of documents. The accent is placed on the quality of fulltext sources subjected to mining, not on their quantity. “We could have collected millions or even tens of millions of documents from open sources, by scanning the web, but there would be a lot of garbage there, so we concentrate on selecting high-quality sources and subject them to subsequent expert validation”, explained Kuzminov.
According to the expert, glossaries of highly specific (marker) words, phrases and their synonymous rows are composed, to make a module for subject-based machine classification of fulltext sources. Work is also under way to automate production of a radically new Foresight Centre’s product – structured timelines of S&T development. “In a way it’s a map of the future which describes events we’re going to experience in the next 30 years”, noted the speaker. Two ministries and a number of corporate customers have already expressed interest in this applied product.
Another text mining function in foresight studies is identifying so-called “weak signals” in scientific texts – i.e. information about events currently seen as insignificant or indeterminate, but potentially capable of radically changing the future.
One way to identify weak signals is searching for neologisms. This requires a full list of words in a language: a dictionary which includes proper names, names of geographic locations, chemical substances and biological species, typical misprints and grammatical errors. Comparing words and word combinations from a scientific journal or conference proceedings with such huge dictionary, and applying a number of specially designed filters allows to identify potential neologisms – i.e. words just emerging in this language. By analysing the meaning of such new words one can foresee emergence of new industries with a potential to completely change the future. The expert reminded that the words “aviator” or “robot” have appeared in literature even before the relevant concepts were implemented in practice.
Practical application of technologies is no less important than their development. Text mining allows to determine which scientific concepts do flow from theory into management practices, and which do not. To give a simplified example, two sets of sources are taken: scientific papers on the one hand, and forecasting, analytical and programme documents prepared by international organisations and national government agencies supervising development of specific industries, on the other.This may reveal that a certain cluster of interconnected concepts has been actively discussed in scientific literature for a decade but still is very rarely mentioned in current global- or national-level strategic decision-making documents. It may be evidence of insufficiently active dialogue between science and practice in the relevant area. On the other hand, if a system of concepts which has emerged literally last year is already actively cited in government documents, it means that the relevant research area commands close attention of decision-makers.
What experts don’t talk about
Text mining techniques can very well be applied not just to predict the future but also to deal with existing problems. One of the prospective areas HSE experts are working on is assessing experts’ professional profiles using text mining. In the most simple case, this requires a “model” set of texts in a particular area, e.g. agriculture, and processing it with a computer application to produce a list of, say, one hundred most commonly used but highly specific for the area words, word combinations, or phrases (to put it scientifically, “n-grams”). Then experts are asked to make a ranged list of most important word combinations which, in their opinion, describe this area.The two lists are compared with each other. If they’re close and the expert provided the most commonly used words, we have a broad specialist with a general understanding of the industry, without going into detail. If the word combinations selected by the experts refer to a specialised field, we have a “skewed” sample; it means that the expert is a narrow specialist who sees the industry through the prism of their particular niche.
Finally, if the person suggests words and word combinations which have no relation to the list computer-generated from the model verified list of industry-specific documents, the expert may not have sufficiently high qualifications in the area.HSE experts have already developed working software for that kind of tasks, together with relevant algorithms, the speaker noted. Several tens of thousands of documents have been processed and “completely broken down into phrases and word combinations”. “The number may not seem to be too large – certain systems cover tens of millions of documents”, commented the HSE expert. “But we are being very thorough in our selection. We do not launch a robot programme which would pick up from the internet everything available in open access; we use hard-to-access sources, including internal sources of the Higher School of Economics”. A user-friendly interface is currently being developed, so any user would be able to download the data they need in file format and get customised search results.
Similarly, text mining allows to select resumes of candidates for particular jobs. After all, each type of activity has a specific set of relevant words and phrases. Therefore resumes of, publications by, and other information about potential candidates can be tested for presence and frequency of use of specific marker words. One could make “word specifications” for particular positions.
Text mining is a good basic analysis tool which allows to range resumes by grammatical errors, or usage of slang or expressions unacceptable in business communications. “Up to several thousand resumes are submitted for certain positions”, explained Ilya Kuzminov. “Text mining allows to select faultless ones in seconds, the ones people have worked hard to compose weighing their every word”.
By Vlad Grinkevich, for OPEC.ru