Process-based machine learning to analyze HEI compliance

Ildikó Borbásné Szabó
Corvinus University of Budapest

Katalin Ternai
Corvinus University of Budapest

Szabina Fodor
Corvinus University of Budapest

1 Introduction
In the era of fourth revolution educational institutions have a responsibility to educate the future employees. They need predictions about future competences to start modifying, updating curricula in time, because students take 3-5 years to be graduated. Pre-dictions of competences are created based on experts’ opinions [1][2][3][4], but the causal relationships are not formalized in these cases. Models and statistical analyses can discover these relationships behind the data. Statistical analyses require time series data to detect trend and so on. Data warehouses can consolidate these subject-oriented, non-volatile and integrated time series data. Job ads contain competences needed by a labor market per region, time and occupation. ETL process extracts, cleansed and load them from streaming da-ta of job portals into the DW. Business analytics facilitates to analyze trends. These trends are used to check the compliance of HEI educational portfolios and their learning outcomes with future competence needs. Learning analytics helps to analyze student learning process-es to make their knowledge acquisition activities more improved.
1.1 Contributions
Our research aims at creating a data warehouse to assess future job competences collecting time series data from job portals and transforms them into the data warehouse. It requires to develop a process-based machine learning algorithm to extract competence patterns from job descriptions.
2 Context
Data derived from different data sources are consolidated in a data warehouse to provide business enterprise view of all information. Data warehouse contains connected data tables determined by business analytical purposes. Dimension tables represent qualitative data, fact tables store mainly quantitative data. These tables can be ordered a star schema or snowflake schema to represent multidimensionality and provide opportunities for analyzing data along different dimension or their hierarchy [11]. Eckerson in Turban at al. [12] distin-guished four types of architectures. Data mart is a small data warehouse to meet depart-mental requirements instead of enterprise wide ones. Virtual, distributed and federated type is not a data warehouse in the traditional sense. A middleware layer is created to handle, or-ganize, transform data from different data sources for providing analytical capabilities. Hud-and-Spoke and Enterprise DW are to define an enterprise-centric approach including com-mon data definitions, formats etc for managing business in a consolidated way. But data marts are created from the data warehouse in the first case to facilitate departmental deci-sion making, but this results redundant data and their maintenance cost. ETL (extract, trans-formation, load) process is required to provide a consolidated, enterprise wide view of busi-ness by data warehouses. During this process data are extracted from different sources (op-erational systems, flat files etc), transformed into a predefined, agreed data structure and loaded into the data warehouse. [11]
3 Problem
Our purpose is to create a data warehouse to analyze timeseries and regional data about competences for monitoring labor market needs and detecting changes caused by technological innovations boosting the fourth industrial revolution.
4 Solution

Approximately 1000 job advertisements were downloaded from Indeed job portal in Sep-tember 2019. Job vacancies are from manufacturing industry. They contain data about coun-try or city where competence needs emerged, company and occupation requiring these competences, job description contain information about them, date when these needs ap-peared on the web and salary. Collected data will be organized into the following dimension and fact tables: Dimension tables are the following ones, every tables have primary key not mentioned here.
• Location table contains Location ID, Country, Region and City data.
• Company table includes Company ID, Occupation, Company, its related Industry and Industrial code from a national framework (like NACE, SOC etc.)
• Competence table has Competence ID? Professional, Personal, Social and Methodo-logical Competence fields.
• Date table contains year, quarter, month, data of publishing date. All of them are part of the key field.
The related fact table includes data about salary and IDs of these dimension tables. To create a data warehouse the following problems must be solved by the algorithms of the ETL pro-cess:
• job portals do not publish region and city data separately, hence they have to be sep-arated.
• names of occupations are within a wide range scale, they have to be classified
• competences have to be extracted from job descriptions and organized into one of the four categories.
The ETL process is under development. This paper presents only the competence extraction process
5 Implementation
Streaming data coming from job portals can be stored in SAP Hana running on the HPI Future SOC Lab architecture. SAP Hana Text Analytics can be used to implement developed process-based machine-learning algorithm. This algorithm uses process ontologies transformed from business process modelling and semantics of relationships between process model elements as open queries. First version of this algorithm was presented in [5][6][7].
The first phase is to create an Adonis process model from a selected manufacturing process (1) and transform it into the Reference Process Ontology (RPO) (3) with using JAVA transformation program.
Our approach is principally transferable to other semi-formal modeling languages. The semantic annotation for specifying the semantics of the tasks and decisions in the process flow explicitly is important in our method. The second phase Is to discover new process elements within job descriptions with the help of semantic text mining. The algorithm focuses on finding patterns shaped into open queries. Relations are regarded as ordered pairs. The algorithm assumes that certain expressions can represent a given relation within the doc-ument e.g. produces_output(Process_step, Document) relation suggests that something must be happen with a document e.g. it is submitted or signed. That’s why the algorithm wants to find x submit y pattern within the document, where y is a document. It seeks „submit“term and collects few words after that.
7 Conclusion
This paper presented a solution how data about labor market needs can be consolidated into a data warehouse. There are other solutions to analyze job market needs. But they are mainly to match job seekers profiles with actual labor market requirements and not to store these data for analyzing trends of competence needs. These analyses require unified, inte-grated time series data in the meaning of their syntax and semantic as well which can be provided by a data warehouse.