Combined Transport Magazine met with Dr. Hannah Richta of DB CARGO to speak about the benefits and challenges of BIG DATA and how DB Cargo wants to take more data-driven decisions in the future.
Matthias: DB Cargo is well known as one of the important traditional big European players for rail freight solutions. Currently, Big Data is a hot topic in the logistic market. The German industry and politicians formed the buzzwords Industry 4.0 and derived Logistics 4.0 referring to the fourth industrial revolution. We are interested how Big data will change the decision-making processes at DB Cargo. What is the difference of DB Cargo’s big data strategy compared to global tech companies like Google, Apple or Uber?
Hannah: According to studies, more than 90 percent of today’s global data has only been created since 2010. And the data stock continues to grow. Thus, the data volume is doubled each year – or in other words, citing a figure of the Internet economy association from 2013, 2.5 trillion new data are generated every day. While the first thought could be that this development concerns only Internet companies, it quickly becomes clear that even in railway operations every day huge amounts of data are generated. A few obvious examples are ticket sales or freight transport bookings, timetables, GPS positions and sensor information of locomotives, waggons, and loading units, or the operating data of the escalators in station buildings. Finally, there are also sources like weather, traffic and economic data, which have a relevance for railway operations. All these data and challenges are not new. However, the technical possibilities have advanced rapidly and improved methods for storage, analysis and use of the data are new to us. While the global tech companies that you mentioned have those aspects in their DNA, we are still at the beginning. Therefore, the starting point for DB Cargo to utilise this large amount of data by means of the new possibilities is to set up a big data strategy comprising five essential building blocks. In the following, these building blocks are illustrated.
Matthias: In the huge amount of data you can quickly lose the overview. What is DB Cargo’s specific approach?
Hannah: At the beginning and end of each analysis and big data strategy, there are one or more business questions that need to be answered. This seems trivial, but the endless opportunities of Big Data make strategy development difficult. The business questions that Big Data can answer are usually divided into three categories: 1) Descriptive analysis 2) Classification and 3) Prediction. The aim of descriptive analysis is to comprehensively prepare a data set and to support relevant stakeholders in the understanding of the data. Frequently, data visualisations are used in this context. Descriptive analysis helps to make connections visible and reveal hidden patterns. This allows us to look at patterns, while the actual pattern recognition remains left to the human viewer.The categories classification and forecast differ in that either a case of a category is assigned or a numerical value is to be estimated. Typical classification applications are for example the classification of customers into groups based on the probability of changing the supplier, as well as of trains with high and low failure probability. The forecast, on the other hand, involves the estimation of a numerical value, such as the future market share of rail transport or probable demand in a particular relation. Both approaches can be used for future developments and for the analysis of past values. Depending on the context in which classification and prognosis are used, a distinction is made between diagnosis, prediction, prescriptive or prescriptive analysis and preventive analysis.
Matthias: Could you share some real world examples of DB Cargo with us?
Hannah: Concrete applications can be quickly identified and arranged for DB Cargo. For example reports on the operational situation, which are usual in every company, fall into the area of descriptive analysis. However, already in this simple field of application clear developments related to the topic of Big Data can be observed: On the one hand, the possibility of combining more different data sources offers the opportunity to convey a more and more comprehensive picture to decision-makers through reporting. However, the resulting increase in the complexity of the information to be conveyed simultaneously leads to growing requirements for processing and visualisation. On the other hand, the technical possibilities also increase the demand for alternative forms of reporting. Cockpits, also known as dashboards, are especially in the trend. They can provide the information online and make it accessible for field workers via a mobile app. DB Cargo is working on the development of several of these applications.
In addition to the descriptive evaluation, numerous applications were identified at DB Cargo, in particular in the area of diagnosis and forecasting. For example, the analysis of the impact of construction sites on the quality of the transport can be assigned to the diagnosis. Predictions are used, for example, with regard to the future development of the share of rail in the freight transport market, capacity planning or the identification of future barriers in the individual vehicle network.
Matthias: In which form must the data be available for optimal evaluation?
Hannah: For working with data, it is important to keep in mind that the most computer data is stored in one form: in tables. In these tables, the individual cases considered are usually in the rows and the characteristics in the columns. For example, in such a table, train movements may be in the lines, while in the columns of the departure and destination stations, as well as the series of the traction unit, and the punctuality or delay of the train. However, the data are often not available in such tables. Examples of such non-tables are texts or log files. In these cases, it is necessary to prepare the data before the analysis, for example, by the use of algorithms for text capturing. Closely linked to the storage of the data in tables is the relational data model, which makes it possible to establish relationships between individual tables. The basis for this is the use of unique keys. The key can be any data that only occurs in a single row of a table. Your identification is an essential step in the combination of data from different sources. Thus, in the above example, a second table could be provided, in which the train vehicle series are listed by line with their maximum speed and failure probability. Now, a top speed and a failure probability of the locomotives can be clearly assigned to each train journey via the series. Meanwhile, new providers – such as Hunk – promise to use a schema-on-the-fly principle even without a previously defined relational model. This partially relieves the user of establishing the relationships between the tables used and is more flexible than static relationships defined by the user. However, at the time of the analysis, it must also be possible, even in this on-the-fly procedure, to connect the data of the individual tables as described above. Summing up, one of our core tasks is to identify relevant data, prepare them accordingly to our analytical needs and find ways to combine data sets from different sources by finding the right keys.
Matthias: What challenges does DB Cargo have to deal with in order to find the right answers from the accumulated data of recent years?
Hannah: Particular attention must also be paid to the nature of the data. Are they numbers or categories? Do the numbers contain a ranking? Or is there a text that needs to be evaluated? All this influences the possibilities of the evaluation. Time-series data are also a special case. Dealing with them requires additional considerations and decisions. For which period are the data available? In what frequency are they collected? Are all the data to be used for the analysis of a time series? If yes, do the measurement periods fit together? And if not, how can the timeline data be combined with the static characteristics – if possible without losing any relevant information? All these questions can only be answered in the individual case and in the context of the specific business question. The answers must always be documented in a data model.
Matthias: So how do you find the right data model for answering the questions?
Hannah: If you are developing the data model, two other questions are particularly relevant: How many cases and how many characteristics should be considered? The answer to the first question is very simple: more is better. The answer to the second question, however, is somewhat more difficult. The number of characteristics has a decreasing limit. It is, therefore, necessary to make a skilful selection. The basis for this is a good administration of the data and the explorative analysis of the data set. Finally, there is also the possibility of feature engineering, which describes the calculation of new features from the data of the available features to reduce the number of characteristics. For a freight train journey, for example, the indicator of the gross tonne-kilometer information about the transported quantity and the distance travelled – however, at the price that it is no longer clear whether a heavy cargo is travelling over a fairly short distance or a light load over a long distance was transported. To weigh the costs and benefits against the background of the business question is, therefore, an essential part of all considerations for combining characteristics. In addition to the development of the data model, the question of data availability also arises. Are the data internal? Or are they from external sources? Do they have to be procured first – whether through measurements or the requirement for a neighbouring department? Is it at all possible to procure it? Are the data collected only once or does the data record have to be updated on an ongoing basis? And are restrictions to be observed, such as data protection requirements when working with personal data? It is important that the training of algorithms must also contain data records that contain information about the characteristic to be determined. In order to train an algorithm to classify trains into those with high and low failure probability, data sets are required which contain both information about the characteristics to be used for classification and information about whether the train has failed or not.
Matthias: Are the DB Cargo data already available in the necessary data quality in order to correctly describe the data model?
Hannah: Even if all data are available and are summarised according to the data model, they are as good as never directly usable. The reason for this is that records in reality always contain missing values, outlines and nonsensical data lines. The root cause can be technical errors in the collection of the data as well as false input by humans in the collection or processing of the data. It is therefore propagated in specialist circles that 50 to 80 or even 90 percent of the work with data is not the analysis but the meaningful summarization and cleanup of the data. Today, the data must always be prepared for each report. In the long term, the data quality at the source is to be improved iteratively.
Matthias: How do you make the data accessible inside your organisation and which tools are deployed to manage the data?
Hannah: Our approach combines a data lake and a more traditional data warehouse. Data in our data lakes are unstructured and the benefits are that data scientists and programmers can access original data from different sources easily and in an unfiltered format. In contrast to the data lake, our data warehouses structure and filter the data for reports and analytics. The data warehouse allows a very fast and a clear governance-driven access of standardised and well-edited information for many users. The framework for the decision on administration, i.e. which storage and hardware will be used, is based on the data model. Technical aspects are the volume and nature of the data. In addition, the requirements of the processing speed, the technical availability of the hardware and the retention period of the data resulting from the technical context must also be considered. Likewise, technical requirements result from the answer to the question whether the system is to be used only for the preparation of the one-time analysis or for the continuous provision of information, for example via a cockpit. And whether a data record is used as a snapshot, or (largely) automated procedures are regularly reloaded. Such interfaces, which are referred to as ETL (extract, transform, load) pattern, must be re-developed for each application in which new data is integrated. Finally, access concepts are often derived from organisational and data protection aspects. Common systems for managing and storing the data range from MS Excel tables via MS Access databases to desktop computers or network drives via databases such as SQL servers to distributed computer systems such as HADOOP clusters or Apache Spark.
Matthias: What kind of skills and capabilities are needed for data experts at DB Cargo?
Hannah: Our data scientists, data engineers and programmers are working on new analysis projects, accessing data in real-time, or investigating the data on previously unseen questions. Depending on their role in the project, employees need to have different skills, but in general, they should bring knowledge in the fields of statistics, databases, programming, communication and last but not least domain expertise in rail and road transportation. As very few people possess the required deep expertise in all of the above-mentioned fields at the same time, the ability to collaborate in interdisciplinary teams successfully is very important.
Matthias: Please describe the daily tasks of an interdisciplinary team of big data experts in DB Cargo.
Hannah: As I mentioned before, due to the wide range of requirements on the knowledge of statistical procedures and business contexts, the application of the methods is usually carried out not by individual persons but by teams of experts who collectively have the necessary knowledge. In many cases, there is also a lively exchange with experts from other organisations in order to utilise as many different methods as possible and the creative potential of as many individuals and teams as possible. In the meantime, data and/or methods from companies are often publicly made available, in order to enable better exchange with the external community. Examples are the open source software centre of Netflix or the community platform Kaggle. Another tool for exchanging information with the external community hackathons, as they are now aligned with some companies.
Matthias: Are you planning to provide DB Cargo information as OpenData and what would be the business opportunities arising from the railway industry?
Hannah: The open data portal of Deutsche Bahn publishes person and freight transport data for external parties and is partnering with the mCloud initiative of the German ministry of transport. DB Mindbox in Berlin organises events, during which the attendees receive data from the company and develop new, creative solutions, the best of which are usually awarded prizes. As a rule, both own employees, as well as external experts such as freelancers or students, can participate. The Deutsche Bahn has already had good experiences with this format and DB Cargo plans to participate in this format of the DB Group in the future. With this initiative, DB Cargo wants to find new use cases to attract new customer groups, and generate new business.
Matthias: Which one of the big data initiatives is the most successful at DB Cargo?
Hannah: Cockpits and dashboards are used more and more. The users can navigate through the prepared data and assemble the relevant information for them. From the one-page cockpit, you can directly access the data. Based on the responsive design of the dashboard the user is able to use it online, mobile, app-available, always up-to-date information, and can be represented in a modern or classical report design. Frequently used software packages for the development of such cockpits are Qlik, RShiny and Business Objects, as well as Oracle’s Business Intelligence Suite. In other cases, the developed algorithms are directly integrated into existing IT processes or Internet pages. Finally, there is also the possibility to develop specialised applications to answer business questions. For example, C #, Python, Visual Basic or Unity are used.
Matthias: Thank you very much for your time and the interesting insights of the transformation process in DB Cargo.
Hannah is a dedicated manager for performance analytics at DB Cargo AG in Frankfurt am Main, Germany. As data scientist with a doctorate in knowledge and learning for international companies, she is driving the digitisation of DB Cargo.