An intern’s perspective on the use of Data Science in the MT workplace

Data Science in the MT workplace

An intern’s perspective on the use of Data Science in the MT workplace

Jack BoylanAuthor: Jack Boylan, Data Science Intern at Iconic (March – September) through Dublin City University

Iconic has happily hosted several interns over the past number of years. Having grown from origins in University research, Iconic’s strong ties with academia have fostered many of these relationships. Our most recent intern, Jack Boylan, was our first Data Scientist intern. In fact, Jack is a part of the first undergraduate class studying for a new undergraduate degree in Data Science at DCU. We were happy to explore how a new Data Science perspective could benefit our work with neural machine translation. He shares here his experience of working with us – in the midst of the pandemic. Enjoy!

I began studying Data Science as an undergraduate at DCU in 2017. In 2019, we had numerous presentations given by experts in fields related to Data Science – these included Finance, Sports Performance and Natural Language Processing. One class showing the power of Neural Machine Translation – led by Prof Andy Way – caught my interest, so when the opportunity for an INTRA placement at Iconic appeared, I knew I had to apply.

I began my placement at Iconic on Monday 2nd March. Looking forward to meeting new people and working in a different environment, I paid almost no attention to news of a virus outbreak happening far away. Little did I know, this virus would soon change everyone’s lives for the foreseeable future. Shortly thereafter, while I was busy learning everybody’s names, reinstalling Ubuntu because my WiFi wasn’t connecting, and figuring out who Thor, Megatron and Loki were (servers!) I came to learn that the country was preparing to enter total lockdown.

On 27th March, the government imposed a stay-at-home order, banning all non-essential travel and contact with people outside one’s home (including family and partners). This meant that all work that could be done from home, should be done from home. It felt strange heading back home to carry out my INTRA placement, but the team at Iconic took it in their stride.
Every morning, the Development and MT team met to discuss what they had been working on the previous day and what they would be doing that day. For me, this was a great opportunity to meet some of Iconic’s employees from across the world, who were already working remotely from home. I got the chance to hear what projects others were working on, any issues they had and how they went about solving them. Everyone was exceptionally friendly, despite the big changes we were all experiencing.

During my time at Iconic, I have built upon skills learned while completing my Data Science degree. My ability to code in multiple languages, primarily Python and Java, has improved significantly, and I have also picked up useful lessons in Perl and Bash. Being proficient in these areas is crucial for data collection, data processing, machine learning and visualisation, all of which are integral to Data Science.

An important issue facing the world today is that of efficient and effective communication. The exchange of ideas and information between countries has caused dramatic societal and technological improvements in the past. In the age of COVID-19, communication is essential to inform the public, ensure regulations are being met, and aid in the creation of a vaccine. To this effect, fields such as Data Science and Machine Learning are attempting to solve this problem through the collection of good quality data and development of high performing translation machines to facilitate seamless exchange of information from anywhere in the world. Iconic is doing it’s part by working on projects like PRINCIPLE, an effort to improve the quality of translation services for under-resourced languages such as Irish, Croatian, Icelandic and Norwegian.

Working on the PRINCIPLE Project, I performed code reviews for scripts involving corpus preparation, engine training and testing. These reviews provided an opportunity for me to understand how an NMT engine can run on a server for multiple clients, and how various corpora from different sources can be combined to train an engine, particularly for languages such as Irish, which do not have a huge amount of training data widely available.

With the help of other members on the Iconic team, I also found a way to integrate Tensorboard functionality into the training of engines using Fairseq. This feature allows MT scientists to quickly and easily visualise important metrics that represent the current state of a model in real time without stopping training or reading through pages of logs. It may also be used to prevent overfitting of a model, or identify any issues arising that may hinder optimal engine performance. Some platforms, such as Tensorflow, come with built in visualisation options (Tensorboard), but others – like Fairseq – require a workaround.

Exposure to Natural Language Processing (NLP) at Iconic has given me invaluable skills for my final year, particularly our Semester 2 module on NLP. Working with machine learning platforms such as Tensorflow and Fairseq has shown me what is possible in this area, and has given me the knowledge and confidence to make use of these tools in my final year project, which I hope to base upon Machine Translation.

I would like to take this opportunity to thank everyone at Iconic for an excellent experience! With other parts of the world put on hold, I was fortunate enough to be able to continue learning and working with a fantastic group for the past six months.