How online media companies get ahead with AI?

An online media company we have worked with recently required an automated tagging solution to have a more consistent content management solution, less time spent by authors and editors on classifying articles, better search results bringing visitors to their websites, and ultimately, more readers finding relevant articles they want to read. The challenge was to develop a machine-learning algorithm to assign the 6 most relevant labels from a pool of several thousand.

For such a task ideally, it is required to have at least 100 examples for each label. The final dataset we were given contained about 100 thousand articles, nevertheless one of the challenges was that most labels occurred at most 50 times. Another challenge was that there was a lot of noise, meaning that many articles were tagged inappropriately, which made it difficult for the model to learn to associate the correct labels. In such cases, we can make suggestions for clients as to which articles they should review. Alternatively, we also provide a data annotation service, whereby we manually go through a dataset to check if the labels are correct and amend the ones that aren't.

The structure of the dataset looked as follows. We were provided with the title, the abstract, and the body, as well as the labels and the source (i.e. the media company had several online portals from which they sent us articles.) There are many different ways one can approach such a problem. As we were asked to return at most 6 labels per article, we have set up separate streams for the title, the abstract and the body of the article, where each stream returned the 2 most probable labels based on the input text. We have applied a recent open-source model from Google to transform the text so that the model can handle it - a process called embedding. We then applied a deep neural network to associate the embedding vector (i.e. the transformed textual information) to the available nearly 5000 labels. In short, the model works by assigning a probability to each label and we take the 2 most probable label for each stream as explained above.

Given that most articles in the dataset had only about 2 labels on average, we evaluated the model both by checking whether the target labels were among the labels the model have returned, as well as checking manually if the labels were appropriate. Based on this information we could further advise the client as to which articles to review so that we can create more accurate models and which labels to potentially merge with others to have a more consistent label set and better site navigation.

The final solution provides accurate labels for articles in seconds be they one paragraph long or several pages. This improves the consistency and quality of article tagging, resulting in better navigation on the publisher's websites, better search results, and less headache for authors and editors. The solution can be customized for the client's requirements. We can ship a containerized API in a docker image that your IT or Data Engineering department can integrate into their operations. Alternatively, we can ship a standalone web application with a graphical user interface that your editors and authors can use.

If you are interested in working with us, please complete this form, or send an email to hello@neuralmachines.co.uk.

‍