One of the most essential parts of any recommender system is personalization– how acceptable the recommendations are from the user’s perspective. However, in many real-world applications, there are other stakeholders whose needs and interests should be taken into account. In this work, we define the problem of multistakeholder recommendation and we focus on finding algorithms for a special case where the recommender system itself is also a stakeholder. In addition, we will explore the idea of incremental incorporation of system-level objectives into recommender systems over time to tackle the existing problems in the optimization techniques which only look for optimizing the individual users’ lists.
With billions of users, social networks have become the go to platform for information diffusion for news media outlets. Lately, certain entities (users and/or organizations) have been active in generating misinformation in order to attract users to their respective websites, to generate online advertisement revenues, to increase followers, to create political instability, etc. With the increasing presence of misinformation on social networks, it is becoming increasingly difficult to not only distinguish between information and misinformation, but also, to identify the source(s) of misinformation propagation. This effort reviews my doctoral research on identifying the source(s) of misinformation propagation. Particularly, I utilize the mathematical concept of Identifying Codes to uniquely identify users who become active in propagating misinformation. In this paper, I formally present the computation of the Minimum Identifying Code Set (MICS) as a novel variation of the traditional Graph Coloring problem. Furthermore, I present an Integer Linear Program for the computation of the MICS. I apply the technique on various anonymous Facebook network datasets and show the effectiveness of the approach.
The scientific literature is a large information network linking various actors (laboratories, companies, institutions, etc.). The vast amount of data generated by this network constitutes a dynamic heterogeneous attributed network (HAN), in which new information is constantly produced and from which it is increasingly difficult to extract content of interest. In this article, I present my first thesis works in partnership with an industrial company, Digital Scientific Research Technology. This later offers a scientific watch tool, Peerus, addressing various issues, such as the real time recommendation of newly published papers or the search for active experts to start new collaborations. To tackle this diversity of applications, a common approach consists in learning representations of the nodes and attributes of this HAN and use them as features for a variety of recommendation tasks. However, most works on attributed network embedding pay too little attention to textual attributes and do not fully take advantage of recent natural language processing techniques. Moreover, proposed methods that jointly learn node and document representations do not provide a way to effectively infer representations for new documents for which network information is missing, which happens to be crucial in real time recommender systems. Finally, the interplay between textual and graph data in text-attributed heterogeneous networks remains an open research direction.
Complex networks are a powerful paradigm to model complex systems. Specific network models, e.g., multilayer networks, temporal networks, and signed networks, enrich the standard network representation with additional information to better capture real-world phenomena. Despite the keen interest in a variety of problems, algorithms, and analysis methods for these types of network, the problem of extracting cores and dense structures still has unexplored facets.
In this work, we present advancements to the state of the art by the introduction of novel definitions and algorithms for the extraction of dense structures from complex networks, mainly cores. At first, we define core decomposition in multilayer networks together with a series of applications built on top of it, i.e., the extraction of maximal multilayer cores only, densest subgraph in multilayer networks, the speed-up of the extraction of frequent cross-graph quasi-cliques, and the generalization of community search to the multilayer setting. Then, we introduce the concept of core decomposition in temporal networks; also in this case, we are interested in the extraction of maximal temporal cores only. Finally, in the context of discovering polarization in large-scale online data, we study the problem of identifying polarized communities in signed networks.
The proposed methodologies are evaluated on a large variety of real-world networks against naïve approaches, non-trivial baselines, and competing methods. In all cases, they show effectiveness, efficiency, and scalability. Moreover, we showcase the usefulness of our definitions in concrete applications and case studies, i.e., the temporal analysis of contact networks, and the identification of polarization in debate networks.
The goal of this work is to systematically extract information from hacker forums, whose information would be in general described as unstructured: the text of a post is not necessarily following any writing rules. By contrast, many security initiatives and commercial entities are harnessing the readily public information, but they seem to focus on structured sources of information. Here, we focus on the problem of analyzing text content in security forums. A key novelty is that we use user profiles and contextual features along with transfer learning approach and also embedding space to help us identify and refine information that we could not get from security forum with trivial analysis. We collect a wealth of data from 5 different security forums. The contribution of our work is twofold; (a) we develop a method to automatically identify through the forums malicious IP addresses (b) we also propose a systematic method to identify and classify user-specified threads of interest into four categories. We further showcase how this information can inform knowledge extraction from the forums. As the cyber-wars are becoming more intense, having early accesses to useful information becomes more imperative to remove the hackers first-move advantage, and our work is a solid step towards this direction.
The study of information-sharing cascades has been a constant endeavor since the emergence of social networks. Internet memes which mostly consist of catchphrases, viral images, or small videos shared over the social network are notorious for attracting the users’ attention and spreading through the web in a fast fashion. Misinformation propagators latch their message to a meme to maximize the influence and spreading of the false news. As a result, the diffusion of misleading content has become a force to be reckoned with in the field of information warfare, as foreign actors seek to change opinions, manipulate ideologies, and create conflicts. In this study, we analyze the rapid dissemination of misinformation, aka, misinformation cascades, focusing on cascade temporal behavior and multi-cascade influence relationships. Twitter data used in this study contains only information associated with the Russian Internet Agency (IRA) and the Iranian Cyber Army (ICA). Our study focuses on analyzing temporal patterns of information dynamics created by these foreign actors for the sole purpose of spreading misinformation. We explore dividing temporal cascades into phases, where each phase differs from the previous regarding the number and characteristics of the information bursts. For this preliminary study, we are focusing on the #Trump and #USA hashtags used by the ICA. By studying the dynamics behind each phase, the forces behind the transition from one phase to another, and the influence relationships between cascades and their phases, we expect to shed some light on the timely subject of how to identify and protect society from information manipulation campaigns.
The rise of big data frameworks has given website administrators the ability to track user clickstream data with more detail than ever before. These clickstreams can represent the user’s intent and purpose in visiting the site. While existing work has explored methods for predicting future user actions, these methods are limited focus solely on one task at a time, ignore graph structure inherent in clickstreams, or model the conversion of the entire clickstream session, ignoring complexities such as multiple conversions in a single session. In this work, we formulate the novel problem of simultaneously predicting future user actions given a user’s clickstream history. We argue that clickstream data contains important signal for predicting future user action. To tackle this new problem, we propose a novel method called ClickGraph, a recurrent neural network that encodes the graph structure of user click trajectories in the learned representations of web pages. We conduct experiments on a real-world dataset and demonstrate that this multitask learning approach is effective at improving the prediction of form fill conversions over strong baselines. In particular, we demonstrate that the ClickGraph model is effective at reducing false positive rates, increasing F1 scores, and improving recall.
Advancements in technology have enabled society to become increasingly globalised, both with regard to physical migration, as well as through the use of information and communication technologies (ICTs) to enable the maintenance of transnational ties. In particular, transient migration in the area of higher education has seen an increasing number of students migrate overseas for the purpose of their studies. However, research has shown that these international students are often disconnected from their host culture and society, with local-international friendships proving to be uncommon (Baldassar & McKenzie, 2016). Based on interviews with over 200 international students in Australia, Sawir et al. (2008) revealed that two-third of them had suffered or were suffering from loneliness, of which Sawir et al. identified three kinds -personal loneliness from loss of contact with families, social loneliness from the loss of networks, and cultural loneliness from the change in environment. This raises the question of how these students may be better supported, and the international student experience improved upon.
Social media has often been positioned as a tool through which users become connected and communities are formed. One of the most popular platforms of social media, Facebook, has become an established part of many lives in modern society. Media and culture have always been interconnected, however, the dominance of the Web in everyday lives means that the role that media plays in cross-cultural communication is more significant than ever and must be researched for a better understanding of this phenomenon. While current research has examined the issues relating to construction of online identities for communication within established social networks, new issues have emerged in relation to collapsed contexts and imagined audiences in today's globalised world, especially as multiple cultures are introduced onto the same platform as a result of migration or relocation. Insufficient research has been done into the influence of technology on transient migration and its potential to support cross-cultural communication. The question thus is how exactly social media may assist transient migrants overcome issues of isolation and loneliness, and provide them with support during their time abroad.
This study looks to address the issue of student isolation within host societies by examining how social media may provide spaces for support, self-expression and cross-cultural communication. Through a visual internet ethnographic study, it examines the profile pages of international students on Facebook to better understand their positions in these home and host societies. This research study is supplemented by semi-structured interviews for a thorough examination of international students’ use of social media.
As we examine how the Web has developed and changed over the last 30 years, it must be acknowledged that this change cannot be solely attributed to technological advancements, but is also influenced by the actual human users of the Web who participate in it. Web users utilise online tools to produce content of their own, tailoring their online experiences accordingly. While advancements in technology has created a more globalised society, the globalised users within this society have had their own impact on technology. This research looks to create a more in-depth understanding of the ways in which social networking platforms are used by transient migrants to navigate transnational cultural settings. It is aimed at enabling deeper understanding of the complex inter-linkages between cultures, to engender new insights of transnational identities. This is essential to address the global nature of today's society and the role of social media platforms in the spaces they create for transient migrants.
There are many online courses and contents on the web, so each learner can find the best one and choose it. However, sometimes many online courses are poorly accessible due to the limits of the search engines on the web. The advent of intelligent systems, and online Chatbots, in particular, has brought improvement in various fields. Education Chatbots improve communication, increase productivity, and simplify learning interaction. This study aims to provide an intelligent Edu Chatbot with a high level of customization to learners who have different needs. This way, they can find their personalized learning path dynamically and their customized content without too much time and effort. This is precisely what e-learning needs because of the enormous amount of material on the web.
The Entity Linking (EL) task is concerned with linking entity mentions in a text collection with their corresponding knowledge-base entries. Despite the progress made in the evaluation of EL systems, there is still much work to be done, where this Ph.D. research tackles issues concerning EL evaluation. Among these issues, we stress (a) the lack of consensus about the definition of “entity” and the lack of evaluation metrics that allow for different notions of entities, (b) the lack of datasets that allow for cross-language comparison, and (c) the focus on evaluating high-level systems rather than low-level techniques. By addressing these challenges and better understanding the performance of EL systems, our hypothesis is that we can create a more general, more configurable EL framework that can be better adapted to the needs of a particular application. In the early stages of this PhD work, we have identified these problems and begun to address (a) and (b), publishing initial results that constitute a significant step forward in our investigation. However, there are still further challenges that must be addressed before we reach our goal. Our next steps thus involve proposing a more fluid definition of “entity” adaptable to different applications, the definition of quality measures that allow for comparing EL approaches targeting different types of entities, as well as the creation of a customizable EL framework that allows for composing and evaluating individual techniques as appropriate to a particular task.
It is a streaming world: a new generation of Web Applications is pushing the Web infrastructure to evolve and process data as soon as they arrive. However, the Web of Data is not appealing to the growing number of Web Applications demanding to tame Data Velocity. To solve these issues, we need to introduce new key abstractions, i.e., stream and events. Moreover, we need to investigate how to identify, represent and process streams and events on the Web. In this paper, we discuss why taming Velocity on the Web of Data. We present a Design Science research plan that builds on the state of the art of Stream Reasoning and RDF Stream Processing. Finally, we present our research results, for representing and processing stream and events on the Web.
The potentially detrimental effects of cyberbullying have led to the development of numerous automated, data–driven approaches, with an emphasis on classification accuracy. Cyberbullying, as a form of abusive online behavior, although not well–defined, is a repetitive process, i.e., a sequence of aggressive messages sent from a bully to a victim over a period of time with the intent to harm the victim. Existing work has focused on aggression (i.e., using profanity to classify toxic comments independently) as an indicator of cyberbullying, disregarding the repetitive nature of this harassing process. However, raising a cyberbullying alert immediately after an aggressive comment is detected can lead to a high number of false positives. At the same time, three key practical challenges remain unaddressed: (i) detection timeliness, which is necessary to support victims as early as possible, (ii) scalability to the staggering rates at which content is generated in online social networks, (iii) reliance on high quality annotations from human experts for training of highly accurate supervised classifiers.
To overcome the challenges associated with cyberbullying detection in online social networks, my PhD thesis focuses on a novel formulation of the online classification problem as sequential hypothesis testing that seeks to drastically reduce the number of features used while maintaining high classification accuracy. To reduce the dependency on labeled datasets, I seek to develop efficient semisupervised methods that extrapolate from a small seed set of expert annotations. Preliminary results are very encouraging, showing significant improvements over the state–of–the–art.
Eye tracking provides an effective solution to users’ attention, interest, and engagement. While gaze estimation based on a standard camera can be versatile, it remains challenging to achieve an accurate, robust, and scalable solution on mobile devices. In this talk, I will describe three studies that aim to address these challenges. Specifically, 1) we found that screen reflection on user's cornea can be leveraged for gaze estimation and it considerably improves the practicability of indoor eye tracking. 2) We exploited gaze-hand coordination and applied interaction data for implicit calibration when a user naturally interacts with the computer. This can prevent users from tedious and intrusive calibration in practice. 3) We also proposed to train a multi-device person-specific gaze estimator to accelerate implicit calibration. It adapts the data from different personal devices to learn the shared mapping from user appearance into eye gaze. Taken together, these studies identify indicative eye gaze features, alleviate user calibration effort, and thus pave the way for scalable eye tracking in daily use.
Visual attention and eye movements in primates have been widely shown to be guided by a combination of stimulus-dependent or 'bottom-up' cues, as well as task-dependent or 'top-down' cues. Both the bottom-up and top-down aspects of attention and eye movements have been modeled computationally. Yet, it is not until recent work which I will describe that bottom-up models have been strictly put to the test, predicting significantly above chance the eye movement patterns, functional neuroimaging activation patterns, or most recently neural activity in the superior colliculus of human or monkey participants inspecting complex static or dynamic scenes. In recent developments, models that increasingly attempt to capture top-down aspects have been proposed. In one system which I will describe, neuromorphic algorithms of bottom-up visual attention are employed to predict, in a task-independent manner, which elements in a video scene might more strongly attract attention and gaze. These bottom-up predictions have more recently been combined with top-down predictions, which allowed the system to learn from examples (recorded eye movements and actions of humans engaged in 3D video games, including flight combat, driving, first-person, or running a hot-dog stand that serves hungry customers) how to prioritize particular locations of interest given the task. Pushing deeper into real-time, joint online analysis of video and eye movements using neuromorphic models, we have recently been able to predict future gaze locations and intentions of future actions when a player is engaged in a task. In a similar approach where computational models provide a normative gold standard against a particular individual's gaze behavior, machine learning systems have been demonstrated which can predict, from eye movement recordings during as little as only 5 minutes of watching TV, whether a person has ADHD or other neurological disorders. Together, these studies suggest that it is possible to build fully computational models that coarsely capture some aspects of both bottom-up and top-down visual attention.
In this paper, we study the effectiveness of personalized persuasive interventions to change urban travelers’ mobility behavior and nudge them towards more sustainable transport choices. More specifically, we embed a set of persuasive design elements in a route planning application and investigate how they affect users’ travel choices. The design elements take into consideration the style, the intensity, the target of persuasive interventions as well as users’ characteristics and the trip purpose. Our results show evidence that our proposed approach motivates users on a personal level to change their mobility behavior and make more sustainable choices. Furthermore, by personalizing the persuasive interventions while considering combinations of interventions styles (in our case messages and visualizations) as well as adjusting the intensity of persuasive interventions according to the trip purpose and the transport modes of the routes which the user is nudged to follow, the effects of the persuasive interventions can be increased.
MAAS as a mobility model leans on the idea that a gap between private and public transport systems needs to be bridged as well as on a city, intercity, national and supranational level. The current situation is felt problematic due to the fragmented tools and services often organized in silos to meet a traveler needs to undertake a trip. One of the major concerns designing any platform system like Mobility as a Service is where to start modeling and how to express the notion of the platform system in some language that is understandable for all stakeholders of the platform system. Understandability buttresses the expectation of stakeholders whether some design will probably implement the intended platform services enabling users to actually buy and or use the platform system for what ever purpose. Building on the economic theories of two-sided markets and mechanism design we introduce the concept of value nets extending the Contract Protocol Net. Value net modeling offers a precise abstract representation which provides in the detailed informational requirements in a canonical form and it connects i.e. implements the abstract notion of Service Oriented Architecture characterizing systems without loss of crucial informational elements.
Looking ahead, in about 20 years, there is likely to be urban air mobility in larger cities across the globe. If economic predictions come true, thousands of air taxi flights will take place daily in capital cities – not only in megacities. Noise generated by urban flight mobility has been identified as a critical factor in this development. A concept is proposed to help raise the tolerance level for urban air noise among communities as well as of individual residents by means of transparency. This concept views residents as stakeholders in urban air mobility and widens the call for continuous noise measurements of vertical take-off and landing operations on individual site basis  by residents voluntary on-site data collection enabled by smartphone-based participatory noise sensing (PNS). In the presentation of this concept, this discussion paper describes important aspects of social acceptance of urban air mobility.
Tourist trip recommender systems (RSs) support travelers in identifying the most attractive points of interests (POIss) and combine the POIss along a route for single- or multi-day trips. Most RSs consider only the quality of POIss when searching for the best recommendation. In this work, we introduce a novel approach that also considers the attractiveness of the routes between POIss. For this purpose, we identify a list of important attributes of route attractiveness and explain how to implement our approach using three exemplary attributes. We develop a web application for demonstration purposes and apply it in a small preliminary user study with 16 participants. The results show that the integration of route attractiveness attributes makes most people choose the more attractive route over the shortest path between two POIss. This paper highlights how tourist trip RSs can support smart tourism. Our work aims to encourage further discussion on collecting and providing environmental data in cities to enable such applications.
Sustainable mobility is one of the main goals of both European and United Nations plans for 2030. The concept of Smart Cities has arisen as a way to achieve this goal by leveraging IoT interconnected devices to collect and analyse large quantities of data. However, several works have pointed out the importance of including the human factor, and in particular, citizens, to make sense of the collected data and ensure their engagement along the data value chain. This paper presents the design and implementation of two end-to-end hybrid human-machine workflows for solving two mobility problems: modal split estimation, and mapping mobility infrastructure. For modal split, we combine the use of i-Log, an app to collect data and interact with citizens, with reinforcement learning classifiers to continuously improve the accuracy of the classification, aiming at reducing the required interactions from citizens. For mobility infrastructure, we developed a system that uses remote crowdworkers to explore the city looking for Points of Interest, that is more scalable than sending agents on the field. Crowdsourced maps are then fused with existing maps (if available) to create a final map that then is validated on the field by citizens engaged through the i-Log app.
Managing the ever increasing road traffic congestion due to enormous vehicular growth is a big concern all over the world. Tremendous air pollution, loss of valuable time and money are the common consequences of traffic congestion in urban areas. IoT based Intelligent Transportation System (ITS) can help in managing the road traffic congestion in an efficient way. Estimation and classification of the traffic congestion state of different road segments is one of the important aspects of intelligent traffic management. Traffic congestion state recognition of different road segments helps the traffic management authority to optimize the traffic regulation of a transportation system. The commuters can also decide their best possible route to the destination based on traffic congestion state of different road segments. This paper aims to estimate and classify the traffic congestion state of different road segments within a city by analyzing the road traffic data captured by in-road stationary sensors. The Artificial Neural Network (ANN) based system is used to classify traffic congestion states. Based on traffic congestion status, ITS will automatically update the traffic regulations like, changing the queue length in traffic signal, suggesting alternate routes. It also helps the government to device policies regarding construction of flyover/alternate route for better traffic management.
Cities have entered the age of the sensor and located sensors everywhere over and under cities. The sensors monitor a host of factors that assess City operations and life such as air quality, noise, city services and traffic. Further, the sensors have “gone mobile” with announcements of situation aware mobile sensor platforms designed for city-level security and public safety. These wearable sensor platforms combine video, audio, and location data with Internet of Things (IoT) capabilities. However, the many sensors and functional platforms have not yet made the cities employing these many diverse sensors truly Smart. We are analyzing why the success toward the Smart city is limited, or late in coming. The explanations for the constrained effectiveness are assigned to many factors, but one of significance can be teased from a long-accepted explanation that associates data, information, and knowledge. Smart Cities need to effectively use the sensor data and the information assembled from these interpreted and organized data to create knowledge that serves the city and its people by answering and resolving key problems and questions. But the systems and analytic models needed to associate these data from many sensors have yet to be designed, constructed, and proven in the complex cities of today. Thus, the data (and information from the diverse sensors) lacks crucial integration and coordination for decisions and sense-making. While these sensor-based systems were, and in many cases are meeting some intended functionally discrete goals, they appear to be better described as data collection tools feeding centralized analytical engines. They are point solutions with specialized or targeted sensors feeding specialized solutions. This is a significant limiting factor in a city's drive to improve the quality of life and the efficiency of the services a city provides to its stakeholders. In this paper we present current trends in Smart City development, emerging issues with data and complexity growth, and proposes a mean to leverage the advancing technologies to address the integration problem.
Smart technologies advancements, emerging markets competition and sustainability needs have radically changed tourism and transport sectors. The key features of this change are the exploitation of evolving Big Data in the business intelligence context, the development of customized services tailored to the needs of consumers with the purpose of improving their experience, and the development of new business models based on the interaction between business and consumers. This is due to the capacity of smart transport technologies to integrate customers sensing and in this way a novel framework aimed at: i) developing personalized transport services in the tourism sector and ii) creating and delivering patterns of tourist consumer behavior according to specific target groups and market segments at tourist destination or country level is designed and outlined. The proposed “TΟMI” framework, exploits tour data analytics, in order to enable the deployment of personalized tour services that will be beneficial for tour operators, travellers and any other interested parties (local stakeholders, tourism entrepreneurs, etc.). The exploitation of the “TOMI” framework for the purpose of organizing tours in a city is also addressed through a case study on the city of Thessaloniki.
Travel time estimates are highly useful in planning urban mobility events. This paper investigates the quality of travel time estimates in the Indian capital city of Delhi and the National Capital Region (NCR). Using Uber mobile and web applications, we collect data about 610 trips from 34 Uber users. We empirically show the unpredictability of travel time estimates for Uber cabs. We also discuss the adverse effects of such unpredictability on passengers waiting for the cabs, leading to a whopping 28.4% of the requested trips being cancelled. Our empirical observations differ significantly from the high accuracies reported in travel time estimation literature. These pessimistic results will hopefully trigger useful investigations in future on why the travel time estimates are mismatching the high accuracy levels reported in literature - (a) is it a lack of training data issue for developing countries or (b) an algorithmic shortcoming that cannot capture the (lack of) historical patterns in developing region travel times or (c) a conscious policy decision by Uber platform or Uber drivers, to mismatch the correctly predicted travel time estimates and increase cab cancellation fees? In the context of smartphone apps extensively generating and utilizing travel time information for urban commute, this paper identifies and discusses the important problem of travel time estimation inaccuracies in developing countries.
With the increasing availability of mobility-related data, such as GPS-traces, Web queries and climate conditions, there is a growing demand to utilize this data to better understand and support urban mobility needs. However, data available from the individual actors, such as providers of information, navigation and transportation systems, is mostly restricted to isolated mobility modes, whereas holistic data analytics over integrated data sources is not sufficiently supported. In this paper we present our ongoing research in the context of holistic data analytics to support urban mobility applications in the Data4UrbanMobility (D4UM) project. First, we discuss challenges in urban mobility analytics and present the D4UM platform we are currently developing to facilitate holistic urban data analytics over integrated heterogeneous data sources along with the available data sources. Second, we present the MiC app - a tool we developed to complement available datasets with intermodal mobility data (i.e. data about journeys that involve more than one mode of mobility) using a citizen science approach. Finally, we present selected use cases and discuss our future work.
Spam Bots have become a threat to online social networks with their malicious behavior, posting misinformation messages and influencing online platforms to fulfill their motives. As spam bots have become more advanced over time, creating algorithms to identify bots remains an open challenge. Learning low-dimensional embeddings for nodes in graph structured data has proven to be useful in various domains. In this paper, we propose a model based on graph convolutional neural networks (GCNN) for spam bot detection. Our hypothesis is that to better detect spam bots, in addition to defining a features set, the social graph must also be taken into consideration. GCNNs are able to leverage both the features of a node and aggregate the features of a node’s neighborhood. We compare our approach, with two methods that work solely on a features set and on the structure of the graph. To our knowledge, this work is the first attempt of using graph convolutional neural networks in spam bot detection.
Recent years have witnessed a surge of manipulation of public opinion and political events by malicious social media actors. These users are referred to as “Pathogenic Social Media (PSM)” accounts. PSMs are key users in spreading misinformation in social media to viral proportions. These accounts can be either controlled by real users or automated bots. Identification of PSMs is thus of utmost importance for social media authorities. The burden usually falls to automatic approaches that can identify these accounts and protect social media reputation. However, lack of sufficient labeled examples for devising and training sophisticated approaches to combat these accounts is still one of the foremost challenges facing social media firms. In contrast, unlabeled data is abundant and cheap to obtain thanks to massive user-generated data. In this paper, we propose a semi-supervised causal inference PSM detection framework, SemiPsm, to compensate for the lack of labeled data. In particular, the proposed method leverages unlabeled data in the form of manifold regularization and only relies on cascade information. This is in contrast to the existing approaches that use exhaustive feature engineering (e.g., profile information, network structure, etc.). Evidence from empirical experiments on a real-world ISIS-related dataset from Twitter suggests promising results of utilizing unlabeled instances for detecting PSMs.
Social media, once hailed as a vehicle for democratization and the promotion of positive social change across the globe, are under attack for becoming a tool of political manipulation and spread of disinformation. A case in point is the alleged use of trolls by Russia to spread malicious content in Western elections. This paper examines the Russian interference campaign in the 2016 US presidential election on Twitter. Our aim is twofold: first, we test whether predicting users who spread trolls’ content is feasible in order to gain insight on how to contain their influence in the future; second, we identify features that are most predictive of users who either intentionally or unintentionally play a vital role in spreading this malicious content. We collected a dataset with over 43 million election-related posts shared on Twitter between September 16 and November 9, 2016, by about 5.7 million users. This dataset includes accounts associated with the Russian trolls identified by the US Congress. Proposed models are able to very accurately identify users who spread the trolls’ content (average AUC score of 96%, using 10-fold validation). We show that political ideology, bot likelihood scores, and some activity-related account meta data are the most predictive features of whether a user spreads trolls’ content or not.
Cyberbullying poses serious threats to preteens and teenagers, therefore, understanding the incentives behind cyberbullying is critical to prevent its happening and mitigate the impact. Most existing work towards cyberbullying detection has focused on the accuracy, and overlooked causes of the outcome. Discovering the causes of cyberbullying from observational data is challenging due to the existence of confounders, variables that can lead to spurious causal relationships between covariates and the outcome. This work studies the problem of robust cyberbullying detection with causal interpretation and proposes a principled framework to identify and block the influence of the plausible confounders, i.e., p-confounders. The de-confounded model is causally interpretable and is more robust to the changes in data distribution. We test our approach using the state-of-the-art evaluation method, causal transportability. The experimental results corroborate the effectiveness of our proposed algorithm. The purpose of this study is to provide a computational means to understanding cyberbullying behavior from observational data. This improves our ability to predict and to facilitate effective strategies or policies to proactively mitigate the impact of cyberbullying.
How useful is the information that a security analyst can extract from a security forum? We focus on threads of interest, which we define as: (i) alerts of worrisome events, such as attacks, (ii) offering of malicious services and products, (iii) hacking information to perform malicious acts, and (iv) useful security-related experiences. The analysis of security forums is in its infancy despite several promising recent works. Here, we leverage our earlier work in thread analysis, and ask the question: what kind of information do these malicious threads provide. Specifically, we analyze threads in three dimensions: (a) temporal characteristics, (b) user-centric characteristics (c) content-centric properties. We study threads pulled from three security forums spanning the period 2012-2016. First, we show that 53% of the users asking/selling malicious Services on average has 3 posts and initiate 1 thread and 1 day lifetime. Second, we argue that careful analysis can help to identify emerging threats reported in security forums through Services and Alerts threads and potentially help security analysts prevent attacks. We see this study as a first attempt to argue for the wealth and type of information that can be extracted from security forums.
Cyberbullying is a major issue on online social platforms, and can have prolonged negative psychological impact on both the bullies and their targets. Users can be characterized by their involvement in cyberbullying according to different social roles including victim, bully, and victim supporter. In this work, we propose a social role detection framework to understand cyberbullying on online social platforms, and select a dataset that contains users’ records on both Instagram and Ask.fm as a case study. We refine the traditional victim-bully framework by constructing a victim-bully-supporter network on Instagram. These social roles are automatically identified via ego comment networks and linguistic cues of comments. Additionally, we analyze the consistency of users’ social role within Instagram and compare users’ behaviors on Ask.fm. Our analysis reveals the inconsistency of social roles both within and across platforms, which suggests social roles in cyberbullying are not invariant by conversation, person, or social platform.
Digital threats such as backdoors, trojans, info-stealers and bots can be especially damaging nowadays as they actively steal information or allow remote control for nefarious purposes. A common attribute amongst such malware is the need for network communication and many of them use domain generation algorithms (DGAs) to pseudo-randomly generate numerous domains to communicate with each other to avoid being take-down by blacklisting method. DGAs are constantly evolving and these generated domains are mixed with benign queries in network communication traffic each day, which raises a high demand for an efficient real-time DGA classifier on domains in DNS log. Previous works either rely on group contextual/statistical features or extra host-based information and thus need long time window, or depend on lexical features extracted from domain strings to build real-time classifiers, or directly build an end-to-end deep neural network to make prediction from domain strings. Pros and cons exist for either way in experiments. In this paper, we propose several new real-time detection models and frameworks which utilize meta-data generated from domains and combine the advantages of a deep neural network model and a lexical features based model using the ensemble technique. Our proposed model obtains performance higher than all state-of-art methods so far to the best knowledge of the authors, with both precision and recall at 99.8% on a widely used public dataset.
Public opinion manipulation is a serious threat to society, potentially influencing elections and the political situation even in established democracies. The prevalence of online media and the opportunity for users to express opinions in comments magnifies the problem. Governments, organizations, and companies can exploit this situation for biasing opinions. Typically, they deploy a large number of pseudonyms to create an impression of a crowd that supports specific opinions. Side channel information (such as IP addresses or identities of browsers) often allows a reliable detection of pseudonyms managed by a single person. However, while spoofing and anonymizing data that links these accounts is simple, a linking without is very challenging.
In this paper, we evaluate whether stylometric features allow a detection of such doppelgängers within comment sections on news articles. To this end, we adapt a state-of-the-art doppelgänger detector to work on small texts (such as comments) and apply it on three popular news sites in two languages. Our results reveal that detecting potential doppelgängers based on linguistics is a promising approach even when no reliable side channel information is available. Preliminary results following an application in the wild shows indications for doppelgängers in real world data sets.
The deep and darkweb (d2web) refers to limited access web sites that require registration, authentication, or more complex encryption protocols to access them. These web sites serve as hubs for a variety of illicit activities: to trade drugs, stolen user credentials, hacking tools, and to coordinate attacks and manipulation campaigns. Despite its importance to cyber crime, the d2web has not been systematically investigated. In this paper, we study a large corpus of messages posted to 80 d2web forums over a period of more than a year. We identify topics of discussion using LDA and use a non-parametric HMM to model the evolution of topics across forums. Then, we examine the dynamic patterns of discussion and identify forums with similar patterns. We show that our approach surfaces hidden similarities across different forums and can help identify anomalous events in this rich, heterogeneous data.
It has been widely recognized that automated bots may have a significant impact on the outcomes of national events. It is important to raise public awareness about the threat of bots on social media during these important events, such as the 2018 US midterm election. To this end, we deployed a web application to help the public explore the activities of likely bots on Twitter on a daily basis. The application, called Bot Electioneering Volume (BEV), reports on the level of likely bot activities and visualizes the topics targeted by them. With this paper we release our code base for the BEV framework, with the goal of facilitating future efforts to combat malicious bots on social media.
Over the past couple of years, anecdotal evidence has emerged linking coordinated campaigns by state-sponsored actors with efforts to manipulate public opinion on the Web, often around major political events, through dedicated accounts, or “trolls.” Although they are often involved in spreading disinformation on social media, there is little understanding of how these trolls operate, what type of content they disseminate, and most importantly their influence on the information ecosystem.
In this paper, we shed light on these questions by analyzing 27K tweets posted by 1K Twitter users identified as having ties with Russia’s Internet Research Agency and thus likely state-sponsored trolls. We compare their behavior to a random set of Twitter users, finding interesting differences in terms of the content they disseminate, the evolution of their account, as well as their general behavior and use of Twitter. Then, using Hawkes Processes, we quantify the influence that trolls had on the dissemination of news on social platforms like Twitter, Reddit, and 4chan. Overall, our findings indicate that Russian trolls managed to stay active for long periods of time and to reach a substantial number of Twitter users with their tweets. When looking at their ability of spreading news content and making it viral, however, we find that their effect on social platforms was minor, with the significant exception of news published by the Russian state-sponsored news outlet RT (Russia Today).
Since 2016, through an association between Telefónica R&D and the Institute of Data Science in Chile, a group of researchers and myself have been working with trillions of digital traces left behind when people use their mobile phones. All of this work has been done under the general umbrella term of "data science for social good", and we have worked on anything from population displacement after external events like earthquakes, how people started using public spaces after the introduction of a popular mobile game, to actual social inclusion of people of different socio-economic backgrounds mixing in shopping malls or reading certain kinds of news, or patterns arising from gendered data sets. We will show how data in the private sector made us learn important social lessons such as how parks can become more secure when people went out to play Pokemon Go, how certain malls are hubs of social inclusion, how gender segregates the city and how different demographics keep themselves in their own informational filter bubble. However, even after all this benefits, the relationship with industry has never been fluid, and involves a lot of small and not so small compromises and "battles". In this talk, I will present a technical history of the work we've done with X/CDRs for social good including practical aspects of accessing and sharing data, the balance of research and industrial innovation, and issues of transactions costs while still providing value for the company itself, government, the university and society. I will also recount experiences about what it meant for a company like Telefónica and a research university like us to travel together in a very interesting context of huge data, incredible insights, privacy considerations, money, corporate interests, university expectations, and data-driven discovery.
In most emergencies, people use social media platforms to publicly share information. Such data, from multiple sources is extremely useful for emergency response and public safety: the more knowledge that is gathered, the better the response can be. When an emergency event first occurs, getting the right information as quickly as possible is critical in saving lives. When an emergency event is ongoing, information on what is happening can be critical in making decisions to keep people safe and take control of the particular situation unfolding. In both cases, first responders have to quickly make decisions that include what resources to deploy and where. In this talk, I will describe challenges in emergency response and how a computational platform that leverages public data can address them. A platform to detect emergency situations and deliver the right information to first responders has to deal with ingesting thousands of data points per second: sifting through and identifying relevant information, from different sources, in different formats, with varying levels of detail, in real time, so that first responders and others can be alerted at the right level and at the right time. I will describe technical challenges in processing vast amounts of heterogenous data in real time, highlighting the importance of interdisciplinary research and a human-centered approach to address problems in emergency response. I will give specific examples and discuss relevant research topics in Machine Learning, NLP, Information Retrieval, Computer Vision and other fields.
Human service providers play a critical role in improving well–being in the United States. However, little is know about (i) how service seekers find the services they are looking for by navigating among available service providers, and (ii) how such organizations collaborate to meet human needs. In this paper, we report the first outcomes of our ongoing project. Specifically, we first describe a data acquisition engine, designed around the particular challenges of capturing, maintaining, and updating data pertaining to human service organizations from semistructured Web sources. We then proceed to illustrate the potential of the resulting comprehensive repository of human service providers through a case study showcasing a mobile app prototype designed to provide a one–stop shop for human service seekers.
One of the hallmarks of a free and fair society is the ability to conduct a peaceful and seamless transfer of power from one leader to another. Democratically, this is measured in a citizen population’s trust in the electoral system of choosing a representative government. In view of the well documented issues of the 2016 US Presidential election, we conducted an in-depth analysis of the 2018 US Midterm elections looking specifically for voter fraud or suppression. The Midterm election occurs in the middle of a 4 year presidential term. For the 2018 midterms, 35 Senators and all the 435 seats in the House of Representatives were up for re-election, thus, every congressional district and practically every state had a federal election. In order to collect election related tweets, we analyzed Twitter during the month prior to, and the two weeks following, the November 6, 2018 election day. In a targeted analysis to detect statistical anomalies or election interference, we identified several biases that can lead to wrong conclusions. Specifically, we looked for divergence between actual voting outcomes and instances of the #ivoted hashtag on the election day. This analysis highlighted three states of concern: New York, California, and Texas. We repeated our analysis discarding malicious accounts, such as social bots. Upon further inspection and against a backdrop of collected general election-related tweets, we identified some confounding factors, such as population bias, or bot and political ideology inference, that can lead to false conclusions. We conclude by providing an in-depth discussion of the perils and challenges of using social media data to explore questions about election manipulation.
Psychological, political, cultural, and even societal factors are entangled in the reasoning and decision-making process towards vaccination, rendering vaccine hesitancy a complex issue. Here, administering a series of surveys via a Facebook-hosted application, we study the worldviews of people that “Liked” supportive or vaccine resilient Facebook Pages. In particular, we assess differences in political viewpoints, moral values, personality traits, and general interests, finding that those sceptical about vaccination, appear to trust less the government, are less agreeable, while they are emphasising more on anti-authoritarian values. Exploring the differences in moral narratives as expressed in the linguistic descriptions of the Facebook Pages, we see that pages that defend vaccines prioritise the value of the family while the vaccine hesitancy pages are focusing on the value of freedom. Finally, creating embeddings based on the health-related likes on Facebook Pages, we explore common, latent interests of vaccine-hesitant people, showing a strong preference for natural cures. This exploratory analysis aims at exploring the potentials of a social media platform to act as a sensing tool, providing researchers and policymakers with insights drawn from the digital traces, that can help design communication campaigns that build confidence, based on the values that also appeal to the socio-moral criteria of people.
The role of social networks during natural disasters is becoming crucial to share relevant information and coordinate relief actions. With the reach of the social networks, any user around the world has the possibility of interact in crisis-events as these unfold. A large part of the information posted during a disaster uses the native language where the disaster occurred. However, there are also users from other parts of the world who can comment about the event, often in another language. In this work, we conducted a study of crisis-related tweets about the earthquake that occurred in Ecuador in April 2016. To that end, we introduce a new annotated dataset in both Spanish and English languages with approximately 8K tweets; half of them belong to conversations. We evaluate several neural architectures to identify crisis-related tweets in a multi-lingual setting, and we found that deep contextual multi-lingual embeddings outperform other strong baseline models. We then explore the type of conversations that occur from the perspective of different languages. The results show that certain types of conversations occur more in the native language and others in a foreign language. Conversations from foreign countries seek to gather situation awareness and give emotional support, while in the affected country the conversations aim mainly to humanitarian aid.
A map of potential prevalence of Chagas disease (ChD) with high spatial disaggregation is presented. It aims to detect areas outside the Gran Chaco ecoregion (hyperendemic for the ChD), characterized by high affinity with ChD and high health vulnerability.
To quantify potential prevalence, we developed several indicators: an Affinity Index which quantifies the degree of linkage between endemic areas of ChD and the rest of the country. We also studied favorable habitability conditions for Triatoma infestans, looking for areas where the predominant materials of floors, roofs and internal ceilings favor the presence of the disease vector.
We studied determinants of a more general nature that can be encompassed under the concept of Health Vulnerability Index. These determinants are associated with access to health providers and the socio-economic level of different segments of the population.
Finally we constructed a Chagas Potential Prevalence Index (ChPPI) which combines the affinity index, the health vulnerability index, and the population density. We show and discuss the maps obtained. These maps are intended to assist public health specialists, decision makers of public health policies and public officials in the development of cost-effective strategies to improve access to diagnosis and treatment of ChD.
To capitalize on the benefits associated with word embeddings, researchers working with data from domains such as medicine, sentiment analysis, or finance, have dedicated efforts to either taking advantage of popular, general-purpose embedding-learning strategies, such as Word2Vec, or developing new ones that explicitly consider domain knowledge in order to generate new domain-specific embeddings. In this manuscript, we instead propose a mixed strategy to generate enriched embeddings specifically designed for the educational domain. We do so by leveraging FastText embeddings pre-trained using Wikipedia, in addition to established educational standards that serve as structured knowledge sources to identify terms, topics, and subjects for each school grade. The results of an initial empirical analysis reveal that the proposed embedding-learning strategy, which infuses limited structured knowledge currently available for education into pre-trained embeddings, can better capture relationships and proximity among education-related terminology. Further, these results demonstrate the advantages of using domain-specific embeddings over general-purpose counterparts for capturing information that pertains to the educational area, along with potential applicability implications when it comes to text processing and analysis for K–12 curriculum-related tasks.
In this extended abstract, we present an algorithm that learns a similarity measure between documents from the network topology of a structured corpus. We leverage the Scaled Dot-Product Attention, a recently proposed attention mechanism, to design a mutual attention mechanism between pairs of documents. To train its parameters, we use the network links as supervision. We provide preliminary experiment results with a citation dataset on two prediction tasks, demonstrating the capacity of our model to learn a meaningful textual similarity.
Recently, considerable research attention has been paid to graph embedding, a popular approach to construct representations of vertices in latent space. Due to the curse of dimensionality and sparsity in graphical datasets, this approach has become indispensable for machine learning tasks over large networks. The majority of the existing literature has considered this technique under the assumption that the network is static. However, networks in many applications, including social networks, collaboration networks, and recommender systems, nodes, and edges accrue to a growing network as streaming. A small number of very recent results have addressed the problem of embedding for dynamic networks. However, they either rely on knowledge of vertex attributes, suffer high-time complexity or need to be re-trained without closed-form expression. Thus the approach of adapting the existing methods designed for static networks or dynamic networks to the streaming environment faces non-trivial technical challenges.
These challenges motivate developing new approaches to the problems of streaming graph embedding. In this paper, we propose a new framework that is able to generate latent representations for new vertices with high efficiency and low complexity under specified iteration rounds. We formulate a constrained optimization problem for the modification of the representation resulting from a stream arrival. We show this problem has no closed-form solution and instead develop an online approximation solution. Our solution follows three steps: (1) identify vertices affected by newly arrived ones, (2) generating latent features for new vertices, and (3) updating the latent features of the most affected vertices. The new representations are guaranteed to be feasible in the original constrained optimization problem. Meanwhile, the solution only brings about a small change to existing representations and only slightly changes the value of the objective function. Multi-class classification and clustering on five real-world networks demonstrate that our model can efficiently update vertex representations and simultaneously achieve comparable or even better performance compared with model retraining.
In dialogue systems, discourse coherence is an important concept that measures semantic relevance between an utterance and its context. It plays a critical role in determining the inappropriate reply of dialogue systems with regard to a given dialogue context. In this paper, we present a novel framework for evaluating discourse coherence by seamlessly integrating Bayesian and neural networks. The Bayesian network corresponds to Coherence-Pivoted Latent Dirichlet Allocation (cpLDA). cpLDA concentrates on generating the fine-grained topics from dialogue data and takes both local and global semantics into account. The neural network corresponds to Multi-Hierarchical Coherence Network (MHCN). Coupled with cpLDA, MHCN quantifies discourse coherence between an utterance and its context by comprehensively utilizing original texts, topic distribution and topic embedding. Extensive experiments show that the proposed framework yields superior performance comparing with the state-of-the-art methods.
Graph representation learning for static graphs is a well studied topic. Recently, a few studies have focused on learning temporal information in addition to the topology of a graph. Most of these studies have relied on learning to represent nodes and substructures in dynamic graphs. However, the representation learning problem for entire graphs in a dynamic context is yet to be addressed. In this paper, we propose an unsupervised representation learning architecture for dynamic graphs, designed to learn both the topological and temporal features of the graphs that evolve over time. The approach consists of a sequence-to-sequence encoder-decoder model embedded with gated graph neural networks (GGNNs) and long short-term memory networks (LSTMs). The GGNN is able to learn the topology of the graph at each time step, while LSTMs are leveraged to propagate the temporal information among the time steps. Moreover, an encoder learns the temporal dynamics of an evolving graph and a decoder reconstructs the dynamics over the same period of time using the encoded representation provided by the encoder. We demonstrate that our approach is capable of learning the representation of a dynamic graph through time by applying the embeddings to dynamic graph classification using a real world dataset of animal behaviour.
Sadly, an empty abstract.
Online review system enables users to submit reviews about the products. However, the openness of Internet and monetary rewards for crowdsourcing tasks stimulate a large number of fraudulent users to write fake reviews and post advertisements to interfere the rank of apps. Existing methods for detecting spam reviews have been successful but they usually aims at e-commerce (e.g. Amazon, eBay) and recommendation (e.g. Yelp, Dianping) systems. Since the behaviors of fraudulent users are complexity and varying across different review platforms, existing methods are not suitable for fraudster detection in online app review system.
To shed light on this question, we are among the first to analyze the intentions of fraudulent users from different review platforms and categorize them by utilizing characteristics of contents (similarity, special symbols) and behaviors (timestamps, device, login status). With a comprehensive analysis of spamming activities and relationships between normal and malicious users, we design and present FdGars, the first graph convolutional network approach for fraudster detection in online app review system. Then we evaluate FdGars on real-world large-scale dataset (with 82,542 nodes and 42,433,134 edges) from Tencent App Store. The result demonstrates that F1-score of FdGars can achieve 0.938+, which outperforms several baselines and state-of-art fraudsters detecting methods. Moreover, we deploy FdGars on Tencent Beacon Anti-Fraud Platform to show its effectiveness and scalability. To the best of our knowledge, this is the first work to use graph convolutional networks for fraudster detection in the large-scale online app review system. It is worth to mention that FdGars can uncover malicious accounts even the data lack of labels in anti-spam tasks.
When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines ; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase “green card” is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality1, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.
In this talk we consider the domain of voice shopping in Alexa, Amazon’s voice assistant. In this domain, search scenarios are integral part of the shopping sessions, where users seek for a product to buy, or for some information about a product. The fact that in voice search, both the input and output are spoken, involves many challenges in automatic speech recognition, natural language understanding, question answering, and new user experiences. We will elaborate on customers’ behavior in voice shopping, where we have observed an interesting and surprising phenomenon that many customers purchase or engage with irrelevant search results. The term “irrelevance” may mislead, since a relevant item is typically interpreted as “anything that satisfies the user needs”. Thus, the title of this work may look as an oxymoron – the purchase of a product is a strong signal of relevance to the customer. In the context of this work we take a simplified approach. We mark product items as relevant or irrelevant to the user query based on the relevance judgments of several human annotators. However, even in the context of objective relevance judgments, it is still surprising that so many customers engage with irrelevant results. We will analyze this phenomenon and demonstrate its significance. We will offer several hypotheses as to the reasons behind customers’ purchase and engagement with irrelevant results, including customers’ personal preferences, trendiness of the products and their relatedness to the query, the query intent and the product price.
Size selection is a critical step while purchasing fashion products. Unlike offline, in online fashion shopping, customers don’t have the luxury of trying a product and have to rely on the product images and size charts to select a product that fits well. As a result of this gap, online shopping yields a large percentage of returns due to size and fit. Hence providing size recommendation for customers enhances their buying experience and also reduces operational costs incurred during exchanges and returns. In this paper, we present a robust personalized size recommendation system which predicts the most appropriate size for users based on their order history and product data. We embed both users and products in a size and fit space using skip-gram based Word2Vec model and employ GBM classifier to predict the fit likelihood. We describe the architecture of the system and challenges we encountered while developing it. Further we also analyze the performance of our system through extensive offline and online testing, compare our technique with another state-of-art technique and share our findings.
Recent theoretical and practical advances have led to the emergence of review-based recommender systems, where user preference data is encoded in at least two dimensions; the traditional rating scores in a predefined discrete scale and the user-generated reviews in the form of free-text. The main contribution of this work is the presentation of a new technique of incorporating those reviews into collaborative filtering matrix factorization algorithms. The text of each review, of arbitrary length, is mapped to a continuous feature space of fixed length, using neural language models and more specifically, the Paragraph Vector model. Subsequently, the resulting feature vectors (the neural embeddings) are used in combination with the rating scores in a hybrid probabilistic matrix factorization algorithm, based on maximum a-posteriori estimation. The proposed methodology is then compared to three other similar approaches on six datasets in order to assess its performance. The obtained results demonstrate that the new formulation outperforms the other systems on a set of two metrics, thereby indicating the robustness of the idea.
We describe a system that organizes search results in the context of an exploratory product search session where the user is researching goods. Compared to existing approaches that use predefined categories to filter results by attributes, we organize information needs based on queries instead of documents. The idea is to organize queries around the same topic and produce a hierarchical representation of intents that describe information about a product from different perspectives. We present a prototype implementation using a real-world data set of 24M queries.
The demand generation and assortment planning are two critical components of running a retail business. Traditionally, retail companies use the historical sales data for modeling and optimization of assortment selection, and they use a marketing strategy for demand generation. However, today, most retail businesses have e-commerce sites with rapidly growing online sales. An e-commerce site typically has to maintain a large amount of digitized product data, and it also keeps a vast amount of historical customer interaction data that includes search, browse, click, purchase and many other different interactions. In this paper, we show how this digitized product data and the historical search logs can be used in understanding and quantifying the gap between the supply and demand side of a retail market. This gap helps in making an effective strategy for both demand generation and assortment selection. We construct topic models of the historical search queries and the digitized product data from the catalog. We use the former to model the customer demand and the later to model the supply side of the retail business. We then create a tool to visualize the topic models to understand the differences between the supply and demand side. We also quantify the supply and demand gap by defining a metric based on Kullback-Leibler (KL) divergence of topic distributions of queries and the products. The quantification helps us identifying the topics related to excess or less demand and thereby in designing effective strategies for demand generation and assortment selection. Application of this work by e-Commerce retailers can result in the development of product innovations that can be utilized to achieve economic equilibrium. We can identify the excess demand and can provide insight to the teams responsible for improving assortment and catalog quality. Similarly, we can also identify excess supply and can provide that intelligence to the teams responsible for demand generation. Tools of this nature can be developed to systematically drive efficiency in achieving better economic gains for the entire e-commerce engine. We conduct several experiments collecting data from Walmart.com to validate the effectiveness of our approach.
Product pages on e-commerce websites often overwhelm their customers with a wealth of data, making discovery of relevant information a challenge. Motivated by this, here, we present a novel framework to answer both factoid and non-factoid user questions on product pages. We propose several question-answer matching models leveraging both deep learned distributional semantics and semantics imposed by a structured resource like a domain specific ontology. The proposed framework supports the use of a combination of these models and we show, through empirical evaluation, that a cascade of these models does much better in meeting the high precision requirements of such a question-answering system. Evaluation on user asked questions shows that the proposed system achieves 66% higher precision1 as compared to IDF-weighted average of word vectors baseline .
With the recent proliferation of e-commerce services, online shopping has become more and more popular among customers. Because it is necessary to recommend proper items to customers, to improve the accuracy of recommendation, high-performance recommender systems are required. However, current recommender systems are mainly based on information of their own domain, resulting in low accurate recommendation for customers with limited purchasing histories. The accuracy may suffer due to a lack of information. In order to use information from other domains, it is necessary to associate behaviors in different domains of the behaviorally related users. This paper presents a preliminary analysis of matching behaviors of the behaviorally related users in different domains. The result shows that we got a better prediction rate than linear regression.
For an E-commerce website like Walmart.com, search is one of the most critical channel for engaging customer. Most existing works on search are composed of two steps, a retrieval step which obtains the candidate set of matching items, and a re-rank step which focuses on fine-tuning the ranking of candidate items. Inspired by latest works in the domain of neural information retrieval (NIR), we discuss in this work our exploration of various product retrieval models which are trained on search log data. We discuss a set of lessons learned in our empirical result section, and these results can be applied to any product search engine which aims at learning a good product retrieval model based on search log data.
Automated detection of text with misrepresentations such as fake reviews is an important task for online reputation management. The dataset of customer complaints - emotionally charged texts which are very similar to reviews and include descriptions of problems customers experienced with certain businesses – is presented. It contains 2746 complaints about banks and provides clear ground truth, based on available factual knowledge about the financial domain. Among them, 400 texts were manually tagged. Initial experiments were performed in order to explore the links between implicit cues of the rhetoric structure of texts and the validity of arguments, and also how truthful/deceptive are these texts.
A current research question in the area of entity resolution (also called link discovery or duplicate detection) is whether and in which cases embeddings and deep neural network based matching methods outperform traditional symbolic matching methods. The problem with answering this question is that deep learning based matchers need large amounts of training data. The entity resolution benchmark datasets that are currently available to the public are too small to properly evaluate this new family of matching methods. The WDC Training Dataset for Large-Scale Product Matching fills this gap. The English language subset of the training dataset consists of 20 million pairs of offers referring to the same products. The offers were extracted from 43 thousand e-shops which provide schema.org annotations including some form of product ID such as a GTIN or MPN. We also created a gold standard by manually verifying 2200 pairs of offers belonging to four product categories. Using a subset of our training dataset together with this gold standard, we are able to publicly replicate the recent result of Mudgal et al. that embeddings and deep neural network based matching methods outperform traditional symbolic matching methods on less structured data.
The quality of e-Commerce services largely depends on the accessibility of product content as well as its completeness and correctness. Nowadays, many sellers target cross-country and cross-lingual markets via active or passive cross-border trade, fostering the desire for seamless user experiences. While machine translation (MT) is very helpful for crossing language barriers, automatically matching existing items for sale (e.g. the smartphone in front of me) to the same product (all smartphones of the same brand/type/colour/condition) can be challenging, especially because the seller’s description can often be erroneous or incomplete. This task we refer to as item alignment in multilingual e-commerce catalogues. To facilitate this task, we develop a pipeline of tools for item classification based on cross-lingual text similarity, exploiting recurrent neural networks (RNNs) with and without pre-trained word-embeddings. Furthermore, we combine our language agnostic RNN classifiers with an in-domain MT system to further reduce the linguistic and stylistic differences between the investigated data, aiming to boost our performance. The quality of the methods as well as their training speed is compared on an in-domain data set for English–German products.
Analyzing commercial pages to infer the products or services being offered by a web-based business is a task central to product search, product recommendation, ad placement and other e-commerce tasks. What makes this task challenging is that there are two types of e-commerce product pages. One is the single-product (SP) page where one product is featured primarily and users are able to buy that product or add to cart on the page. The other is the multi-product (MP) page, where users are presented with multiple (often 10-100) choices of products within a same category, often with thumbnail pictures and brief descriptions — users browse through the catalogue until they find a product they want to learn more about, and subsequently purchase the product of their choice on a corresponding SP page. In this paper, we take a two-step approach to identifying product phrases from commercial pages. First we classify whether a commercial web page is a SP or MP page. To that end, we introduce two different image recognition based models to differentiate between these two types of pages. If the page is determined to be SP, we identify the main product featured in that page. We compare the two types of image recognition models in terms of trade-offs between accuracy and latency, and empirically demonstrate the efficacy of our overall approach.
Many current applications use recommender systems to predict user preferences, aiming at improving user experience and increasing the amount of sales and the usage time that users spent on the application. However, it is not an easy task to recommend items to new users accurately because of the user cold-start problem, which means that recommendation performance will degrade on users with little interaction, particularly for latent users who have never used the service before. In this work, we combine an online shopping domain with information from an ad platform, and then apply deep learning to build a cross-domain recommender system based on shared users of these two domains, to alleviate the user cold-start problem. Experimental results show the effectiveness of our deep cross-domain recommender system on handling user cold-start problem. By our framework, it is possible to recommend products to users of other domain through ad distribution in a more accurate level, and to increase sales amount of online shopping.
For a product of interest, we propose a search method to surface a set of reference products. The reference products can be used as candidates to support downstream modeling tasks and business applications. The search method consists of product representation learning and fingerprint-type vector searching. The product catalog information is transformed into a high-quality embedding of low dimensions via a novel attention auto-encoder neural network, and the embedding is further coupled with a binary encoding vector for fast retrieval. We conduct extensive experiments to evaluate the proposed method, and compare it with peer services to demonstrate its advantage in terms of search return rate and precision.
ProductNet is a collection of high-quality product datasets for better product understanding. Motivated by ImageNet, ProductNet aims at supporting product representation learning by curating product datasets of high quality with properly chosen taxonomy. In this paper, the two goals of building high-quality product datasets and learning product representation support each other in an iterative fashion: the product embedding is obtained via a multi-modal deep neural network (master model) designed to leverage product image and catalog information; and in return, the embedding is utilized via active learning (local model) to vastly accelerate the annotation process. For the labeled data, the proposed master model yields high categorization accuracy (94.7% top-1 accuracy for 1240 classes), which can be used as search indices, partition keys, and input features for machine learning models. The product embedding, as well as the fined-tuned master model for a specific business task, can also be used for various transfer learning tasks.
Emojis have quickly become a universal language that is used by worldwide users, for everyday tasks, across language barriers, and in different apps and platforms. The prevalence of emojis has quickly attracted great attentions from various research communities such as natural language processing, Web mining, ubiquitous computing, and human-computer interaction, as well as other disciplines including social science, arts, psychology, and linguistics.
This talk summarizes the recent efforts made by my research group and our collaborators on analyzing large-scale emoji data. The usage of emojis by worldwide users presents interesting commonality as well as divergence. In our analysis of emoji usage by millions of smartphone users in 212 countries, we show that the different preferences and usage of emojis provide rich signals for understanding the cultural differences of Internet users, which correlate with the Hofstede’s cultural dimensions .
Emojis play different roles when used alongside text. Through jointly learning the embeddings and topological structures of words and emojis, we reveal that emojis present both complementary and supplementary relations to words. Based on the structural properties of emojis in the semantic spaces, we are able to untangle several factors behind the popularity of emojis .
This talk also highlights the utility of emojis. In general, emojis have been used by Internet users as text supplements to describe objects and situations, express sentiments, or express humor and sarcasm; they are also used as communication tools to attract attention, adjust tones, or establish personal relationships. The benefit of using emojis goes beyond these intentions. In particular, we show that including emojis in the description of an issue report on GitHub results in the issue being responded to by more users and resolved sooner.
Large-scale emoji data can also be utilized by AI systems to improve the quality of Web mining services. In particular, a smart machine learning system can infer the latent topics, sentiments, and even demographic information of users based on how they use emojis online. Our analysis reveals a considerable difference between female and male users of emojis, which is big enough for a machine learning algorithm to accurately predict the gender of a user. In Web services that are customized for gender groups, gender inference models built upon emojis can complement those based on text or behavioral traces with fewer privacy concerns .
Emojis can be also used as an instrument to bridge Web mining tasks across language barriers, especially to transfer sentiment knowledge from a language with rich training labels (e.g., English) to languages that have been difficult for advanced natural language processing tasks . Through this bridge, developers of AI systems and Web services are able to reduce the inequality in the quality of services received by the international users that has been caused by the imbalance of available human annotations in different languages.
In general, emojis have evolved from visual ideograms to a brand-new world language in the era of AI and a new Web. The popularity, roles, and utility of emojis have all gone beyond people’s original intentions, which have created a huge opportunity for future research that calls for joint efforts from multiple disciplines.
Emojis are increasingly being used in today’s social communication - both formally in team messaging systems as well as informally via text messages on phones. Besides being used in social communication, emojis might also be a suitable mechanism for emotion (self-)assessment. Indeed, emojis can be expected to be familiar to people of different social groups and do not depend on the mastery of a specific language. However, emojis could be interpreted very differently from their actual intent. In order to determine whether people interpret emojis (specific to emotional states) in a consistent manner, we conducted an online survey on nine emojis with 386 people. The results show that the emojis representing anger, sadness, joy, surprise, and neutral state are interpreted as they were intended, independent of age and gender. Interpretations of other emojis such as Unamused Face and Face Screaming in Fear depend on age, and thus are not as useful for probing for emotion in a study setting unless all participants belong to the same age category. The Face with Rolling Eyes emoji is interpreted differently by gender and finally the Nauseated Face emoji resulted in no conclusive interpretation.
Currently, to support gender inclusive codepoints (for example, gender inclusive can be defined as male/female to an equal degree, can neither be confidently identified as male/female, etc.) all major platforms default to a male or a female design. So, if someone were to send a text to their friend from a Microsoft device, “Love a good mansplain 1F925” their friend, if reading from an iPhone will see, “Love a good mansplain 1F926” even though both of these emojis map to U+1F926. This creates all kinds of cross platform inconsistencies and in some cases reinforces stereotypes.1
Focusing on a Chinese social media platform, this study adopts computer-mediated discourse analysis to examine how users employ emoji sequences to construct their personal identity through the expression of stance and engagement. Seven types of linguistic elements were identified by conducting stance and engagement analysis on emoji sequences in posts by social media influencers. Stance was more frequent than engagement. Attitude markers were the most common element used to convey stance, whereas directive was the most prevalent element used to express engagement. In addition, emoji sequences that did not convey stance and engagement were coded as n/a. This study also observed creative usages in the composition of emoji sequences that compensate for the lack of a prescribed emoji sequence grammar. Based on these findings, it advances recommendations for the design of emoji and of social media platforms grounded in linguistic principles.
Most NLP and Computer Vision tasks are limited to scarcity of labelled data. In social media emotion classification and other related tasks, hashtags have been used as indicators to label data. With the rapid increase in emoji usage of social media, emojis are used as an additional feature for major social NLP tasks. However, this is less explored in case of multimedia posts on social media where posts are composed of both image and text. At the same time, w.e have seen a surge in the interest to incorporate domain knowledge to improve machine understanding of text. In this paper, we investigate whether domain knowledge for emoji can improve the accuracy of emotion classification task. We exploit the importance of different modalities from social media post for emotion classification task using state-of-the-art deep learning architectures. Our experiments demonstrate that the three modalities (text, emoji and images) encode different information to express emotion and therefore can complement each other. Our results also demonstrate that emoji sense depends on the textual context, and emoji combined with text encodes better information than considered separately. The highest accuracy of 71.98% is achieved with a training data of 550k posts.
In the last two decades, Emoji have become a mainstay of digital communication, allowing ordinary people to convey ideas, concepts, and emotions with just a few Unicode characters. While emoji are most often used to supplement text in digital communication, they comprise a powerful and expressive vocabulary in their own right. In this paper, we study the affordances of “emoji-first” communication, in which sequences of emoji are used to describe concepts without textual accompaniment.
To investigate the properties of emoji-first communication, we built and released Opico, a social media mobile app that allows users to create reactions — sequences of between one and five emoji — and share them with a network of friends. We then leveraged Opico to collect a repository of more than 3700 emoji reactions from more than 1000 registered users, each tied to one of 2441 physical places.
We describe the design and architecture of the Opico app, present a qualitative and quantitative analysis of Opico’s reaction dataset, and discuss the implications of Emoji-first communication for future social platforms.
Chinese characters predate the introduction of digital emoji by approximately 3,000 years. Despite the temporal gap, there are striking parallels between the canonical set of Chinese radicals and the set of emoji that have currently been approved by the Unicode Consortium. Comparing the 214 Kangxi Chinese radicals with the 3,019 emojis in the Unicode 12.0 set can reveal semantic gaps and provide directions for new emoji. Our analysis found that 72.4% of radicals have reasonable emoji equivalents, while only 17.8% of radicals lack any emoji equivalent that we could determine.
Emojis have gained widespread acceptance, globally and cross-culturally. However, Emoji use may also be nuanced due to differences across cultures, which can play a significant role in shaping emotional life. In this paper, we a) present a methodology to learn latent emotional components of Emojis, b) compare Emoji-Emotion associations across cultures, and c) discuss how they may reflect emotion expression in these platforms. Specifically, we learn vector space embeddings with more than 100 million posts from China (Sina Weibo) and the United States (Twitter), quantify the association of Emojis with 8 basic emotions, demonstrate correlation between visual cues and emotional valence, and discuss pairwise similarities between emotions. Our proposed Emoji-Emotion visualization pipeline for uncovering latent emotional components can potentially be used for downstream applications such as sentiment analysis and personalized text recommendations.
In this study, we aim to predict the most likely emoji given only a short text as an input. We extract a Hebrew political dataset of user comments for emoji prediction. Then, we investigate highly sparse n-grams representations as well as denser character n-grams representations for emoji classification. Since the comments in social media are usually short, we also investigate four dimension reduction methods, which associates similar words to similar vectorial representation. We demonstrate that the common Word Embedding dimension reduction method is not optimal. We also show that the character n-grams representations outperform all the other representation for the task of emoji prediction for Hebrew political domain.
This paper discusses two multi-experiment studies using the ERP methodology to investigate neural correlates of processing linguistic emojis. The first study examined the use of wink emojis used to mark irony and found the same ERP response complex that has been found in response to word-generated irony. Contingent upon individual differences in interpretation, these emojis are processed the same way as ironic words. The second study investigated the prediction of non-face emojis substituted for nouns. When predictability was high, unexpected emojis elicited the same ERP response patterns as words. Overall, the results of these two studies suggest that emojis used linguistically are processed in the same way as words and that individuals can integrate input from multiple modalities into a holistic representation of a single utterance.
The sharing of data is at the core of many technology companies. Data sharing is also increasingly important for government decision-making, as stated by the Commission on Evidenced-Based Policymaking, which led to the Foundations for Evidence-Based Policymaking Act. However, in many instances, data used for decision-making is generated by people, and needs to be explicitly shared by the data subject with those wanting to use the data. The decision-making process behind sharing (private) information needs to be understood to assess (and circumvent) potential biases in the resulting data. When assessing bias in algorithmic decision-making, awareness of biases in the training data is essential. This presentation will review social science theories behind data sharing decision-making, highlight a series of experimental studies designed to affect sharing decisions, and present a framework design to detect sources of bias in various data sources.
As freelancing work keeps on growing almost everywhere due to a sharp decrease in communication costs and to the widespread of Internet-based labour marketplaces (e.g., guru.com, feelancer.com, mturk.com, upwork.com), many researchers and practitioners have started exploring the benefits of outsourcing and crowdsourcing [13, 14, 16, 23, 25, 29]. Since employers often use these platforms to find a group of workers to complete a specific task, researchers have focused their efforts on the study of team formation and matching algorithms and on the design of effective incentive schemes [2, 3, 4, 17]. Nevertheless, just recently, several concerns have been raised on possibly unfair biases introduced through the algorithms used to carry out these selection and matching procedures. For this reason, researchers have started studying the fairness of algorithms related to these online marketplaces [8, 19], looking for intelligent ways to overcome the algorithmic bias that frequently arises. Broadly speaking, the aim is to guarantee that, for example, the process of hiring workers through the use of machine learning and algorithmic data analysis tools does not discriminate, even unintentionally, on grounds of nationality or gender. In this short paper, we define the Fair Team Formation problem in the following way: given an online labour marketplace where each worker possesses one or more skills, and where all workers are divided into two or more not overlapping classes (for examples, men and women), we want to design an algorithm that is able to find a team with all the skills needed to complete a given task, and that has the same number of people from all classes. We provide inapproximability results for the Fair Team Formation problem together with four algorithms for the problem itself. We also tested the effectiveness of our algorithmic solutions by performing experiments using real data from an online labor marketplace.
Unintended bias in Machine Learning can manifest as systemic differences in performance for different demographic groups, potentially compounding existing challenges to fairness in society at large. In this paper, we introduce a suite of threshold-agnostic metrics that provide a nuanced view of this unintended bias, by considering the various ways that a classifier’s score distribution can vary across designated groups. We also introduce a large new test set of online comments with crowd-sourced annotations for identity references. We use this to show how our metrics can be used to find new and potentially subtle unintended bias in existing public models.
An ever increasing number of decisions affecting our lives are made by algorithms. For this reason, algorithmic transparency is becoming a pressing need: automated decisions should be explainable and unbiased. A straightforward solution is to make the decision algorithms open-source, so that everyone can verify them and reproduce their outcome. However, in many situations, the source code or the training data of algorithms cannot be published for industrial or intellectual property reasons, as they are the result of long and costly experience (e.g. this is typically the case in banking or insurance). We present an approach whereby individual subjects on whom automated decisions are made can elicit in a collaborative and privacy-preserving manner a rule-based approximation of the model underlying the decision algorithm, based on limited interaction with the algorithm or even only on how they have been classified. Furthermore, being rule-based, the approximation thus obtained can be used to detect potential discrimination. We present empirical work to demonstrate the practicality of our ideas.
The European General Data Protection Regulation (GDPR) brings new challenges for companies, who must demonstrate that their systems and business processes comply with usage constraints specified by data subjects. However, due to the lack of standards, tools, and best practices, many organizations struggle to adapt their infrastructure and processes to ensure and demonstrate that all data processing is in compliance with users’ given consent. The SPECIAL EU H2020 project has developed vocabularies that can formally describe data subjects’ given consent as well as methods that use this description to automatically determine whether processing of the data according to a given policy is compliant with the given consent. Whereas this makes it possible to determine whether processing was compliant or not, integration of the approach into existing line of business applications and ex-ante compliance checking remains an open challenge. In this short paper, we demonstrate how the SPECIAL consent and compliance framework can be integrated into Linked Widgets, a mashup platform, in order to support privacy-aware ad-hoc integration of personal data. The resulting environment makes it possible to create data integration and processing workflows out of components that inherently respect usage policies of the data that is being processed and are able to demonstrate compliance. We provide an overview of the necessary meta data and orchestration towards a privacy-aware linked data mashup platform that automatically respects subjects’ given consents. The evaluation results show the potential of our approach for ex-ante usage policy compliance checking within the Linked Widgets Platforms and beyond.
Nowadays, human trajectories are enriched with semantic information having multiple aspects, such as by using background geographic information, by user-provided data via location-based social media, as well as by data coming from various kind of sensing devices. This new type of multiple aspects representation of personal movements as sequences of places visited by a person during his/her movement poses even greater privacy violation threats. This paper provides the blueprint of a semantic-aware Moving Object Database (MOD) engine for privacy-aware sharing of such enriched mobility data and introduces an attack prevention mechanism where all potential privacy breaches that may occur when answering a query, are prevented through an auditing methodology. Towards enhancing the user-friendliness of our approach, we propose a mechanism whose objective is to modify the user queries that cannot be answered due to possible privacy violation, to ‘similar’ queries that can be answered without exposing sensitive information.
In this paper, we analogize the practice of trolling to the practice of hacking. Just as hacking often involves the discovery and exploitation of vulnerabilities in a computer security landscape, trolling frequently involves the discovery and exploitation of vulnerabilities in a media or attention landscape to amplify messages and direct attention. Also like with hacking, we consider the possibility for a range of trolling personas: from black hat trolls who push an agenda that is clearly counter to the interests of the target, to gray hat trolls who exploit vulnerabilities to draw critical attention to unaddressed issues, and white hat trolls who could help proactively disclose vulnerabilities so that attack surface can be reduced. We discuss a variety of trolling techniques from dogpiling to sockpuppetry and also a range of possible interventions.
The prevalence of misinformation on online social media has tangible empirical connections to increasing political polarization and partisan antipathy in the United States. Ranking algorithms for social recommendation often encode broad assumptions about network structure (like homophily) and group cognition (like, social action is largely imitative). Assumptions like these can be naïve and exclusionary in the era of fake news and ideological uniformity towards the political poles. We examine these assumptions with aid from the user-centric framework of trustworthiness in social recommendation. The constituent dimensions of trustworthiness (diversity, transparency, explainability, disruption) highlight new opportunities for discouraging dogmatization and building decision-aware, transparent news recommender systems.
Voice-based assistants are becoming increasingly widespread all over the world. However, the performance of these assistants in the interaction with users of languages and accents of developing countries is not clear yet. Eventual bias against specific language or accent of different groups of people in developing countries is maybe a factor to increase the digital gap in these countries. Our research aims at analysing the presence of bias in the interaction via audio. We carried out experiments to verify the quality of the recognition of phrases spoken by different groups of people. We evaluated the behaviour of Google Assistant and Siri for groups of people formed according to gender and regions that have different accents. Preliminary results indicate that accent and mispronunciation due to regional differences are not being properly considered by the assistants we have analyzed.
Recent awareness of the impacts of bias in AI algorithms raises the risk for companies to deploy such algorithms, especially because the algorithms may not be explainable in the same way that non-AI algorithms are. Even with careful review of the algorithms and data sets, it may not be possible to delete all unwanted bias, particularly because AI systems learn from historical data, which encodes historical biases. In this paper, we propose a set of processes that companies can use to mitigate and manage three general classes of bias: those related to mapping the business intent into the AI implementation, those that arise due to the distribution of samples used for training, and those that are present in individual input samples. While there may be no simple or complete solution to this issue, best practices can be used to reduce the effects of bias on algorithmic outcomes.
As the Online Social Networks (OSNs) presence continues to grow as a form of mass communication, tensions regarding their usage and perception by different social groups are reaching a turning point. The number of messages that are exchanged between users in these environments are vast and brought a trust problem, where it is difficult to know if the information is from a real person and if what was said is true. Automated users (bots) are part of this issue, as they may be used to spread false and/or harmful messages through an OSN while pretending to be a person. New attempts to automatically identify bots are in constant development, but so are the mechanisms to elude detection. We believe that teaching the user to identify a bot message is an important step in maintaining the credibility of content on social media. In this study, we developed an analysis tool, based on media literacy considerations, that helps the ordinary user to recognize a bot message using only textual features. Instead of simply classifying a user as a bot or human, this tool presents an interpretable reasoning path that helps to educate the user into recognizing suspicious activity. Experimental evaluation is conducted to test the tool’s primary effectiveness (classification) and results are presented. The secondary effectiveness (interpretability) is discussed in qualitative terms.
In this work, we introduce a novel metric for auditing group fairness in ranked lists. Our approach offers two benefits compared to the state of the art. First, we offer a blueprint for modeling of user attention. Rather than assuming a logarithmic loss in importance as a function of the rank, we can account for varying user behaviors through parametrization. For example, we expect a user to see more items during a viewing of a social media feed than when they inspect the results list of a single web search query. Second, we allow non-binary protected attributes to enable investigating inherently continuous attributes (e.g., political alignment on the liberal to conservative spectrum) as well as to facilitate measurements across aggregated sets of search results, rather than separately for each result list. By combining these two elements into our metric, we are able to better address the human factors inherent in this problem. We measure the whole sociotechnical system, consisting of a ranking algorithm and individuals using it, instead of exclusively focusing on the ranking algorithm. Finally, we use our metric to perform three simulated fairness audits. We show that determining fairness of a ranked output necessitates knowledge (or a model) of the end-users of the particular service. Depending on their attention distribution function, a fixed ranking of results can appear biased both in favor and against a protected group1.
Algorithms for social influence maximization have been extensively studied for the purpose of strategically choosing an initial set of individuals in a social network from which information gets propagated. With many applications in advertisement, news spread, vaccination, and online trend-setting, this problem is a central one in understanding how information flows in a network of individuals. As human networks may encode historical biases, algorithms performing on them might capture and reproduce such biases when automating outcomes.
In this work, we study the social influence maximization problem for the purpose of designing fair algorithms for diffusion, aiming to understand the effect of communities in the creation of disparate impact among network participants based on demographic attributes (gender, race etc). We propose a set of definitions and models for assessing the fairness-utility tradeoff in designing algorithms that maximize influence through a mathematical model of diffusion and an empirical analysis of a collected dataset from Instagram. Our work shows that being feature-aware can lead to more diverse outcomes in outreach and seed selection, as well as better efficiency, than being feature-blind.
As today’s media landscape is carved by social media endorsements and built on automated recommendations, both of these are often criticized for inducing vicious dynamics, such as the filter bubble effect, echo chamber, or polarization. We introduce a new model featuring a mild version of homophily and two well-known popularity dynamics. These broadly reproduce the organic activity and the algorithmic filtering, respectively, of which the latter is now commonplace within social media or other online services. Surprisingly, we show this is all that is needed to create hegemony: a single viewpoint (or side) not only receives undue attention, but it also captures all the attention given to “top trending” items.
The Fourth Industrial Revolution (4IR) is characterized by a fusion of technologies, which is blurring the lines between the physical, digital, and biological spheres. In this context, two fundamental characteristics emerge: transparency and privacy. From one side, transparency can be seen as the quality that allows participants of a community to know which particular processes are being applied, by which agents, and on which data items. It is generally regarded as a means to enable checks and balances within this community, so as to provide a basis for trust among its participants. Privacy, on the other side, essentially refers to the right of an individual to control how information about her/him is used by others. The issue of public transparency versus individual privacy has long been discussed, and within already existing 4IR scenarios, it became clear that the free flow of information fostered by transparency efforts poses serious conflicting issues to privacy assurance. In order to deal with the myriad of often conflicting cross-cutting concerns, Internet applications and systems must incorporate adequate mechanisms to ensure compliance of both ethical and legal principles. In this paper, we use the OurPrivacy Framework as a conceptual framework to precisely characterize where in the design process the decisions must be made to handle both transparency and privacy concerns.
Machine learning algorithms are used to make decisions in various applications. These algorithms rely on large amounts of sensitive individual information to work properly. Hence, there are sociological concerns about machine learning algorithms on matters like privacy and fairness. Currently, many studies focus on only protecting individual privacy or ensuring fairness of algorithms. However, how to meet both privacy and fairness requirements simultaneously in machine learning algorithms is under exploited. In this paper, we focus on one classic machine learning model, logistic regression, and develop differentially private and fair logistic regression models by combining functional mechanism and decision boundary fairness in a joint form. Theoretical analysis and empirical evaluations demonstrate our approaches effectively achieve both differential privacy and fairness while preserving good utility.
The online game-with-a-purpose Phrase Detectives (https://www.phrasedetectives.com)  has been collecting decisions about anaphoric coreference in human language for over 10 years (4 million judgements from 40,000 players). The game was originally designed to collect multiple valid solutions for a single task, which complicated aggregation but created a very rich (and noisy) dataset . Analysis of the ambiguous player decisions highlights the need for understanding and resolving disagreement that is inherent in language interpretation. This talk will present some of the interesting cases of ambiguity found by the players of Phrase Detectives (a dataset that will be made available to the research community later this year ) and discuss the statistical methods we have been working on to harness crowds that disagree with each other [4, 5].
Many real world analytics problems examine multiple entities or classes that may appear in a corpus. For example, in a customer satisfaction survey analysis there are over 60 categories of (somewhat overlapping) concerns. Each of these is backed by a lexicon of terminology associated with the concern (e.g., “Easy, user friendly process” or ”Process confusing, too many handoffs”). These categories need to be expanded by a subject matter expert as the terminology is not always straight forward (e.g., “handoffs” may also include “ping-pong” and “hot potato” as relevant terms).
But given that Subject Matter Expert time is costly, which of the 60+ lexicons should we expand first? We propose a metric for evaluating an existing set of lexicons and providing guidance on which are likely to benefit most from human-in-the-loop expansion. Using our ranking results we achieved ≈ 4 × improvement in impact when expanding the first few lexicons off our suggested list as compared to a random selection.
Data exploration is a task that inherently requires high human interaction. The subject matter expert looks at the data to identify a hypothesis, potential questions, and where to look for answers in the data. Virtually all data exploration scenarios can benefit from a tight human-in-the-loop paradigm, where data can be visualized and reshaped, but also augmented with missing semantic information - that the subject matter expert can supplement in itinere. In this demo we show a novel graph-based data exploration model where the subject matter expert can annotate and maneuver the data to answer specific questions. This demo specifically focuses on the task of migrating data centers, logically and/or physically, where the subject matter expert needs to identify the function of each node - a server, a virtual machine, a printer, etc - in the data center, which is not necessarily directly available in the data and to be able to plan a safe switch-off and relocation of a cluster of nodes. We show how the novel human-in-the-loop data exploration and enrichment paradigm helps designing the data center migration plan.
Electronic publishers and other web-companies are starting to collect user feedback on ads with the aim of using this signal to maintain the quality of ads shown on their sites. However, users are not randomly sampled to provide feedback on ads, but targeted. Furthermore some users who provide feedback may be prone to dislike ads more than the general user. This raises questions about the reliability of ad feedback as a signal for measuring ad quality and whether it can be used in ad ranking. In this paper we start by gaining insights to such signals by analyzing the feedback event logs attributed to users of a popular mobile news app. We then propose a model to reduce potential biases in ad feedback data. Finally, we conclude by comparing the effectiveness of reducing the bias in ad feedback data using existing ad ranking methods along with a new and novel approach we propose that takes revenue considerations into account.
Collaborative creation of knowledge is an approach which has been successfully demonstrated by crowdsourcing project like Wikipedia. Similar techniques have recently been adopted for the creation of collaboratively generated Knowledge Graphs like, for example, Wikidata. While such an approach enables the creation of high quality structured content, it also comes with the challenge of introducing contributors’ implicit bias in the generated Knowledge Graph. In this paper, we investigate how paid crowdsourcing can be used to understand contributor bias for controversial facts to be included into collaborative Knowledge Graphs. We propose methods to trace the provenance of crowdsourced fact checking thus enabling bias transparency rather than aiming at eliminating bias from the Knowledge Graph.
Loneliness is becoming a global epidemic. As many as 33% of Americans report being chronically lonely, with similar percentages being reported in countries around the world. Additionally, this is a percentage that has risen in recent years. Many are turning to online forums as a way to connect with others about their feelings of loneliness and to begin to reduce these feelings. However, oftentimes, posts go unresponded to and online conversations do not take place, perhaps because those conversing did not find a connection with each other, potentially leaving the poster feeling even more lonely. This paper introduces a human-in-the-loop approach so that computers can mediate interactions online about loneliness and facilitate more intimate interactions. We also discuss ways to mitigate the bias when creating this system. The artificial intelligence in this approach takes into account the homophilous characteristics of the conversations that are taking place online by examining the homophily of the participants. Initial findings related to the correlation between homophily and successful conversations about loneliness on reddit are presented, and we lay the groundwork for being able to facilitate finding optimal conversation partners for those that are feeling lonely by proposing a human-in-the-loop approach.
Time series are one of the most common data types in nature. Given this fact, there are dozens of query-by-sketching/ query-by-example/ query-algebra systems proposed to allow users to search large time series collections. However, none of these systems have seen widespread adoption. We argue that there are two reasons why this is so. The first reason is that these systems are often complex and unintuitive, requiring the user to understand complex syntax/interfaces to construct high-quality queries. The second reason is less well appreciated. The expressiveness of most query-by-content systems is surprisingly limited. There are well defined, simple queries that cannot be answered by any current query-by-content system, even if it uses a state-of-the-art distance measure such as Dynamic Time Warping. In this work, we propose a natural language search mechanism for searching time series. We show that our system is expressive, intuitive, and requires little space and time overhead. Because our system is text-based, it can leverage decades of research text retrieval, including ideas such as relevance feedback. Moreover, we show that our system subsumes both motif/discord discovery and most existing query-by-content systems in the literature. We demonstrate the utility of our system with case studies in domains as diverse as animal motion studies, medicine and industry.
Recommender Systems (RSs) are widely used to help online users discover products, books, news, music, movies, courses, restaurants, etc. Because a traditional recommendation strategy always shows the most relevant items (thus with highest predicted rating), traditional RS’s are expected to make popular items become even more popular and non-popular items become even less popular which in turn further divides the haves (popular) from the have-nots (unpopular). Therefore, a major problem with RSs is that they may introduce biases affecting the exposure of items, thus creating a popularity divide of items during the feedback loop that occurs with users, and this may lead the RS to make increasingly biased recommendations over time. In this paper, we view the RS environment as a chain of events that are the result of interactions between users and the RS. Based on that, we propose several debiasing algorithms during this chain of events, and evaluate how these algorithms impact the predictive behavior of the RS, as well as trends in the popularity distribution of items over time. We also propose a novel blind-spot-aware matrix factorization (MF) algorithm to debias the RS. Results show that propensity matrix factorization achieved a certain level of debiasing of the RS while active learning combined with the propensity MF achieved a higher debiasing effect on recommendations.
This paper explores the intersection between microservices and Multi-Agent Systems (MAS), introducing the notion of a new approach to building MAS known as Multi-Agent MicroServices (MAMS). Our approach is illustrated through a worked example of a Vickrey Auction implemented as a microservice.
The industrial domain offers a high degree of standardization, a variety of very specialized use cases, and an abundance of resources. These characteristics provide perfect conditions for Digital Companion systems. A Digital Companion is a cognitive agent that assists human users by taking on three roles: as guardians, assistants or mentors, and partners. This paper describes the characteristics, conceptual architecture, use cases and open challenges regarding Digital Companions for industry.
Deploying context management systems at a global scale comes with a number of challenges and requirements. We argue that the hypermedia model and the agent-oriented paradigm help achieve the vision of Context-as-a-Service. We categorize challenges according to context processing concerns and use a scenario to exemplify how the proposed architectural principles help overcome the challenges.
Decentralised data solutions bring their own sets of capabilities, requirements and issues not necessarily present in centralised solutions. In order to compare the properties of different approaches or tools for management of decentralised data, it is important to have a common evaluation framework. We present a set of dimensions relevant to data management in decentralised contexts and use them to define principles extending the FAIR framework, initially developed for open research data. By characterising a range of different data solutions or approaches by how TRusted, Autonomous, Distributed and dEcentralised, in addition to how Findable, Accessible, Interoperable and Reusable, they are, we show that our FAIR TRADE framework is useful for describing and evaluating the management of decentralised data solutions, and aim to contribute to the development of best practice in a developing field.
Decentralised data solutions bring their own sets of capabilities, requirements and issues not necessarily present in centralised solutions. In order to compare the properties of different approaches or tools for management of decentralised data, it is important to have a common evaluation framework. We present a set of dimensions relevant to data management in decentralised contexts and use them to define principles extending the FAIR framework, initially developed for open research data. By characterising a range of different data solutions or approaches by how TRusted, Autonomous, Distributed and dEcentralised, in addition to how Findable, Accessible, Interoperable and Reusable, they are, we show that our FAIR TRADE framework is useful for describing and evaluating the management of decentralised data solutions, and aim to contribute to the development of best practice in a developing field.
This panel will focus on industry applications related to knowledge graph and showcase how knowledge graph transforms the conventional and unconventional industries to the new era of AI, ranging from innovations in medicine and healthcare, literature search, e-commerce, professional connections, to getting a ride.
Panelists are: Senior Software Engineer/Research Scientist at Uber, Co-founder of Tinkerpop, specialized on real-time semantics, RDF streams and graph databases Head of AI at Genentech, passionate about modeling, and currently developing a general medical inference engine that can be applied to a wide variety of areas, from point of care decision support, triage, insurance risk managment to name a few A senior staff engineer/director at Business Platform Unit, Alibaba, leading product knowledge graph (PKG) team and Business Platform AI team. He and his team have built a huge PKG with 10 billion of entities. Data Scientist at Numedii, previously postdoctoral research fellow at Stanford University School of Medicine, developing novel methods to integrate and explore a broad set of biological and clinical data for scientific reproducibility and biomedical discovery. An accomplished technical scientist, innovator and R&D leader in cutting-edge technology research and product development. Proven track record of success (20+ years of successful professional career in leading R&D organizations) in leading rapid technological advancement, innovation, and highly competitive environments. Broad range of skills from initiating research breakthroughs to achieving marketable product development. Renowned expert and technological visionary in the fields of enterprise middleware, cloud computing, data centric computing, workload optimized systems and appliances, business analytics, big data, social media and multimedia, speech and natural language processing. Extended technical leadership in systems design for LinkedIn Economic Graph, Google Search and Google Research. Breadth and depth of expertise in building data systems and platforms. Technical leadership and management of software development in both start-up and large companies. A principal research staff member, and a senior manager at IBM Almaden Research Center. I manage the information management department, working on HTAP (hybrid transactional and analytical processing) systems, large scale machine learning, and natural language querying of data. A team lead and senior scientist at Bloomberg. He holds a PhD in computer science from the University of Amsterdam and has an extensive track record in artificial intelligence, information retrieval, knowledge graphs, natural language processing, and machine learning. Before joining Bloomberg he worked at Yahoo Labs on semantic search at web scale using the Yahoo Knowledge Graph. At Bloomberg he leads the team that is responsible for leveraging knowledge graph technology to drive advanced financial insights.
This panel will focus on the cutting-edge computation methods, which can be applied to knowledge graph, such as latest NLP technologies to extract entities and relationships to build knowledge graphs, machine learning or deep learning methods on mining knowledge graph, and intelligent search or recommendations powered by knowledge graphs.
Panelists are: Professor and the Vice Chair of the Department of Computer Science and Technology of Tsinghua University. I obtained my Ph.D. in DCST of Tsinghua University in 2006. My research interests include artificial intelligence, data mining, social networks, machine learning and knowledge graph, with an emphasis on designing new algorithms for mining social and knowledge networks. Associate Professor of Computer Science at Stanford University. My research focuses on mining and modeling large social and information networks, their evolution, and diffusion of information and influence over them. Problems I investigate are motivated by large scale data, the web and on-line media Managing Director of MSR Outreach, an organization with the mission to serve the research community. In addition to applying the intelligent technologies to make Bing and Cortana smarter in gathering and serving academic knowledge, we are also starting an experimental website, academic.microsoft.com (powered by Academic API), and mobile apps dedicated to exploring new service scenarios for active researchers like myself A leading expert in knowledge representation and reasoning languages and systems and has worked in ontology creation and evolution environments for over 20 years. Most recently, McGuinness is best known for her leadership role in semantic web research, and for her work on explanation, trust, and applications of semantic web technology, particularly for scientific applications. Director of Artificial Intelligence at Amazon Web Services, passionate about opening up new markets and opportunities with smart application of Artificial Intelligence and application-driven research in AI, in particular in language technology, speech processing, computer vision and computational reasoning.
Deep neural networks have achieved promising results in stock trend prediction. However, most of these models have two common drawbacks, including (i) current methods are not sensitive enough to abrupt changes of stock trend, and (ii) forecasting results are not interpretable for humans. To address these two problems, we propose a novel Knowledge-Driven Temporal Convolutional Network (KDTCN) for stock trend prediction and explanation. Firstly, we extract structured events from financial news, and utilize external knowledge from knowledge graph to obtain event embeddings. Then, we combine event embeddings and price values together to forecast stock trend. We evaluate the prediction accuracy to show how knowledge-driven events work on abrupt changes. We also visualize the effect of events and linkage among events based on knowledge graph, to explain why knowledge-driven events are common sources of abrupt changes. Experiments demonstrate that KDTCN can (i) react to abrupt changes much faster and outperform state-of-the-art methods on stock datasets, as well as (ii) facilitate the explanation of prediction particularly with abrupt changes.
We present WebProtégé, a tool to develop ontologies represented in the Web Ontology Language (OWL). WebProtégé is a cloud-based application that allows users to collaboratively edit OWL ontologies, and it is available for use at https://webprotege.stanford.edu. WebProtégé currently hosts more than 68,000 OWL ontology projects and has over 50,000 user accounts. In this paper, we detail the main new features of the latest version of WebProtégé.
Content-based news recommendation systems need to recommend news articles based on the topics and content of articles without using user specific information. Many news articles describe the occurrence of specific events and named entities including people, places or objects. In this paper, we propose a graph traversal algorithm as well as a novel weighting scheme for cold-start content based news recommendation utilizing these named entities. Seeking to create a higher degree of user-specific relevance, our algorithm computes the shortest distance between named entities, across news articles, over a large knowledge graph. Moreover, we have created a new human annotated data set for evaluating content based news recommendation systems. Experimental results show our method is suitable to tackle the hard cold-start problem and it produces stronger Pearson correlation to human similarity scores than other cold-start methods. Our method is also complementary and a combination with the conventional cold-start recommendation methods may yield significant performance gains. The dataset, CNRec, is available at: https://github.com/kevinj22/CNRec
Historically, most of the focus in the knowledge graph community has been on the support for web, social network, or product search applications. This paper describes some of our experience in developing a large-scale applied knowledge graph for a more technical audience with more specialized information access and analysis needs – the air traffic management community. We describe ATMGRAPH, a knowledge graph created by integrating various sources of structured aviation data, provided in large part by US federal agencies. We review some of the practical challenges we faced in creating this knowledge graph.
Automatic extraction of information from text and its transformation into a structured format is an important goal in both Semantic Web Research and computational linguistics. Knowledge Graphs (KG) serve as an intuitive way to provide structure to unstructured text. A fact in a KG is expressed in the form of a triple which captures entities and their interrelationships (predicates). Multiple triples extracted from text can be semantically identical but they may have a vocabulary gap which could lead to an explosion in the number of redundant triples. Hence, to get rid of this vocabulary gap, there is a need to map triples to a homogeneous namespace. In this work, we present an end-to-end KG construction system, which identifies and extracts entities and relationships from text and maps them to the homogenous DBpedia namespace. For Predicate Mapping, we propose a Deep Learning architecture to model semantic similarity. This mapping step is computation heavy, owing to the large number of triples in DBpedia. We identify and prune unnecessary comparisons to make this step scalable. Our experiments show that the proposed approach is able to construct a richer KG at a significantly lower computation cost with respect to previous work.
Linked Open Data and the RDF format have become the premier method of publishing structured data representing entities and facts. Specifically, media organizations, such as the New York Times and the BBC, have embraced Linked Open Data as a way of providing structured access to traditional media content, including articles, images, and video. To ground RDF entities and predicates in existing Linked Open Data sources, dataset curators provide links for some entities to existing general purpose repositories, such as YAGO and DBpedia, using entity extraction and linking tools. However, these state-of-the-art tools rely on the entities to exist in the knowledge base. How much of the information is actually new and thus unable to be grounded is unclear. In this work, we empirically investigate the prevalence of new entities in news feeds with respect to both public and commercial knowledge graphs.
This paper presents the KG Usage framework, which allows the introduction of KG features to support Trust, Privacy and Transparency concerns regarding the use of its contents by applications. A real-world example is presented and used to illustrate how the framework can be used.
Extracting entities and relations is critical to the understanding of massive text corpora. Recently, neural joint models have shown promising results for this task. However, the entity features are not effectively used in these joint models. In this paper, we propose an approach to utilize the implicit entity features in the joint model and show these features can facilitate the joint extraction task. Particularly, we use the hidden-layer vectors extracted from a pre-trained named entity recognition model as the entity features. Thus, our method does not need to design the entity features by hand and can benefit from the new development of named entity recognition task. In addition, we introduce an attention mechanism in our model which can select the informative parts of the input sentence to the prediction. We conduct a series of experiments on a public dataset and the results show the effectiveness of our model.
Misinformation dissemination is a topic that has gained a lot of attention from academia and public media, in general. Despite a rich literature on strategies to detect and mitigate this phenomenon, the problem still persists with impact on several sectors of the society. In this talk, I will discuss the problem, revise existing approaches as well as discuss challenges to properly address it. I will also discuss recent results of our group on the investigation of misinformation spread on WhatsApp.
In this talk we refer to bias in its everyday sense, as a prejudice against a person or a group, and ask whether an algorithm, particularly a ranking algorithm, can be biased. We begin by defining under which conditions this can happen. Next, we describe key results from research on algorithmic fairness, much of which studies automatic classification by a supervised learning method. Finally, we attempt to map these concepts to rankings and to introduce new, ranking-specific ways of looking at algorithmic bias.
News websites are currently one of the main sources of information. Like traditional media, these sources can have a bias in how they report news. This media bias can influence how people perceive events, political decisions, or discussions. In this paper, we describe a link-based approach to identify news websites with the same political orientation, i.e., characterize the bias of news websites, using network analysis techniques. After constructing a graph from a few seeds with previously known bias, we show that a community detection algorithm can identify groups formed by sources with the same political orientation.
We study the workload of an Online Invoicing application with clients in the Andean region in South America. The application is offered to clients with a Software-as-a-Service model, has a microservices architecture and runs on a containerized environment on a public cloud provider. The cloud application workload described in this paper can be used as part of a workload suite comprised of different application workloads, when evaluating microservices architectures. To the best of our knowledge, this is a novel workload in the web domain and it complements other workloads publicly available. Though we make no claim of the general applicability of this workload as a “microservices benchmark”, its inclusion in evaluations could aid researchers and practitioners enrich their evaluations with tests based on a real microservices-based web application. Finally, we provide some insights regarding best-practices in microservice design, as a result of the observed workload.
In this work, we analyze content and structure of the Twitter trending topic #cuentalo with the purpose of providing a visualization of the movement. A supervised learning methodology is used to train the classifying algorithms with hand-labeled observations. The methodology allows us to classify each tweet according to its role in the movement.
The Bitcoin protocol and its underlying cryptocurrency have started to shape the way we view digital currency, and opened up a large list of new and interesting challenges. Amongst them, we focus on the question of how is the price of digital currencies affected, which is a natural question especially when considering the price rollercoaster we witnessed for bitcoin in 2017-2018. We work under the hypothesis that price is affected by the web footprint of influential people, we refer to them as crypto-influencers.
In this paper we provide neural models for predicting bitcoin price. We compare what happens when the model is fed only with recent price history versus what happens when fed, in addition, with a measure of the positivity or negativity of the sayings of these influencers, measured through a sentiment analysis of their twitter posts. We show preliminary evidence that twitter data should indeed help to predict the price of bitcoin, even though the measures we use in this paper have a lot of room for refinement. In particular, we also discuss the challenges of measuring the correct sensation of these posts, and discuss the work that should help improving our discoveries even further.
The world of video games has changed considerably over the recent years. Its diversification has dramatically increased the number of users engaged in online communities of this entertainment area, and consequently, the number and types of games available. This context of information overload underpins the development of recommender systems that could leverage the information that the video game platforms collect, hence following the trend of new games coming out every year. In this work we test the potential of state-of-the-art recommender models based respectively on Factorization Machines (FM), deep neural networks (DeepNN) and one derived from the mixture of both (DeepFM), chosen for their potential of receiving multiple inputs as well as different types of input variables. We evaluate our results measuring the ranking accuracy of the recommendation and the diversity/novelty of a recommendation list. All the algorithms achieve better results than a baseline based on implicit feedback (Alternating Least Squares model). The best performing algorithm is DeepNN, the high order interactions are more important than the low order ones for this recommendation task. We also analyze the effect of the sentiment extracted directly from game reviews, and find that it is not as relevant for recommendation as one might expect. We are the first in studying the aforementioned recommender systems over the context of online video game platforms, reporting novel results which could be used as baseline in future works.
When making purchasing decisions, customers usually rely on information from two types of sources: product specifications, provided by manufacturers, and reviews, posted by other customers. Both kinds of information are often available on e-commerce websites. While researchers have demonstrated the importance of product specifications and reviews as separate and valuable sources to support purchase decision-making, a largely uninvestigated issue is what is the relationship between these two kinds of information. In this paper we present an empirical study on the use of direct and indirect mentions to canonical product attributes, that is, those defined by manufactures in product specifications, in the reviews written by customers. For this study, we analyzed more than 1,100,000 opinionated sentences available in about 650,000 user reviews from Amazon.com across five product categories. Our results indicate that user opinions are indeed guided by the attributes from product specifications and highlight the influence of canonical attributes on the user reviews.
Complex human behaviors related to crime require multiple sources of information to understand them. Social Media is a place where people share opinions and news. This allows events in the physical world like crimes to be reflected on Social Media. In this paper we study crimes from the perspective of Social Media, specifically car theft and Twitter. We use data of car theft reports from Twitter and car insurance companies in Chile to perform a temporal analysis. We found that there is an increasing correlation in recent years between the number of car theft reports in Twitter and data collected from insurance companies. We performed yearly, monthly, daily and hourly analyses. Though Twitter is an unstructured source and very noisy, it allows you to estimate the volume of thefts that are reported by the insurers. We experimented with a Moving Average to predict the tendency in the number of car theft reported to insurances using Twitter data and found that one month is the best time window for prediction.
Migration is a worldwide phenomenon that may generate different reactions in the population. Attitudes vary from those that support multiculturalism and communion between locals and foreigners, to contempt and hatred toward immigrants. Since anti-immigration attitudes are often materialized in acts of violence and discrimination, it is important to identify factors that characterize these attitudes. However, doing so is expensive and impractical, as traditional methods require enormous efforts to collect data. In this paper, we propose to leverage Twitter to characterize local attitudes toward immigration, with a case study on Chile, where immigrant population has drastically increased in recent years. Using semi-supervised topic modeling, we situated 49K users into a spectrum ranging from in-favor to against immigration. We characterized both sides of the spectrum in two aspects: the emotions and lexical categories relevant for each attitude, and the discussion network structure. We found that the discussion is mostly driven by Haitian immigration; that there are temporal trends in tendency and polarity of discussion; and that assortative behavior on the network differs with respect to attitude. These insights may inform policy makers on how people feel with respect to migration, with potential implications on communication of policy and the design of interventions to improve inter-group relations.
The last decades have shown us a growing interest in different fields of how interactive art transforms the position of the viewer into a participant and how audiences engage and relate to interactive artwork. This article presents a visual analysis of the content shared on Instagram by the audience of Default –an interactive art installation presented in Santiago, Chile in 2017. The analysis shows that people reacted and engaged differently with various aspects of the installation, as shown by the strategies they used to share it. We argue that the analysis of the visual content of Instagram posts opens avenues to understanding the relationship between installation and audience, giving clues on the audience experience and, therefore, providing feedback for developers, who could use them in the design process of future installations.
Currently, there is a limited understanding of how data privacy concerns vary across the world. The Cambridge Analytica scandal triggered a wide-ranging discussion on social media about user data collection and use practices. We conducted a cross-language study of this online conversation to compare how people speaking different languages react to data privacy breaches. We collected tweets about the scandal written in Spanish and English between April and July 2018. We used the Meaning Extraction Method in both datasets to identify their main topics. They reveal a similar emphasis on Zuckerberg’s hearing in the US Congress and the scandal’s impact on political issues. However, our analysis also shows that while English speakers tend to attribute responsibilities to companies, Spanish speakers are more likely to connect them to people. These findings show the potential of cross-language comparisons of social media data to deepen the understanding of cultural differences in data privacy perspectives.
The Best Practices described on the Data on the Web Best Practices (DWBP) document  encourages and enables the continued expansion of the Web as a medium for the exchange of data. In this context, this paper focus on two cases of implementing the DWBP. The first one concerns data published by The Regional Center for Studies on the Development of the Information Society (Cetic.br) of The Brazilian Network Information Center (NIC.br). The second use case shows the experience of the Judiciary Department of Costa Rica (Justicia Abierta) with applying the DWBP Recommendation to publish their data on the Web.
Recent work suggests that certain places can be more attractive for car theft based how many people regularly visit them, as well as other factors. In this sense, we must also consider the city or district itself where vehicles are stolen. All cities have different cultural and socioeconomic characteristics that influence car theft patterns. In particular, the distribution of public services and places attract a large crowd could play a key role in the occurrence of car theft. Santiago, a city that displays drastic socioeconomic differences among its districts, presents increasingly-high car theft rates. This represents a serious issue for the city, as for any other major city, which –at least for Santiago– has not been analyzed in depth using quantitative approaches. In this work, we present a preliminary study of how places that create social interest, such as restaurants, bars, schools, and shopping malls, increase car theft frequency in Santiago. We also study if some types of places are more attractive than others for this type of crime. To evaluate this, we propose to analyze car theft points (CTP) from insurance companies and their relationship with places of social interest (PSI) extracted from Google Maps, using a proximity based approach. Our findings show a high correlation between CTP and PSI for all of the social interest categories that we studied in the different districts of the Santiago. In particular our work contributes to the understanding of the social factors that are associated to car thefts.
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with a corresponding unambiguous entry in a Knowledge Base. The evaluation of EL systems relies on the comparison of their results against gold standards. A common format used to represent gold standard datasets is the NLP Interchange Format (NIF), which uses RDF as a data model. However, creating gold standard datasets for EL is a time-consuming and error-prone process. In this paper we propose a tool called NIFify to help manually generate, curate, visualize and validate EL annotations; the resulting tool is useful, for example, in the creation of gold standard datasets. NIFify also serves as a benchmark tool that enables the assessment of EL results. Using the validation features of NIFify, we further explore the quality of popular EL gold standards.
Decentralized web applications do not offer fine-grained access controls to users’ data, which potentially creates openings for data breaches. For software companies that need to comply with Brazil’s General Data Protection Law (LGPD), data breaches not only might harm application users but also could expose the companies to hefty fines. In this context, engineering fine-grained authorization controls (that comply with the LGPD) to decentralized web application requires creating audit trails, possibly in the source code. Although the literature offers some solutions, they are scattered. We present Esfinge Guardian, an authorization framework that completely separates authorization from other concerns, which increases compliance with the LGPD. We conclude the work with a brief discussion.
Decentralised data solutions bring their own sets of capabilities, requirements and issues not necessarily present in centralised solutions. In order to compare the properties of different approaches or tools for management of decentralised data, it is important to have a common evaluation framework. We present a set of dimensions relevant to data management in decentralised contexts and use them to define principles extending the FAIR framework, initially developed for open research data. By characterising a range of different data solutions or approaches by how TRusted, Autonomous, Distributed and dEcentralised, in addition to how Findable, Accessible, Interoperable and Reusable, they are, we show that our FAIR TRADE framework is useful for describing and evaluating the management of decentralised data solutions, and aim to contribute to the development of best practice in a developing field.
This paper is a progress report on our recent work on two applications that use Linked Data and Distributed Ledger technologies and aim to transform the Greek public sector into a decentralized, trusted, intelligent and linked organization. The first application is a re-engineering of Diavgeia, the Greek government portal for open and transparent public administration. The second application is Nomothesia, a new portal that we have built, which makes Greek legislation available on the Web as linked data to enable its effective use by citizens, legal professionals and software developers who would like to build new applications that utilize Greek legislation. The presented applications have been implemented without funding from any source and are available for free to any part of the Greek public sector that may want to use them. An important goal of this paper is to present the lessons learned from this effort.
The Linked Open Data (LOD) cloud has been around since 2007. Throughout the years, this prominent depiction served as the epitome for Linked Data and acted as a starting point for many. In this article we perform a number of experiments on the dataset metadata provided by the LOD cloud, in order to understand better whether the current visualised datasets are accessible and with an open license. Furthermore, we perform quality assessment of 17 metrics over accessible datasets that are part of the LOD cloud. These experiments were compared with previous experiments performed on older versions of the LOD cloud. The results showed that there was no improvement on previously identified problems. Based on our findings, we therefore propose a strategy and architecture for a potential collaborative and sustainable LOD cloud.
There is no credibility insurance measure for the information provided by the Web. In most cases, information cannot be checked for accuracy. Semantic Web technologies aimed to give structure and sense to information published on the Web and to provide us with a machine-readable data format for interlinked data. However, Semantic Web standards do not offer the possibility to represent and attach uncertainty to such data in a way allowing the reasoning over the latter. Moreover, uncertainty is context-dependent and may be represented by multiple theories which apply different calculi. In this paper, we present a new vocabulary and a framework for handling generic uncertainty representation and reasoning. The meta-Uncertainty vocabulary offers a way to represent uncertainty theories and annotate Linked Data with uncertainty information. We provide the tools to represent uncertainty calculi linked to the previous theories using the LDScript function scripting language. Moreover, we describe the semantics of contexts in uncertainty reasoning with meta-uncertainty. We describe the mapping between RDF triples and their uncertainty information, and we demonstrate the effect on the query writing process in Corese. We discuss the translatability of uncertainty theories and, finally, the negotiation of an answer annotated with uncertainty information.
To help in making sense of the ever-increasing number of data sources available on the Web, in this article we tackle the problem of enabling automatic discovery and querying of data sources at Web scale. To pursue this goal, we suggest to (1) provision rich descriptions of data sources and query services thereof, (2) leverage the power of Web search engines to discover data sources, and (3) rely on simple, well-adopted standards that come with extensive tooling. We apply these principles to the concrete case of SPARQL micro-services that aim at querying Web APIs using SPARQL. The proposed solution leverages SPARQL Service Description, SHACL, DCAT, VoID, Schema.org and Hydra to express a rich functional description that allows a software agent to decide whether a micro-service can help in carrying out a certain task. This description can be dynamically transformed into a Web page embedding rich markup data. This Web page is both a human-friendly documentation and a machine-readable description that makes it possible for humans and machines alike to discover and invoke SPARQL micro-services at Web scale, as if they were just another data source. We report on a prototype implementation that is available on-line for test purposes, and that can be effectively discovered using Google’s Dataset Search engine.
In this keynote, I will present some results we have obtained using location data from mobile phones interacting with information (news) websites, mobile apps like Pokemon GO and Twitter, and large physical spaces like shopping malls. I will draw some conclusions and generally discuss about the properties of mobile phone data for location-based research to finally close with some remarks about privacy and data security.
The amount of information available in social media and specialized blogs has become useful for a user to plan a trip. However, the user is quickly overwhelmed by the list of possibilities offered to him, making his search complex and time-consuming. Recommender systems aim to provide personalized suggestions to users by leveraging different type of information, thus assisting them in their decision-making process. Recently, the use of neural networks and knowledge graphs have proven to be efficient for items recommendation. In our work, we propose an approach that leverages contextual, collaborative and content information in order to recommend personalized destinations to travelers. We compare our approach with a set of state of the art collaborative filtering methods and deep learning based recommender systems.
Linked data (LD) is a technology to support publishing structured data on the web so that it may be interlinked. Building Information Modelling (BIM) is a key enabler to support integration of building data within the buildings life cycle (BLC). LD can therefore provide better access and more semantically useful querying of BIM data. The integration of BIM into the geospatial domain provides much needed contextual information about the building and its surroundings, and can support geospatial querying over BIM data. Creating GeoSPARQL queries for users who are non experts in semantic web technologies can be a challenge. In this paper we present a visualization tool built upon HTML5 and WebGL technologies that supports queries over linked data without the need to understand the resulting SPARQL queries. The interactive web interface can be quickly extended to support new use cases, for example, related to 3D geometries. The paper discusses the underlying data management, the methodology for uplifting several open data sources into Resource Description Framework (RDF), and the front-end implementation tested over a sample use case. Finally some discussion and future work is given, with a focus on how this tool can potentially support BIM integration.
This paper studies for the first time the usage and propagation of hashtags in a new and fundamentally different type of social media that is i) without profiles and ii) location-based to only show nearby posted content. Our study is based on analyzing the mobile-only Jodel microblogging app, which has an established user base in several European countries and Saudi Arabia. All posts are user to user anonymous (i.e., no displayed user handles) and are only displayed in the proximity of the user’s location (up to 20 km). It thereby forms local communities and opens the question of how information propagates within and between these communities. We tackle this question by applying established metrics for Twitter hashtags to a ground-truth data set of Jodel posts within Germany that spans three years. We find the usage of hashtags in Jodel to differ from Twitter; despite embracing local communication in its design, Jodel hashtags are mostly used country-wide.
Billions of dollars in financial securities exchange hands every day in independent continuous double auctions. Although the auctions are automated, fast, open 24-7, and have worldwide scope and massive scale, the underlying auction rules have not changed much for over 100 years. Advertisement auctions, on the other hand, have rapidly evolved, incorporating optimization and machine learning directly into their allocation rules. The downside is a less-transparent auction, but the upsides for efficiency and expressiveness are tremendous. The trend toward smarter markets will expand into finance and well beyond, pervading how markets are designed. I will discuss markets that optimize and learn, using prediction markets and advertising markets as key examples.
Much of classical auction theory has been developed from the standpoint of the seller, trying to understand how to optimize auctions to maximize seller revenue for instance. This is still a source of very active current research.
Billions of auctions are now run on the Internet everyday between the same sellers and bidders and this creates a need to better understand auctions from the bidders’ perspective.
In this talk we will present some recent results on this question, showing for instance that auctions that are reputed to be truthful are not truthful anymore when the seller optimizes the auction format based on bidders’ past bids, provide explicit and simple to implement shading strategies that improve bidders’ utility (on and off equilibrium) and are robust to various forms of estimation error and mechanism changes. We will also discuss various equilibrium questions.
We take a mostly functional analytic point of view on these problems. If time permits, we will discuss ongoing work on a machine-learning-based perspective.
Joint work with Thomas Nedelec, Marc Abeille, Clément Calauzènes, Benjamin Heymann and Vianney Perchet while doing research at Criteo.
Motivated by online advertising market, we consider a seller who repeatedly sells ex ante identical items via the second-price auction. Buyers’ valuations for each item are drawn i.i.d. from a distribution F that is unknown to the seller. We find that if the seller attempts to dynamically update a common reserve price based on the bidding history, this creates an incentive for buyers to shade their bids, which can hurt revenue. When there is more than one buyer, incentive compatibility can be restored by using personalized reserve prices, where the personal reserve price for each buyer is set using the historical bids of other buyers. In addition, we use a lazy allocation rule, so that buyers do not benefit from raising the prices of their competitors. Such a mechanism asymptotically achieves the expected revenue obtained under the static Myerson optimal auction for F. Further, if valuation distributions differ across bidders, the loss relative to the Myerson benchmark is only quadratic in the size of such differences. We extend our results to a contextual setting where the valuations of the buyers depend on observed features of the items.
We consider the problem of the optimization of bidding strategies in prior-dependent revenue-maximizing auctions, when the seller fixes the reserve prices based on the bid distributions. Our study is done in the setting where one bidder is strategic. Using a variational approach, we study the complexity of the original objective and we introduce a relaxation of the objective functional in order to use gradient descent methods. Our approach is simple, general and can be applied to various value distributions and revenue-maximizing mechanisms. The new strategies we derive yield massive uplifts compared to the traditional truthfully bidding strategy.
With the success of machine learning (ML) techniques, ML has already proved a tremendous potential to impact the foundations, algorithms, and models of several data management tasks, such as error detection, data quality assessment, data cleaning, and data integration. In Knowledge Graphs, part of the data preparation and cleaning processes, such as data linking, identity disambiguation, or missing value inference and completion could be automated by making a ML model “learn” and predict the matches routinely with different degrees of supervision. This talk will survey the recent trends of applying machine learning solutions to improve and facilitate Knowledge Graph curation and enrichment, as one of the most critical tasks impacting Web search and query-answering. Finally, the talk will discuss the next research challenges in the convergence of machine learning and management of Knowledge Graph evolution and preservation.
This report describes a way to represent and operate on an RDF dataset such the it behaves as an instance of a conflict free replicable datatype. In this industry presentation, we describe how we accomplish this for the Dydra RDF graph storage service in a manner compatible with the SPARQL Graph Store HTTP Protocol (GSP). The standard GSP concerns the current store state only. Dydra retains previous store states as active addressable aspects analogous to named graphs in a quad store. It incorporates and addresses arbitrary revisions of target datasets according to ETag and Content-Disposition specifications in HTTP headers. Appropriate interpretation of these arguments permits to replicate datasets among cooperating participants.
The Semantic Web is about collaboration and exchange of information. While the data on the Semantic Web is constantly evolving and meant to be collaboratively edited there is no practical transactional concept or method to control concurrent writes to a dataset and avoid conflicts. Thus, we follow the question, how can we ensure a controlled state of a SPARQL Store when performing non transactional write operations? Based on the Distributed Version Control System for RDF data implemented in the Quit Store we present the Quit Editor Interface Concurrency Control (QEICC). QEICC provides a protocol on top of the SPARQL 1.1 standard to identify, avoid, and resolve conflicts. The strategies reject, branch, and merge are presented to allow different levels of control over the conflict resolution. While the reject strategy gives full control to the client, with branch and merge it is even possible to postpone the conflict resolution and integrate it into the date engineering process.
Apart from documents, datasets are gaining more attention on the World Wide Web. An increasing number of the datasets on the Web are available as Linked Data, also called the Linked Open Data Cloud1 or Giant Global Graph2. Collaboration of people and machines is a major aspect of the World Wide Web and as well of the Semantic Web. Currently, the access to RDF data on the Semantic Web is possible by applying the Linked Data principles3, and the SPARQL specification4, which enables clients to access and retrieve data stored and published via SPARQL endpoints. RDF resources in the Semantic Web are interconnected and often correspond to previously created vocabularies and patterns. This way of reusing existing knowledge facilitates the modeling and representation of information and may optimally reduce the development costs of a knowledge base. As a result of the collaborative reuse process, structural and content interferences as well as varying models and contradictory statements are inevitable.
As the Web of Data grows, so does the need to establish the quality and trustworthiness of its contents. Increasing numbers of libraries are publishing their metadata as Linked Data (LD). As these institutions are considered authoritative sources of information, it is likely that library LD will be treated with increased credibility over data published by other sources. However, in order to establish this trust, the provenance of library LD must be provided.
In 2018 we conducted a survey which explored the position of Information Professionals (IPs), such as librarians, archivists and cataloguers, with regards to LD. Results indicated that IPs find the process of LD interlinking to be a particularly challenging. In order to publish authoritative interlinks, provenance data for the description and justification of the links is required. As such, the goal of this research is to provide a provenance model for the LD interlinking process that meets the requirements of library metadata standards. Many current LD technologies are not accessible to non-technical experts or attuned to the needs of the library domain. By designing a model specifically for libraries, with input from IPs, we aim to facilitate this domain in the process of creating interlink provenance data.
Some facts in the Web of Data are only valid within a certain time interval. However, most of the knowledge bases available on the Web of Data do not provide temporal information explicitly. Hence, the relationship between facts and time intervals is often lost. A few solutions are proposed in this field. Most of them are concentrated more in extracting facts with time intervals rather than trying to map facts with time intervals. This paper studies the problem of determining the temporal scopes of facts, that is, deciding the time intervals in which the fact is valid. We propose a generic approach which addresses this problem by curating temporal information of facts in the knowledge bases. Our proposed framework, Temporal Information Scoping (TISCO) exploits evidence collected from the Web of Data and the Web. The evidence is combined within a three-step approach which comprises matching, selection and merging. This is the first work employing matching methods that consider both a single fact or a group of facts at a time. We evaluate our approach against a corpus of facts as input and different parameter settings for the underlying algorithms. Our results suggest that we can detect temporal information for facts from DBpedia with an f-measure of up to 80%.
Knowledge graphs are dynamic in nature, new facts about an entity are added or removed over time. Therefore, multiple versions of the same knowledge graph exist, each of which represents a snapshot of the knowledge graph at some point in time. Entities within the knowledge graph undergo evolution as new facts are added or removed. The problem of automatically generating a summary out of different versions of a knowledge graph is a long-studied problem. However, most of the existing approaches are limited to a pairwise version comparison. This limitation makes it difficult to capture a complete evolution out of several versions of the same knowledge graph. To overcome this limitation, we envision an approach to create a summary graph capturing temporal evolution of entities across different versions of a knowledge graph. The entity summary graphs may then be used for documentation generation, profiling or visualization purposes. First, we take different temporal versions of a knowledge graph and convert them into RDF molecules. Secondly, we perform Formal Concept Analysis on these molecules to generate summary information. Finally, we apply a summary fusion policy in order to generate a compact summary graph which captures the evolution of entities.
For better traffic flow and making better policy decisions, the city of Antwerp is connecting traffic lights to the Internet. The live “time to green” only tells a part of the story: also the historical values need to be preserved and need to be made accessible to everyone. We propose (i) an ontology for describing the topology of an intersection and the signal timing of traffic lights, (ii) a specification to publish these historical and live data with Linked Data Fragments and (iii) a method to preserve the published data in the long-term. We showcase the applicability of our specification with the opentrafficlights.org project where an end-user can see the live count-down as well as a line chart showing the historic “time to green” of a traffic light. We found that publishing traffic lights data as time sorted Linked Data Fragments allow synchronizing and reusing an archive to retrieve historical observations. Long-term preservation with tape storage becomes feasible when archives shift from byte preservation to knowledge preservation by combining Linked Data Fragments.
This is a print-version of a paper first written for the Web. The Web-version is available at https://brechtvdv.github.io/Article-Open-Traffic-Lights
Credibility research in the social sciences has a long history that may be particularly informative to today's efforts to combat misinformation and disinformation online. This keynote address will discuss the ways in which credibility has been studied in the disciplines of communication and psychology, including both how this notion has been conceptually and operationally defined. Key research findings will be presented with the aim of understanding what kinds of computational algorithms, tools, systems, and applications for tackling misinformation and disinformation are more versus less likely to be effective. This will help to answer a very important yet often overlooked question, which is: To what extent is misinformation a problem of information or a problem of human information processing? The answer to this question is crucial for both software developers and designers of educational intervention efforts to minimize the negative societal impacts of fabricated news and information flowing over the Internet and on social media.
• Applied computing∼Psychology
Fake news and misinformation have been increasingly used to manipulate popular opinion and influence political processes. To better understand fake news, how they are propagated, and how to counter their effect, it is necessary to first identify them. Recently, approaches have been proposed to automatically classify articles as fake based on their content. An important challenge for these approaches comes from the dynamic nature of news: as new political events are covered, topics and discourse constantly change and thus, a classifier trained using content from articles published at a given time is likely to become ineffective in the future. To address this challenge, we propose a topic-agnostic (TAG) classification strategy that uses linguistic and web-markup features to identify fake news pages. We report experimental results using multiple data sets which show that our approach attains high accuracy in the identification of fake news, even as topics evolve over time.
The spread of ‘fake’ health news is a big problem with even bigger consequences. In this study, we examine a collection of health-related news articles published by reliable and unreliable media outlets. Our analysis shows that there are structural, topical, and semantic patterns which are different in contents from reliable and unreliable media outlets. Using machine learning, we leverage these patterns and build classification models to identify the source (reliable or unreliable) of a health-related news article. Our model can predict the source of an article with an F-measure of 96%. We argue that the findings from this study will be useful for combating the health disinformation problem.
A frequent journalistic fact-checking scenario is concerned with the analysis of statements made by individuals, whether in public or in private contexts, and the propagation of information and hearsay (“who said/knew what when”). Inspired by our collaboration with fact-checking journalists from Le Monde, France’s leading newspaper, we describe here a Linked Data (RDF) model, endowed with formal foundations and semantics, for describing facts, statements, and beliefs. Our model combines temporal and belief dimensions to trace propagation of knowledge between agents along time, and can answer a large variety of interesting questions through RDF query evaluation. A preliminary feasibility study of our model incarnated in a corpus of tweets demonstrates its practical interest.
Automatic fact-checking systems detect misinformation, such as fake news, by (i) selecting check-worthy sentences for fact-checking, (ii) gathering related information to the sentences, and (iii) inferring the factuality of the sentences. Most prior research on (i) uses hand-crafted features to select check-worthy sentences, and does not explicitly account for the recent finding that the top weighted terms in both check-worthy and non-check-worthy sentences are actually overlapping . Motivated by this, we present a neural check-worthiness sentence ranking model that represents each word in a sentence by both its embedding (aiming to capture its semantics) and its syntactic dependencies (aiming to capture its role in modifying the semantics of other terms in the sentence). Our model is an end-to-end trainable neural network for check-worthiness ranking, which is trained on large amounts of unlabelled data through weak supervision. Thorough experimental evaluation against state of the art baselines, with and without weak supervision, shows our model to be superior at all times (+13% in MAP and +28% at various Precision cut-offs from the best baseline with statistical significance). Empirical analysis of the use of weak supervision, word embedding pretraining on domain-specific data, and the use of syntactic dependencies of our model reveals that check-worthy sentences contain notably more identical syntactic dependencies than non-check-worthy sentences.
This study explores an online fact-checking community called politicalfactchecking on reddit.com that relies on crowdsourcing to find and verify check-worthy facts relating to U.S. politics. The community embodies a network journalism model in which the process of finding and verifying check-worthy facts through crowdsourcing is coordinated by a team of moderators. Applying the concepts of connective journalism, this study analyzed the posts (N = 543) and comments (N = 10, 221) on the community’s Reddit page to understand differences in roles of the community members and the moderators. A mixed-method approach was used to analyze the data. The authors also developed an automated argument classification model to analyze the contents and identify ways to automate parts of the process. The findings suggest that a model consisting of crowds, professionals, and computer-assisted analysis could increase efficiency and decrease costs in news organizations that involve fact-checking.
Recent research brought awareness of the issue of bots on social media and the significant risks of mass manipulation of public opinion in the context of political discussion. In this work, we leverage Twitter to study the discourse during the 2018 US midterm elections and analyze social bot activity and interactions with humans. We collected 2.6 million tweets for 42 days around the election day from nearly 1 million users. We use the collected tweets to answer three research questions: (i) Do social bots lean and behave according to a political ideology? (ii) Can we observe different strategies among liberal and conservative bots? (iii) How effective are bot strategies in engaging humans?
We show that social bots can be accurately classified according to their political leaning and behave accordingly. Conservative bots share most of the topics of discussion with their human counterparts, while liberal bots show less overlap and a more inflammatory attitude. We studied bot interactions with humans and observed different strategies. Finally, we measured bots embeddedness in the social network and the extent of human engagement with each group of bots. Results show that conservative bots are more deeply embedded in the social network and more effective than liberal bots at exerting influence on humans.
There are rising concerns over the spread of misinformation in WhatsApp groups and the potential impact on political polarization, hindrance of public debate and fostering acts of political violence. As social media use becomes increasingly widespread, it becomes imperative to study how these platforms can be used to as a tool to spread propaganda and manipulate audience groups ahead of important political events. In this paper, we present a grounded typology to classify links to news sources into different categories including ‘junk’ news sources that deliberately publish or aggregate misleading, deceptive or incorrect information packaged as real news about politics, economics or culture obtained from public WhatsApp groups. Further, we examine a sample of 200 videos and images, extracted from a sample of WhatsApp groups and develop a new typology to classify this media content. For our analysis, we have used data from 130 public WhatsApp groups in the period leading up to the two rounds of the 2018 Brazilian presidential elections.
How should an organized response to disinformation proceed in a 21st century democratic society? At the highest level, what strategies are available? This paper attempts to answer these questions by looking at what three contemporary counter-disinformation organizations are actually doing, then analyzing their tactics. The EU East StratCom Task Force is a contemporary government counter-propaganda agency. Facebook has made numerous changes to its operations to try to combat disinformation, and is a good example of what platforms can do. The Chinese information regime is a marvel of networked information control, and provokes questions about what a democracy should and should not do. The tactics used by these organizations can be grouped into six high level strategies: refutation, exposure of inauthenticity, alternative narratives, algorithmic filter manipulation, speech laws, and censorship. I discuss the effectiveness and political legitimacy of these approaches when used within a democracy with an open Internet and a free press.
State actors, private influence operators and grassroots groups are all exploiting the openness and reach of the Internet to manipulate populations at a distance, extending their decades-long struggle for “hearts and minds” via propaganda, influence operations and information warfare. Computational propaganda fueled by AI makes matters worse.
The structure and propagation patterns of these attacks have many similarities to those seen in information security and computer hacking. The Credibility Coalition's MisinfosecWG working group is analyzing those similarities, including information security frameworks that could give the truth-based community better ways to describe, identify and counter misinformation-based attacks. Specifically, we place misinformation components into a framework commonly used to describe information security incidents. We anticipate that our work will give responders the ability to transfer other information security principles to the misinformation sphere, and to plan defenses and countermoves .
Online Social Networks (OSNs) represent a fertile field to collect real user data and to explore OSNs user behavior. Recently, two topics are drawing the attention of researchers: the evolution of online social roles and the question of participation inequality. In this work, we bring these two fields together to study and characterize the behavioral evolution of OSNs users according to the quantity and the typology of their social interactions. We found that online participation on the microblogging platform can be categorized into four different activity levels. Furthermore, we empirically verified that the 90-9-1 rule of thumb about participation inequality is not an accurate representation of reality. Findings from our analysis reveal that lurkers are less than expected: they are not 9 out of 10 as suggested by Nielsen, but 3 out of 4. This represents a significant result that can give new insights on how users relate with social media and how their use is evolving towards a more active interaction with the new generation of consumers.
Understanding mechanisms driving link formation in dynamic social networks is a long-standing problem that has implications to understanding social structure as well as link prediction and recommendation. Social networks exhibit a high degree of transitivity, which explains the successes of common neighbor-based methods for link prediction. In this paper, we examine mechanisms behind link formation from the perspective of an ego node. We introduce the notion of personalized degree for each neighbor node of the ego, which is the number of other neighbors a particular neighbor is connected to. From empirical analyses on four on-line social network datasets, we find that neighbors with higher personalized degree are more likely to lead to new link formations when they serve as common neighbors with other nodes, both in undirected and directed settings. This is complementary to the finding of Adamic and Adar  that neighbor nodes with higher (global) degree are less likely to lead to new link formations. Furthermore, on directed networks, we find that personalized out-degree has a stronger effect on link formation than personalized in-degree, whereas global in-degree has a stronger effect than global out-degree. We validate our empirical findings through several link recommendation experiments and observe that incorporating both personalized and global degree into link recommendation greatly improves accuracy.
The detection of anomalies and exceptional patterns in social interaction networks is a prominent research direction in data mining and network science. For anomaly detection, typically two questions need to be addressed and defined: (1) What is an anomaly? (2) How do we identify an anomaly? This paper discusses model-based approaches and methods for addressing and formalizing these issues in the context of feature-rich social interaction networks. It provides a categorization of model-based approaches and provides perspectives and first promising directions for its implementation.
The ability to track and monitor relevant and important news in real-time is of crucial interest in multiple industrial sectors. In this work, we focus on cryptocurrency news, which recently became of emerging interest to the general and financial audience. In order to track popular news in real-time, we (i) match news from the web with tweets from social media, (ii) track their intraday tweet activity and (iii) explore different machine learning models for predicting the number of article mentions on Twitter after its publication. We compare several machine learning models, such as linear extrapolation, linear and random forest autoregressive models, and a sequence-to-sequence neural network.
User generated video systems like YouTube and Twitch.tv have been a major internet phenomenon. They have attracted a vast user base with their many and varied contents provided by their users, and a series of social features tailored for online viewing. In hoping for building a more lively community and encouraging the content creators to share more, recently many such systems have introduced crowdsourcing mechanisms wherein creators get tangible rewards through user donations. User donation is a very special form of user relationships. It influences user engagement in the community, and has a great impact on the success of these systems. However, user donations and donation relationships remain trade secrets for most enterprises and to date are still unexplored. It is not clear at what scale are the donations or how users donate in these systems. In this work, we attempt to fill this gap. We obtain and provide a publicly available dataset on user donations in BiliBili, a popular user generated video system in China with 76.4 million average monthly active users. Based on detailed information on over 5 million videos, over 700 thousand content creators, and over 1.5 million user donations, we quantitatively reveal the characteristics of user donations, we examine their correlations with the upload behavior and content popularity of the creators, and we adopt machine-learned classifiers to accurately predict the creators who will receive donations and who will donate in the future.
Recently, content polluters post malicious information in Online Social Networks (OSNs), which is a more and more serious problem that poses a serious threat to the privacy information, account security, user experience, etc. They continuously simulate the behaviors of legitimate accounts in various ways, and evade detection systems against them. In this paper, we focus on one kind of content polluter, namely collective content polluter (hereinafter referred to as CCP). Existing works either focus on individual polluters or require long periods of data records for detection, making their detection methods less robust and lagging behind. It is thus necessary to analyze the characteristics of collective content polluters and study the methods for early detection. This paper proposes a CCP early detection method called CrowdGuard. It analyzes the crowd behaviors of collective content polluters and legitimate accounts, extracts distinctive features, and leverages the Gaussian Mixture Model (GMM) method to cluster the two groups of accounts (legitimate users and polluters) to achieve early detection. Using the public dataset including thousands of collective content polluters on Twitter about a political election, we design an experimental scenario simulating early detection and evaluate the performance of CrowdGuard. The results show that CrowdGuard outperforms existing methods and is adequate for early detection.
Emotion Classification (EC) aims at assigning an emotion label to a textual document with two inputs – a set of emotion labels (e.g. anger, joy, sadness) and a document collection. The best performing approaches for EC are dictionary-based and suffer from two main limitations: (i) the out-of-vocabulary (OOV) keywords problem and (ii) they cannot be used across heterogeneous domains. In this work, we propose a way to overcome these limitations with a supervised approach based on TF-IDF indexing and Multinomial Linear Regression with Elastic-Net regularization to extract an emotion lexicon and classify short documents from diversified domains. We compare the proposed approach to state-of-the-art methods for document representation and classification by running an extensive experimental study on two shared and heterogeneous data sets.
Information diffusion mechanisms based on social influence models are mainly studied using likelihood of adoption when active neighbors expose a user to a message. The problem arises primarily from the fact that for the most part, this explicit information of who-exposed-whom among a group of active neighbors in a social network, before a susceptible node is infected is not available. In this paper, we attempt to understand the diffusion process through information cascades by studying the temporal network structure of the cascades. In doing so, we accommodate the effect of exposures from active neighbors of a node through a network pruning technique that leverages network motifs to identify potential infectors responsible for exposures from among those active neighbors. We attempt to evaluate the effectiveness of the components used in modeling cascade dynamics and especially whether the additional effect of the exposure information is useful. Following this model, we develop an inference algorithm namely InferCut, that uses parameters learned from the model and the exposure information to predict the actual parent node of each potentially susceptible user in a given cascade. Empirical evaluation on a real world dataset from Weibo social network demonstrate the significance of incorporating exposure information in recovering the exact parents of the exposed users at the early stages of the diffusion process.
Influence diffusion has been widely studied in social networks for applications such as service promotion and marketing. There are two challenging issues here: (1) how we measure people’s influence on others; (2) how we predict whom would be influenced by a particular person and when people would be influenced. Existing works have not captured the temporal and structural characteristics of influence diffusion in Twitter. In this paper, we firstly develop a model to learn influence probabilities between users in Twitter from their action history; secondly, we introduce diffusion models that are used to predict how information is propagated in Twitter. Experiment results show that our proposed models outperform existing models in terms of the balanced precision and recall.
Most recommendation algorithms produce results without humans-in-the-loop. Combining algorithms with expert human curation can make recommendations much more effective, especially in hard-to-quantify domains like fashion. But it also makes things more complicated, introducing new sources of statistical bias and challenges for traditional approaches to training and evaluating algorithms. Humans and machines can also disagree, further complicating the design of production systems. In this talk I'll share lessons from combining algorithms and human judgement for personal styling recommendations at Stitch Fix, an online personal styling service that commits committed to its recommendations through the physical delivery of merchandise to clients.
To understand why and how subjectivity and disagreement in label collection matters or doesn't matter I examine the history of systems evaluations and measurement performed by humans and trace the roots of human computation/crowdsourcing and the context in which it arose. Before we can begin fruitful discussions about subjectivity and disagreement we need to ask ourselves what/who it is that human raters are supposed to represent. I offer multiple different perspectives and scenarios that showcase just how varied and ill-defined the role of a human rater can be. I will conclude with some practical recommendations with respect to the questions researchers and practitioners ought to ask themselves before employing human raters, and some challenges with both methodology of such data collection and subsequent analysis of such data.
Discussing things you care about can be difficult, especially via online platforms, where sharing your opinion leaves you open to the real and immediate threats of abuse and harassment. Due to these threats, people stop expressing themselves and give up on seeking different opinions. Recent research efforts focus on examining the strengths and weaknesses (e.g. potential unintended biases) of using machine learning as a support tool to facilitate safe space for online discussions; for example, through detecting various types of negative online behaviors such as hate speech, online harassment, or cyberbullying. Typically, these efforts build upon sentiment analysis or spam detection in text. However, the toxicity of the language could be a strong indicator for the intensity of the negative behavior. In this paper, we study the topic of toxicity in online conversations by addressing the problems of subjectivity, bias, and ambiguity inherent in this task. We start with an analysis of the characteristics of subjective assessment tasks (e.g. relevance judgment, toxicity judgment, sentiment assessment, etc). Whether we perceive something as relevant or as toxic can be influenced by almost infinite amounts of prior or current context, e.g. culture, background, experiences, education, etc. We survey recent work that tries to understand this phenomenon, and we outline a number of open questions and challenges which shape the research perspectives in this multi-disciplinary field.
Crowdsourcing systems increasingly rely on users to provide more subjective ground truth for intelligent systems - e.g. ratings, aspect of quality and perspectives on how expensive or lively a place feels, etc. We focus on the ubiquitous implementation of online user ordinal voting (e.g 1-5, 1 star-4 stars) on some aspect of an entity, to extract a relative truth, measured by a selected metric such as vote plurality or mean. We argue that this methodology can aggregate results that yield little information to the end user. In particular, ordinal user rankings often converge to a indistinguishable rating. This is demonstrated by the trend in certain cities for the majority of restaurants to all have a 4 star rating. Similarly, the rating of an establishment can be significantly affected by a few users . User bias in voting is not spam, but rather a preference that can be harnessed to provide more information to users. We explore notions of both global skew and user bias. Leveraging these bias and preference concepts, the paper suggests explicit models for better personalization and more informative ratings.
Machine learning problems are often subjective or ambiguous. That is, humans solving the same problems might come to legitimate but completely different conclusions, based on their personal experiences and beliefs. In supervised learning, particularly when using crowdsourced training data, multiple annotations per data item are usually reduced to a single label representing ground truth. This hides a rich source of diversity and subjectivity of opinions about the labels. Label distribution learning associates for each data item a probability distribution over the labels for that item, thus can preserve the diversity that conventional learning hides or ignores. We introduce a strategy for learning label distributions with only five-to-ten labels per item by aggregating human-annotated labels over multiple, semantically related data items. Our results suggest that specific label aggregation methods can help provide reliable representative semantics at the population level.
Developing instructions for microtask crowd workers requires time to ensure consistent interpretations by crowd workers. Even with substantial effort, workers may still misinterpret the instructions due to ambiguous language and structure in the task design. Prior work demonstrated methods for facilitating iterative improvement with help from the requester. However, any participation by the requester reduces the time saved by delegating the work—and hence the utility of using crowdsourcing. We present TaskMate, a system for facilitating worker-led refinement of task instructions with minimal involvement by the requester. Small teams of workers search for ambiguities and vote on the interpretation they believe the requester intended. This paper describes the workflow, our implementation, and our preliminary evaluation.
Group-based discussion among human graders can be a useful tool to capture sources of disagreement in ambiguous classification tasks and to adjudicate any resolvable disagreements. Existing workflows for panel-based adjudication, however, capture graders’ arguments and rationales in a free-form, unstructured format, limiting the potential for automatic analysis of the discussion contents. We designed and implemented a structured adjudication system that collects graders’ arguments in a machine-readable format without limiting graders’ abilities to provide free-form justifications for their classification decisions. Our system enables graders to cite instructions from a set of labeling guidelines, specified in the form of discrete classification rules and conditions that need to be met in order for each rule to be applicable. In the present work, we outline the process of designing and implementing this adjudication system, and report preliminary findings from deploying our system in the context of medical time series analysis for sleep stage classification.
Crowdsourcing is a great tool for conducting subjective user studies with large amounts of users. Collecting reliable annotations about the quality of speech stimuli is challenging. The task itself is of high subjectivity and users in crowdsourcing work without supervision. This work investigates the intra- and inter-listener agreement withing a subjective speech quality assessment task. To this end, a study has been conducted in the laboratory and in crowdsourcing in which listeners were requested to rate speech stimuli with respect to their overall quality. Ratings were collected on a 5-point scale in accordance with the ITU-T Rec. P.800 and P.808, respectively. The speech samples were taken from the database ITU-T Rec. P.501 Annex D, and were presented four times to the listeners. Finally, the crowdsourcing results were contrasted to the ratings collected in the laboratory. Strong and significant Spearman’s correlation was achieved when contrasting the ratings collected in both environments. Our analysis show that while the inter-rater agreement increased the more the listeners conducted the assessment task, the intra-rater reliability remained constant. Our study setup helped to overcome the subjectivity of the task and we found that disagreement can represent a source of information to some extent.
Twitter and Facebook continue to be top destinations for information consumption on the Internet. The ever-expanding social graph based enables the implementation of traditional features like item recommendation and selection of trending content that rely on human input and other behavioral data. However, given the enormous amount of human sensing in the world at any given moment in any platform, there is a lot of untapped potential that goes beyond simple applications on top of atomic level content like a post or tweet. In this talk we describe a social knowledge graph that discover relationships as they occur over time and how it can be used to capture the evolution of events or stories.
Knowledge graphs enriched with temporal information are becoming more and more common. As an example, the Wikidata KG contains millions of temporal facts associated with validity intervals (i.e., start and end time) covering a variety of domains. While these facts are interesting, computing temporal relations between their intervals allows to discover temporal relations holding between facts (e.g., “football players that get divorced after moving from a team to another”). In this paper we study the problem of computing different kinds of interval joins in temporal KGs. In principle, interval joins can be computed by resorting to query languages like SPARQL. However, this language is not optimized for such a task, which makes it hard to answer real-world queries. For instance, the query “find players that were married while being member of a team” times out on Wikidata. We present efficient algorithms to compute interval joins for the main Allen’s relations (e.g., before, after, during, meets). We also address the problem of interval coalescing, which is used for merging contiguous or overlapping intervals of temporal facts, and propose an efficient algorithm. We integrate our interval joins and coalescing algorithms into a light SPARQL extension called iSPARQL. We evaluated the performance of our algorithms on real-world temporal kgs.
Knowledge bases (KBs) contain huge amounts of facts about entities, their properties, and relations between them. They are thus the key asset in any intelligent system for tasks such as structured search and question answering. However, due to dynamics in the real world, properties and relations change over time, and stored knowledge may become outdated. While KB information evolves steadily, there is no information whether or not a KB property might be subject to change with high probability or whether it is likely to be stable. Systems exploiting KB information, however, could benefit a lot if they had access to this kind of information. In this paper, we analyze and predict the stability of KB entries, which allows to accompany entries with stability scores. Our predictive model exploits entity-based features and learns through historic data. A particular challenge to determine stability scores is that KB entries are not only added or modified due to real-world changes but also to reduce the incompleteness of KBs in general. Nevertheless, our evaluation of sample properties demonstrates the effectiveness of our method for predicting the one-year stability of KB properties.
Temporal information extracted from texts and normalized to some standard format has been exploited in a variety of tasks such as information retrieval and question answering. Classifying documents into categories using temporal features has not yet been tried. Such a method might be particularly valuable when classifying sensitive texts such as patient records, i.e., whenever the pure content of the documents should not be used for the classification. In this paper, we describe, as a proof-of-concept, our work on classifying news articles exploiting only features defined over extracted and normalized temporal expressions. Our evaluation of two classification models on large German and English news archives shows promising results and demonstrates the discriminative power of temporal features for topically classifying text documents.
As video games are developing fast, many users issue queries related to video games in a daily fashion. While there were a few attempts to understand their behavior, little is known on how the video game-related searches are done. Digesting and analyzing this search behavior may thus be faced as an important contribution for search engines to provide better results and search services for their users. To overcome this lack of knowledge and to gain more insight into how video game searches are done, we analyze in this paper, a number of game search queries submitted to a general search engine named Parsijoo. The analysis conducted was performed on top of 372,508 game search records extracted from the query logs within 253,516 different search sessions. Different aspects of video game searches are studied, including, their temporal distribution, game version specification, popular game categories, popular game platforms, game search sessions and clicked pages. Overall, the experimental analysis on video game searches shows that the current retrieval methods used by traditional search engines cannot be applied for game searches, thus, different retrieval and search services should be considered for these searches in the future.
In this paper we present a proof-of-concept of a visual navigation tool for a personalized “sandbox” of Wiki pages. The navigation tool considers multiple groups of algorithmic parameters and adapts to user activity via graphical user interfaces. The output is a 2D map of a subset of Wikipedia pages network which provides a different and broader visual representation – a map – in the neighborhood (according to some metric) of the pages around the page currently displayed in a browser. The representation schema includes the incorporation of a kind of transparency in the algorithmic parameters affecting the presentation of the landscape visualization, which in turn enables the delivery of a personalized canvas, designed by the user. A case study shows the combination of four different sourcing (i.e., identification and extraction of the neighboring pages) rules and three layouts over the same Wikipedia subnetwork. The basic schema is readily adapted to other search experiences and contexts.
Wikipedia serves as a good example of how editors collaborate to form and maintain an article. The relationship between editors, derived from their sequence of editing activity, results in a directed network structure called the revision network, that potentially holds valuable insights into editing activity. In this paper we create revision networks to assess differences between controversial and non-controversial articles, as labelled by Wikipedia. Originating from complex networks, we apply motif analysis, which determines the under or over-representation of induced sub-structures, in this case triads of editors. We analyse 21,631 Wikipedia articles in this way, and use principal component analysis to consider the relationship between their motif subgraph ratio profiles. Results show that a small number of induced triads play an important role in characterising relationships between editors, with controversial articles having a tendency to cluster. This provides useful insight into editing behaviour and interaction capturing counter-narratives, without recourse to semantic analysis. It also provides a potentially useful feature for future prediction of controversial Wikipedia articles.
Wikipedia is a rich and invaluable source of information. Its central place on the Web makes it a particularly interesting object of study for scientists. Researchers from different domains used various complex datasets related to Wikipedia to study language, social behavior, knowledge organization, and network theory. While being a scientific treasure, the large size of the dataset hinders pre-processing and may be a challenging obstacle for potential new studies. This issue is particularly acute in scientific domains where researchers may not be technically and data processing savvy. On one hand, the size of Wikipedia dumps is large. It makes the parsing and extraction of relevant information cumbersome. On the other hand, the API is straightforward to use but restricted to a relatively small number of requests. The middle ground is at the mesoscopic scale, when researchers need a subset of Wikipedia ranging from thousands to hundreds of thousands of pages but there exists no efficient solution at this scale.
In this work, we propose an efficient data structure to make requests and access subnetworks of Wikipedia pages and categories. We provide convenient tools for accessing and filtering viewership statistics or “pagecounts” of Wikipedia web pages. The dataset organization leverages principles of graph databases that allows rapid and intuitive access to subgraphs of Wikipedia articles and categories. The dataset and deployment guidelines are available on the LTS2 website https://lts2.epfl.ch/Datasets/Wikipedia/.
Recently much progress has been made in entity disambiguation and linking systems (EDL). Given a piece of text, EDL links words and phrases to entities in a knowledge base, where each entity defines a specific concept. Although extracted entities are informative, they are often too specific to be used directly by many applications. These applications usually require text content to be represented with a smaller set of predefined concepts or topics, belonging to a topical taxonomy, that matches their exact needs. In this study, we aim to build a system that maps Wikidata entities to such predefined topics. We explore a wide range of methods that map entities to topics, including GloVe similarity, Wikidata predicates, Wikipedia entity definitions, and entity-topic co-occurrences. These methods often predict entity-topic mappings that are reliable, i.e., have high precision, but tend to miss most of the mappings, i.e., have low recall. Therefore, we propose an ensemble system that effectively combines individual methods and yields much better performance, comparable with human annotators.
Wikipedia is the largest on-line collaborative encyclopedia, containing information from a plethora of fields, including medicine. It has been shown that Wikipedia is one of the top visited sites by readers looking for information on this topic. The large reliance on Wikipedia for this type of information drives research towards the analysis of the quality of its articles. In this work, we evaluate and compare the quality of medicine-related articles in the English and Portuguese Wikipedia. For that we use metrics such as authority, completeness, complexity, informativeness, consistency, currency and volatility, and domain-specific measurements, in order to evaluate and compare the quality of medicine related articles in the English and Portuguese Wikipedia. We were able to conclude that the English articles score better across most metrics than the Portuguese articles.
The Thanks feature on Wikipedia, also known as “Thanks”, is a tool with which editors can quickly and easily send one other positive feedback . The aim of this project is to better understand this feature: its scope, the characteristics of a typical “Thanks” interaction, and the effects of receiving a thank on individual editors. We study the motivational impacts of “Thanks” because maintaining editor engagement is a central problem for crowdsourced repositories of knowledge such as Wikimedia. Our main findings are that most editors have not been exposed to the Thanks feature (meaning they have never given nor received a thank), thanks are typically sent upwards (from less experienced to more experienced editors), and receiving a thank is correlated with having high levels of editor engagement. Though the prevalence of “Thanks” usage varies by editor experience, the impact of receiving a thank seems mostly consistent for all users. We empirically demonstrate that receiving a thank has a strong positive effect on short-term editor activity across the board and provide preliminary evidence that thanks could compound to have long-term effects as well.
Developing a deeper understanding of the travel domain is helpful for presenting users with consistent and reliable information, and few sources of data are able to achieve that. Further, such information can serve as background knowledge for evaluating machine learning algorithms. In this paper, we present part of our work towards developing such an understanding. We demonstrate a simple extraction technique and how the extracted data can be used to evaluate an unsupervised embedding model built on search queries with travel intent.
Online advertising platforms in partnerships with media companies typically have access to an online user’s history of viewed articles. If a concerned brand (advertiser) plans to run advertisement campaigns on users exposed to negative articles, it is essential to first identify articles with negative sentiment about the brand. For an advertising platform, scalable identification of such articles with little human-annotation effort is necessary for launching campaigns soon after an advertiser signs up. In this context, generic sentiment analysis tools suffer from the lack of contextual world knowledge associated with the advertiser. Human annotation of articles for supervised approaches is laborious and painstaking. To address these problems, we propose the use of publicly available Wikipedia footnote references for an advertiser, and propagate their sentiment to several articles related to the advertiser. In particular, our proposed approach has three components: (i) automatically find Wikipedia references which have negative sentiment about an advertiser, (ii) learn distributed representations (doc2vec) of article texts referred in footnotes and other unlabeled articles, and (iii) inferring sentiment in unlabeled articles using label propagation (from references) in the doc2vec space. Our experiments spanning three real brands, and data from a major advertising platform (Yahoo Gemini) show significant lifts in sentiment inference compared to existing baselines. In addition, we share valuable insights on how article sentiment influences the online activities of a user with respect to a brand.
News agencies produce thousands of multimedia stories describing events happening in the world that are either scheduled such as sports competitions, political summits and elections, or breaking events such as military conflicts, terrorist attacks, natural disasters, etc. When writing up those stories, journalists refer to contextual background and to compare with past similar events. However, searching for precise facts described in stories is hard. In this paper, we propose a general method that leverages the Wikidata knowledge base to produce semantic annotations of news articles. Next, we describe a semantic search engine that supports both keyword based search in news articles and structured data search providing filters for properties belonging to specific event schemas that are automatically inferred.
The increased availability of online learning resources in the form of courses, videos, and tutorials has created new opportunities for independent learners, but it has also increased the difficulty of planning a course of study. Where should the learner start? What should the learner know before tackling a new course? Manually identifying these prerequisite relations between learning resources or concepts is expensive in terms of time and expertise, and it is particularly difficult to do so for new or rapidly changing areas of knowledge. To address this challenge, we present a new method for identifying prerequisite relations based on naturally occurring data, namely the navigation patterns of users on the Wikipedia online encyclopedia. Our supervised learning approach shows that the navigation network structure can be used to identify dependencies among concepts in several domains.
Increased polarization and partisanship have become a consistent state of politics, media, and society, especially in the United States. As many news publishers are perceived as “biased” and some others have come under attack as being “fake news”, efforts to make such labels stick have increased too. In some cases (e.g., InfoWars), the use of such labels is legitimate, because some online publishers deliberately spread conspiracy theories and false stories. Other news publishers are perceived as partisan and biased, in ways that damages their reporting credibility. Whether political bias affects journalism standards appears to be a debated topic with no clear consensus. Meanwhile, labels such as “far-left” or “alt-right” are highly contested and may become cause for prolonged edit wars on the Wikipedia pages of some news sources. In this paper, we try to shine a light into this phenomenon and its extent, in order to start a conversation within the Wikipedia community about transparent processes for assigning political orientation and journalistic reliability labels to news sources, especially to unfamiliar ones, which users would be more likely to verify by looking them up. As more of Wikipedia’s content is used outside Wikipedia’s “container” (e.g., in search results or by voice personal assistants), the issue of where certain statements appear in the Wikipedia page and their verifiability becomes an urgent one to consider not only by Wikipedia editors, but by third-party information providers too.
Understanding how various external campaigns or events affect readership on Wikipedia is important to efforts aimed at improving awareness and access to its content. In this paper, we consider how to build time-series models aimed at predicting page views on Wikipedia with the goal of detecting whether there are significant changes to the existing trends. We test these models on two different events: a video campaign aimed at increasing awareness of Hindi Wikipedia in India and the page preview feature roll-out—a means of accessing Wikipedia content without actually visiting the pages—on English and German Wikipedia. Our models effectively estimate the impact of page preview roll-out, but do not detect a significant change following the video campaign in India. We also discuss the utility of other geographies or language editions for predicting page views from a given area on a given language edition.
Information Extraction (IE) techniques enables us to distill Knowledge from the abundantly available unstructured content. Some of the basic IE methods include the automatic extraction of relevant entities from text (e.g. places, dates, people, ...), understanding relations among them, building semantic resources (dictionaries, ontologies) to inform the extraction tasks, connecting extraction results to standard classification resources. IE techniques cannot decouple from human input - at bare minimum some of the data needs to be manually annotated by a human so that automatic methods can learn patterns to recognize certain type of information. The human-in-the-loop paradigm applied to IE techniques focuses on how to better take advantage of human annotations (the recorded observations), how much interaction with the human is needed for each specific extraction task.
Data science is an emerging discipline that offers both promise and peril. Responsible data science refers to efforts that address both the technical and societal issues in emerging data-driven technologies. How can machine learning and AI systems reason effectively about complex dependencies and uncertainty? Furthermore, how do we understand the ethical and societal issues involved in data-driven decision-making? There is a pressing need to integrate algorithmic and statistical principles, social science theories, and basic humanist concepts so that we can think critically and constructively about the socio-technical systems we are building. In this talk, I will overview this emerging area.
As one of the largest communities that search for online resources, children are introduced to the Web at increasingly young ages . However, popular search tools are not explicitly designed with children in mind nor do their retrieved results explicitly target children. Consequently, many young users struggle in completing successful searches, especially since most search engines (SE) do not directly support, or offer weak support, for children’s inquiry approaches . Even though children, as inexperienced users, struggle with describing their information needs in a concise query, they still expect SEs to retrieve relevant information in response to their requirements. As part of their capabilities, SEs often suggest queries to aid users in better defining their information needs. In fact, a recent study conducted by Gossen et al.  shows that children pay more attention to suggested queries than adults. Unfortunately, these suggestions are not specifically tailored towards children and thus need improvement . While there exist multiple query suggestion modules, only few specifically target children. To address this problem, along with a need for more children-related tools, we rely on ReQuIK (Recommendations based on Query Intention for Kids), a query suggestion module tailored towards 6-to-13 year old children (introduced in ). ReQuIK informs its suggestion process by applying (i) a strategy based on search intent to capture the purpose of a query , (ii) a ranking strategy based on a wide and deep neural network that considers both raw text and traits commonly associated with kid-related queries, (iii) a filtering strategy based on the readability levels of documents potentially retrieved by a query to favor suggestions that trigger the retrieval of documents matching children’s reading skills, and (iv) a content-similarity strategy to ensure diversity among suggestions.
For assessing the quality of the system, we conducted initial offline and online experiments based on 591 queries written by 97 children, ages 6 to 13. The results of this assessment verified the correctness of ReQuIK’s recommendation strategy, the fact that it provides suggestions that appeal to children and ReQuIK’s ability to recommend queries that lead to the retrieval of materials with readability levels that correlate with children’s reading skills
To the best of our knowledge, ReQuIK is the only available system that can be coupled with SEs to generate query recommendations for children, favoring those that lead to easier-to-read, child-related resources, which can improve SEs’ performance. The design of the proposed tool explicitly considers different patterns children use while searching the Web to adequately capture the intended meaning of their original queries. For example, if a child submits the query “elsa”, ReQuIK aims to prioritize query suggestions such as “elsa coloring papers” or “elsa dress up games” that correlate better with topics of interest to children rather than “elsa pataky”, as suggested by Google, which is more appealing to mature users. Other contributions of our work include a novel ranking model inspired by a deep-and-wide architecture that, while successfully applied for ranking purposes , has never been used in the query suggestions domain; a strategy to overcome the lack of queries written by children by taking advantage of general purpose children-oriented phrases; and a newly-created dataset .
In this lighting talk paper, we present a dataset of jokes in Russian and deep learning model for solving humor recognition task. The new large dataset was collected from various online resources and complemented carefully with unfunny texts with similar lexical properties. In total, there are more than 300,000 short texts, which is significantly larger than any previous humor-related corpus. Manual annotation of 2,000 items proved the reliability of corpus construction approach. Further, we applied language model fine-tuning for text classification and obtained an F1 score of 0.91, which constitutes a considerable gain over baseline methods.
Predicting signed links in social networks often faces the problem of signed link data sparsity, i.e., only a small percentage of signed links are given. The problem is exacerbated when the number of negative links is much smaller than that of positive links. Boosting signed link prediction necessitates additional information to compensate for data sparsity. According to psychology theories, one rich source of such information is user’s personality such as optimism and pessimism that can help determine her propensity in establishing positive and negative links. In this study, we investigate how personality information can be obtained, and if personality information can help alleviate the data sparsity problem for signed link prediction. We propose a novel signed link prediction model that enables empirical exploration of user personality via social media data. We evaluate our proposed model on two datasets of real-world signed link networks. The results demonstrate the complementary role of personality information in the signed link prediction problem. Experimental results also indicate the effectiveness of different levels of personality information for signed link data sparsity problem.
It is estimated that merely 4% of the world's population reside on US soil. Remarkably, 43% of the entire population of prominent websites are hosted in the United States (Fig. 1). Even though most data content on the Web is unstructured, the US government has had big contributions in producing and actively releasing structured datasets related to different fields such as health, education, safety and finance.
Aforementioned datasets are referred to as Open Government Data (OGD) and are aimed at increasing the structured data pool in conjunction with promoting government transparency and accountability. In this paper, we present a new system “OGDXplor” which processes raw OGD through a well-defined procedure leveraging machine learning algorithms and produces meaningful insights.
The novelty of this work is encompassed by the collective approach utilized in developing the system and tackling challenges. First by addressing arising challenges due to data being collected and aggregated from heterogeneous sources that otherwise would have been impossible to acquire as a comprehensive unit. moreover, classification and comparisons are drawn on a much finer level that we refer to as zone level. Zones are the areas encompassed and defined by zip codes and are seldomly used in classifying and extracting insights as presented here. OGDXplor facilitates comparing and classifying zones located in different cities or zones within an individual city.
The system is presented to end-users as a web application allowing users to elect zones and features relevant to their use case. Results are presented in both chart and map formats which aids the decision-making process.
Dialogue systems and conversational agents are becoming increasingly popular in the modern society but building an agent capable of holding intelligent conversation with its users is a challenging problem for artificial intelligence. In this talk, we share challenges and learnings from our journey of building a deep learning based conversational social agent called ”Ruuh” (m.me/Ruuh) developed by a team at Microsoft India to converse on a wide range of topics. The authors are co-creators of Ruuh and the original paper was presented in NeurIPS 2018 Demonstration Track by two of the authors. As a social agent, Ruuh needs to think beyond the utilitarian notion of merely generating ”relevant” responses and meet a wider range of user social needs. The agent also needs to detect and respond to abusive language, sensitive topics and trolling behavior of users. Some of the above objectives pose significant research challenges in the areas of NLP, IR and AI. Our agent has interacted with over 2 million real world users till date which has generated over 150 million user conversations. We intend to walk the audience through our journey of overcoming several research challenges to become the most popular social agent in India.
Many crowd-sourced review platforms, such as Yelp, TripAdvisor, and Foursquare, have sprung up to provide a shared space for people to write reviews and rate local businesses. With the substantial impact of businesses’ online ratings on their selling , many businesses add themselves to multiple websites to more easily be discovered. Some might also engage in reputation management, which could range from rewarding their customers for a favorable review, or a complex review campaign, where armies of accounts post reviews to influence a business’ average review score.
Most of previous work use supervised machine learning, and only focus on textual and stylometry features [1, 3, 4, 7]. Their obtained ground truth data is not large and comprehensive [4, 5, 6, 7, 8, 10]. These works also assume a limited threat model, e.g., an adversary’s activity is assumed to be found near sudden shifts in the data , or focused on positive campaigns.
We propose OneReview , a system for finding fraudulent content on a crowd-sourced review site, leveraging correlations with other independent review sites, and the use of textual and contextual features. We assume that an attacker may not be able to exert the same influence over a business’ reputation on several websites, due to increased cost. OneReview focuses on isolating anomalous changes in a business’ reputation across multiple review sites, to locate malicious activity without relying on specific patterns. Our intuition is that a business’s reputation should not be very different in multiple review sites; e.g., if a restaurant changes its chef or manager, then the impact of these changes should appear on reviews across all the websites. OneReview utilizes Change Point Analysis method on the reviews of every business independently on every website, and then uses our proposed Change Point Analyzer to evaluate change-points, detect those that do not match across the websites, and identify them as suspicious. Then, it uses supervised machine learning, utilizing a combination of textual and metadata features to locate fraudulent reviews among the suspicious reviews.
We evaluated our approach, using data from two reviewing websites, Yelp and TripAdvisor, to find fraudulent activity on Yelp. We obtained Yelp reviews, through the Yelp Data Challenge , and used our Change Point Analyzer to correlate this with data crawled from TripAdvisor. Since realistic and varied ground truth data is not currently available, we used a combination of our change point analysis and crowd-labeling to create a set of 5,655 labeled reviews. We used k-cross validation (k=5) on our ground truth and obtained 97% (+/- 0.01) accuracy, 91% (+/- 0.03) precision and 90% (+/- 0.06) recall. The model was used on the suspicious reviews, which classified 61,983 reviews, about 8% of all reviews, as fraudulent.
We further detected fraudulent campaigns that are actively initiated by or targeted toward specific businesses. We identified 3,980 businesses with fraudulent reviews, as well as, 14,910 suspected spam, where at least 40% of their reviews are classified as fraudulent. We also used community detection algorithms to locate several large astroturfing campaigns. These results show the effectiveness of OneReview in detecting fraudulent campaigns.
The potentially detrimental effects of cyberbullying have led to the development of numerous automated, data–driven approaches, with emphasis on classification accuracy. Cyberbullying, as a form of abusive online behavior, although not well–defined, is a repetitive process, i.e., a sequence of aggressive messages sent from a bully to a victim over a period of time with the intent to harm the victim.
Existing work has focused on harassment (i.e., using profanity to classify toxic comments independently) as an indicator of cyberbullying, disregarding the repetitive nature of this harassing process. However, raising a cyberbullying alert immediately after an aggressive comment is detected can lead to a high number of false positives. At the same time, two key practical challenges remain unaddressed: (i) timeliness: the state–of–the–art relies on a fixed set of features learned during training for offline detection (i.e., after all correspondence has become available), hindering the ability to respond in a timely manner (i.e., as soon as possible) to cyberbullying events. (ii) scalabilty: the scalability of existing methods to the staggering rates at which content is generated (e.g., 95 million photos and videos are shared on Instagram per day1) has largely remained unaddressed.
In my lightning talk, I will introduce CONcISE, a novel approach for timely and accurate Cyberbullying detectiON on Instagram media SEssions, that has been accepted for presentation at the main conference . Specifically, I will present a novel two–stage online approach (illustrated in Figure 1) designed to reduce the time to raise a cyberbullying alert by (i) sequentially examining comments as they become available over time, and (ii) minimizing the number of feature evaluations necessary for a decision to be made for each comment. By formalizing the problem as a sequential hypothesis testing problem, a novel algorithm has been developed that satisfies four key properties: accuracy, repetitiveness, timeliness, and efficiency.
Extensive experiments on a real–world Instagram dataset with ∼ 4M users and ∼ 10M comments demonstrate the effectiveness of the proposed approach with respect to accuracy, timeliness, efficiency, and robustness, and show that it consistently outperforms the stat–of–the–art, often by a considerable margin.
The explosive growth of fake news and its erosion to democracy, journalism and economy has increased the demand for fake news detection. To achieve efficient and explainable fake news detection, an interdisciplinary approach is required, relying on scientific contributions from various disciplines, e.g., social sciences, engineering, among others. Here, we illustrate how such multidisciplinary contributions can help detect fake news by improving feature engineering, or by providing well-justified machine learning models. We demonstrate how news content, news propagation patterns, and users’ engagements with news can help detect fake news.
Many data intensive applications that use machine learning or artificial intelligence techniques depend on humans providing the initial dataset, enabling algorithms to process the rest or for other humans to evaluate the performance of such algorithms. There are, however, practical issues with the adoption of human computation and crowdsourcing at scale in the real world. Building systems data processing pipelines that require crowd computing remains difficult. In this tutorial, we present practical considerations for designing and implementing tasks that require the use of humans and machines in combination with the goal of producing high quality labels.
In this tutorial, we introduce a novel crowdsourcing methodology called CrowdTruth [1, 9]. The central characteristic of CrowdTruth is harnessing the diversity in human interpretation to capture the wide range of opinions and perspectives, and thus provide more reliable, realistic and inclusive real-world annotated data for training and evaluating machine learning components. Unlike other methods, we do not discard dissenting votes, but incorporate them into a richer and more continuous representation of truth. CrowdTruth is a widely used crowdsourcing methodology1 adopted by industrial partners and public organizations such as Google, IBM, New York Times, Cleveland Clinic, Crowdynews, Sound and Vision archive, Rijksmuseum, and in a multitude of domains such as AI, news, medicine, social media, cultural heritage, and social sciences. The goal of this tutorial is to introduce the audience to a novel approach to crowdsourcing that takes advantage of the diversity of opinions and perspectives that is inherent to the Web, as methods that deal with disagreement and diversity in crowdsourcing have become increasingly popular. Creating this more complex notion of truth contributes directly to the larger discussion on how to make the Web more reliable, diverse and inclusive.
Machine learning algorithms increasingly affect both our online and offline experiences. Researchers and policymakers, however, have rightfully raised concerns that these systems might inadvertently exacerbate societal biases. We provide an introduction to fair machine learning, beginning with a general overview of algorithmic fairness, and then discussing these issues specifically in the context of the Web.
To measure and mitigate potential bias from machine learning systems, there has recently been an explosion of competing mathematical definitions of what it means for an algorithm to be fair. Unfortunately, as we show, many of the most prominent definitions of fairness suffer from subtle shortcomings that can lead to serious adverse consequences when used as an objective. To illustrate these complications, we draw on a variety of classical and modern ideas from statistics, economics, and legal theory.
We further discuss the equity of machine learning algorithms in the specific context of the Web, focusing on search engines and e-commerce websites. We expose the different sources for bias on the Web and how they impact fairness. They include not only data bias, but also biases that are produced by data sampling, the algorithms per-se, user interaction and feedback loops that result from user personalization and content creation. All these lead to a vicious cycle that affects everybody.
The content of this tutorial is mainly based in the work of the authors [1,2,3,4].
Researchers and practitioners from different disciplines have highlighted the ethical and legal challenges posed by the use of machine learned models and data-driven systems, and the potential for such systems to discriminate against certain population groups, due to biases in algorithmic decision-making systems. This tutorial aims to present an overview of algorithmic bias / discrimination issues observed over the last few years and the lessons learned, key regulations and laws, and evolution of techniques for achieving fairness in machine learning systems. We will motivate the need for adopting a “fairness-first” approach (as opposed to viewing algorithmic bias / fairness considerations as an afterthought), when developing machine learning based models and systems for different consumer and enterprise applications. Then, we will focus on the application of fairness-aware machine learning techniques in practice, by highlighting industry best practices and case studies from different technology companies. Based on our experiences in industry, we will identify open problems and research challenges for the data mining / machine learning community.
The Internet and the general digitalization of products and operations provides an unprecedented opportunity to accelerate innovation while applying a rigorous and trustworthy methodology for supporting key product decisions. Developers of connected software, including web sites, applications, and devices, can now evaluate ideas quickly and accurately using controlled experiments, also known as A/B tests. From front-end user-interface changes to backend algorithms, from search engines (e.g., Google, Bing, Yahoo!) to retailers (e.g., Amazon, eBay, Etsy) to social networking services (e.g., Facebook, LinkedIn, Twitter) to travel services (e.g., Expedia, Airbnb, Booking.com) to many startups, online controlled experiments are now utilized to make data-driven decisions at a wide range of companies. The theory of a controlled experiment is simple, but for the practitioner the deployment and evaluation of online controlled experiments at scale (100’s of concurrently running experiments) across a variety of web sites, mobile apps, and desktop applications presents many pitfalls and new research challenges. In this tutorial, we will introduce the overall A/B testing methodology, walkthrough use cases using real examples, and then focus on practical and research challenges in scaling experimentation. We will share key lessons learned from scaling experimentation at Microsoft to thousands of experiments per year and outline promising directions for future work.
Machine Learning is increasingly employed to make consequential decisions for humans. In response to the ethical issues that may ensue, an active area of research in ML has been dedicated to the study of algorithmic unfairness. This tutorial introduces fair-ML to the web conference community and offers a new perspective on it through the lens of the long-established economic theories of distributive justice. Based on our past and ongoing research, we argue that economic theories of equality of opportunity, inequality measurement, and social choice have a lot to offer—in terms of tools and insights—to data scientists and practitioners interested in understanding the ethical implications of their work. We overview these theories and discuss their connections to fair-ML.
User engagement plays a central role in companies operating online services, such as search engines, news portals, e-commerce sites, entertainment services, and social networks. A main challenge is to leverage collected knowledge about the daily online behavior of millions of users to understand what engage them short-term and more importantly long-term. Two critical steps of improving user engagement are metrics and their optimization. The most common way that engagement is measured is through various online metrics, acting as proxy measures of user engagement. This tutorial will review these metrics, their advantages and drawbacks, and their appropriateness to various types of online services. Once metrics are defined, how to optimize them will become the key issue. We will survey methodologies including machine learning models and experimental designs that are utilized to optimize these metrics via direct or indirect ways. As case studies, we will focus on four types of services, news, search, entertainment, and e-commerce. We will end with lessons learned and a discussion on the most promising research directions.
In the field of web mining and web science, as well as data science and data mining there has been a lot of interest in the analysis of (social) networks. With the growing complexity of heterogeneous data, feature-rich networks have emerged as a powerful modeling approach: They capture data and knowledge at different scales from multiple heterogeneous data sources, and allow the mining and analysis from different perspectives. The challenge is to devise novel algorithms and tools for the analysis of such networks.
This tutorial provides a unified perspective on feature-rich networks, focusing on different modeling approaches, in particular multiplex and attributed networks. It outlines important principles, methods, tools and future research directions in this emerging field.
Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. In this tutorial, we will first present an overview of privacy breaches over the last two decades and the lessons learned, key regulations and laws, and evolution of privacy techniques leading to differential privacy definition / techniques. Then, we will focus on the application of privacy-preserving data mining techniques in practice, by presenting case studies such as Apple’s differential privacy deployment for iOS / macOS, Google’s RAPPOR, LinkedIn Salary, and Microsoft’s differential privacy deployment for collecting Windows telemetry. We will conclude with open problems and challenges for the data mining / machine learning community, based on our experiences in industry.
The inclusion of tracking technologies in personal devices opened the doors to the analysis of large sets of mobility data like GPS traces and call detail records. This tutorial presents an overview of both modeling principles of human mobility and machine learning models applicable to specific problems. We review the state of the art of five main aspects in human mobility: (1) human mobility data landscape; (2) key measures of individual and collective mobility; (3) generative models at the level of individual, population and mixture of the two; (4) next location prediction algorithms; (5) applications for social good. For each aspect, we show experiments and simulations using the Python library ”scikit-mobility” developed by the presenters of the tutorial.
Understanding and extracting knowledge contained in text and encoding it as linked data for the WEB is a highly complex task that poses several challenges, requiring expertise from different fields such as conceptual modeling, natural language processing and web technologies including web mining, linked data generation and publishing, etc. When it comes to the scholarly domain, the transformation of human readable research articles into machine comprehensible knowledge bases is considered of high importance and necessity today due to the explosion of scientific publications in every major discipline, that makes it increasingly difficult for experts to maintain an overview of their domain or relate ideas from different domains. This situation could be significantly alleviated by knowledge bases capable of supporting queries such as: find all papers that address a given problem; how was the problem solved; which methods are employed by whom in addressing particular tasks; etc. that currently cannot be addressed by commonly used search engines such as Google Scholar or Semantic Scholar.
This tutorial addresses the above challenge by introducing the participants to methods required in order to model knowledge regarding a given domain, extract information from available texts using advanced machine learning techniques, associate it with other information mined from the web in order to infer new knowledge and republish everything as linked open data on the Web. To this end, we will use a specific use case – that of the scholarly domain, and will show how to model research processes, extract them from research articles, associate them with contextual information from article metadata and other linked repositories and create knowledge bases available as linked data. Our aim is to show how methodologies from different computer science fields, namely natural language processing, machine learning and conceptual modeling, can be combined with Web technologies in a single meaningful workflow.
Recommender systems are widely used in online applications to help users find items of interest and help them deal with information overload. In this tutorial, we discuss the class of sequence-aware recommender systems. Differently from the traditional problem formulation based on a user-item rating matrix, the input to such systems is a sequence of logged user interactions. Likewise, sequence-aware recommender systems implement alternative computational tasks, such as predicting the next items a user will be interested in an ongoing session or creating entire sequences of items to present to the user. We propose a problem formulation, sketch a number of computational tasks, review existing algorithmic approaches, and finally discuss evaluation aspects of sequence-aware recommender systems.
Subgraph counting is a fundamental problem in graph analysis that finds use in a wide array of applications. The basic problem is to count or approximate the occurrences of a small subgraph (the pattern) in a large graph (the dataset). Subgraph counting is a computationally challenging problem, and the last few years have seen a rich literature develop around scalable solutions for it. However, these results have thus far appeared as a disconnected set of ideas that are applied separately by different research groups. We observe that there are a few common algorithmic building blocks that most subgraph counting results build on. In this tutorial, we attempt to summarize current methods through distilling these basic algorithmic building blocks. The tutorial will also cover methods for subgraph analysis on “big data” computational models such as the streaming model and models of parallel and distributed computation.
Deep Learning has shown significant results in various domains. In this tutorial, we provide conceptual understanding of embedding methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNNs). We present fashion use case and apply these techniques for modeling image, text as well as sequence data to figure out user profiles, give personalized recommendations tailored to changing user taste and interest. Given the image of a fashion item, recommending complementary matches is a challenge. Users’ taste evolves over time and depends on persona. Humans relate objects based on their appearance and non-visual factors of lifestyle merchandise which further complicate recommendation task. Composing outfits in addition necessitates constituent items to be compatible - similar in some but different in other aspects.
Network representation learning offers a revolutionary paradigm for mining and learning with network data. In this tutorial, we will give a systematic introduction for representation learning on networks. We will start the tutorial with industry examples from Alibaba, AMiner, Microsoft Academic, WeChat, and XueTangX to explain how network analysis and graph mining on the Web are benefiting from representation learning. Then we will comprehensively introduce both the history and recent advances on network representation learning, such as network embedding and graph neural networks. Uniquely, this tutorial aims to provide the audience with the underlying theories in network representation learning, as well as our experience in translating this line of research into real-world applications on the Web. Finally, we will release public datasets and benchmarks for open and reproducible network representation learning research. The tutorial accompanying page is at https://aminer.org/nrl_www2019.
This half-day tutorial provides a comprehensive introduction to web stream processing, including the fundamental stream reasoning concepts, as well as an introduction to practical implementations and how to use them in concrete web applications. To this extent, we intend to (1) survey existing research outcomes from Stream Reasoning / RDF Stream Processing that arise in querying, reasoning on and learning from a variety of highly dynamic data, (2) introduce deductive and inductive stream reasoning techniques as powerful tools to use when addressing a data-centric problem characterized both by variety and velocity, (3) present a relevant use-case, which requires to address data velocity and variety simultaneously on the web, and guide the participants in developing a web stream processing application.
As language technologies have become increasingly prevalent in analyzing online data, there is a growing awareness that decisions we make about our data, methods, and tools often have immense impact on people and societies. This tutorial will provide an overview of real-world applications of Natural Language Processing technologies and their potential ethical implications. We intend to provide the researchers with an overview of tools to ensure that the data, algorithms, and models that they build are socially responsible. These tools will include a checklist of common pitfalls that one should avoid, as well as methods to mitigate these issues. Issues of bias, ethics, and impact are often not clear-cut; this tutorial will also discuss the complexities inherent in this area.
The Web as the world’s largest information system has largely settled on a solid foundation of HTTP-based connectivity, and the representation of User Interface (UI) information resources through a mix of HTML and scripting. In contrast, the similarly rapidly evolving “Web of Services” is still based on a more diverse and more quickly evolving set of approaches and technologies. This can make architectural decisions harder when it comes to choosing on how to expose information and services through an Application Programming Interface (API). This challenge becomes even more pronounced when organizations are faced with developing strategies for managing constantly growing and evolving API landscapes.
This tutorial takes participants through two different journeys. The first one is a journey discussing API styles and API technologies, comparing and contrasting them as a way to highlight the fact that there is no such thing as the one best choice. The goal of this first journey is to provide an overview of how APIs are used nowadays in research and in industry. The second journey discusses the question of how to define an API strategy, which focuses both on helping teams to make effective choices about APIs in a given context, and on how to manage that context over time when large organizations nowadays have thousands of APIs, which will continue to evolve constantly.
The tutorial is based on our long-term research on open domain conversation and rich hands-on experience on development of Microsoft XiaoIce. We will summarize the recent achievements made by both academia and industry on chatbots, and give a thorough and systematic introduction to state-of-the-art methods for open domain conversation modeling including both retrieval-based methods and generation-based methods. In addition to these, our tutorial will also cover some new trends of research of chatbots, such as how to design a reasonable evaluation metric for open domain dialogue generation, how to build conversation models with multiple modalities, and how to conduct dialogue management in open domain conversation systems.
Explainable recommendation and search attempt to develop models or methods that not only generate high-quality recommendation or search results, but also intuitive explanations of the results for users or system designers, where the explanations can be either post-hoc or directly come from an explainable model. Explainable recommendation and search can help to improve the system transparency, persuasiveness, trustworthiness, and effectiveness. This is even more important in personalized search and recommendation scenarios, where users would like to know why a particular product, web page, news report, or friend suggestion exists in his or her own search and recommendation lists. The tutorial focuses on the research and application of explainable recommendation and search algorithms, as well as their application in real-world systems such as search engine, e-commerce and social networks. The tutorial aims at introducing and communicating explainable recommendation and search methods to the community, as well as gathering researchers and practitioners interested in this research direction for discussions, idea communications, and research promotions.