Analytics Used to Detect Online Harassment

The internet can spread ideas, connect communities, and serve as the foundation for thriving businesses. But so many people who use the internet or social media know that it is also home to trolls, harassers, "doxers," and others with less noble intentions, as we are reminded today on Internet Safety Day.

For example, here's one of the things an anonymous poster said to a woman editor on Wikipedia in March 2015: "What you need to understand as you are doing the ironing is that Wikipedia is no place for a woman."

That post was left on one of Wikipedia's "talk pages," which are pages attached to every Wikipedia article and user page on the platform. It demonstrates that these discussions are not always good-faith collaboration and exchange of ideas.

(Image: Pixabay)

(Image: Pixabay)

In conjunction with attention to this problem and defenses against it, the Wikimedia Foundation has released two large data sets to the public. The first set is a collection of over one million annotations of Wikipedia talk page edits from 4,000 crowd workers to determine whether each edit was a personal attack and who was the target of each attack. Each edit was rated by 10 judges whose opinions were aggregated and used to train the model. The Wikimedia Foundation said it believes this is the largest public annotated data set of personal attacks available today.

The second data set is all 95 million user and article talk comments made between 2001 and 2015. Both of these data sets are available to the public here to support further research.

Wikipedia said that the model was inspired by research at Yahoo that was designed to detect abusive language by using fragments of text from the Wikipedia edits and feeding them into a simple machine learning algorithm for logistic regression.

"The model this produces gives a probability estimate of whether an edit is a personal attack," the Wikimedia Foundation said in a statement announcing the data set availability. "What surprised us was the effectiveness of this model: a fully trained model achieves better performance than the combined average of three human crowd workers."

The research also revealed the following insights about online harassment of Wikipedia editors:

  • Only 18% of attacks that the algorithm discovered were followed by a warning or block of the offending user.
  • While anonymous users are responsible for a disproportionate number of attacks, registered users still account for almost 67% of the attacks on Wikipedia.
  • While half of all attacks come from editors who make fewer than five edits per year, a full one-third of attacks come from registered users with more than 100 edits per year.

There's still plenty of work to be done. While the researchers now understand more about this kind of behavior, there's still plenty of work needed to learn the best ways to mitigate the behavior. Also, the data is currently only in English, and so the model only understands English. The Wikimedia Foundation acknowledges that the model is still not very good about identifying threats.

The Wikimedia Foundation worked with Jigsaw, Alphabet's technology incubator, on this research, and invites others to join the future research efforts by getting in touch via the project's wiki page.

Jessica Davis, Senior Editor, Enterprise Apps, Informationweek

Jessica Davis has spent a career covering the intersection of business and technology at titles including IDG's Infoworld, Ziff Davis Enterprise's eWeek and Channel Insider, and Penton Technology's MSPmentor. She's passionate about the practical use of business intelligence, predictive analytics, and big data for smarter business and a better world. In her spare time she enjoys playing Minecraft and other video games with her sons. She's also a student and performer of improvisational comedy. Follow her on Twitter: @jessicadavis.

Success Secrets of Top Omnichannel Retailers

It's a tough and changing environment for retailers. Yet some are enjoying continued success during turbulent times. We take a closer look at how they do it.

A2 Radio: Lean Analytics for 2018

Lean Analytics author Alistair Croll joins AllAnalytics radio to talk about how to apply the process for 2018.

Re: Trolling trolls with AI analytics
  • 2/21/2017 3:47:27 PM

I would suppose reporting problems may very well help. I'm sure lots of social media providers have differening attitutes if not vastly different resources to decide just how they handle it. It should all come down to the return on money and time invested vs. the social good in alleviating distressful situation for both staff and users.

Re: Trolling trolls with AI analytics
  • 2/19/2017 7:40:03 PM

I've found that reporting abusers does help.  Facebook is quick to act on posts that have been reported as abusive or fradulent.

Also, I think it's important to educate kids on the realities of social media, i.e. you don't really have 600 friends and how people behave online so that they don't take things as personally.  Teach them that they can report harrassement and block those who are doing the harrassement.

Re: Trolling trolls with AI analytics
  • 2/15/2017 10:26:15 AM

It would seem from the data presented that perhaps psychologist or sociologist might have some solution for prevention or at least downsizing the amount of attacks. In the light of our current divisions in political conversation I would only guess that the bad behavior might even grow over the next months or years unless some education program gets going on the problem.

Re: Trolling trolls with AI analytics
  • 2/11/2017 7:38:47 AM

@SaneIT: Which simply goes to show how technologically far along Tay was as AI.  It's the same thing with a human.  Influence them while they're young with whatever ideas you have and that's how they'll grow up.

Middle out
  • 2/11/2017 7:37:30 AM

> While half of all attacks come from editors who make fewer than five edits per year, a full one-third of attacks come from registered users with more than 100 edits per year.

I suppose this only makes sense.  The majority would be those who aren't as invested in the platform, while the second-highest plurality would come from people who are highly invested in the platform.  You wouldn't get as much from the middle.

Re: Trolling trolls with AI analytics
  • 2/10/2017 1:54:10 PM

@SaneIT. The common explanation for trolling on the Internet is that it allows us to be somewhat anonymous, so we can say anything we want. However, I think that explains only some of the negative activity, considering how many of those posters are easily identified (often using their real names). They say things they would never say face-to-face with their subjects/targets, even they had such access. 

Is it merely false bravado fueled by a desire to stand out from the crowd, to be a contrarian, justified by the fact that they have a forum that they think compares with the platform (movie, TV, sports arena, political office) that their target has?


Re: Trolling trolls with AI analytics
  • 2/10/2017 8:15:18 AM

@ T Sweeney, I think we only need to look at Microsoft's leap into bot technology with Tay.  They intended for it to be more human in its interactions.  Sadly humans messing with it turned the bot into an angry Nazi sympathizer.  It might have worked in a controlled environment but something about the internet makes people want to mess with or break things and trolling seems to be the usual way to do this. 

Re: Trolling trolls with AI analytics
  • 2/9/2017 4:44:26 PM

At some point software might just be good enough, but I agree with you- we are not near that point today.

Re: Trolling trolls with AI analytics
  • 2/9/2017 10:34:46 AM

@Terry Good point about both the machines and some humans. The words themselves are only a fraction of the whole communication experience. I was wondering if that trolling problem and the challenge of keeping it under control without human moderation is what led IMDB to remove the message boards section from its site. Though I don't contribute, I find the discussions entertaining and sometimes even informative, as they touch on aspects of a film you may not easily find anywhere else. Here's what IMDB wrote:  

IMDb is the world's most popular and authoritative source for movie, TV and celebrity content. As part of our ongoing effort to continually evaluate and enhance the customer experience on IMDb, we have decided to disable IMDb's message boards on February 20, 2017. This includes the Private Message system. After in-depth discussion and examination, we have concluded that IMDb's message boards are no longer providing a positive, useful experience for the vast majority of our more than 250 million monthly users worldwide. The decision to retire a long-standing feature was made only after careful consideration and was based on data and traffic.

[the next paragraph is mostly loaded with links to its social media pages that it says can offer a place for discussions]

Because IMDb's message boards continue to be utilized by a small but passionate community of IMDb users, we announced our decision to disable our message boards on February 3, 2017 but will leave them open for two additional weeks so that users will have ample time to archive any message board content they'd like to keep for personal use. During this two-week transition period, which concludes on February 19, 2017, IMDb message board users can exchange contact information with any other board users they would like to remain in communication with (since once we shut down the IMDb message boards, users will no longer be able to send personal messages to one another). We regret any disappointment or frustration IMDb message board users may experience as a result of this decision.

IMDb is passionately committed to providing innovative ways for our hundreds of millions of users to engage and communicate with one another. We will continue to enhance our current offerings and launch new features in 2017 and beyond that will help our customers communicate and express themselves in meaningful ways while leveraging emerging technologies and opportunities.

Re: Trolling trolls with AI analytics
  • 2/8/2017 2:47:30 PM

Machine learning is notoriously obtuse where things like irony are concerned (which come to think of it, mirrors some humans!). Human editors are still going to excel where detection of snotty asides and sarcasm are concerned.

And sometimes there's a fine line between blowing off some verbal steam and outright trolling. Is it really reasonable to expect software to make this distinction accurately most of the time?

Page 1 / 2   >   >>