Faculty Mentor: Gary Weiss (Fordham)
Approximately 70% of global email consists of unsolicited spam email. Spam email conservatively wastes about 20 hours per person per year, while industries that are more reliant on email may see closer to 70 hours wasted per employee per year. This has an immense cost to businesses. Furthermore, some spam email is more than an annoyance, and may be associated with phishing attempts to obtain confidential information or an attempt to get the reader to download harmful programs (e.g., spyware or ransomware. To combat spam many companies and individuals rely on spam filters; however, distinguishing between spam and legitimate email is not always easy, and blocking legitimate email can be very costly. In this project, text mining methods will be used to automatically build a classifier from emails labeled as “spam” or “legitimate”, such that the classifier can be used to distinguish the two.
This project is appropriate for participants who have little background in computer science or programming—since a data mining tool can be used to perform the necessary tasks. Participant with more advanced background can still participate and can choose to utilize a more programming-intensive approach.
Objectives and Learning Goals
The participants in this project will achieve the following:
- Gain a basic understanding of data mining, and more specifically, of classification.
- Learn about a variety of classification algorithms, including how they work and about their strengths and weaknesses.
- Learn how to use a data mining tool to automatically build a classifier from data
- Learn about text mining and the steps required to preprocess textual data for use in data mining.
- Experience building and evaluating a model to classify spam email.