Skip To Content
    Dashboard
    • Login
    • Dashboard
    • Courses
    • Calendar
    • 0
      Inbox
    • Help
    Close
    • My Dashboard
    • CMSC131
    • Assignments
    • Project 7: Extracting meaning from data
    Spring 2016
    • Home
    • Assignments
    • Pages
    • Files
    • Syllabus
    • Quizzes
    • Modules
    • Panopto Recordings
    • Clicker Registration

    Project 7: Extracting meaning from data

    • Due Apr 27, 2016 by 11pm
    • Points 5
    • Available after Apr 11, 2016 at 12:50pm

    Theme

    In today’s world, we are constantly generating new data, from real-time weather updates to personalized fitness trackers. How can we extract meaningful insights about human behavior from these large amounts of data? This project allows you to explore and develop your own method of analyzing real-world datasets. There will be a few required methods, but we want you to get creative with your analysis of the data, so you will also make up your own analyses to run.

    Project Overview

    This project is broken down into three phases: I. Project Proposal, II. Baseline Methods, and III. Creative Methods. You can choose to work with one of the following datasets: (1) College Admissions, (2)  Montgomery County Traffic Violations, (3) Baseball Statistics, or (4) NY High School SAT Scores.

    Each dataset can be downloaded from ELMS following the links above as a Comma-Separated Value (CSV) file. You can open up the CSV files in any spreadsheet software (Excel, OpenOffice), look through the data, and then determine which dataset interests you the most. Once you have chosen your dataset and are ready to start coding, we will give you code which reads in the files for you (so don’t worry about parsing the data files).

    I. Project Proposal (20%)                                            Due: Friday, April 15th

    Before you begin working, it’s important that you plan out not only which dataset you will be using, but how you will organize your project. Complete and submit the 1-page Project Proposal sheet at the end of this Project 7 specification. Submit this form through ELMS by Friday, April 15th at noon. The TAs will look over your design specification and grade it for feasibility, use of appropriate data structures, and creativity. You can come into office hours, or use Piazza to talk about your ideas for the project.

    NOTE: DO NOT START CODING THIS WEEK! The design phase is incredibly important and you should think your decisions through thoroughly. Project code will be released to you at a later date.

    Again - submit the proposal here before the Friday deadline: Project 7 Proposal

    II. Baseline Methods (40%)                                        Due: Monday, April 25

    Regardless of which dataset you select, each CSV file has several columns of numerical data (ints or doubles) and several columns of Strings. The first column of each dataset has a unique ID number for each data entry, starting at 1 and incrementing by 1 per row. Because of the way we designed the datasets, the baseline methods you will implement should work regardless of dataset.

    The code that you must implement will consist of a class named StatAnalysis that contains at least the methods described below, as well as fields that help store the data found in the file you will analyze.  This class must include the following methods:

    public void readData(String name) – reads the data from the file named name.  We will provide you with further instructions and code for reading data from a file by the end of this week.

    public double average(int columnNumber) - this method takes a single parameter that represents the column number in the CSV file. If the data in that column is numerical, the method returns the average of the values in the column. If the data in that column is not numerical, or if the column number refers to a non-existent column, it should return -1.

    public double median (int columnNumber) – same as above, but computes the median of the corresponding column

    public String mode (int columnNumber) –this method returns the mode (the most frequent element) from a column.IMPORTANT: Updated!  Tread the column as String even if the data contained within it is numerical.  If the column number is incorrect, the method should return the special value null.

    When deciding how to structure your code and which data structures to use (e.g. arrays, ArrayLists, individual primitives), think about the fact that you will need to write methods which take as a parameter int columnNumber and then compute information about a specific column of data. You should also think about keeping your data organized in some fashion (perhaps using the provided IDs).

    III. Creative Methods (40%)                              Due: Wednesday, April 27

    You will need to implement one or more interesting methods which go beyond the scope of the baseline methods in order to further analyze your dataset. We expect you to use your creativity to provide some interesting new information about the data. See the two examples below for some basic ideas of what we are looking for. If you are unsure whether your idea is too simple or too complex, ask a TA sometime this week. You will also receive feedback on your project proposal, which should let you know if you’re on the right track.

    You will submit a short write-up demonstrating the functionality and output of your creative methods and explaining what data structures or algorithms were used and why your results are interesting. Additionally, you will reflect upon your initial design choices and write about how your approach (either conceptual or in code) changed as you began coding or reached the creative methods section of the project.

    Example 1. Visualization

    Let’s say you chose to work with the NY High School SAT Scores. Using only printouts to the console (no external libraries or code bases unless you want to challenge yourself), you can create a simple text-based visualization of the distribution of SAT scores. When printed in the console, the dataset could look something like this:

          +

          + +

      +   +++

      ++ ++++++++   +

    ++++++++++++++  ++

    0     1200    2400

    Above, you can see a simple graph that shows how many students from the NY High School SAT Scores dataset received scores ranging from 0 to 2400.

    Example 2. Statistical Analysis

    You already know that you will be asked to compute the average for columns of numerical data. However, you know that an average doesn’t tell you very much about the data, so you decide to additionally compute the standard deviation, 25th percentile, and 75th percentile for each column of numerical data in order to provide a more complete understanding of the data set.

    Reading data from files

    We have added to the project directories a Java library called ReadCSV that helps you read a comma-separated file in a simple way.  The documentation for the library is available here: ReadCSV.pdf.

    Briefly, to use this library to read from a file you can use the statement:

    ReadCSV reader = new ReadCSV("file.csv");

    The resulting object allows you to access the header line as an array of Strings:

    String[] header = reader.getHeader();

    as well as to get the best guess of the program about the type of data stored in each column:

    String[] columnType = reader.colTypes();

    Each of the elements of the resulting array contains the value "numeric" or "string" indicating the type guessed by the CSV reader.  For the purpose of this project the guessing procedure should be accurate, but be careful if you want to use this library in other projects.

    You can read the actual data in the file, line by line, using the pattern:

    String [] data;
    while ( (data = reader.getLine()) != null) {
    // do something with the data
    }

    Lines are read one by one until the end of the file at which point subsequent calls to reader.getLine() return null.  The lines are automatically split, as strings, into the corresponding values for each column.

    If you want to convert the String data to one of the number types, you can use the parse* methods from the wrapper classes:

    Double d = Double.parseDouble(data[5]); // converts the 6th column into a double
    Integer i = Integer.parseInteger(data[2]); // converts the 3rd column into an integer

    IMPORTANT:  you may want to save the ReadCSV object as a private member of your class, or at least save the header line and column type information within the class and use this information within your methods.

    Indicating the file used

    The code provided to you includes one static class variable for the class StatAnalysis called FILE .  You should modify this variable to reflect the file you are planning to work on, from among the four files made available to you.  The baseline methods can be written in a way that is independent of the actual file, but in all likelihood  for your creativity part of the project you will want to create custom structures that will not universally work.  For this reason  you want to keep track of the file you expect to use.  Knowing this file will also help our testing code.

    1461812400 04/27/2016 11:00pm
    Additional Comments:
    Rating max score to > pts

    Rubric

     
     
     
     
     
         
    Can't change a rubric once you've started using it.  
    Find a Rubric
    Find Rubric
    Title
    You've already rated students with this rubric. Any major changes could affect their assessment results.
    Title
    Criteria Ratings Pts
    Edit criterion description Delete criterion row
    This criterion is linked to a Learning Outcome Description of criterion
    view longer description
    threshold: 5 pts
    Edit rating Delete rating
    5 to >0 pts
    Full Marks
    blank
    Edit rating Delete rating
    0 to >0 pts
    No Marks
    blank_2
    This area will be used by the assessor to leave comments related to this criterion.
    pts
      / 5 pts
    --
    Additional Comments
    Total Points: 5 out of 5