CS 212 Software Development

CS 212-01, CS 212-02 • Fall 2019

Project 4b Search Engine

Associated Assignments Project 4 Search Engine

For this project, you will extend your previous project to create a fully functional search engine. This project is split into two main components: (1) a multithreaded web crawler using a work queue to build the index from a seed URL, and (2) a search engine web interface using embedded Jetty and servlets to search that index.

This writeup is for the search engine functionality only. See the general Project 4 Writeup for more details.

Functionality

You will lose 5 points if you do not protect against XSS attacks and another 5 points if you do not protect against SQL injection attacks in your servlets. This means ANYTIME you get data from either the HTTP request or the database, you must escape it before including it in your HTML page. And ANYTIME you store data in your database you must use prepared statements.

The functionality for this project is broken into 2 parts: core functionality and extra features. You must complete the core functionality before attempting extra features.

Search Engine Core Functionality

You must implement the following core features using embedded Jetty and servlets for a total of 30 points:

  • Search (15 points): Display a webpage with a text box where users may enter a multi-word search query, and a button that submits the query to your search engine.

  • Results (15 points): Your search engine should perform a partial search from an inverted index generated by your web crawler, and return an HTML page with sorted and clickable links to the search results.

You should not begin working on extra features until the core functionality is working properly.

Search Engine Extra Functionality

You must implement a minimum of 70 points worth of additional features. These features are broken into three categories: user tracking, database support, and extra functionality. You may choose any combination of features from these categories.

  • User Tracking: The following features require you to use session tracking and/or cookies to store per-user information.

    • Search History (10 points): Store a history of all search queries entered by a user, and allow the user to view and clear that history.

    • Visited Results (10 points): Store a history of all search results visited by a user, and allow the user to view and clear that history. Hint: Modify the search result links to direct back to your search engine, so that you may first store that the link was visited and then redirect the user to the link selected.

    • Favorite Results (10 points): Allow a user to save favorite search results, and allow the user to view and clear those favorites. Hint: Add a special link to each result that saves it as a favorite, but consider how to do this in the least disruptive way for the user.

    • Time Stamps (5 points): Add timestamps to each item stored per user. For example, add timestamps to the user’s search history. You are expected to implement this for all related features to earn full credit. For example, if you implement search history and visited results, timestamps should be added to BOTH features for full credit.

    • Private Search (5 points): Allow users to set an option that turns off all tracking of per-user data. You are expected to implement this for all related features to earn full credit. For example, if you implement search history and visited results, tracking should be turned off for BOTH features for full credit.

    • Partial Search Toggle (5 points): Allow the user to toggle on/off partial versus exact search.

    • Last Login Time (5 points): Track and display the last time the user logged into your search engine. If you are not implementing user accounts, track the last time the user visited your search engine instead.

  • Database Support: The following features require you to connect your search engine to a database over JDBC.

    • User Accounts (20 points): Allow users to register, login, and logout of an account with your search engine. You will need to securely store usernames and passwords in a database for this feature, and use some form of session tracking to determine if a user is logged in to the search engine.

    • User Data (15 points): Store all per-user data persistently in a database instead of in cookies. You are expected to implement this for all related features to earn full credit. For example, if you implement search history and visited results, data from BOTH should be stored in a database for full credit.

    • Change Password (10 points): Allow users to change and/or reset their password.

    • Logged In Users (5 points): Track and display the last 5 users to log into your search engine.

    • Page Snippets (15 points): During a crawl, store short snippets of every webpage found in a database and display these snippets whenever that page is returned as a result.

    • Last Crawled (5 points): This feature requires you implement the “Page Snippets” feature. When crawling pages and storing page snippets, also store a timestamp of when that page was crawled in the database. Whenever that page and snippet is returned as a result, display the crawled date as well.

    • Popular Queries, Database Edition (15-20 points): Every time a search is conducted, parse/clean/optimize that query and store the number of times that query has been searched for in a database. Allow users to see the top 5 most popular queries on your search page. The base functionality of this feature is worth 15 points. You can earn an additional 5 points if you make those queries clickable such that when clicked, the results for that query is displayed (i.e. clicking a popular query conducts a search for that query). Note: You cannot implement both this and the non-database “Suggested Queries” feature.

    • Most Visited and/or Favorited Results (10-15 points): This feature requires you implement at least one of the “Visited Results” or “Favorite Results” features. When storing the visit or favorite in the user session, also increment the number of times that page has been visited or favorited in a database. When displaying the user’s visit or favorite history, also show the top 5 visited or favorited pages overall from the database. If you implement this for just one feature (visited -or- favorited results, but not both) it is worth 10 points. If you implement this for both the visited and favorited results features, it is worth 15 points.

    • Reset Database (5 points): This feature requires you implement at least one database feature. Allow users with an administrator password to clear all the tables in the database associated with your search engine.

  • Extra Functionality: The following features allow you to customize the functionality of your search engine.

    • New Crawl (10 points): Allow a user to enter a new seed URL to crawl. The results should be added to your inverted index (not replace the already existing results).

    • Suggested Queries (10 points): Provide users five suggested queries based on either the latest queries made by other users -or- the most popular queries made by other users.

    • Graceful Shutdown (10 points): Allow an administrator to trigger a graceful shutdown of your search engine without calling System.exit(). You will need to create a special servlet for this feature.

    • Index Browser (5 points): Allow users to browse your inverted index as an HTML page with clickable links to all of the indexed URLs.

    • Location Browser (5 points): Allow users to browse all of the locations and their word counts stored by your inverted index as an HTML page with clickable links to all of the indexed URLs.

    • I’m Feeling Lucky Button (5 points): Add a new button to your search page (in addition to the normal search button) that automatically redirects the user to the first search result instead of listing all of the search results. This is similar to the “I’m Feeling Lucky” button that Google Search includes on its page. You have to consider what to do if there are no search results!

    • StringTemplate (5 points): Use StringTemplate to generate your HTML instead of several println() statements. See for http://www.cs.usfca.edu/~parrt/course/601/lectures/stringtemplate.html more information on StringTemplate.

    • Page Statistics (5 points): In addition to providing a clickable link for each search result (i.e. web page), display the page title (via the <title> tag in HTML), word count, and content length (via the Content-Length HTTP header). This information can be stored in-memory (no database connectivity required) by your web crawler (except word count, which is already stored by your inverted index).

    • Search Statistics (5 points): Display the total number of results along with the time it took to calculate and fetch those results, and display the score and number of matches per search result listed.

    • Web Framework (5 points): Design a search engine using any popular CSS/style framework to create a consistent style for all the web pages. For example, consider using Bulma, Bootstrap (Twitter), Pure.css, Material (Google), Semantic UI, and many more.

    • Search Brand (5 points): Design a search engine with a distinct brand and tagline. This includes creating a logo and tagline, and including it on all of the web pages.

You may implement more extra features than necessary to receive extra credit on this project. The overall project category grade will be capped to 115% at the end of the semester.

 Have a feature idea? You can propose an extra feature in a public post on Piazza. If approved, the instructor will post the number of points that feature will be worth on the final project.

Input

Your main method must be placed in a class named Driver. The Driver class should accept the following additional command-line arguments:

  • -port num where -port indicates the next argument is the port the web server should use to accept socket connections. Use 8080 as the default value if it is not provided.

    If the -port flag is provided, your code should enable multithreading with the default number of worker threads even if the -threads flag is not provided.

The command-line flag/value pairs may be provided in any order, and the order provided is not the same as the order you should perform the operations (i.e. always build the index before performing search, even if the flags are provided in the other order).

Your code should support all of the command-line arguments from the previous project as well.

Output

The majority of the output for this project will be in the form of HTTP responses to a browser. Only output the inverted index or search results to a file if the necessary flags are provided.

Testing

No tests will be provided for this project. Instead, you will demonstrate your search engine functionality to the instructor during your final code review appointment during finals week.