Stop Name Cleaning Project at Sakay.ph
Introduction Through regular interaction with the Sakay database, I noticed several different errors in the names of stops. These range from spelling errors Calcoocan instead of Caloocan), confusing city descriptions Antipolo City, Manila), and various inconsistencies Bgy. vs Brgy. vs Bgry.). Seeing these errors in the database made me wonder how varied and widespread the errors and inconsistencies may be across all of the 8,000 stop names in the database. Some stops only name the landmark, some name the landmark and street, while others are so specific and even describe stops as being in "Metro Manila" or even the "Philippines". The inconsistency reflects how there has been no clearly defined guideline on how to name the stops. Additionally, given the informal nature of paratransit, stops too often have flexible markers ("sa tapat ng poste sa may Quezon Ave.") or vernacular names like "Ligaya". This project set out to take a thorough first look at all of the stop names in the database, with many stops not corrected since the first iteration of Sakay's database.
The task began with a focus on spotting spelling errors and logical errors that incorrectly name a stop as being in the wrong city. From there, more observations on the quality of the stop names were made along the way, and the tools were adapted for these errors and inconsistencies. Methodology
Checking the Spelling
1. Building the dictionary
The first step in building a spellchecker tool was building the dictionary which the spellchecker tool will reference. I combined a comma-separated list of all the words in an English dictionary with a list of all the words in a Filipino dictionary.
Then, my spellchecker tool reads each word from all of the stop names in the feed and checks if it is present in my custom dictionary. It then flags all of the words that are not present in the dictionary and looks for a word in the dictionary that is closest to it. To look for the word in the dictionary that is closest to the flagged word, the flagged word must be compared to each word in the dictionary to find which it is most similar to. There are many ways to compare two words, and the method used in this tool is the Jaro-Winkler distance. The Jaro-Winkler distance measures how many changes must be made to one word to transform it into the other word. A lower distance means that the words are more similar to each other, while a higher distance means the words are more dissimilar. Jaro-Winkler distances exist on a scale of 0 to 1 with 0 meaning that the words being compared are exactly the same and 1 meaning that the two words are completely different. Jaro-Winkler distances are also calculated to give more weight to dissimilarity in the first few letters of a word. This is because typographical errors tend to be made towards the middle or end of the word and less often at the start of the word. In the tool, all words in the dictionary and from the stop names were converted to lowercase before being compared because a majority of words in the dictionary were in lowercase while a majority of words in the stop names were capitalized.
As many location and road names in and around Metro Manila have proper names not commonly found in neither the English nor Filipino dictionary, a whitelist of words to be added to the dictionary was created. Typical "words" found in stop names but are not found in the dictionary were added to the dictionary. This includes numbers 1, 2, 3, 19, 20, 30 , ordinals 1st, 2nd, 3rd, 7th, 23rd), initials A., B., C., D., G., Z.), and abbreviations of words you would find on a map Rd., Blvd., Ave.). Then, using a political map of the Philippines, the names of the provinces around Metro Manila and the major cities and municipalities in and around Metro Manila were also added to the whitelist. As many major roads are inspired by national heroes, the names of Philippine heroes and past presidents were also added to the whitelist. Then, with this dictionary composed of the English and Filipino dictionary wordlists and the whitelist, each word in each stop name was iteratively checked against it. Flagged words that were not found in the dictionary but represent actual places in the Philippines were added to the tool's dictionary. Many of these words in the whitelist are names Tolentino, Sanchez, Almeda, Evangelista, etc.), specific places Bagumbayan, Kalantiaw, Habay), and brands Jollibee, 7/11, Petron). When the whitelist was updated to include most of the recurring words that were incorrectly flagged as spelling errors, the remaining manageable list of errors were checked individually for their existence as an actual location and stop. 2. Building the architecture of the Spellchecker Now, with the confidence of this robust dictionary that was built, the architecture of the spellchecker tool was further improved by incorporating the bag-of-words model. Imagine how many times the words "Restaurant", "City", and "Rizal" are used to describe places around Metro Manila. If the tool confirms that "City" is indeed a correctly spelled word, it should not need to check this again every time the word "City" appears in a different stop. The bag-of-words model prevents this redundancy by making a bag of unique words that can be found across all of the stop names in the database. Then, instead of checking each word in each stop name is in the dictionary, each word in the bag of names was checked with the dictionary. If there was a word not found in the dictionary, then it was compared with each word in the tool's dictionary. And if that word was found in another stop, then the tool would not need If the word was not found in the dictionary, then a list of words that could likely replace it was created. If any of these words were found in other stop names, then it is dubbed a "transit word" and given priority among the correction recommendations. The speed of the tool is relatively better after incorporating the bag-of-words model as it takes just under a minute to check almost 9000 stop names instead of the 45 minutes it used to take. 3. Types of errors detected
Checking the City 1. Generating stop and city geometries This city-verifying tool was initially planned around being able to check stop names that incorrectly label stops as being in Manila (e.g. 'Taytay, Manila', 'Antipolo City, Manila', 'Bocaue, Manila'). First, the tool had to be able to verify whether those stops were geographically located in the City of Manila or even in Metro Manila. This was done by first converting stop latitude and longitude into a point geometry. Then, a geospatial shapefile of the bounds of Metro Manila and of the City of Manila were converted into a format that could be compared with the point geometries of the stops. Each stop in the database was checked if it was located in Metro Manila and/or in the City of Manila. 2. Building queries to find errors With the previous step ready, finding stops that incorrectly list 'Manila' in its stop name despite it not being within the bounds of the City of Manila was a matter of generating a database query that tests for and finds all the relevant stops. These stops that had phrases such as 'Taytay, Manila' or 'Quezon City, Manila' were flagged with error type 'not in City of Manila'. Some may argue that it is not wrong to refer to cities such as Quezon City or Pasay City that are within Metro Manila as 'Quezon City, Manila' or 'Pasay City, Manila' as 'Manila' here may be referring to 'Metro Manila'. To avoid this ambiguity, another query was made that searched for all stops within the City of Manila that had 'Manila' in its stop name. These were flagged so that 'Manila' in the stop name can later on be changed to 'City of Manila'. Errors of this type were labeled in the CSV file as 'indicate City of Manila instead of Manila'.
There were some stops that were very descriptive and included phrases such as 'Quezon City, Metro Manila' or 'Valenzuela City, Philippines'. Although these are not wrong, we determined that these are too specific for the purposes served by the stop names. A commuter using Sakay.ph in the Philippines is likely aware that the stops being referred to in the app are located in the Philippines. If they are commuting in or around Metro Manila, it is also likely that if they see a stop labeled as being in Quezon City, Caloocan City, or Pasig, then they would know that this is within Metro Manila. Another query was then built to identify stop names that include 'Philippines', 'Metro Manila', or 'Metropolitan Manila', and these were flagged in the CSV file as 'extraneous descriptor'. Finally, the last type of errors detected by the city-verifying tool are very specific "typographical slip" cases that were noticed through interaction with the DOTr transit feed. Stops located in Taytay, Rizal were misspelled in the stop name as 'Tatay' instead of 'Taytay'. Since 'tatay' is a word in the Filipino dictionary, this was reasonably not flagged as a spelling error. A query was made to search for stops not within Metro Manila that had 'Tatay' in its stop name. Unfortunately, there was no shapefile data for the city of Taytay, so the stop's city was not verified. However, each flagged stop was double-checked and is not a false positive. Similarly, stops in 'Juan Luna' in Manila were incorrectly labeled as 'Juan Luma'. These were also corrected so as not to disgrace the name of the most celebrated Filipino artist. 3. Types of errors detected To summarize, the table below lists the type of city or logic errors.
The city-checker also analyzed for inconsistencies within checking each stop name for its location in Metro Manila or the City of Manila. Some stops all shared the same stop name, but as they are located either on the border of Metro Manila or the City of Manila, the corrections to be made are ambiguous. The stop names and stop ids involved in such a situation are listed in a separate file and labeled as a 'border case'. As it is possible that some stops have more than one correction to be made, either from more than one spelling error in the stop name or from a combination of spelling or city-related errors, a CSV file entitled changes.csv lists each stop name and considers all corrections that need to be incorporated into the stop name. The CSV file lists the original stop name, the stop IDs with this stop name, the errors, the fixes to be made, the suggested stop name, the confidence level, and a 'yes'/'no' column on whether this change should be integrated into the database. Correcting the Stop Names The next step, integrating changes into the database, relies on the include column in the changes.csv file. If there is a 'yes', then the recommended change will be made in the database, while if there is a 'no', then the change will not be made. The tool automatically designates high-confidence suggestions as a 'yes' and medium to low confidence suggestions as a 'no'. The CSV file can be edited manually to improve the suggestions and change the include value of the suggestions if they are to be considered or not. Discussion
Checking the Spelling Some of the most common errors were 'Montinlupa' instead of 'Muntinlupa' which was spotted in 100 stops, 'Calcoocan' instead of 'Caloocan' in 70 stops, and 'Genteral' instead of 'General' in 40 stops. Another common error was confusion with the use of ñ. There were 45 instances of 'Paranaque' used instead of 'Parañaque', 14 cases of 'Osmena' instead of 'Osmeña', and 10 cases of 'Binan' instead of 'Biñan'. From the list of recommended suggestions for each wrong spelling error, the tool was able to determine "likely" corrections for some of the errors. Likelihood is determined either when the correction is a word found in other stop names, or if the error is found in a list of common spelling errors. Out of these "likely" corrections, only 20 (7.0%) had wrong suggestions.
Checking the City The most common error flagged by the city-checking part of the tool were stops that wrongly indicate 'Manila' when the stops are not within the city of Manila. Some examples are 'San Mateo, Manila' and 'Quezon City, Manila'. These account for almost half of all of the stops in the system and may be potentially confusing for first-time commuters. Of these, 159 stops seem to be located on the borders of Manila and may be flagged due to the inaccuracy of the Metro Manila and City of Manila shapefiles used. These 159 stops were labeled as likely false positives in the output CSV file for further detailed examination. As mentioned earlier, there are also inconsistencies within stops of the same name that are located along the borders of Metro Manila and the City of Manila, so these flagged errors must be looked over by the human eye. Throughout the entire database, there was much inconsistency on how cities were represented. For example, 'Pasay City' was also written in other stops as just 'Pasay' or as 'Lungsod ng Pasay' or 'City of 'Pasay'. This may not be confusing to a commuter, but this could be improved in the future to maintain consistency. The two special cases of 'Tatay' listed instead of 'Taytay' and 'Juan Luma' instead of the correct 'Juan Luna' were also addressed by the city-checker. 46 instances of 'Tatay' and 26 instances of 'Juan Luma' were found and the appropriate corrections were made.
Summary and Conclusion In summary, the spell-checking and city-checking tool were able to spot and correct a wide range of different errors from the stop names in the database. Out of the 8157 stops in the database, 5795 (71.04% of all of the stops) had at least one type of error flagged. Of these stops, the tool was able to make high-confidence corrections for 5146 stops, and this was applied to Sakay's database. This tool is the first internal approach towards systematically finding and documenting errors in the stop names after several years of building the database. The development of this tool is an important step in maintaining the health of our database as it serves as documentation of the current and future state of the quality of the stop names and calls certain conventions (such as abbreviations and spelling) to our attention. The tool also systematically determines any corrections to be made along with and measures the confidence of these corrections. This makes it easier to evaluate how well the tool performs these corrections and how further improvements could be made.