Fair weather forecasting? The shortcomings of big data for sustainable development, a case study from Hubballi-Dharwad, India

Sustainable urban mobility is an essential component of sustainable development but requires careful planning in rapidly growing urban areas. This paper investigates the value and limitations of Big Data for evaluating transport policies, plans, and projects in Hubballi-Dharwad, India. Results show how Big Data can enable the outcomes of transport interventions to be evaluated more readily than conventional transport analysis. However, the analysis also found that this data may be less able to detect the impacts of travel behaviours in informal settlements, and the impact of extreme weather events. These potential shortcomings, as well as a lack of transparency around the methodology and data sources used by sources of Big Data, could generate unintended consequences and biases in transport planning. Reflecting on these challenges, and the wider implications for urban governance, we conclude that there is an urgent need for Big Data and other technical advances in urban modelling to be seen as compliments to, rather than substitutes for, wider methods of knowledge generation in urban areas.

interventions, particularly those that discourage private transport options, and to draw lessons from completed projects. Such assessments, however, face a range of challenges, from overly optimistic modelling processes to problems with data access, to the complexity of modelling urban mobility processes (Gouldson, Sudmant, Khreis, & Papargyropoulou, 2020). Consequently, high-quality ex-post assessments of transport interventions that could yield important insights are rare even in high-income nations (Driscoll, 2014;Flyvbjerg, Skamris Holm, & Buhl, 2003;Nicolaisen & Driscoll, 2014). In urban areas in low-income countries, demand for mobility is rising quickly and public resources face competing demands, addressing this challenge has particular urgency (Cabannes & Lipietz, 2018;Colenbrander et al., 2017).
In this context, tremendous potential is thought to exist from harnessing Big Data: The vast amounts of information coming from mobile phones and other connected devices that are increasingly ubiquitous to our lives. Real-time, geolocated, high frequency, and (in many cases) low-cost applications of Big Data for transportincluding Google Maps, Waze, Apple Maps, TomTom, and a host of other services-are already used by billions on a daily basis, ostensibly demonstrating that they are valued by individuals and businesses.
The value of such data from a public policy context, however, does not naturally follow from such services being widely used by individuals and firms (Khan et al., 2020). Big Data sources generally provide a restricted number of variables, requiring assessments to draw inferences with explanatory characteristics (Hu & Jin, 2017). Datasets can be biased, blinding policymakers to the impact of policies on particular populations (Kwan, 2018). The way algorithms capture, sort, clean, and pass on data can alter our understanding of phenomena in ways that policymakers (and sometimes information providers) are not aware of (Zou & Schiebinger, 2018). Modes of governance informed by such data can incentivise city governments to prioritise a narrow set of metrics (Hughes, Giest, & Tozer, 2020) and to discount wider means of urban knowledge generation (Coletta & Kitchin, 2017). And what data is available, for who, and under what circumstances remains a legally and ethically contentious question, with a number of authors reminding us that it would be naive to assume that the interests of private firms automatically align with the interests of the wider public (Albino, Berardi, & Dangelico, 2015;Docherty, Marsden, & Anable, 2018;Wang & Ma, 2021). Questions surrounding the value of Big Data for policymaking thus extend from the specificities of data collection techniques and the ways algorithms are developed to overarching logics and rationalities and their implications for governmentalities (Bissell, 2018;Coletta & Kitchin, 2017;Kitchin, Lauriault, & McArdle, 2015).
Nonetheless, "Smart Cities" relying heavily on Big Data have become a national policy objective in a number of countries worldwide.
In India, the Smart Cities Mission was launched in 2015 to support sustainable development through the application of information and communication technologies (Dwevedi, Krishna, & Kumar, 2018). The Smart Cities Mission is focused on cities with a population between 1 and 4 million ("second tier" cities) and particular opportunities are thought to exist in the transport sector and from new sources of Big Data, including Google Maps (Jindal, Kumar, & Singh, 2020;Rakesh, Heeks, Chattapadhyay, & Foster, 2018;Rizwan, Suresh, & Rajasekhara Babu, 2016). Hubballi-Dharwad, the case study analysis focuses on among the 100 cities in the "Smart Cities Mission", and "Smart Mobility" is recognised as a key area for intervention (Hubballi-Dharwad, 2013).
First, by focusing on a smaller urban centre in the Global South, this research considers an underexplored context. In contrast with the larger and often wealthier urban centres that are the focus of much existing research, smaller urban centres frequently have very high urban growth rates and are yet to invest significantly in public transport networks. Such urban centres are also more likely to face capacity issues in government due to smaller budgets and less established institutional structures, possibly leading them to be more attracted to "smart innovations" using Big Data. Cities with these characteristics are anticipated to be the source of the majority of urban population growth over the coming decades (UN DESA 2019) and are, therefore, critical to the achievement of the Sustainable Development Goals.
Sustainable urban mobility plans (SUMPs) are a focus of a growing body of literature (Okraszewska et al., 2018); however, low-income regions of the world continue to be underrepresented.
The potential value of Big Data for transport planning in Hubballi-Dharwad, and "second-tier" global cities more generally, is considered in the first and second sections of the Results. We first analyse the extent to which big data derived from Google Maps can provide information on key attributes of the transport system, including hourly and weekly travel times on key routes, before assessing the impact of a new bus-rapid transport line on travel times to key locations in the city.
The second way this analysis adds to the existing literature is by probing some of the specific potential shortcomings of Big Data raised by existing authorship. The third section of the Results assesses the possibility that algorithms may capture, sort, clean, and pass on data in ways that alter our understanding of phenomena (Zou & Schiebinger, 2018) by focusing on a major rainstorm event that affected Hubballi-Dharwad in June 2018. Finally, the fourth section of the Results assesses potential biases in the data (Kwan, 2018) by comparing the quality of the data provided from informal settlement and wealthier parts of the city.

| STUDY AREA
A rapidly expanding urban population and sprawling cities are placing increasing pressure on transport systems in India. At the same time, partly as a result of increasing incomes, there is a growing trend towards private transport. The transport sector contributes to about 15% of CO 2 emissions in India, a share that has been increasing over time (Gupta & Garg, 2020) and congestion, air pollution, and road traffic accidents are common in urban areas, at great cost to society and the economy (Rajasekaran, Rajasekaran, & Vaishya, 2021).
The National Urban Transport Policy (NUTP) of 2006 emphasised the need to give greater priority to public transport, and the Sustainable Urban Transport Programme (SUTP) was designed to support and demonstrate the principles of the NUTP. Following the adoption of these policies, a Bus Rapid Transit (BRT) scheme connecting the twin cities of Hubballi and Dharwad was chosen as a demonstration project. The engineering study completed before construction contains many of the "best practice" elements for BRT networks. For example, the system is designed with a dedicated roadway, raised platforms, a limited number of stations, and an electronic payment system. Importantly, the document also highlights that reducing congestion along the main corridors of the city is a key justification for the project (CEPT, 2013). Assessing private vehicle travel times along the route is, therefore, seen as an indirect means of assessing the success of the project and its overall impact on the city's transport network.
Whether Google Maps travel time estimates (or other Big Data sources) can be used in this way has relevance beyond Hubballi-Dharwad. BRT systems are considered an important tool for climate change mitigation due to their potential to provide an important public service while also contributing to global emissions reduction targets Sudmant, Mi, et al., 2020;Sudmant, Verlinghieri, et al., 2020). Studies from several cities with well-established BRT systems -such as Bogota, Johannesburg, and Mexico City (Ingvardson & Nielsen, 2018), substantiate this. In addition, BRTs contribute to the reduction of air pollutants such as carbon monoxide and particulate matter, primarily through reducing the total number of vehicle kilometres travelled and by encouraging the replacement of older, smaller vehicles with newer, cleaner high-capacity buses (Stankov et al., 2020). Research also suggests that BRT systems can contribute to equity objectives by providing low-income groups with greater access to public transport, travel time and cost savings, and safety benefits (Venter, Jennings, Hidalgo, & Pineda, 2018). This is significant in the context of this research since informal settlements have fewer vehicles that Google can track to determine travel times, and residents may have fewer devices from which Google can collect data. A recent study from Karnataka's capital, Bangalore, shows that less than 1% of informal settlement households own a car (Roy et al., 2018), compared with more than 70% of wealthier households in the same city (Bansal, Kockelman, Schievelbein, & Schauer-West, 2018). Assuming this is representative of cities in India, this could make it potentially challenging for Google to estimate travel times from these areas, a matter we investigate at the end of the Results section. Although Google Maps' estimated time of arrival algorithm is not public, it is understood that Google uses different features to assess live travel times. These include official speed limits, recommended speeds, information on road types, and topography and real-time traffic information. A mix from these different data is processed to enhance the algorithm.

| RESULTS
Results are presented in four sections. First, Google Maps travel times estimate data are used to identify key attributes of the transport network, including hourly and weekly traffic variation across the city.  In Figure 4, the morning and evening congestion periods are presented more clearly by assessing travel speeds across different routes.

| Traffic variation in Hubballi-Dharwad
Combining these by day of the week reveals that Saturday has the most traffic and Sunday has the least traffic congestion. The effect of time of day is seen to be significantly more important than the day of the week for the level of congestion given the much larger differences in travel speeds.

| The impact of the bus-rapid transport network
In order to understand the effect of the BRT on travel times, we assess travel times before and after the BRT began operation and compare routes that are parallel to and perpendicular to the BRT.
The hypothesis behind this approach is that trips parallel to (or along) the BRT line will be affected by the new transport option, while trips perpendicular to the BRT should not be affected. To provide clarity, the city is divided into regions, as shown in Figure 5. had the effect of improving congestion, one of the stated goals of the project. However, whether this effect is by moving drivers from cars onto the bus, by discouraging drivers from taking this route, or by another means is beyond the scope of this analysis to determine. Further, who is taking the BRT and how the specific trips they are taking have been affected, is information not available using this data set and approach.   the largest impact (each observation is one data point, representing an estimation of the travel time and distance between a cell and a centre or a centre and a cell), 14 showed faster times on June 4th and 9 showed slower times. Of these, only five routes were 10% faster or slower than usual.

| Comparing Google maps estimates with a simple transport model
Following the results in the previous section, we were curious to explore the extent to which Google data is adding on-the-ground information to its estimates. In the absence of detailed information on the way Google Maps estimates are calculated, we develop a model of travel times that is based on a set of characteristics seen to have an important role in predicting travel times: the hour of the day, the day of the week, population density, and the distance of the trip (Table 1).
Using linear-regression, this model explains 85% of the variation across all 3.2 million trips in our dataset, suggesting that 15% of the variation in Google's estimated travel times is related to other factors.
We assume that a significant portion of this 15% of addition variation comes from Google's ability to collect real-time travel information on actual travel conditions, relating, for example, to the weather, traffic accidents, or other events that are too rare or uncertain to be included in the characteristics of our model.
The extent to which Google is able to capture this real-time information may not be the same across the city, particularly in informal settlement areas due to a lower concentration of mobile devices. To test this hypothesis, we can compare the fit of our model for trips starting from informal settlement areas versus the fit of our model for a trip starting from non-informal settlement areas.
If a subset of the dataset (informal settlement or non-informal settlement originating trips) shows a lower R 2 in our model, this suggests that Google might have more real-time information, allowing Google to provide more bespoke travel time estimates that differ from the ones in the "basic model." If the R 2 is higher, this suggests Google travel time estimates are more likely to be based on a set of characteristics similar to those in our model, implying that they may not have more information to improve their estimates. This effect should be magnified for shorter trips. Longer trips will frequently converge onto the same routes and over the course of a longer trip, drivers will have more opportunity to change their route to avoid traffic. We would therefore expect the R 2 to be higher for relatively long trips compared with shorter trips.
Results find that the model of travel times we apply explains a higher proportion of all variation in trip times from informal settlement areas compared with the remaining grid cells. This phenomenon exists across all grid cells and also when we restrict our analysis to the "finer" 1 km square cells. Results also show a higher R 2 as the minimum trip length is increased, in line with our assumption about longer trips.
These findings could be a result of fundamental aspects of transport in Hubballi-Dharwad. Travel times from informal settlement areas may be more predictable due to geography or the configuration of the travel network. This would be despite the fact that informal settlement areas are found across Hubballi-Dharwad, including adjacent to formal settlement areas. However, without detailed information on the raw data Google Maps is using, or the way that data is processed before it is passed through Google Maps, we cannot rule out that the data we are being provided with is more detailed outside of informal settlement areas. Analysis of the BRT suggests Google Maps data may also be able support ex-post assessment, a process that is critical for learning but often not undertaken due to the cost and challenge of accessing data (Nicolaisen & Driscoll, 2014). Results here, which show a relatively modest change in travel times along the BRT compared with routes perpendicular to the BRT, also highlight the value of the large datasets accessible with Google Maps, which allow for a level of statistical robustness that would be challenging with other methods.

| DISCUSSION
A transport department that completed these analyses could easily replicate them in the future. And since policymakers in other urban areas also using Google Maps would have access to data of the same types and format, knowledge sharing, and learning could be radically increased. These realisations have enchanted academics who forecast the beginning of a fundamental shift in our epistemological approach to transport planning led by data analysis rather than the development of hypotheses (Kitchin, 2014;Rabari & Storper, 2015) and suggest the private sector could play an important role supporting sustainable low carbon development (Colenbrander, Sudmant, Chilundika, & Gouldson, 2019;Scheyvens, Banks, & Hughes, 2016;Sudmant, Colenbrander, Gouldson, & Chilundika, 2017).
The extent that such a shift in the nature of urban transport and urban transport policymaking is on the horizon is beyond the scope of this paper. However, the third and fourth analyses in the results section were undertaken with the intention of exploring how Google data might contribute to more novel analysis of the kind that has been associated with this transition in transport planning (cf. Kitchin, 2014).
The speed with which data can be collected and assessed is a key feature of Big Data and has clear value for transport policymakers.
Rapid analysis can help in identifying transport hotspots and responding to emergencies. In contrast with our personal experience with Google Maps in other urban contexts during periods of disruption, however, we were surprised to find no clear impact In order to probe the characteristics of this underlying algorithm, we developed a simple model of the transport network. Across the entire dataset, results show that characteristics, including time of day, day of the week, the distance of a trip, and the density of the urban area travelled through describe 85% of the variation in travel times.
This suggests that either these variables, or factors correlated with them, are constituents of the model used by Google. This also suggests that 15% of the variation in estimated travel times may be attributable to other variables or information captured by Google connected devices. Wider factors might include topography, road quality, and speed limits, while information collected from connected devices might include traffic caused by a car breaking down, a slow driver, or weather.
In this context, we would assume that data captured by Google connected devices would override the estimates of the model.
Described another way, if Google has information that a road is poor quality, on a steep hill, and that it is the busiest day of the week and time of the day (implying that a road is likely to be relatively slow for vehicles according to the model), but connected devices are reporting that vehicles are travelling quickly, we would assume that Google would eventually conclude that this is a relatively fast route for cars and provide estimates accordingly. Similarly, for the opposite case, data from connected devices should allow Google to correctly predict slower travel times on roads even if an ex-ante estimate suggested relatively fast travel speeds. Over a long period of time, during which many data points are collected, Google estimates should improve significantly by this means.
Importantly, the extent to which Google can account for certain unpredictable events (e.g., a car breaking down) will likely still depend on timely data from connected devices. All else constant, this factor will be most prominent for shorter trips where there are fewer opportunities for alternative routes to avoid such events. We would, therefore, expect that the difference between a basic model of the transport network and a more complicated (and, by assumption, more accurate) model, such as that used by Google, would be largest for shorter trips and smallest for longer trips.
The failure of the just mentioned hypothesis for trips starting from informal settlement areas, with the model we have developed providing a similar degree of accuracy for shorter and longer trips, maybe explained in three (non-exclusive) ways. First, as with any statistical analysis, there is the potential that these results are a statistical artefact. This is mitigated to some degree by the number of observations and by the different specifications of the model presented. Second, the elements of the basic model may be a better fit for trips from informal settlements over shorter distances. In other words, characteristics left out of our model, including topography, weather, and car accidents may only have a small effect on travel times from informal settlement areas. This seems unlikely as the informal settlements are in different parts of the city and adjacent in many cases to wealthier areas (see Figure 2). Further, one would not expect some of these factors (a car breaking down or an unexpected rainstorm) to be significantly correlated with the wealth of the neighbourhood car passing through.
Finally, Google travel times from the informal settlement areas of the city may not include the same amount and quality of on-theground data as they are able to access from wealthier areas, forcing Google to provide less accurate estimates. We would emphasise that these results call for further research to be verified. However, there is reason to think, the third of these explanations could be the cause of these results. Only approximately one-third of the population has a smartphone in India in 2019 (Statista, 2019). The vast majority of these devices are Android, but ownership is skewed towards the wealthier population (ibid). And among the poorer population, some share a device or leave it at home for safety purposes, further reducing their visibility in data collected. These factors suggest that there is a causal pathway that could lead to lower quality travel time estimates from poorer areas.
While concerns around systematic biases in Big Data sets are well established (Batty et al. 2012;Kwan, 2018), a number of authors have implicitly made the assumption that these biases are not large enough to be a concern in analyses of Google data. In addition, the exact nature of these biases remains poorly explored. Here, we find some evidence to suggest the existence of spatial and temporal limitations of Google data, which may have a social consequence: reduced quality of travel time data for informal settlement populations with implications for urban policymaking, and inclusive urban development.
It should be noted that on-the-ground assessment to confirm these findings, or comparison with a city-based transport model, was not possible. Nonetheless, these latter analyses raise wider concerns about the use of Big Data for informing urban policies, plans, and programmes. If there is no transparency around the quality of data and the way it has been processed there may be significant limits to the extent that surprising results can be explained, leading to concerns about datasets as a whole. This issue is particularly evident in our findings around the days Hubballi-Dharwad faced flooding but apply also to the findings on the differences between informal and noninformal settlements, and the impact of the BRT. And since the data available is wide but thin, that is, massive in the quantity of information but lacking in number of variables, corroborating the results with other datasets is challenging.
Important in this context is that the potential for errors in the data is known, but the nature of these errors is not. This is in contrast with conventional transport modelling methods where the exact nature of errors is unknown, but comparisons with other datasets can be used to determine confidence levels and indications of bias. Big Data sources rarely come with a detailed methodology, quality assurance, or user manual of any kind. On the contrary, Big Data is often described as speaking for itself (Villanueva et al., 2016). But, if the data is of questionable validity-and therefore, does not speak for itself-there may be some irony in using it for ex-post analysis.
For policymakers, the key concern in this context regards unintended consequences. An individual's travel app that does not work during poor weather may lead to a dangerous travel decision, but more likely leads only to a lengthy commute. A transport planner basing a policy or project on data that only considers fair weather, by contrast, may lead to a city gridlocked for the course of the monsoon.
For the academic community, the specific aspects of urban life that are misrepresented or that fall between the columns of ever more impressive datasets may be a secondary, if critical, issue. Faced with a new age of seemingly limitless information, more fundamental questions may consider the ways algorithmic governance expands the capacity to govern by replacing or crowd out other forms of knowledge and power.
Reflecting on waves of enthusiasm for more "scientific" approaches to urban planning over recent decades, Duminy and Parnell (2020) remind us that the debates between "interpretivists" and

| CONCLUSIONS
Google Maps and other sources of Big Data present an emerging opportunity for policymaking in transport and more widely. The extent to which these approaches can be relied upon, however, depends on the value they add to analysis weighed against the new limitations and sources of uncertainty they generate. To date, quantitative analysis has placed a much greater focus on the opportunities.
Here, we contribute to what we hope will be a growing field of analysis assessing the quantitative shortcomings of Big Data approaches for informing policymaking, and how these may be overcome, where efforts are made to understand the lived realities behind the data and the complementarities between Big Data and wider methods of knowledge generation in urban areas.
Future analysis in this field can be targeted to three areas. First, analysis can explore the existence and extent of disparities between the value of Big Data for populations from different socio-economic backgrounds. This analysis is essential to understand the extent and possible consequences for sustainable development, especially in rapidly growing urban areas where the vast majority of infrastructure is yet to be built. Second, analysis is needed to "truth" the proliferation of Big Data sources with on-the-ground realities. This can help to determine the key areas new data sources have shortcomings and advantages relative to established sources of information and methods of analysis. Finally, interdisciplinary work that explores, both conceptually and in practical terms, the ways empirical and qualitative urban data sources can be integrated is needed to ensure wider methods of knowledge generation in urban areas can complement the growing proliferation of Big Data.

ACKNOWLEDGEMENT
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article. This work was supported by funding from the Department for International Development grant 113550.