Tag Archives: classifier

Stanford Literary Lab – Reflection 2

While the evolving  New Yorker project (February 13 meeting) seemed like the perfect opportunity to begin the internship and follow the entire development of a DH project in Literature, it seems to be a complex and time consuming endeavour to sort out the different interests and sub genre plans. Thus, at this point, as an intern my opportunity to work on a task in this project has been delayed.

However, the leading group of professors involved in other projects suggested to sit in on the Microgenre project, which has been ongoing for some time. The project explores the “discursive inter-disciplinarity of novels, using machine learning to identify points at which authors incorporate the language and style of other contemporary disciplines into their narratives.” The team is looking for moments in a wide range of novels across genres and time periods to determine the way authors signal the shift between narrative and history, philosophy or natural science. Some of the questions they are hoping answer:  “Do these signaling practices change with time or with discipline? Akin to what Bakhtin terms “heteroglossia,” these stylistic shifts indicate not only the historically contingent ways that novels are assembled from heterogeneous discourses, but they also shed light on the practices of disciplinary knowledge itself.” Since the disciplines have exploded in number at the ubiversities after 1870, the project is examining novels and journals between 1880 and 1930.

The first meeting I attended on February 14 looked at an extensive spreadsheet containing the disciplinary breakdown of journals found in JSTOR’s database. The goal was to narrow down the number of active journals in each discipline during the target years.  Thus, the discussion and on-the-spot quick Internet research yielded some result in the areas of science, literary, religious, phylosophy, and psychology journals. Besides JSTOR’s, Wikipedia’s metadata was also considered in the search.  At the end of the meeting, I was asked to research law as a discipline in the period 1880-1930. Specifically, to know the premier journals, among those in the JSTOR holdings, in the field in both the U.S. and the U.K. and research and give a brief, qualitative sense of the field at this time: the top schools, the difficulty of obtaining a degree, the major questions in the field. My research results are contained in the PDF file “Law as a Discipline” and were briefly discussed in the following meeting on February 28 along with the other major journals. During that meeting, duplicate journals were disounted for the reason that they span across multiple genres and could create confusion of data in the reading of the DFA Classifier.

The discriminant function analysis (DFA) program was created by Mark Algee-Hewitt. In it, the groups are our various disciplines (anthropology, philosophy, history, etc.), and they are training and running the classifier on 100-sentence excerpts from the corpus of texts. As J. D. Porter explained to me, often in literary DH classification it is done with words, but they are trying to capture style and avoid “aboutness”, so for the variables they mostly used parts of speech tags instead, plus a few other things such as sentence length, number of clauses, and numbers of named people and places. Therefore, the results are fairly unique, in that the classifier doesn’t know any semantic content. Nonetheless, it performs well above chance, and for some categories remarkably well (e.g., it correctly identified >60% of the passages in the history, novels, and psychology categories, where chance would have gotten ~12% of them right). This seemed abit confusing to me since I have not seen the previous results or have heard much about this tool.

Other items discussed in the 2/28 meeting included the following:

  • How to measure fiction:
    • Sample sizes and what will they signal (big corpus vs. small corpus)
  • Article length (what should be a median length)
  • Journalism Corpus:
    • articles vs. newspapers
    • where to find newspapers
    • British Periodicals
    • anthology of yellow journalism
  • Literary Criticism
  • Book reviews
  • Reviews of Pedagogical practices
  • What journals should be included in Politics
  • Discipline of Theology/Religion
  • Do disciplines cohere?
  • Classification Model
  • Outlier Slices

The Microgenre Project meeting on March 7 was highly anticipated because Mark promised bring and share the newly run novel chunks and their colorful bar graphs indicating the disciplinary breakdown of genres and hopefuly pointing to shifts within the writing.

A Study in Scarlet by Arthur Conan Doyle

Features of the DH element:

  • graph of disciplines
  • 73% success rate for 100 chunks
  • Values(100)/disciple
  • The bigger the chunk the better the classifier
  • Just parts of speech! – a continuous surprise
  • Microgenres Master feature
  • smaller chunks – division points
  • Posterior Probability/Position
  • 10 most acurate in each discipline/50 sentence chunks
  • Random samples

The meeting concluded with the agreement of sampling about 200 sentence chunks and examining them on a sliding scale. In addition, the already graphed novel chunks will be reviewed by the members of the project and matched with the prediction of the graphs to look for shifts.

My Questions:

  • Which novels are included?
  • Could I have access to any of the sheets/graphs/novel chunks?

While it was helpful to finally see the actual graphs depicting novel chunks and the interplay of various other disciplines within the writing, in the same time the amount of data was overwhelming to take in withouth the actual reference. I would have liked to go over some of he graphs beforehand to study these elements. I was familiar with most of the novels mentioned in the study, however, I do not have the descriptive list of novels included. Once again, I left the meeting without any particular task to complete. Because of the formally closed nature of the projects, I  do not have access to the data or findings of the project before publication. Thus, it seems futile to attend the upcoming meeting on March 15 since I am not able to study the data or provide further research/imput in any part of the project. It has been fascinating to learn about the numerous ways the Literary Lab examines literature and prepares for its Computational Criticism in the field, however, as far as the DH internship is concerned, I feel that I need to search for a new assignment where I am able to participate and learn about the use of tools and strategies applied in the field of Digital Humanities.